Data Requirements for KumoRFM

Data Requirements for KumoRFM#

This guide outlines the data requirements and best practices for working with KumoRFM (Kumo Relational Foundation Model). Understanding these requirements is essential for creating high-quality datasets that maximize KumoRFM’s predictive capabilities.

Introduction#

KumoRFM operates on relational data organized as interconnected tables forming a graph structure. The foundation of this process starts with a set of pandas.DataFrame objects, which are transformed into LocalTable objects and assembled into a LocalGraph. Proper data preparation ensures optimal model performance and reliable predictions.

Key Terms and Concepts#

Before diving into the technical details, it’s important to understand the key terms used throughout this guide:

pandas DataFrame

A two-dimensional labeled data structure in pandas, similar to a spreadsheet or SQL table. Data frames are the starting point for all KumoRFM data preparation workflows. A collection of data frames connected by pkey/fkey relationships defines a relational database.

pandas dtype

The data type of a pandas.Series or pandas.DataFrame column (e.g., int64, float64, object, bool). These represent how pandas stores and processes the data internally.

Kumo Dtype (kumoapi.typing.Dtype)

KumoRFM’s representation of physical data storage types (e.g., Dtype.int, Dtype.string, Dtype.float). These are mapped from pandas dtypes and determine how data is processed by the foundation model.

Kumo Stype (kumoapi.typing.Stype)

Semantic types that define how the data should be interpreted by the foundation model (e.g., Stype.numerical, Stype.categorical, Stype.ID). These determine what preprocessing and modeling techniques are applied to each column.

LocalTable (kumoai.experimental.rfm.LocalTable)

A wrapper around a pandas.DataFrame that includes metadata such as column types, the primary key, and time column. Each table can have at most one primary key and at most one time column, but it can contain many foreign keys (primary keys of other tables). A LocalTable is the fundamental building block in order to define KumoRFM graphs.

LocalGraph (kumoai.experimental.rfm.LocalGraph)

A collection of interconnected LocalTable objects representing the relational structure of your data. The LocalGraph defines how tables relate to each other through primary/foreign key relationships. How we connect the tables is a modeling decision that is important for the performance of the foundation model.

Understanding the distinction between dtype (physical storage) and stype (semantic meaning) is crucial: a column with Dtype.string could have Stype.categorical (for category labels) or Stype.text (for natural language), leading to completely different preprocessing approaches. The other important modeling decision is the structure of graph, it affects both the performance of KumoRFM on the data as well as which predictions can be defined with PQL (see Querying KumoRFM).

Guide Structure#

This guide is organized into focused sections for easy navigation:

Getting Started#

For a complete end-to-end workflow, here’s the typical process:

import pandas as pd
import kumoai.experimental.rfm as rfm

# 1. Prepare your pandas DataFrames with proper dtypes
df_users = pd.DataFrame({
    'user_id': pd.Series([1, 2, 3], dtype='int64'),
    'name': pd.Series(['Alice', 'Bob', 'Charlie'], dtype='string'),
    'age': pd.Series([25, 30, 35], dtype='int32')
})

df_transactions = pd.DataFrame({
    'transaction_id': pd.Series([1, 2, 3], dtype='int64'),
    'user_id': pd.Series([1, 2, 1], dtype='int64'),
    'amount': pd.Series([100.0, 250.0, 75.0], dtype='float64'),
    'timestamp': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03'])
})

# 2. Create LocalGraph (automatically creates tables and infers metadata)
graph = rfm.LocalGraph.from_data({
    'users': df_users,
    'transactions': df_transactions
})

# 3. Validate the graph
graph.validate()

# 4. Use with KumoRFM
model = rfm.KumoRFM(graph)

model.predict("PREDICT users.age for user.user_id=1")

This example demonstrates the core workflow. For detailed explanations of each step, refer to the specific sections linked above.