Data Requirements for KumoRFM#
This guide outlines the data requirements and best practices for working with KumoRFM
(Kumo Relational Foundation Model).
Understanding these requirements is essential for creating high-quality datasets that maximize KumoRFM
’s predictive capabilities.
Introduction#
KumoRFM
operates on relational data organized as interconnected tables forming a graph structure.
The foundation of this process starts with a set of pandas.DataFrame
objects, which are transformed into LocalTable
objects and assembled into a LocalGraph
.
Proper data preparation ensures optimal model performance and reliable predictions.
Key Terms and Concepts#
Before diving into the technical details, it’s important to understand the key terms used throughout this guide:
- pandas DataFrame
A two-dimensional labeled data structure in
pandas
, similar to a spreadsheet or SQL table. Data frames are the starting point for allKumoRFM
data preparation workflows. A collection of data frames connected by pkey/fkey relationships defines a relational database.- pandas dtype
The data type of a
pandas.Series
orpandas.DataFrame
column (e.g.,int64
,float64
,object
,bool
). These represent howpandas
stores and processes the data internally.- Kumo Dtype (
kumoapi.typing.Dtype
) KumoRFM’s representation of physical data storage types (e.g.,
Dtype.int
,Dtype.string
,Dtype.float
). These are mapped frompandas
dtypes and determine how data is processed by the foundation model.- Kumo Stype (
kumoapi.typing.Stype
) Semantic types that define how the data should be interpreted by the foundation model (e.g.,
Stype.numerical
,Stype.categorical
,Stype.ID
). These determine what preprocessing and modeling techniques are applied to each column.- LocalTable (
kumoai.experimental.rfm.LocalTable
) A wrapper around a
pandas.DataFrame
that includes metadata such as column types, the primary key, and time column. Each table can have at most one primary key and at most one time column, but it can contain many foreign keys (primary keys of other tables). ALocalTable
is the fundamental building block in order to defineKumoRFM
graphs.- LocalGraph (
kumoai.experimental.rfm.LocalGraph
) A collection of interconnected
LocalTable
objects representing the relational structure of your data. TheLocalGraph
defines how tables relate to each other through primary/foreign key relationships. How we connect the tables is a modeling decision that is important for the performance of the foundation model.
Understanding the distinction between dtype (physical storage) and stype (semantic meaning) is crucial:
a column with Dtype.string
could have Stype.categorical
(for category labels) or Stype.text
(for natural language), leading to completely different preprocessing approaches. The other important modeling decision is the structure of graph, it affects
both the performance of KumoRFM
on the data as well as which predictions can be defined with PQL (see Querying KumoRFM).
Guide Structure#
This guide is organized into focused sections for easy navigation:
Getting Started#
For a complete end-to-end workflow, here’s the typical process:
import pandas as pd
import kumoai.experimental.rfm as rfm
# 1. Prepare your pandas DataFrames with proper dtypes
df_users = pd.DataFrame({
'user_id': pd.Series([1, 2, 3], dtype='int64'),
'name': pd.Series(['Alice', 'Bob', 'Charlie'], dtype='string'),
'age': pd.Series([25, 30, 35], dtype='int32')
})
df_transactions = pd.DataFrame({
'transaction_id': pd.Series([1, 2, 3], dtype='int64'),
'user_id': pd.Series([1, 2, 1], dtype='int64'),
'amount': pd.Series([100.0, 250.0, 75.0], dtype='float64'),
'timestamp': pd.to_datetime(['2023-01-01', '2023-01-02', '2023-01-03'])
})
# 2. Create LocalGraph (automatically creates tables and infers metadata)
graph = rfm.LocalGraph.from_data({
'users': df_users,
'transactions': df_transactions
})
# 3. Validate the graph
graph.validate()
# 4. Use with KumoRFM
model = rfm.KumoRFM(graph)
model.predict("PREDICT users.age for user.user_id=1")
This example demonstrates the core workflow. For detailed explanations of each step, refer to the specific sections linked above.