kumoai.experimental.rfm#
KumoRFM (Kumo Relational Foundation Model) is an experimental feature that provides a powerful interface to query relational data using a pre-trained foundation model. Unlike traditional machine learning approaches that require feature engineering and model training, KumoRFM can generate predictions directly from raw relational data using PQL queries.
A table backed by a |
|
A graph of |
|
The Kumo Relational Foundation model (RFM) from the KumoRFM: A Foundation Model for In-Context Learning on Relational Data paper. |
Note
KumoRFM is currently in experimental phase. The API may change in future releases.
Overview#
KumoRFM consists of three main components:
LocalTable
: Apandas.DataFrame
wrapper that manages metadata including semantic types, primary keys, and time columns.LocalGraph
: A collection ofLocalTable
objects with edges defining relationships between tables.KumoRFM
: The main interface to query the relational foundation model.
Workflow#
The typical KumoRFM
workflow follows these steps:
Data Preparation: Load your relational data into
pandas.DataFrame
objects.Table Creation: Create
LocalTable
objects from your data frames.Graph Construction: Build a
LocalGraph
that defines relationships between tables.Model Initialization: Initialize
KumoRFM
with your graph.Querying: Execute predictive queries to get predictions, explanations or evaluations.
Quick Example#
Here’s a simple example showing how to use KumoRFM
with e-commerce data:
import pandas as pd
from kumoai.experimental.rfm import LocalTable, LocalGraph, KumoRFM
# Load your tables into memory:
users_df = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'created_at': pd.date_range('2023-01-01', periods=5),
'age': [25, 30, 35, 40, 45]
})
orders_df = pd.DataFrame({
'order_id': [101, 102, 103, 104, 105],
'user_id': [1, 2, 1, 3, 4],
'amount': [100.0, 250.0, 75.0, 300.0, 150.0],
'order_date': pd.date_range('2023-02-01', periods=5)
})
# Create a graph from tables:
graph = LocalGraph.from_data({
'users': users_df,
'orders': orders_df
})
# Initialize KumoRFM:
rfm = KumoRFM(graph)
# Query the model
query = "PREDICT COUNT(orders.*, 0, 30, days)>0 FOR users.user_id=0"
result = rfm.predict(query)
# Result is a pandas DataFrame with prediction probabilities
print(result) # user_id COUNT(orders.*, 0, 30, days) > 0
# 1 0.85
Query Language#
KumoRFM
uses the predictive query language (PQL) for making predictions. For a broader introduction to PQL, see the Querying KumoRFM tutorial.
The KumoRFM
PQL syntax differs slightly from the general PQL syntax. The user must specify the entity (or entities) to make predictions for, while the general PQL structure stays the same:
PREDICT <aggregation_expression> FOR <entity_specification>
The entities for each query can be specified in one of two ways:
By specifying a single entity id, e.g.,
users.user_id=1
By specifying a tuple of entity ids, e.g.,
users.user_id IN (1, 2, 3 )
Classes#
LocalTable#
A LocalTable
represents a single table backed by a pandas.DataFrame
with
rich metadata support:
Key features:
Metadata Management: Automatic inference of data types and semantic types
Primary Key Support: Specify or auto-detect primary keys
Time Column Support: Handle temporal data with designated time columns
Validation: Comprehensive validation of table structure and metadata
table = LocalTable(df=df, name="users")
# Optional: Infer metadata automatically
table.infer_metadata()
# Verify metadata:
print(table.metadata)
# Modify column metadata:
table.primary_key = "user_id"
table["age"].stype = "numerical"
LocalGraph#
A LocalGraph
represents relationships between multiple LocalTable
objects,
similar to a relational database schema.
Construction Methods: Create from tables or directly from data frames
Relationship Management: Define and manage edges between tables
Automatic Link Inference: Intelligent detection of foreign key relationships
Graph Validation: Ensure graph structure meets requirements before using with
KumoRFM
# Create from tables:
graph = LocalGraph(tables=[users_table, orders_table]).infer_links()
# Or create directly from data:
graph = LocalGraph.from_data({
'users': users_df,
'orders': orders_df,
})
# Manual relationship management:
graph.link('orders', 'user_id', 'users')
graph.unlink('orders', 'user_id', 'users')
# Validation:
graph.validate()
KumoRFM#
The main KumoRFM
class provides the interface to query the relational foundation model.
# Initialize with local graph:
rfm = KumoRFM(graph)
# Query the model:
query = "PREDICT COUNT(orders.*, 0, 30, days)>0 FOR users.user_id=0"
result = rfm.predict(query)
print(result) # user_id COUNT(orders.*, 0, 30, days) > 0
# 1 0.85
rfm.predict(query, explain=True)
metrics = rfm.evaluate(query)
Best Practices#
Note
For a more detailed introduction into data preparation, see Data Requirements for KumoRFM.
Data Preparation#
Clean Data: Ensure your
pd.DataFrame
objects are clean with no duplicate column namesConsistent Types: Use consistent data types across related columns
Consistent Column Names: Ensure column names are consistent across related tables
Primary Keys: Include a primary key column in each table if possible
Time Columns: Each table should have at most one time column (if possible)
Graph Design#
Metapath lengths: Keep metapath lengths reasonable (ideally 2-3 hops)
Longer paths may lead to performance issues and less interpretable results
If your relational schema is very complex, it might be worth splitting it into multiple graphs
Meaningful Relationships: Ensure that the inferred relationships are meaningful/correct
Validation: Always validate your graph before using with KumoRFM.
Size Limits: There is an upper size limit on context. If you hit this limit, reduce your graph to the most meaningful tables and or columns.
Querying#
Start Simple: Begin with basic
COUNT
queries before moving to complex aggregations.Time Windows: Use appropriate time windows for temporal queries.
Entity Specification: Be specific about which entities you are predicting for.