kumoai.experimental.rfm#

KumoRFM (Kumo Relational Foundation Model) is an experimental feature that provides a powerful interface to query relational data using a pre-trained foundation model. Unlike traditional machine learning approaches that require feature engineering and model training, KumoRFM can generate predictions directly from raw relational data using PQL queries.

LocalTable

A table backed by a pandas.DataFrame.

LocalGraph

A graph of LocalTable objects, akin to relationships between tables in a relational database.

KumoRFM

The Kumo Relational Foundation model (RFM) from the KumoRFM: A Foundation Model for In-Context Learning on Relational Data paper.

Note

KumoRFM is currently in experimental phase. The API may change in future releases.

Overview#

KumoRFM consists of three main components:

  1. LocalTable: A pandas.DataFrame wrapper that manages metadata including semantic types, primary keys, and time columns.

  2. LocalGraph: A collection of LocalTable objects with edges defining relationships between tables.

  3. KumoRFM: The main interface to query the relational foundation model.

Workflow#

The typical KumoRFM workflow follows these steps:

  1. Data Preparation: Load your relational data into pandas.DataFrame objects.

  2. Table Creation: Create LocalTable objects from your data frames.

  3. Graph Construction: Build a LocalGraph that defines relationships between tables.

  4. Model Initialization: Initialize KumoRFM with your graph.

  5. Querying: Execute predictive queries to get predictions, explanations or evaluations.

Quick Example#

Here’s a simple example showing how to use KumoRFM with e-commerce data:

import pandas as pd
from kumoai.experimental.rfm import LocalTable, LocalGraph, KumoRFM

# Load your tables into memory:
users_df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'created_at': pd.date_range('2023-01-01', periods=5),
    'age': [25, 30, 35, 40, 45]
})

orders_df = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'user_id': [1, 2, 1, 3, 4],
    'amount': [100.0, 250.0, 75.0, 300.0, 150.0],
    'order_date': pd.date_range('2023-02-01', periods=5)
})

# Create a graph from tables:
graph = LocalGraph.from_data({
    'users': users_df,
    'orders': orders_df
})

# Initialize KumoRFM:
rfm = KumoRFM(graph)

# Query the model
query = "PREDICT COUNT(orders.*, 0, 30, days)>0 FOR users.user_id=0"
result = rfm.predict(query)

# Result is a pandas DataFrame with prediction probabilities
print(result)  # user_id  COUNT(orders.*, 0, 30, days) > 0
               # 1        0.85

Query Language#

KumoRFM uses the predictive query language (PQL) for making predictions. For a broader introduction to PQL, see the Querying KumoRFM tutorial. The KumoRFM PQL syntax differs slightly from the general PQL syntax. The user must specify the entity (or entities) to make predictions for, while the general PQL structure stays the same:

PREDICT <aggregation_expression> FOR <entity_specification>

The entities for each query can be specified in one of two ways:

  • By specifying a single entity id, e.g., users.user_id=1

  • By specifying a tuple of entity ids, e.g., users.user_id IN (1, 2, 3 )

Classes#

LocalTable#

A LocalTable represents a single table backed by a pandas.DataFrame with rich metadata support:

Key features:

  • Metadata Management: Automatic inference of data types and semantic types

  • Primary Key Support: Specify or auto-detect primary keys

  • Time Column Support: Handle temporal data with designated time columns

  • Validation: Comprehensive validation of table structure and metadata

table = LocalTable(df=df, name="users")

# Optional: Infer metadata automatically
table.infer_metadata()

# Verify metadata:
print(table.metadata)

# Modify column metadata:
table.primary_key = "user_id"
table["age"].stype = "numerical"

LocalGraph#

A LocalGraph represents relationships between multiple LocalTable objects, similar to a relational database schema.

  • Construction Methods: Create from tables or directly from data frames

  • Relationship Management: Define and manage edges between tables

  • Automatic Link Inference: Intelligent detection of foreign key relationships

  • Graph Validation: Ensure graph structure meets requirements before using with KumoRFM

# Create from tables:
graph = LocalGraph(tables=[users_table, orders_table]).infer_links()

# Or create directly from data:
graph = LocalGraph.from_data({
    'users': users_df,
    'orders': orders_df,
})

# Manual relationship management:
graph.link('orders', 'user_id', 'users')
graph.unlink('orders', 'user_id', 'users')

# Validation:
graph.validate()

KumoRFM#

The main KumoRFM class provides the interface to query the relational foundation model.

# Initialize with local graph:
rfm = KumoRFM(graph)

# Query the model:
query = "PREDICT COUNT(orders.*, 0, 30, days)>0 FOR users.user_id=0"
result = rfm.predict(query)
print(result)  # user_id  COUNT(orders.*, 0, 30, days) > 0
               # 1        0.85

rfm.predict(query, explain=True)
metrics = rfm.evaluate(query)

Best Practices#

Note

For a more detailed introduction into data preparation, see Data Requirements for KumoRFM.

Data Preparation#

  1. Clean Data: Ensure your pd.DataFrame objects are clean with no duplicate column names

  2. Consistent Types: Use consistent data types across related columns

  3. Consistent Column Names: Ensure column names are consistent across related tables

  4. Primary Keys: Include a primary key column in each table if possible

  5. Time Columns: Each table should have at most one time column (if possible)

Graph Design#

  1. Metapath lengths: Keep metapath lengths reasonable (ideally 2-3 hops)

    • Longer paths may lead to performance issues and less interpretable results

    • If your relational schema is very complex, it might be worth splitting it into multiple graphs

  2. Meaningful Relationships: Ensure that the inferred relationships are meaningful/correct

  3. Validation: Always validate your graph before using with KumoRFM.

  4. Size Limits: There is an upper size limit on context. If you hit this limit, reduce your graph to the most meaningful tables and or columns.

Querying#

  1. Start Simple: Begin with basic COUNT queries before moving to complex aggregations.

  2. Time Windows: Use appropriate time windows for temporal queries.

  3. Entity Specification: Be specific about which entities you are predicting for.