Best Practices and Guidelines#

This section outlines the best practices for preparing high-quality datasets for KumoRFM, incorporating key insights from LocalTable and LocalGraph design patterns.

Table Structure and Entity Design#

One entity or event per table:

# Good: Separate tables for different entities
users = df[['user_id', 'user_name', 'user_email']]
transactions = df[['transaction_id', 'user_id', 'amount', 'timestamp']]

# Avoid: Mixing entities in one table
mixed_data = df[['user_id', 'user_name', 'transaction_id', 'amount']]

Single time column per table:

# Good: Split tables with multiple timestamps
policies = df[['policy_id', 'user_id', 'start_date', 'policy_type']]
claims = df[['claim_id', 'policy_id', 'claim_date', 'claim_amount']]

# Avoid: Multiple time columns in one table
mixed_times = df[['policy_id', 'start_date', 'claim_date', 'end_date']]

Handle many-to-many relationships with junction tables:

# Good: Junction table pattern
users = df[['user_id', 'user_name']]
skills = df[['skill_id', 'skill_name']]
user_skills = df[['user_skill_id', 'user_id', 'skill_id', 'proficiency']]

Data Preparation#

Modify dtypes at :class:`pandas.DataFrame` level before creating tables:

# Good: Set proper pandas dtypes first
df['user_id'] = df['user_id'].astype('int64')
df['category'] = df['category'].astype('string')
df['timestamp'] = pd.to_datetime(df['timestamp'])
table = rfm.LocalTable(df, "my_table")

Use consistent naming conventions:

# Good: Consistent foreign key naming
users.user_id, transactions.user_id, profiles.user_id

# Avoid: Inconsistent naming
users.id, transactions.uid, profiles.customer_id

Ensure unique primary keys:

# Validate uniqueness
assert df['user_id'].nunique() == len(df)
assert df['user_id'].notna().all()
# Otherwise, Kumo will automatically drop duplicates internally!

Semantic Type Assignment#

Choose meaningful semantic types:

# IDs should use ID stype
table['user_id'].stype = 'ID'

# Text descriptions should use text stype
table['description'].stype = 'text'

# Limited categories should use categorical stype
table['status'].stype = 'categorical'

# Numerical measurements should use numerical stype
table['amount'].stype = 'numerical'

# Important: Validate metadata before proceeding:
print(table.metadata)

Graph Construction#

Design for meaningful relationships:

# Good: Meaningful entity relationships
graph = rfm.LocalGraph.from_data({
    'users': users_df,           # Entity table
    'transactions': transactions_df,  # Event table linked to users
    'products': products_df      # Entity table linked via transactions
})

Ensure prediction-ready structure:

# Consider what PQL queries you want to run
# Example: "PREDICT COUNT(transactions.*, 0, 30, days) FOR users.user_id=1"
# Requires: users -> transactions relationship via user_id
# Requires: Timestamp at the transaction table

Common Data Modeling Patterns#

Entity-Event Pattern

# Users (entity) with transactions (events)
users = pd.DataFrame({'user_id': [1, 2], 'name': ['Alice', 'Bob']})
transactions = pd.DataFrame({
    'transaction_id': [1, 2, 3],
    'user_id': [1, 1, 2],
    'amount': [100, 50, 200],
    'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03']
})

Hierarchical Entities

# Multiple levels: company -> department -> employee
companies = pd.DataFrame({'company_id': [1, 2], 'company_name': ['ACME', 'TechCorp']})
departments = pd.DataFrame({'dept_id': [1, 2], 'company_id': [1, 1], 'dept_name': ['Engineering', 'Sales']})
employees = pd.DataFrame({'emp_id': [1, 2, 3], 'dept_id': [1, 1, 2], 'emp_name': ['Alice', 'Bob', 'Charlie']})

Junction Table for Many-to-Many

# Products and categories with many-to-many relationship
products = pd.DataFrame({'product_id': [1, 2], 'product_name': ['Laptop', 'Mouse']})
categories = pd.DataFrame({'category_id': [1, 2], 'category_name': ['Electronics', 'Accessories']})
product_categories = pd.DataFrame({
    'product_category_id': [1, 2, 3],
    'product_id': [1, 1, 2],
    'category_id': [1, 2, 2]
})

Summary#

Following these best practices will help ensure your KumoRFM datasets are well-structured, validated, and optimized for performance:

Table Design:

One entity or event per table
Single time column per table
Unique primary keys with consistent naming
Junction tables for many-to-many relationships

Data Preparation:

Set proper pandas dtypes before creating tables
Use meaningful semantic types (ID, categorical, text, numerical)
Validate metadata and semantic types before proceeding

Graph Structure:

Design meaningful entity relationships
Consider PQL query requirements in your structure
Ensure single connected component
Test with validation workflow

These patterns will help you create robust, queryable datasets that work effectively with KumoRFM’s predictive capabilities.

Best Practices and Guidelines

Contents