Best Practices and Guidelines#
This section outlines the best practices for preparing high-quality datasets for KumoRFM
, incorporating key insights from LocalTable
and LocalGraph
design patterns.
Table Structure and Entity Design#
One entity or event per table:
# Good: Separate tables for different entities users = df[['user_id', 'user_name', 'user_email']] transactions = df[['transaction_id', 'user_id', 'amount', 'timestamp']] # Avoid: Mixing entities in one table mixed_data = df[['user_id', 'user_name', 'transaction_id', 'amount']]
Single time column per table:
# Good: Split tables with multiple timestamps policies = df[['policy_id', 'user_id', 'start_date', 'policy_type']] claims = df[['claim_id', 'policy_id', 'claim_date', 'claim_amount']] # Avoid: Multiple time columns in one table mixed_times = df[['policy_id', 'start_date', 'claim_date', 'end_date']]
Handle many-to-many relationships with junction tables:
# Good: Junction table pattern users = df[['user_id', 'user_name']] skills = df[['skill_id', 'skill_name']] user_skills = df[['user_skill_id', 'user_id', 'skill_id', 'proficiency']]
Data Preparation#
Modify dtypes at :class:`pandas.DataFrame` level before creating tables:
# Good: Set proper pandas dtypes first df['user_id'] = df['user_id'].astype('int64') df['category'] = df['category'].astype('string') df['timestamp'] = pd.to_datetime(df['timestamp']) table = rfm.LocalTable(df, "my_table")
Use consistent naming conventions:
# Good: Consistent foreign key naming users.user_id, transactions.user_id, profiles.user_id # Avoid: Inconsistent naming users.id, transactions.uid, profiles.customer_id
Ensure unique primary keys:
# Validate uniqueness assert df['user_id'].nunique() == len(df) assert df['user_id'].notna().all() # Otherwise, Kumo will automatically drop duplicates internally!
Semantic Type Assignment#
Choose meaningful semantic types:
# IDs should use ID stype table['user_id'].stype = 'ID' # Text descriptions should use text stype table['description'].stype = 'text' # Limited categories should use categorical stype table['status'].stype = 'categorical' # Numerical measurements should use numerical stype table['amount'].stype = 'numerical' # Important: Validate metadata before proceeding: print(table.metadata)
Graph Construction#
Design for meaningful relationships:
# Good: Meaningful entity relationships graph = rfm.LocalGraph.from_data({ 'users': users_df, # Entity table 'transactions': transactions_df, # Event table linked to users 'products': products_df # Entity table linked via transactions })
Ensure prediction-ready structure:
# Consider what PQL queries you want to run # Example: "PREDICT COUNT(transactions.*, 0, 30, days) FOR users.user_id=1" # Requires: users -> transactions relationship via user_id # Requires: Timestamp at the transaction table
Common Data Modeling Patterns#
Entity-Event Pattern
# Users (entity) with transactions (events)
users = pd.DataFrame({'user_id': [1, 2], 'name': ['Alice', 'Bob']})
transactions = pd.DataFrame({
'transaction_id': [1, 2, 3],
'user_id': [1, 1, 2],
'amount': [100, 50, 200],
'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03']
})
Hierarchical Entities
# Multiple levels: company -> department -> employee
companies = pd.DataFrame({'company_id': [1, 2], 'company_name': ['ACME', 'TechCorp']})
departments = pd.DataFrame({'dept_id': [1, 2], 'company_id': [1, 1], 'dept_name': ['Engineering', 'Sales']})
employees = pd.DataFrame({'emp_id': [1, 2, 3], 'dept_id': [1, 1, 2], 'emp_name': ['Alice', 'Bob', 'Charlie']})
Junction Table for Many-to-Many
# Products and categories with many-to-many relationship
products = pd.DataFrame({'product_id': [1, 2], 'product_name': ['Laptop', 'Mouse']})
categories = pd.DataFrame({'category_id': [1, 2], 'category_name': ['Electronics', 'Accessories']})
product_categories = pd.DataFrame({
'product_category_id': [1, 2, 3],
'product_id': [1, 1, 2],
'category_id': [1, 2, 2]
})
Summary#
Following these best practices will help ensure your KumoRFM datasets are well-structured, validated, and optimized for performance:
Table Design:
One entity or event per table
Single time column per table
Unique primary keys with consistent naming
Junction tables for many-to-many relationships
Data Preparation:
Set proper pandas dtypes before creating tables
Use meaningful semantic types (ID, categorical, text, numerical)
Validate metadata and semantic types before proceeding
Graph Structure:
Design meaningful entity relationships
Consider PQL query requirements in your structure
Ensure single connected component
Test with validation workflow
These patterns will help you create robust, queryable datasets that work effectively with KumoRFM’s predictive capabilities.