Best Practices and Guidelines#

This section outlines the best practices for preparing high-quality datasets for KumoRFM, incorporating key insights from LocalTable and LocalGraph design patterns.

Table Structure and Entity Design#

  1. One entity or event per table:

    # Good: Separate tables for different entities
    users = df[['user_id', 'user_name', 'user_email']]
    transactions = df[['transaction_id', 'user_id', 'amount', 'timestamp']]
    
    # Avoid: Mixing entities in one table
    mixed_data = df[['user_id', 'user_name', 'transaction_id', 'amount']]
    
  2. Single time column per table:

    # Good: Split tables with multiple timestamps
    policies = df[['policy_id', 'user_id', 'start_date', 'policy_type']]
    claims = df[['claim_id', 'policy_id', 'claim_date', 'claim_amount']]
    
    # Avoid: Multiple time columns in one table
    mixed_times = df[['policy_id', 'start_date', 'claim_date', 'end_date']]
    
  3. Handle many-to-many relationships with junction tables:

    # Good: Junction table pattern
    users = df[['user_id', 'user_name']]
    skills = df[['skill_id', 'skill_name']]
    user_skills = df[['user_skill_id', 'user_id', 'skill_id', 'proficiency']]
    

Data Preparation#

  1. Modify dtypes at :class:`pandas.DataFrame` level before creating tables:

    # Good: Set proper pandas dtypes first
    df['user_id'] = df['user_id'].astype('int64')
    df['category'] = df['category'].astype('string')
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    table = rfm.LocalTable(df, "my_table")
    
  2. Use consistent naming conventions:

    # Good: Consistent foreign key naming
    users.user_id, transactions.user_id, profiles.user_id
    
    # Avoid: Inconsistent naming
    users.id, transactions.uid, profiles.customer_id
    
  3. Ensure unique primary keys:

    # Validate uniqueness
    assert df['user_id'].nunique() == len(df)
    assert df['user_id'].notna().all()
    # Otherwise, Kumo will automatically drop duplicates internally!
    

Semantic Type Assignment#

  1. Choose meaningful semantic types:

    # IDs should use ID stype
    table['user_id'].stype = 'ID'
    
    # Text descriptions should use text stype
    table['description'].stype = 'text'
    
    # Limited categories should use categorical stype
    table['status'].stype = 'categorical'
    
    # Numerical measurements should use numerical stype
    table['amount'].stype = 'numerical'
    
    # Important: Validate metadata before proceeding:
    print(table.metadata)
    

Graph Construction#

  1. Design for meaningful relationships:

    # Good: Meaningful entity relationships
    graph = rfm.LocalGraph.from_data({
        'users': users_df,           # Entity table
        'transactions': transactions_df,  # Event table linked to users
        'products': products_df      # Entity table linked via transactions
    })
    
  2. Ensure prediction-ready structure:

    # Consider what PQL queries you want to run
    # Example: "PREDICT COUNT(transactions.*, 0, 30, days) FOR users.user_id=1"
    # Requires: users -> transactions relationship via user_id
    # Requires: Timestamp at the transaction table
    

Common Data Modeling Patterns#

Entity-Event Pattern

# Users (entity) with transactions (events)
users = pd.DataFrame({'user_id': [1, 2], 'name': ['Alice', 'Bob']})
transactions = pd.DataFrame({
    'transaction_id': [1, 2, 3],
    'user_id': [1, 1, 2],
    'amount': [100, 50, 200],
    'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03']
})

Hierarchical Entities

# Multiple levels: company -> department -> employee
companies = pd.DataFrame({'company_id': [1, 2], 'company_name': ['ACME', 'TechCorp']})
departments = pd.DataFrame({'dept_id': [1, 2], 'company_id': [1, 1], 'dept_name': ['Engineering', 'Sales']})
employees = pd.DataFrame({'emp_id': [1, 2, 3], 'dept_id': [1, 1, 2], 'emp_name': ['Alice', 'Bob', 'Charlie']})

Junction Table for Many-to-Many

# Products and categories with many-to-many relationship
products = pd.DataFrame({'product_id': [1, 2], 'product_name': ['Laptop', 'Mouse']})
categories = pd.DataFrame({'category_id': [1, 2], 'category_name': ['Electronics', 'Accessories']})
product_categories = pd.DataFrame({
    'product_category_id': [1, 2, 3],
    'product_id': [1, 1, 2],
    'category_id': [1, 2, 2]
})

Summary#

Following these best practices will help ensure your KumoRFM datasets are well-structured, validated, and optimized for performance:

Table Design:

  • One entity or event per table

  • Single time column per table

  • Unique primary keys with consistent naming

  • Junction tables for many-to-many relationships

Data Preparation:

  • Set proper pandas dtypes before creating tables

  • Use meaningful semantic types (ID, categorical, text, numerical)

  • Validate metadata and semantic types before proceeding

Graph Structure:

  • Design meaningful entity relationships

  • Consider PQL query requirements in your structure

  • Ensure single connected component

  • Test with validation workflow

These patterns will help you create robust, queryable datasets that work effectively with KumoRFM’s predictive capabilities.