Understanding Graph Definitions#

LocalGraph objects represent the relational structure between your tables. The key to a good graph is having well-prepared tables underneath - proper dtypes, stypes, primary keys, and time columns in the individual tables are essential for graph success.

Graph Structure and Metadata#

A LocalGraph holds two types of information:

Tables: The collection of LocalTable objects containing your data
Edges: The relational metadata defining how tables connect through primary/foreign key relationships

The edges are the crucial metadata that transforms individual tables into a connected relational structure, enabling KumoRFM to understand and leverage relationships in your data.

Graph Construction Methods#

You can construct a LocalGraph in two ways:

import kumoai.experimental.rfm as rfm

# Method 1: Utility function (recommended for most cases)
# Automatically creates tables from data frames, infers metadata, and finds links
graph = rfm.LocalGraph.from_data({
    'users': df_users,
    'products': df_products,
    'transactions': df_transactions
})

# Method 2: Manual construction from pre-configured table objects
tables = [users_table, products_table, transactions_table]
graph = rfm.LocalGraph(tables=tables)
graph.infer_links()  # or define links manually

The utility function LocalGraph.from_data() is often preferred because it:

Creates LocalTable objects from your data frames
Calls infer_metadata() on each table (see Understanding Table Definitions)
Automatically infers links between tables based on column names

Link Inference and Naming Conventions#

Link inference is based on column names, making consistent naming conventions crucial for automatic graph construction:

# For example, these column patterns create automatic links:
# transactions.user_id -> users.user_id (or users.id)
# orders.product_id -> products.product_id (or products.id)
# reviews.customer_id -> customers.customer_id (or customers.id)

# View inferred edges
for edge in graph.edges:
    print(f"{edge.src_table}.{edge.fkey} -> {edge.dst_table}")

Best practice: Use consistent foreign key naming (e.g., always use user_id, not mixing user_id, uid, customer_id for the same relationship).

Manual Link Management#

If you cannot rename columns to follow consistent patterns, you can add links manually:

# Add specific edge
graph.link(src_table="transactions", fkey="user_id", dst_table="users")

# Remove edge
graph.unlink(src_table="transactions", fkey="user_id", dst_table="users")

What Makes a Good Graph#

A good LocalGraph should have:

Well-prepared tables: The tables should be well-prepared, and split up according to best practices (see Understanding Table Definitions)
Meaningful links: Edges should represent meaningful relationships between tables, not just technical connections
Entities are well-defined: Each table should represent either a single entity or a single event, not a mix of both
Includes prediction ready structure: graph structure imposes limitations on the queries that can be defined with PQL (see Querying KumoRFM), so make sure that PQL queries you want to run are possible with the graph structure

Working around the limitations#

Multiple entities in a single table

Tables that mix data from multiple entities should be split for better graph structure. Think about each table as representing a single entity type or event. Here’s an example:

# Original table mixing transaction, bank, and user data
mixed_data = pd.DataFrame({
    'transaction_id': [1, 2, 3],
    'bank_id': [101, 102, 101],
    'user_id': [201, 202, 203],
    'transaction_amount': [100.0, 250.0, 75.0],
    'transaction_type': ['deposit', 'withdrawal', 'transfer'],
    'bank_name': ['Chase', 'Wells Fargo', 'Chase'],
    'bank_routing': ['123456', '789012', '123456'],
    'user_name': ['Alice', 'Bob', 'Charlie'],
    'user_email': ['alice@email.com', 'bob@email.com', 'charlie@email.com']
})

# Split into three entity-focused tables

# 1. Transactions table (transaction-specific data)
transactions = mixed_data[['transaction_id', 'bank_id', 'user_id', 'transaction_amount', 'transaction_type']].copy()

# 2. Banks table (bank-specific data)
banks = mixed_data[['bank_id', 'bank_name', 'bank_routing']].drop_duplicates()

# 3. Users table (user-specific data)
users = mixed_data[['user_id', 'user_name', 'user_email']].drop_duplicates()

# Create graph with proper entity relationships
graph = rfm.LocalGraph.from_data({
    'transactions': transactions,
    'banks': banks,
    'users': users
})
# Result: transactions.bank_id -> banks.bank_id and transactions.user_id -> users.user_id

Many-to-many relationships

KumoRFM only supports primary-foreign key relationships (one-to-many). Many-to-many relationships require a junction table to break them into two one-to-many relationships:

# Problem: Table with many-to-many data stored as lists/comma-separated values
user_skills_combined = pd.DataFrame({
    'user_id': [1, 2, 3],
    'user_name': ['Alice', 'Bob', 'Charlie'],
    'skills': [['Python', 'SQL'], ['SQL', 'Machine Learning'], ['Python', 'Machine Learning']],
    'proficiency_levels': [['expert', 'beginner'], ['intermediate', 'advanced'], ['expert', 'expert']]
})

# This structure cannot create proper foreign key relationships in KumoRFM

# Solution: Normalize into three tables with junction table

# 1. Users table (entity table)
users = user_skills_combined[['user_id', 'user_name']].copy()

# 2. Skills table (entity table)
all_skills = []
for skill_list in user_skills_combined['skills']:
    all_skills.extend(skill_list)
unique_skills = list(set(all_skills))

skills = pd.DataFrame({
    'skill_id': range(1, len(unique_skills) + 1),
    'skill_name': unique_skills
})

# 3. Junction table (breaks many-to-many into two one-to-many)
user_skills_records = []
for _, row in user_skills_combined.iterrows():
    for skill, proficiency in zip(row['skills'], row['proficiency_levels']):
        skill_id = skills[skills['skill_name'] == skill]['skill_id'].iloc[0]
        user_skills_records.append({
            'user_skill_id': len(user_skills_records) + 1,
            'user_id': row['user_id'],
            'skill_id': skill_id,
            'proficiency_level': proficiency
        })

user_skills = pd.DataFrame(user_skills_records)

# Create graph with proper one-to-many relationships
graph = rfm.LocalGraph.from_data({
    'users': users,
    'skills': skills,
    'user_skills': user_skills
})
# Result: user_skills.user_id -> users.user_id and user_skills.skill_id -> skills.skill_id

This normalization allows proper foreign key relationships and stores relationship-specific attributes (like proficiency_level) in the junction table.

Understanding Graph Definitions

Contents