Understanding Table Definitions#

LocalTable objects wraps a pandas.DataFrame with metadata about columns, primary keys, and time columns. The semantic types are required metadata, while the primary key and time column are optional. Each table can have at most one primary key and at most one time column, but it can contain many foreign keys (primary keys of other tables).

Dtype and Metadata Inference#

When creating a LocalTable, column dtypes and stypes are automatically inferred from the underlying data based on the pandas data type and heuristics. The key metadata that needs to be properly set includes:

  • Stypes: Semantic types that determine model processing behavior

  • Primary key: Unique identifier for the table (optional but recommended)

  • Time column: Temporal column for time-based operations (optional)

The LocalTable.infer_metadata() method automates much of this process:

  • Primary key detection: Uses heuristics to suggest potential primary keys based on column names, uniqueness, and data patterns

  • Time column detection: Identifies columns with temporal data types or time-related naming patterns

import kumoai.experimental.rfm as rfm

# Automatic inference (recommended for initial setup)
table = rfm.LocalTable(df, "users").infer_metadata()

# Manual override of inferred metadata when needed
table['user_id'].stype = Stype.ID  # Override inferred stype
table.primary_key = "user_id"      # Override inferred primary key

Basic Table Creation#

import pandas as pd
import kumoai.experimental.rfm as rfm

# Create table with automatic metadata inference
users_table = rfm.LocalTable(
    df=df_users,
    table_name="users"
).infer_metadata()

# Create table with explicit metadata
transactions_table = rfm.LocalTable(
    df=df_transactions,
    table_name="transactions",
    primary_key="transaction_id",
    time_column="timestamp"
)

Inspecting Table Metadata#

# Access table metadata
print(f"Primary key: {users_table.primary_key}")
print(f"Time column: {users_table.time_column}")
print(f"Columns: {[col.name for col in users_table.columns]}")

# View metadata summary
metadata_df = users_table.metadata
print(metadata_df)

# Check column information
print(f"Column: {users_table['age'].name}")
print(f"Dtype: {users_table['age'].dtype}")
print(f"Stype: {users_table['age'].stype}")

What Makes a Good Table#

A good LocalTable should have:

  • Clean dtypes: Set proper pandas dtypes at DataFrame level before table creation

  • Meaningful stypes: ID columns use Stype.ID, categorical data uses Stype.categorical, text uses Stype.text, etc

  • Unique primary key: Non-null, no duplicates, uniquely identifies each row, preferably stored as integer

  • Consistent naming: Foreign keys match their referenced primary key names

  • Single time column: One temporal column when temporal data is available