Evaluation#

KumoRFM provides an evaluation mode that automatically measures prediction quality by performing a train/test split on context examples and computing relevant metrics.

Running an Evaluation#

Use KumoRFM.evaluate() with the same PQL syntax as KumoRFM.predict():

metrics = model.evaluate(
    "PREDICT COUNT(orders.*, 0, 30, days) > 0 FOR users.user_id=1",
    run_mode="FAST",
)
print(metrics)

The evaluation collects context examples, splits them into in-context (training) and test sets, generates predictions for the test set, and computes metrics comparing predictions to actual outcomes.

You can also use the EVALUATE keyword in the query string directly:

metrics = model.evaluate(
    "EVALUATE PREDICT COUNT(orders.*, 0, 30, days) FOR users.user_id=1"
)

Available Metrics#

The metrics returned depend on the detected task type:

Task Type	Supported Metrics
Binary Classification	`accuracy`, `precision`, `recall`, `f1`, `mrr`, `auc`
Multi-Class Classification	`acc`, `precision`, `recall`, `f1`, `mrr`
Regression / Forecasting	`mae`, `mape`, `mse`, `rmse`, `smape`, `r2`

You can specify which metrics to compute:

metrics = model.evaluate(
    "PREDICT SUM(orders.price, 0, 30, days) FOR items.item_id=42",
    metrics=["mae", "rmse", "r2"],
)

Evaluation Parameters#

The KumoRFM.evaluate() method accepts the same parameters as KumoRFM.predict(), plus:

metrics: A list of metric names to compute. If not specified, all applicable metrics for the task type are computed.

The run_mode, anchor_time, num_hops, and other parameters work identically to KumoRFM.predict(). See Configuration for details on run modes.

Evaluation with TaskTable#

For advanced use cases, you can construct a TaskTable explicitly and use KumoRFM.evaluate_task():

task = TaskTable(
    task_type="binary_classification",
    context_df=context_dataframe,
    pred_df=prediction_dataframe,
    entity_table_name="users",
    entity_column="user_id",
    target_column="target",
    time_column="timestamp",
)

metrics = model.evaluate_task(task)

This gives you full control over the train/test split and context construction.

Interpreting Results#

The evaluation returns a dictionary mapping metric names to values:

>>> metrics = model.evaluate(query)
>>> print(metrics)
{'mae': 12.5, 'rmse': 15.3, 'r2': 0.82}

Higher values are better for r2, accuracy, precision, recall, f1, and auc. Lower values are better for mae, mape, mse, rmse, smape.

Evaluation

Contents