kumoai.pquery.TrainingTable#

class kumoai.pquery.TrainingTable[source]#

Bases: object

A training table in the Kumo platform. A training table can be initialized from a job ID of a completed training table generation job.

import kumoai

# Create a Training Table from a training table generation job. Note
# that the job ID passed here must be in a completed state:
training_table = kumoai.TrainingTable("gen-traintable-job-...")

# Read the training table as a Pandas DataFrame:
training_df = training_table.data_df()

# Get URLs to download the training table:
training_download_urls = training_table.data_urls()

# Add weight column to the training table:
# see `kumo-sdk.examples.datasets.weighted_train_table.py`
# for a more detailed example
# 1. Export train table
connector = kumo.S3Connector("s3_path")
training_table.export(TrainingTableExportConfig(
    output_types={'training_table'},
    output_connector=connector,
    output_table_name="<any_name>"))
# 2. Assume the weight column was added to the train table
# and it was saved to the same S3 path as "<mod_name>"
training_table.update(SourceTable("<mod_table>", connector),
                      TrainingTableSpec(weight_col="weight"))
Parameters:

job_id (str) – ID of the training table generation job which generated this training table.

__init__(job_id)[source]#
data_urls()[source]#

Returns a list of URLs that can be used to view generated training table data. The list will contain more than one element if the table is partitioned; paths will be relative to the location of the Kumo data plane.

Return type:

List[str]

data_df()[source]#

Returns a DataFrame object representing the generated training data. :rtype: DataFrame

Warning

This method will load the full training table into memory as a DataFrame object. If you are working on a machine with limited resources, please use data_urls() instead to download the data and perform analysis per-partition.

export(output_config, non_blocking=True)[source]#

Export the training table to the connector. specified in the output config. Use the exported table to add a weight column then use update to update the training table.

Parameters:
  • output_config (TrainingTableExportConfig) – The output configuration to write the training table.

  • non_blocking (bool) – If True, the method will return a future object ArtifactExportJob representing the export job. If False, the method will block until the export job is complete and return ArtifactExportResult.

Return type:

Union[ArtifactExportJob, ArtifactExportResult]

validate_custom_table(source_table_type, train_table_mod)[source]#

Validates the modified training table.

Parameters:
  • source_table_type (Union[S3SourceTable, SnowflakeSourceTable, DatabricksSourceTable, BigQuerySourceTable, GlueSourceTable]) – The source table to be used as the modified training table.

  • train_table_mod (TrainingTableSpec) – The modification specification.

Raises:

ValueError – If the modified training table is invalid.

Return type:

None

update(source_table, train_table_mod, validate=True)[source]#

Sets the source_table as the modified training table.

Note

The only allowed modification is the addition of weight column Any other modification might lead to unintentded ERRORS while training. Further negative/NA weight values are not supported.

The custom training table is ingested during trainer.fit() and is used as the training table.

Parameters:
  • source_table (SourceTable) – The source table to be used as the modified training table.

  • train_table_mod (TrainingTableSpec) – The modification specification.

  • validate (bool) – Whether to validate the modified training table. This can be slow for large tables.

Return type:

Self