kumoai.graph.Table#

class kumoai.graph.Table[source]#

Bases: object

A table represents metadata information for a table in a Kumo Graph.

Whereas a SourceTable is simply a reference to a table behind a backing Connector, a table fully specifies the relevant metadata (including selected source columns, column data type and semantic type, and relational constraint information) necessary to train a PredictiveQuery on graph of tables. A table can either be constructed explicitly, or with the convenience method from_source_table().

import kumoai

# Define connector to source data:
connector = kumoai.S3Connector('s3://...')

# Create table using `from_source_table`:
customer = kumoai.Table.from_source_table(
    source_table=connector['customer'],
    primary_key='CustomerID',
)

# Create a table by constructing it directly:
customer = kumoai.Table(
    source_table=connector['customer'],
    columns=[kumoai.Column(name='CustomerID', dtype='string', stype='ID')],
    primary_key='CustomerID',
)

# Infer any missing metadata in the table, from source table
# properties:
print("Current metadata: ", customer.metadata)
customer.infer_metadata()

# Validate the table configuration, for use in Kumo downstream models:
customer.validate(verbose=True)

# Fetch statistics from a snapshot of this table (this method will
# take a table snapshot, and as a result may have high latency):
customer.get_stats(wait_for="minimal")
Parameters:
  • source_table (SourceTable) – The source table this Kumo table is created from.

  • columns (Optional[List[Union[SourceColumn, Column]]]) – The selected columns of the source table that are part of this Kumo table. Note that each column must specify its data type and semantic type; see the Column documentation for more information.

  • primary_key (Optional[str]) – The primary key of the table, if present. The primary key must exist in the columns argument.

  • time_column (Optional[str]) – The time column of the table, if present. The time column must exist in the columns argument.

  • end_time_column (Optional[str]) – The end time column of the table, if present. The end time column must exist in the columns argument.

__init__(source_table, columns=None, primary_key=None, time_column=None, end_time_column=None)[source]#
static from_source_table(source_table, column_names=None, primary_key=None, time_column=None, end_time_column=None)[source]#

Creates a Kumo Table from a source table. If no column names are specified, all source columns are included by default.

Parameters:
  • source_table (SourceTable) – The SourceTable object that this table is constructed on.

  • column_names (Optional[List[str]]) – A list of columns to include from the source table; if not specified, all columns are included by default.

  • primary_key (Optional[str]) – The name of the primary key of this table, if it exists.

  • time_column (Optional[str]) – The name of the time column of this table, if it exists.

  • end_time_column (Optional[str]) – The name of the end time column of this table, if it exists.

Return type:

Table

print_definition()[source]#

Prints the full definition for this table; this definition can be copied-and-pasted verbatim to re-create this table.

Return type:

None

has_column(name)[source]#

Returns True if this table has column with name name; False otherwise.

Return type:

bool

column(name)[source]#

Returns the data column named with name name in this table, or raises a KeyError if no such column is present.

Raises:

KeyError – if name is not present in this table.

Return type:

Column

property columns: List[Column]#

Returns a list of Column objects that represent the columns in this table.

add_column(*args, **kwargs)[source]#

Adds a Column to this table. A column can either be added by directly specifying its configuration in this call, or by creating a Column object and passing it as an argument.

Return type:

None

Example

>>> import kumoai
>>> table = kumoai.Table(source_table=...)  
>>> table.add_column(name='col1', dtype='string')  
>>> table.add_column(kumoai.Column('col2', 'int'))  
remove_column(name)[source]#

Removes a Column from this table.

Raises:

KeyError – if name is not present in this table.

Return type:

Self

has_primary_key()[source]#

Returns True if this table has a primary key; False otherwise.

Return type:

bool

property primary_key: Optional[Column]#

The primary key column of this table.

The getter returns the primary key column of this table, or None if no such primary key is present.

The setter sets a column as a primary key on this table, and raises a ValueError if the primary key has a non-ID semantic type.

has_time_column()[source]#

Returns True if this table has a time column; False otherwise.

Return type:

bool

property time_column: Optional[Column]#

The time column of this table.

The getter returns the time column of this table, or None if no such time column is present.

The setter sets a column as a time column on this table, and raises a ValueError if the time column is the same as the end time column, or has a non-timestamp semantic type.

has_end_time_column()[source]#

Returns True if this table has an end time column; False otherwise.

Return type:

bool

property end_time_column: Optional[Column]#

The end time column of this table.

The getter returns the end time column of this table, or None if no such column is present.

The setter sets a column as a time column on this table, and raises a ValueError if the time column is the same as the end time column, or has a non-timestamp semantic type.

property metadata: DataFrame#

Returns a DataFrame object containing Kumo metadata information about the columns in this table.

The returned dataframe has columns name, dtype, stype, is_primary_key, is_time_column, and is_end_time_column, which provide an aggregate view of the properties of the columns of this table.

Example

>>> import kumoai
>>> table = kumoai.Table(source_table=...)  
>>> table.add_column(name='CustomerID', dtype='float64', stype='ID')  
>>> table.metadata  
    name        dtype       stype    is_time_column  is_end_time_column
0   CustomerID  float64     ID       False           False
infer_metadata()[source]#

Infers all metadata for this table’s specified columns, including the column data types, semantic types, timestamp formats, primary keys, and time/end-time columns :rtype: Self

Note

This method in-place modifies the Table object.

Note

By default, inferred information does not override manually specified information.

validate(verbose=True)[source]#

Validates a Table to ensure that all relevant metadata is specified for a table to be used in a downstream Graph and PredictiveQuery.

Conceretely, validation ensures that all columns have valid data and semantic types, with respect to the table’s source data. For example, if a text column is assigned a dtype of "int", this method will raise an exception detailing the mismatch. Similarly, if a column cannot be cast from its source data type to the specified data type (e.g "int" to "binary"), this method will raise an exception.

Warning

Data type validation is performed on a sample of table data. A valid response may not indicate your entire data source is configured correctly.

Parameters:

verbose (bool) – Whether to log non-error output of this validation.

Return type:

Self

Example

>>> import kumoai
>>> table = kumoai.Table(...)  
>>> table.validate()  
Raises:

ValueError – if validation fails.

property snapshot_id: Optional[TableSnapshotID]#

Returns the snapshot ID of this table’s snapshot, if a snapshot has been taken. Returns None otherwise.

Warning

This property currently only returns a snapshot ID if a snapshot has been taken in this session.

snapshot(*, force_refresh=False, non_blocking=False)[source]#

Takes a snapshot of this table’s underlying data, and returns a unique identifier for this snapshot.

The snapshot functionality allows one to freeze a table in time, so that underlying data changes do not require Kumo to re-process the data. This allows for fast iterative machine learning model development, on a consistent set of input data.

Warning

Please note that snapshots are intended to freeze tables in time, and not to allow for “time-traveling” to an earlier version of data with a prior snapshot. In particular, this means that a table can only have one version of a snapshot, which represents the latest snapshot taken for that table.

Note

If you are using Kumo as a Snowpark Container Services native application, please note that snapshot is a no-op for all non-view tables.

Parameters:
  • force_refresh (bool) – Indicates whether a snapshot should be taken, if one already exists in Kumo. If False, a previously existing snapshot may be re-used. If True, a new snapshot is always taken.

  • non_blocking (bool) – Whether this operation should return immediately after creating the snapshot, or await completion of the snapshot. If True, the snapshot will proceed in the background, and will be used for any downstream job.

Return type:

TableSnapshotID

get_stats(wait_for=None)[source]#

Returns all currently computed statistics on the latest snapshot of this table. If a snapshot on this table has not been taken, this method will take a snapshot.

Note

Table statstics are computed in multiple stages after ingestion is complete. These stages are called minimal and full; minimal statistics are always computed before full statistics.

Parameters:

wait_for (Optional[str]) – Whether this operation should block on the existence of statistics availability. This argument can take one of three values: None, which indicates that the method should return immediately with whatever statistics are present, "minimal", which indicates that the method should return the when the minimum, maximum, and fraction of NA values statistics are present, or "full", which indicates that the method should return when all computed statistics are present.

Return type:

Dict[str, Dict[str, Any]]

save()[source]#

Saves a table to Kumo, returning a unique ID for this table. The unique ID can later be used to load the table object.

Return type:

str

Example

>>> import kumoai
>>> table = kumoai.Table(...)  
>>> table.save()  
table-xxx
save_as_template(name)[source]#

Saves a table as a named, re-usable template to Kumo, and returns the saved name as a response. This method can be used to “templatize” / name a table configuration for ease of future reusability.

Parameters:

name (str) – The name of the template to save the graph as. If the name is already associated with another graph, that graph will be overwritten.

Return type:

Self

Example

>>> import kumoai
>>> table = kumoai.Table(...)  
>>> table.save_as_template("name")  
>>> loaded = kumoai.Table.load("name")  
>>> loaded == table  
True
classmethod load(table_id_or_template)[source]#

Loads a table from either a table ID or a named template. Returns a Table object that contains the loaded table along with its columns, etc.

Return type:

Table