kumoai.graph.Table#
- class kumoai.graph.Table[source]#
Bases:
object
A table represents metadata information for a table in a Kumo
Graph
.Whereas a
SourceTable
is simply a reference to a table behind a backingConnector
, a table fully specifies the relevant metadata (including selected source columns, column data type and semantic type, and relational constraint information) necessary to train aPredictiveQuery
on graph of tables. A table can either be constructed explicitly, or with the convenience methodfrom_source_table()
.import kumoai # Define connector to source data: connector = kumoai.S3Connector('s3://...') # Create table using `from_source_table`: customer = kumoai.Table.from_source_table( source_table=connector['customer'], primary_key='CustomerID', ) # Create a table by constructing it directly: customer = kumoai.Table( source_table=connector['customer'], columns=[kumoai.Column(name='CustomerID', dtype='string', stype='ID')], primary_key='CustomerID', ) # Infer any missing metadata in the table, from source table # properties: print("Current metadata: ", customer.metadata) customer.infer_metadata() # Validate the table configuration, for use in Kumo downstream models: customer.validate(verbose=True) # Fetch statistics from a snapshot of this table (this method will # take a table snapshot, and as a result may have high latency): customer.get_stats(wait_for="minimal")
- Parameters:
source_table (
SourceTable
) – The source table this Kumo table is created from.columns (
Optional
[List
[Union
[SourceColumn
,Column
]]]) – The selected columns of the source table that are part of this Kumo table. Note that each column must specify its data type and semantic type; see theColumn
documentation for more information.primary_key (
Optional
[str
]) – The primary key of the table, if present. The primary key must exist in thecolumns
argument.time_column (
Optional
[str
]) – The time column of the table, if present. The time column must exist in thecolumns
argument.end_time_column (
Optional
[str
]) – The end time column of the table, if present. The end time column must exist in thecolumns
argument.
- __init__(source_table, columns=None, primary_key=None, time_column=None, end_time_column=None)[source]#
- static from_source_table(source_table, column_names=None, primary_key=None, time_column=None, end_time_column=None)[source]#
Creates a Kumo Table from a source table. If no column names are specified, all source columns are included by default.
- Parameters:
source_table (
SourceTable
) – TheSourceTable
object that this table is constructed on.column_names (
Optional
[List
[str
]]) – A list of columns to include from the source table; if not specified, all columns are included by default.primary_key (
Optional
[str
]) – The name of the primary key of this table, if it exists.time_column (
Optional
[str
]) – The name of the time column of this table, if it exists.end_time_column (
Optional
[str
]) – The name of the end time column of this table, if it exists.
- Return type:
- print_definition()[source]#
Prints the full definition for this table; this definition can be copied-and-pasted verbatim to re-create this table.
- Return type:
- has_column(name)[source]#
Returns True if this table has column with name
name
; False otherwise.- Return type:
- column(name)[source]#
Returns the data column named with name
name
in this table, or raises aKeyError
if no such column is present.
- property columns: List[Column]#
Returns a list of
Column
objects that represent the columns in this table.
- add_column(*args, **kwargs)[source]#
Adds a
Column
to this table. A column can either be added by directly specifying its configuration in this call, or by creating a Column object and passing it as an argument.- Return type:
Example
>>> import kumoai >>> table = kumoai.Table(source_table=...) >>> table.add_column(name='col1', dtype='string') >>> table.add_column(kumoai.Column('col2', 'int'))
- remove_column(name)[source]#
Removes a
Column
from this table.- Raises:
KeyError – if
name
is not present in this table.- Return type:
Self
- has_primary_key()[source]#
Returns
True
if this table has a primary key;False
otherwise.- Return type:
- property primary_key: Optional[Column]#
The primary key column of this table.
The getter returns the primary key column of this table, or None if no such primary key is present.
The setter sets a column as a primary key on this table, and raises a
ValueError
if the primary key has a non-ID semantic type.
- has_time_column()[source]#
Returns
True
if this table has a time column;False
otherwise.- Return type:
- property time_column: Optional[Column]#
The time column of this table.
The getter returns the time column of this table, or
None
if no such time column is present.The setter sets a column as a time column on this table, and raises a
ValueError
if the time column is the same as the end time column, or has a non-timestamp semantic type.
- has_end_time_column()[source]#
Returns
True
if this table has an end time column;False
otherwise.- Return type:
- property end_time_column: Optional[Column]#
The end time column of this table.
The getter returns the end time column of this table, or
None
if no such column is present.The setter sets a column as a time column on this table, and raises a
ValueError
if the time column is the same as the end time column, or has a non-timestamp semantic type.
- property metadata: DataFrame#
Returns a
DataFrame
object containing Kumo metadata information about the columns in this table.The returned dataframe has columns
name
,dtype
,stype
,is_primary_key
,is_time_column
, andis_end_time_column
, which provide an aggregate view of the properties of the columns of this table.Example
>>> import kumoai >>> table = kumoai.Table(source_table=...) >>> table.add_column(name='CustomerID', dtype='float64', stype='ID') >>> table.metadata name dtype stype is_time_column is_end_time_column 0 CustomerID float64 ID False False
- infer_metadata()[source]#
Infers all metadata for this table’s specified columns, including the column data types, semantic types, timestamp formats, primary keys, and time/end-time columns :rtype:
Self
Note
This method in-place modifies the Table object.
Note
By default, inferred information does not override manually specified information.
- validate(verbose=True)[source]#
Validates a Table to ensure that all relevant metadata is specified for a table to be used in a downstream
Graph
andPredictiveQuery
.Conceretely, validation ensures that all columns have valid data and semantic types, with respect to the table’s source data. For example, if a text column is assigned a
dtype
of"int"
, this method will raise an exception detailing the mismatch. Similarly, if a column cannot be cast from its source data type to the specified data type (e.g"int"
to"binary"
), this method will raise an exception.Warning
Data type validation is performed on a sample of table data. A valid response may not indicate your entire data source is configured correctly.
- Parameters:
verbose (
bool
) – Whether to log non-error output of this validation.- Return type:
Self
Example
>>> import kumoai >>> table = kumoai.Table(...) >>> table.validate()
- Raises:
ValueError – if validation fails.
- property snapshot_id: Optional[TableSnapshotID]#
Returns the snapshot ID of this table’s snapshot, if a snapshot has been taken. Returns None otherwise.
Warning
This property currently only returns a snapshot ID if a snapshot has been taken in this session.
- snapshot(*, force_refresh=False, non_blocking=False)[source]#
Takes a snapshot of this table’s underlying data, and returns a unique identifier for this snapshot.
The snapshot functionality allows one to freeze a table in time, so that underlying data changes do not require Kumo to re-process the data. This allows for fast iterative machine learning model development, on a consistent set of input data.
Warning
Please note that snapshots are intended to freeze tables in time, and not to allow for “time-traveling” to an earlier version of data with a prior snapshot. In particular, this means that a table can only have one version of a snapshot, which represents the latest snapshot taken for that table.
Note
If you are using Kumo as a Snowpark Container Services native application, please note that snapshot is a no-op for all non-view tables.
- Parameters:
force_refresh (
bool
) – Indicates whether a snapshot should be taken, if one already exists in Kumo. IfFalse
, a previously existing snapshot may be re-used. IfTrue
, a new snapshot is always taken.non_blocking (
bool
) – Whether this operation should return immediately after creating the snapshot, or await completion of the snapshot. IfTrue
, the snapshot will proceed in the background, and will be used for any downstream job.
- Return type:
TableSnapshotID
- get_stats(wait_for=None)[source]#
Returns all currently computed statistics on the latest snapshot of this table. If a snapshot on this table has not been taken, this method will take a snapshot.
Note
Table statstics are computed in multiple stages after ingestion is complete. These stages are called minimal and full; minimal statistics are always computed before full statistics.
- Parameters:
wait_for (
Optional
[str
]) – Whether this operation should block on the existence of statistics availability. This argument can take one of three values:None
, which indicates that the method should return immediately with whatever statistics are present,"minimal"
, which indicates that the method should return the when the minimum, maximum, and fraction of NA values statistics are present, or"full"
, which indicates that the method should return when all computed statistics are present.- Return type:
- save()[source]#
Saves a table to Kumo, returning a unique ID for this table. The unique ID can later be used to load the table object.
- Return type:
Example
>>> import kumoai >>> table = kumoai.Table(...) >>> table.save() table-xxx
- save_as_template(name)[source]#
Saves a table as a named, re-usable template to Kumo, and returns the saved name as a response. This method can be used to “templatize” / name a table configuration for ease of future reusability.
- Parameters:
name (
str
) – The name of the template to save the graph as. If the name is already associated with another graph, that graph will be overwritten.- Return type:
Self
Example
>>> import kumoai >>> table = kumoai.Table(...) >>> table.save_as_template("name") >>> loaded = kumoai.Table.load("name") >>> loaded == table True