kumoai.graph.Graph#

class kumoai.graph.Graph[source]#

Bases: object

A graph defines the relationships between a set of Kumo tables, akin to relationships between tables in a relational database. Creating a graph is the final step of data definition in Kumo; after a graph is created, you are ready to write a PredictiveQuery and train a predictive model.

import kumoai

# Define connector to source data:
connector = kumoai.S3Connector('s3://...')

# Create Kumo Tables. See Table documentation for more information:
customer = kumoai.Table(...)
article = kumoai.Table(...)
transaction = kumoai.Table(...)

# Create a graph:
graph = kumo.Graph(
    # These are the tables that participate in the graph: the keys of this
    # dictionary are the names of the tables, and the values are the Table
    # objects that correspond to these names:
    tables={
        'customer': customer,
        'stock': stock,
        'transaction': transaction,
    },

    # These are the edges that define the primary key / foreign key
    # relationships between the tables defined above. Here, `src_table`
    # is the table that has the foreign key `fkey`, which maps to the
    # table `dst_table`'s primary key:`
    edges=[
        dict(src_table='transaction', fkey='StockCode', dst_table='stock'),
        dict(src_table='transaction', fkey='CustomerID', dst_table='customer'),
    ],
)

# Validate the graph configuration, for use in Kumo downstream models:
graph.validate(verbose=True)

# Visualize the graph:
graph.visualize()

# Fetch the statistics of the tables in this graph (this method will
# take a graph snapshot, and as a result may have high latency):
graph.get_table_stats(wait_for="minimal")

# Fetch link health statistics (this method will
# take a graph snapshot, and as a result may have high latency):
graph.get_edge_stats(non_blocking=Falsej)
Parameters:
  • tables (Optional[Dict[str, Table]]) – The tables in the graph, represented as a dictionary mapping unique table names (within the context of this graph) to the Table definition for the table.

  • edges (Optional[List[Edge]]) – The edges (relationships) between the tables in the graph. Edges must specify the source table, foreign key, and destination table that they link.

__init__(tables=None, edges=None)[source]#
property id: str#

Returns the unique ID for this graph, determined from its schema and the schemas of the tables and columns that it contains. Two graphs with any differences in their constituent tables or columns are guaranteed to have unique identifiers.

save()[source]#

Saves a graph to Kumo, returning a unique ID for this graph. The unique ID can later be used to load the graph object.

Return type:

str

Example

>>> import kumoai
>>> graph = kumoai.Graph(...)  
>>> graph.save()  
graph-xxx
save_as_template(name)[source]#

Saves a graph as a named, re-usable template to Kumo, and returns the saved name as a response. This method can be used to “templatize” / name a graph configuration for ease of future reusability.

Parameters:

name (str) – The name of the template to save the graph as. If the name is already associated with another graph, that graph will be overwritten.

Return type:

Self

Example

>>> import kumoai
>>> graph = kumoai.Graph(...)  
>>> graph.save_as_template("name")  
>>> loaded = kumoai.Graph.load("name")  
>>> loaded == graph  
True
classmethod load(graph_id_or_template)[source]#

Loads a graph from either a graph ID or a named template. Returns a Graph object that contains the loaded graph along with its associated tables, columns, etc.

Return type:

Graph

property snapshot_id: Optional[GraphSnapshotID]#

Returns the snapshot ID of this graph’s snapshot, if a snapshot has been taken. Returns None otherwise.

Warning

This function currently only returns a snapshot ID if a snapshot has been taken in this session.

snapshot(*, force_refresh=False, non_blocking=False)[source]#

Takes a snapshot of this graph’s underlying data, and returns a unique identifier for this snapshot.

This is equivalent to taking a snapshot for each constituent table in the graph. For more information, please see the documentation for snapshot().

Warning

Please familiarize yourself with the warnings for this method in Table before proceeding.

Parameters:
  • force_refresh (bool) – Indicates whether a snapshot should be taken, if one already exists in Kumo. If False, a previously existing snapshot may be re-used. If True, a new snapshot is always taken.

  • non_blocking (bool) – Whether this operation should return immediately after creating the snapshot, or await completion of the snapshot. If True, the snapshot will proceed in the background, and will be used for any downstream job.

Raises:

RuntimeError – if non_blocking is set to False and the graph snapshot fails.

Return type:

GraphSnapshotID

get_table_stats(wait_for=None)[source]#

Returns all currently computed statistics on the latest snapshot of this graph. If a snapshot on this graph has not been taken, this method will take a snapshot.

Note

Graph statistics are computed in multiple stages after ingestion is complete. These stages are called minimal and full; minimal statistics are always computed before full statistics.

Parameters:

wait_for (Optional[str]) – Whether this operation should block on the existence of statistics availability. This argument can take one of three values: None, which indicates that the method should return immediately with whatever statistics are present, "minimal", which indicates that the method should return the when the minimum, maximum, and fraction of NA values statistics are present, or "full", which indicates that the method should return when all computed statistics are present.

Return type:

Dict[str, Dict[str, Any]]

get_edge_stats(*, non_blocking=False)[source]#

Retrieves edge health statistics for the edges in a graph, if these statistics have been computed by a graph snapshot.

Edge health statistics are returned in a GraphHealthStats object, and contain information about the match rate between primary key / foreign key relationships between the tables in the graph.

Parameters:

non_blocking (bool) – Whether this operation should return immediately after querying edge statistics (returning None if statistics are not available), or await completion of statistics computation.

Return type:

Optional[GraphHealthStats]

has_table(name)[source]#

Returns True if a table by name is present in this Graph.

Return type:

bool

table(name)[source]#

Returns a table in this Kumo Graph.

Raises:

KeyError – if no such table is present.

Return type:

Table

add_table(name, table)[source]#

Adds a table to this Kumo Graph.

Raises:

KeyError – if a table with the same name already exists in this graph.

Return type:

Graph

remove_table(name)[source]#

Removes a table from this graph.

Raises:

KeyError – if no such table is present.

Return type:

Self

property tables: Dict[str, Table]#

Returns a list of all Table objects that are contained in this graph.

infer_metadata()[source]#

Infers metadata for the tables in this Graph, by inferring the metadata of each table in the graph. For more information, please see the documentation for infer_metadata().

Return type:

Graph

Links two tables (src_table and dst_table) from the foreign key fkey in the source table to the primary key in the destination table. These edges are treated bidirectionally in Kumo.

Parameters:
Raises:

ValueError – if the edge is already present in the graph, if the source table does not exist in the graph, if the destination table does not exist in the graph, if the source key does not exist in the source table, or if the primary key of the source table is being treated as a foreign key.

Return type:

Graph

Removes an edge added to a Kumo Graph.

Parameters:
Raises:

ValueError – if the edge is not present in the graph.

Return type:

Graph

property edges: List[Edge]#

Returns a list of all Edge objects that represent links in this graph.

validate(verbose=True)[source]#

Validates a Graph to ensure that all relevant metadata is specified for its Tables and Edges.

Concretely, validation ensures that all tables are valid (see validate() for more information), and that edges properly link primary keys and foreign keys between valid tables. It additionally ensures that primary and foreign keys between tables in an edge are of the same data type, so that unexpected mismatches do not occur within the Kumo platform.

Example

>>> import kumoai
>>> graph = kumoai.Graph(...)  
>>> graph.validate()  
ValidationResponse(warnings=[], errors=[])
Parameters:

verbose (bool) – Whether to log non-error output of this validation.

Raises:

ValueError – if validation fails.

Return type:

Self

visualize(path=None, show_cols=True)[source]#

Visualizes the tables and edges in this graph using the graphviz library.

Parameters:
  • path (Union[str, BytesIO, None]) – An optional local path to write the produced image to. If None, the image will not be written to disk.

  • show_cols (bool) – Whether to show all columns of every table in the graph. If False, will only show the primary key, foreign key(s), time column, and end time column of each table.

Return type:

Graph

Returns:

A graphviz.Graph instance representing the visualized graph.