kumoai.graph.Graph#
- class kumoai.graph.Graph[source]#
Bases:
object
A graph defines the relationships between a set of Kumo tables, akin to relationships between tables in a relational database. Creating a graph is the final step of data definition in Kumo; after a graph is created, you are ready to write a
PredictiveQuery
and train a predictive model.import kumoai # Define connector to source data: connector = kumoai.S3Connector('s3://...') # Create Kumo Tables. See Table documentation for more information: customer = kumoai.Table(...) article = kumoai.Table(...) transaction = kumoai.Table(...) # Create a graph: graph = kumo.Graph( # These are the tables that participate in the graph: the keys of this # dictionary are the names of the tables, and the values are the Table # objects that correspond to these names: tables={ 'customer': customer, 'stock': stock, 'transaction': transaction, }, # These are the edges that define the primary key / foreign key # relationships between the tables defined above. Here, `src_table` # is the table that has the foreign key `fkey`, which maps to the # table `dst_table`'s primary key:` edges=[ dict(src_table='transaction', fkey='StockCode', dst_table='stock'), dict(src_table='transaction', fkey='CustomerID', dst_table='customer'), ], ) # Validate the graph configuration, for use in Kumo downstream models: graph.validate(verbose=True) # Visualize the graph: graph.visualize() # Fetch the statistics of the tables in this graph (this method will # take a graph snapshot, and as a result may have high latency): graph.get_table_stats(wait_for="minimal") # Fetch link health statistics (this method will # take a graph snapshot, and as a result may have high latency): graph.get_edge_stats(non_blocking=Falsej)
- Parameters:
tables (
Optional
[Dict
[str
,Table
]]) – The tables in the graph, represented as a dictionary mapping unique table names (within the context of this graph) to theTable
definition for the table.edges (
Optional
[List
[Edge
]]) – The edges (relationships) between thetables
in the graph. Edges must specify the source table, foreign key, and destination table that they link.
- property id: str#
Returns the unique ID for this graph, determined from its schema and the schemas of the tables and columns that it contains. Two graphs with any differences in their constituent tables or columns are guaranteed to have unique identifiers.
- save()[source]#
Saves a graph to Kumo, returning a unique ID for this graph. The unique ID can later be used to load the graph object.
- Return type:
Example
>>> import kumoai >>> graph = kumoai.Graph(...) >>> graph.save() graph-xxx
- save_as_template(name)[source]#
Saves a graph as a named, re-usable template to Kumo, and returns the saved name as a response. This method can be used to “templatize” / name a graph configuration for ease of future reusability.
- Parameters:
name (
str
) – The name of the template to save the graph as. If the name is already associated with another graph, that graph will be overwritten.- Return type:
Self
Example
>>> import kumoai >>> graph = kumoai.Graph(...) >>> graph.save_as_template("name") >>> loaded = kumoai.Graph.load("name") >>> loaded == graph True
- classmethod load(graph_id_or_template)[source]#
Loads a graph from either a graph ID or a named template. Returns a
Graph
object that contains the loaded graph along with its associated tables, columns, etc.- Return type:
- property snapshot_id: Optional[GraphSnapshotID]#
Returns the snapshot ID of this graph’s snapshot, if a snapshot has been taken. Returns None otherwise.
Warning
This function currently only returns a snapshot ID if a snapshot has been taken in this session.
- snapshot(*, force_refresh=False, non_blocking=False)[source]#
Takes a snapshot of this graph’s underlying data, and returns a unique identifier for this snapshot.
This is equivalent to taking a snapshot for each constituent table in the graph. For more information, please see the documentation for
snapshot()
.Warning
Please familiarize yourself with the warnings for this method in
Table
before proceeding.- Parameters:
force_refresh (
bool
) – Indicates whether a snapshot should be taken, if one already exists in Kumo. IfFalse
, a previously existing snapshot may be re-used. IfTrue
, a new snapshot is always taken.non_blocking (
bool
) – Whether this operation should return immediately after creating the snapshot, or await completion of the snapshot. IfTrue
, the snapshot will proceed in the background, and will be used for any downstream job.
- Raises:
RuntimeError – if
non_blocking
is set toFalse
and the graph snapshot fails.- Return type:
GraphSnapshotID
- get_table_stats(wait_for=None)[source]#
Returns all currently computed statistics on the latest snapshot of this graph. If a snapshot on this graph has not been taken, this method will take a snapshot.
Note
Graph statistics are computed in multiple stages after ingestion is complete. These stages are called minimal and full; minimal statistics are always computed before full statistics.
- Parameters:
wait_for (
Optional
[str
]) – Whether this operation should block on the existence of statistics availability. This argument can take one of three values:None
, which indicates that the method should return immediately with whatever statistics are present,"minimal"
, which indicates that the method should return the when the minimum, maximum, and fraction of NA values statistics are present, or"full"
, which indicates that the method should return when all computed statistics are present.- Return type:
- get_edge_stats(*, non_blocking=False)[source]#
Retrieves edge health statistics for the edges in a graph, if these statistics have been computed by a graph snapshot.
Edge health statistics are returned in a
GraphHealthStats
object, and contain information about the match rate between primary key / foreign key relationships between the tables in the graph.- Parameters:
non_blocking (
bool
) – Whether this operation should return immediately after querying edge statistics (returning None if statistics are not available), or await completion of statistics computation.- Return type:
- remove_table(name)[source]#
Removes a table from this graph.
- Raises:
KeyError – if no such table is present.
- Return type:
Self
- property tables: Dict[str, Table]#
Returns a list of all
Table
objects that are contained in this graph.
- infer_metadata()[source]#
Infers metadata for the tables in this Graph, by inferring the metadata of each table in the graph. For more information, please see the documentation for
infer_metadata()
.- Return type:
- link(*args, **kwargs)[source]#
Links two tables (
src_table
anddst_table
) from the foreign keyfkey
in the source table to the primary key in the destination table. These edges are treated bidirectionally in Kumo.- Parameters:
*args (
Union
[str
,Edge
,None
]) – Any arguments to construct akumoai.graph.Edge
, or akumoai.graph.Edge
itself.**kwargs (
str
) – Any keyword arguments to construct akumoai.graph.Edge
.
- Raises:
ValueError – if the edge is already present in the graph, if the source table does not exist in the graph, if the destination table does not exist in the graph, if the source key does not exist in the source table, or if the primary key of the source table is being treated as a foreign key.
- Return type:
- validate(verbose=True)[source]#
Validates a Graph to ensure that all relevant metadata is specified for its Tables and Edges.
Concretely, validation ensures that all tables are valid (see
validate()
for more information), and that edges properly link primary keys and foreign keys between valid tables. It additionally ensures that primary and foreign keys between tables in an edge are of the same data type, so that unexpected mismatches do not occur within the Kumo platform.Example
>>> import kumoai >>> graph = kumoai.Graph(...) >>> graph.validate() ValidationResponse(warnings=[], errors=[])
- Parameters:
verbose (
bool
) – Whether to log non-error output of this validation.- Raises:
ValueError – if validation fails.
- Return type:
Self
- visualize(path=None, show_cols=True)[source]#
Visualizes the tables and edges in this graph using the
graphviz
library.- Parameters:
path (
Union
[str
,BytesIO
,None
]) – An optional local path to write the produced image to. If None, the image will not be written to disk.show_cols (
bool
) – Whether to show all columns of every table in the graph. If False, will only show the primary key, foreign key(s), time column, and end time column of each table.
- Return type:
Graph
- Returns:
A
graphviz.Graph
instance representing the visualized graph.