kumoai.connector.SourceTable#

class kumoai.connector.SourceTable[source]#

Bases: object

A source table is a reference to a table stored behind a backing Connector. It can be used to examine basic information about raw data connected to Kumo, including a sample of the table’s rows, basic statistics, and column data type information.

Once you are ready to use a table as part of a Graph, you may create a Table object from this source table, which includes additional specifying information (including column semantic types and column constraint information).

Parameters:
  • name (str) – The name of this table in the backing connector

  • connector (Connector) – The connector containing this table.

Note

Source tables can also be augmented with large language models to introduce contextual embeddings for language features. To do so, please consult add_llm().

Example

>>> import kumoai
>>> connector = kumoai.S3Connector(root_dir='s3://...')  
>>> articles_src = connector['articles']  
>>> articles_src = kumoai.SourceTable('articles', connector)  
__init__(name, connector)[source]#
property column_dict: Dict[str, SourceColumn]#

Returns the names of the columns in this table along with their SourceColumn information.

property columns: List[SourceColumn]#

Returns a list of the SourceColumn metadata of the columns in this table.

head(num_rows=5)[source]#

Returns the first num_rows rows of this source table by reading data from the backing connector.

Parameters:

num_rows (int) – The number of rows to select. If num_rows is larger than the number of available rows, all rows will be returned.

Return type:

DataFrame

Returns:

The first num_rows rows of the source table as a DataFrame.

add_llm(model, api_key, template, output_dir, output_column_name, output_table_name, dimensions=None, *, non_blocking=False)[source]#

Experimental method which returns a new source table that includes a column computed via an LLM such as OpenAI embedding models. Please refer to the example script for more details.

Note

Current LLM embedding only works for SourceTable in s3.

Note

Your api_key will be encrypted once we received it and it’s only decrypted just before we call the OpenAI text embeddings.

Note

Please keep track of the token usage in the OpenAI Dashboard. If number of tokens in the data exceeds the limit, the backend will raise an error and no result will be produced.

Warning

This method only supports text embedding with data that has less than ~6 million tokens. Number of tokens is estimated by following this guide.

Warning

This method is still experimental. Please consult with your Kumo POC before using it.

Parameters:
  • model (str) – The LLM model name, e.g., OpenAI’s "text-embedding-3-small".

  • api_key (str) – The API key to call the LLM service.

  • template (str) – A template string to be put into the LLM. For example, "{A1} and {A2}" will fuse columns A1 and A2 into a single string.

  • output_dir (str) – The S3 directory to store the output.

  • output_column_name (str) – The output column name for the LLM.

  • output_table_name (str) – The output table name.

  • dimensions (Optional[int]) – The desired LLM embedding dimension.

  • non_blocking (bool) – Whether making this function non-blocking.

Return type:

Union[SourceTable, LLMSourceTableFuture]

Example

>>> import kumoai
>>> connector = kumoai.S3Connector(root_dir='s3://...')  
>>> articles_src = connector['articles']  
>>> articles_src_future = \
    connector["articles"].add_llm(
        model="text-embedding-3-small",
        api_key=YOUR_OPENAI_API_KEY,
        template=("The product {prod_name} in the {section_name} section"
                  "is categorized as {product_type_name} "
                  "and has following description: {detail_desc}"),
        output_dir=YOUR_OUTPUT_DIR,
        output_column_name="embedding_column",
        output_table_name="articles_emb",
        dimensions=256,
        non_blocking=True,
    )
>>> articles_src_future.status()  
>>> articles_src_future.cancel()  
>>> articles_src = articles_src_future.result()