kumoai.connector.SourceTable#
- class kumoai.connector.SourceTable[source]#
Bases:
object
A source table is a reference to a table stored behind a backing
Connector
. It can be used to examine basic information about raw data connected to Kumo, including a sample of the table’s rows, basic statistics, and column data type information.Once you are ready to use a table as part of a
Graph
, you may create aTable
object from this source table, which includes additional specifying information (including column semantic types and column constraint information).- Parameters:
name (
str
) – The name of this table in the backing connectorconnector (
Connector
) – The connector containing this table.
Note
Source tables can also be augmented with large language models to introduce contextual embeddings for language features. To do so, please consult
add_llm()
.Example
>>> import kumoai >>> connector = kumoai.S3Connector(root_dir='s3://...') >>> articles_src = connector['articles'] >>> articles_src = kumoai.SourceTable('articles', connector)
- property column_dict: Dict[str, SourceColumn]#
Returns the names of the columns in this table along with their
SourceColumn
information.
- property columns: List[SourceColumn]#
Returns a list of the
SourceColumn
metadata of the columns in this table.
- head(num_rows=5)[source]#
Returns the first
num_rows
rows of this source table by reading data from the backing connector.
- add_llm(model, api_key, template, output_dir, output_column_name, output_table_name, dimensions=None, *, non_blocking=False)[source]#
Experimental method which returns a new source table that includes a column computed via an LLM such as OpenAI embedding models. Please refer to the example script for more details.
Note
Current LLM embedding only works for
SourceTable
in s3.Note
Your
api_key
will be encrypted once we received it and it’s only decrypted just before we call the OpenAI text embeddings.Note
Please keep track of the token usage in the OpenAI Dashboard. If number of tokens in the data exceeds the limit, the backend will raise an error and no result will be produced.
Warning
This method only supports text embedding with data that has less than ~6 million tokens. Number of tokens is estimated by following this guide.
Warning
This method is still experimental. Please consult with your Kumo POC before using it.
- Parameters:
model (
str
) – The LLM model name, e.g., OpenAI’s"text-embedding-3-small"
.api_key (
str
) – The API key to call the LLM service.template (
str
) – A template string to be put into the LLM. For example,"{A1} and {A2}"
will fuse columnsA1
andA2
into a single string.output_dir (
str
) – The S3 directory to store the output.output_column_name (
str
) – The output column name for the LLM.output_table_name (
str
) – The output table name.dimensions (
Optional
[int
]) – The desired LLM embedding dimension.non_blocking (
bool
) – Whether making this function non-blocking.
- Return type:
Example
>>> import kumoai >>> connector = kumoai.S3Connector(root_dir='s3://...') >>> articles_src = connector['articles'] >>> articles_src_future = \ connector["articles"].add_llm( model="text-embedding-3-small", api_key=YOUR_OPENAI_API_KEY, template=("The product {prod_name} in the {section_name} section" "is categorized as {product_type_name} " "and has following description: {detail_desc}"), output_dir=YOUR_OUTPUT_DIR, output_column_name="embedding_column", output_table_name="articles_emb", dimensions=256, non_blocking=True, ) >>> articles_src_future.status() >>> articles_src_future.cancel() >>> articles_src = articles_src_future.result()