kumoai.connector.SourceTable#
- class kumoai.connector.SourceTable[source]#
Bases:
objectA source table is a reference to a table stored behind a backing
Connector. It can be used to examine basic information about raw data connected to Kumo, including a sample of the table’s rows, basic statistics, and column data type information.Once you are ready to use a table as part of a
Graph, you may create aTableobject from this source table, which includes additional specifying information (including column semantic types and column constraint information).- Parameters:
name (
str) – The name of this table in the backing connectorconnector (
Connector) – The connector containing this table.
Note
Source tables can also be augmented with large language models to introduce contextual embeddings for language features. To do so, please consult
add_llm().Example
>>> import kumoai >>> connector = kumoai.S3Connector(root_dir='s3://...') >>> articles_src = connector['articles'] >>> articles_src = kumoai.SourceTable('articles', connector)
- property column_dict: Dict[str, SourceColumn]#
Returns the names of the columns in this table along with their
SourceColumninformation.
- property columns: List[SourceColumn]#
Returns a list of the
SourceColumnmetadata of the columns in this table.
- head(num_rows=5)[source]#
Returns the first
num_rowsrows of this source table by reading data from the backing connector.
- add_llm(model, api_key, template, output_dir, output_column_name, output_table_name, dimensions=None, *, non_blocking=False)[source]#
Experimental method which returns a new source table that includes a column computed via an LLM such as OpenAI embedding models. Please refer to the example script for more details.
Note
Current LLM embedding only works for
SourceTablein s3.Note
Your
api_keywill be encrypted once we received it and it’s only decrypted just before we call the OpenAI text embeddings.Note
Please keep track of the token usage in the OpenAI Dashboard. If number of tokens in the data exceeds the limit, the backend will raise an error and no result will be produced.
Warning
This method only supports text embedding with data that has less than ~6 million tokens. Number of tokens is estimated by following this guide.
Warning
This method is still experimental. Please consult with your Kumo POC before using it.
- Parameters:
model (
str) – The LLM model name, e.g., OpenAI’s"text-embedding-3-small".api_key (
str) – The API key to call the LLM service.template (
str) – A template string to be put into the LLM. For example,"{A1} and {A2}"will fuse columnsA1andA2into a single string.output_dir (
str) – The S3 directory to store the output.output_column_name (
str) – The output column name for the LLM.output_table_name (
str) – The output table name.dimensions (
Optional[int]) – The desired LLM embedding dimension.non_blocking (
bool) – Whether making this function non-blocking.
- Return type:
Union[SourceTable,LLMSourceTableFuture]
Example
>>> import kumoai >>> connector = kumoai.S3Connector(root_dir='s3://...') >>> articles_src = connector['articles'] >>> articles_src_future = \ connector["articles"].add_llm( model="text-embedding-3-small", api_key=YOUR_OPENAI_API_KEY, template=("The product {prod_name} in the {section_name} section" "is categorized as {product_type_name} " "and has following description: {detail_desc}"), output_dir=YOUR_OUTPUT_DIR, output_column_name="embedding_column", output_table_name="articles_emb", dimensions=256, non_blocking=True, ) >>> articles_src_future.status() >>> articles_src_future.cancel() >>> articles_src = articles_src_future.result()