abacusai.dataset
Classes
An abstract class for dataset configs specific to application connectors. |
|
Document processing configuration for dataset imports. |
|
Generic enumeration. |
|
Document processing configuration. |
|
Custom config for dataset parsing. |
|
A schema description for a column |
|
A specific version of a dataset |
|
A refresh schedule for an object. Defines when the next version of the object will be created |
|
A dataset reference |
Module Contents
- class abacusai.dataset.ApplicationConnectorDatasetConfig
Bases:
abacusai.api_class.dataset.DatasetConfigAn abstract class for dataset configs specific to application connectors.
- Parameters:
application_connector_type (enums.ApplicationConnectorType) – The type of application connector
- application_connector_type: abacusai.api_class.enums.ApplicationConnectorType
- classmethod _get_builder()
- class abacusai.dataset.DatasetDocumentProcessingConfig
Bases:
DocumentProcessingConfigDocument processing configuration for dataset imports.
- Parameters:
extract_bounding_boxes (bool) – Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False.
ocr_mode (OcrMode) – OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True.
use_full_ocr (bool) – Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
remove_header_footer (bool) – Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
remove_watermarks (bool) – Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
convert_to_markdown (bool) – Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
page_text_column (str) – Name of the output column which contains the extracted text for each page. If not provided, no column will be created.
- class abacusai.dataset.DataType
Bases:
ApiEnumGeneric enumeration.
Derive from this class to define new enumerations.
- INTEGER = 'integer'
- FLOAT = 'float'
- STRING = 'string'
- DATE = 'date'
- DATETIME = 'datetime'
- BOOLEAN = 'boolean'
- LIST = 'list'
- STRUCT = 'struct'
- NULL = 'null'
- class abacusai.dataset.DocumentProcessingConfig
Bases:
abacusai.api_class.abstract.ApiClassDocument processing configuration.
- Parameters:
extract_bounding_boxes (bool) – Whether to perform OCR and extract bounding boxes. If False, no OCR will be done but only the embedded text from digital documents will be extracted. Defaults to False.
ocr_mode (OcrMode) – OCR mode. There are different OCR modes available for different kinds of documents and use cases. This option only takes effect when extract_bounding_boxes is True.
use_full_ocr (bool) – Whether to perform full OCR. If True, OCR will be performed on the full page. If False, OCR will be performed on the non-text regions only. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
remove_header_footer (bool) – Whether to remove headers and footers. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
remove_watermarks (bool) – Whether to remove watermarks. By default, it will be decided automatically based on the OCR mode and the document type. This option only takes effect when extract_bounding_boxes is True.
convert_to_markdown (bool) – Whether to convert extracted text to markdown. Defaults to False. This option only takes effect when extract_bounding_boxes is True.
- ocr_mode: abacusai.api_class.enums.OcrMode
- class abacusai.dataset.ParsingConfig
Bases:
abacusai.api_class.abstract.ApiClassCustom config for dataset parsing.
- Parameters:
- class abacusai.dataset.DatasetColumn(client, name=None, dataType=None, detectedDataType=None, featureType=None, detectedFeatureType=None, originalName=None, validDataTypes=None, timeFormat=None, timestampFrequency=None)
Bases:
abacusai.return_class.AbstractApiClassA schema description for a column
- Parameters:
client (ApiClient) – An authenticated API Client instance
name (str) – The unique name of the column.
dataType (str) – The underlying data type of each column.
detectedDataType (str) – The detected data type of the column.
featureType (str) – Feature type of the column.
detectedFeatureType (str) – The detected feature type of the column.
originalName (str) – The original name of the column.
validDataTypes (list[str]) – The valid data type options for this column.
timeFormat (str) – The detected time format of the column.
timestampFrequency (str) – The detected frequency of the timestamps in the dataset.
- __repr__()
Return repr(self).
- class abacusai.dataset.DatasetVersion(client, datasetVersion=None, status=None, datasetId=None, size=None, rowCount=None, fileInspectMetadata=None, createdAt=None, error=None, incrementalQueriedAt=None, uploadId=None, mergeFileSchemas=None, databaseConnectorConfig=None, applicationConnectorConfig=None, invalidRecords=None)
Bases:
abacusai.return_class.AbstractApiClassA specific version of a dataset
- Parameters:
client (ApiClient) – An authenticated API Client instance
datasetVersion (str) – The unique identifier of the dataset version.
status (str) – The current status of the dataset version
datasetId (str) – A reference to the Dataset this dataset version belongs to.
size (int) – The size in bytes of the file.
rowCount (int) – Number of rows in the dataset version.
fileInspectMetadata (dict) – Metadata information about file’s inspection. For example - the detected delimiter for CSV files.
createdAt (str) – The timestamp this dataset version was created.
error (str) – If status is FAILED, this field will be populated with an error.
incrementalQueriedAt (str) – If the dataset version is from an incremental dataset, this is the last entry of timestamp column when the dataset version was created.
uploadId (str) – If the dataset version is being uploaded, this the reference to the Upload
mergeFileSchemas (bool) – If the merge file schemas policy is enabled.
databaseConnectorConfig (dict) – The database connector query used to retrieve data for this version.
applicationConnectorConfig (dict) – The application connector used to retrieve data for this version.
invalidRecords (str) – Invalid records in the dataset version
- __repr__()
Return repr(self).
- to_dict()
Get a dict representation of the parameters in this class
- Returns:
The dict value representation of the class parameters
- Return type:
- get_metrics(selected_columns=None, include_charts=False, include_statistics=True)
Get metrics for a specific dataset version.
- Parameters:
- Returns:
The metrics for the specified Dataset version.
- Return type:
- refresh()
Calls describe and refreshes the current object’s fields
- Returns:
The current object
- Return type:
- describe()
Retrieves a full description of the specified dataset version, including its ID, name, source type, and other attributes.
- Parameters:
dataset_version (str) – Unique string identifier associated with the dataset version.
- Returns:
The dataset version.
- Return type:
- get_logs()
Retrieves the dataset import logs.
- Parameters:
dataset_version (str) – The unique version ID of the dataset version.
- Returns:
The logs for the specified dataset version.
- Return type:
- wait_for_import(timeout=900)
A waiting call until dataset version is imported.
- Parameters:
timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.
- class abacusai.dataset.RefreshSchedule(client, refreshPolicyId=None, nextRunTime=None, cron=None, refreshType=None, error=None)
Bases:
abacusai.return_class.AbstractApiClassA refresh schedule for an object. Defines when the next version of the object will be created
- Parameters:
client (ApiClient) – An authenticated API Client instance
refreshPolicyId (str) – The unique identifier of the refresh policy
nextRunTime (str) – The next run time of the refresh policy. If null, the policy is paused.
cron (str) – A cron-style string that describes the when this refresh policy is to be executed in UTC
refreshType (str) – The type of refresh that will be run
error (str) – An error message for the last pipeline run of a policy
- __repr__()
Return repr(self).
- class abacusai.dataset.AbstractApiClass(client, id)
- __eq__(other)
Return self==value.
- _get_attribute_as_dict(attribute)
- class abacusai.dataset.Dataset(client, datasetId=None, sourceType=None, dataSource=None, createdAt=None, ignoreBefore=None, ephemeral=None, lookbackDays=None, databaseConnectorId=None, databaseConnectorConfig=None, connectorType=None, featureGroupTableName=None, applicationConnectorId=None, applicationConnectorConfig=None, incremental=None, isDocumentset=None, extractBoundingBoxes=None, mergeFileSchemas=None, referenceOnlyDocumentset=None, schema={}, refreshSchedules={}, latestDatasetVersion={}, parsingConfig={}, documentProcessingConfig={})
Bases:
abacusai.return_class.AbstractApiClassA dataset reference
- Parameters:
client (ApiClient) – An authenticated API Client instance
datasetId (str) – The unique identifier of the dataset.
sourceType (str) – The source of the Dataset. EXTERNAL_SERVICE, UPLOAD, or STREAMING.
dataSource (str) – Location of data. It may be a URI such as an s3 bucket or the database table.
createdAt (str) – The timestamp at which this dataset was created.
ignoreBefore (str) – The timestamp at which all previous events are ignored when training.
ephemeral (bool) – The dataset is ephemeral and not used for training.
lookbackDays (int) – Specific to streaming datasets, this specifies how many days worth of data to include when generating a snapshot. Value of 0 indicates leaves this selection to the system.
databaseConnectorId (str) – The Database Connector used.
databaseConnectorConfig (dict) – The database connector query used to retrieve data.
connectorType (str) – The type of connector used to get this dataset FILE or DATABASE.
featureGroupTableName (str) – The table name of the dataset’s feature group
applicationConnectorId (str) – The Application Connector used.
applicationConnectorConfig (dict) – The application connector query used to retrieve data.
incremental (bool) – If dataset is an incremental dataset.
isDocumentset (bool) – If dataset is a documentset.
extractBoundingBoxes (bool) – Signifies whether to extract bounding boxes out of the documents. Only valid if is_documentset if True.
mergeFileSchemas (bool) – If the merge file schemas policy is enabled.
referenceOnlyDocumentset (bool) – Signifies whether to save the data reference only. Only valid if is_documentset if True.
latestDatasetVersion (DatasetVersion) – The latest version of this dataset.
schema (DatasetColumn) – List of resolved columns.
refreshSchedules (RefreshSchedule) – List of schedules that determines when the next version of the dataset will be created.
parsingConfig (ParsingConfig) – The parsing config used for dataset.
documentProcessingConfig (DocumentProcessingConfig) – The document processing config used for dataset (when is_documentset is True).
- __repr__()
Return repr(self).
- to_dict()
Get a dict representation of the parameters in this class
- Returns:
The dict value representation of the class parameters
- Return type:
- create_version_from_file_connector(location=None, file_format=None, csv_delimiter=None, merge_file_schemas=None, parsing_config=None)
Creates a new version of the specified dataset.
- Parameters:
location (str) – External URI to import the dataset from. If not specified, the last location will be used.
file_format (str) – File format to be used. If not specified, the service will try to detect the file format.
csv_delimiter (str) – If the file format is CSV, use a specific CSV delimiter.
merge_file_schemas (bool) – Signifies if the merge file schema policy is enabled.
parsing_config (ParsingConfig) – Custom config for dataset parsing.
- Returns:
The new Dataset Version created.
- Return type:
- create_version_from_database_connector(object_name=None, columns=None, query_arguments=None, sql_query=None)
Creates a new version of the specified dataset.
- Parameters:
object_name (str) – The name/ID of the object in the service to query. If not specified, the last name will be used.
columns (str) – The columns to query from the external service object. If not specified, the last columns will be used.
query_arguments (str) – Additional query arguments to filter the data. If not specified, the last arguments will be used.
sql_query (str) – The full SQL query to use when fetching data. If present, this parameter will override object_name, columns, and query_arguments.
- Returns:
The new Dataset Version created.
- Return type:
- create_version_from_application_connector(dataset_config=None)
Creates a new version of the specified dataset.
- Parameters:
dataset_config (ApplicationConnectorDatasetConfig) – Dataset config for the application connector. If any of the fields are not specified, the last values will be used.
- Returns:
The new Dataset Version created.
- Return type:
- create_version_from_upload(file_format=None)
Creates a new version of the specified dataset using a local file upload.
- create_version_from_document_reprocessing(document_processing_config=None)
Creates a new dataset version for a source docstore dataset with the provided document processing configuration. This does not re-import the data but uses the same data which is imported in the latest dataset version and only performs document processing on it.
- Parameters:
document_processing_config (DatasetDocumentProcessingConfig) – The document processing configuration to use for the new dataset version. If not specified, the document processing configuration from the source dataset will be used.
- Returns:
The new dataset version created.
- Return type:
- snapshot_streaming_data()
Snapshots the current data in the streaming dataset.
- Parameters:
dataset_id (str) – The unique ID associated with the dataset.
- Returns:
The new Dataset Version created by taking a snapshot of the current data in the streaming dataset.
- Return type:
- set_column_data_type(column, data_type)
Set a Dataset’s column type.
- set_streaming_retention_policy(retention_hours=None, retention_row_count=None, ignore_records_before_timestamp=None)
Sets the streaming retention policy.
- get_schema()
Retrieves the column schema of a dataset.
- Parameters:
dataset_id (str) – Unique string identifier of the dataset schema to look up.
- Returns:
List of column schema definitions.
- Return type:
- set_database_connector_config(database_connector_id, object_name=None, columns=None, query_arguments=None, sql_query=None)
Sets database connector config for a dataset. This method is currently only supported for streaming datasets.
- Parameters:
database_connector_id (str) – Unique String Identifier of the Database Connector to import the dataset from.
object_name (str) – If applicable, the name/ID of the object in the service to query.
columns (str) – The columns to query from the external service object.
query_arguments (str) – Additional query arguments to filter the data.
sql_query (str) – The full SQL query to use when fetching data. If present, this parameter will override object_name, columns and query_arguments.
- refresh()
Calls describe and refreshes the current object’s fields
- Returns:
The current object
- Return type:
- describe()
Retrieves a full description of the specified dataset, with attributes such as its ID, name, source type, etc.
- list_versions(limit=100, start_after_version=None)
Retrieves a list of all dataset versions for the specified dataset.
- Parameters:
- Returns:
A list of dataset versions.
- Return type:
- delete()
Deletes the specified dataset from the organization.
- Parameters:
dataset_id (str) – Unique string identifier of the dataset to delete.
- wait_for_import(timeout=900)
A waiting call until dataset is imported.
- Parameters:
timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.
- wait_for_inspection(timeout=None)
A waiting call until dataset is completely inspected.
- Parameters:
timeout (int) – The waiting time given to the call to finish, if it doesn’t finish by the allocated time, the call is said to be timed out.
- get_status()
Gets the status of the latest dataset version.
- Returns:
A string describing the status of a dataset (importing, inspecting, complete, etc.).
- Return type:
- describe_feature_group()
Gets the feature group attached to the dataset.
- Returns:
A feature group object.
- Return type:
- create_refresh_policy(cron)
To create a refresh policy for a dataset.
- Parameters:
cron (str) – A cron style string to set the refresh time.
- Returns:
The refresh policy object.
- Return type:
- list_refresh_policies()
Gets the refresh policies in a list.
- Returns:
A list of refresh policy objects.
- Return type:
List[RefreshPolicy]