satif_ai.standardize
astandardize
async def astandardize(
datasource: Datasource,
output_path: SDIFPath,
*,
overwrite: bool = False,
sdif_schema: Optional[Union[FilePath, Dict[str, Any]]] = None,
tidy_adapter: Union[bool, TidyAdapter] = False,
config: Optional[Dict[str, Any]] = None,
standardizer: Optional[AsyncStandardizer] = None,
mcp_server: Optional[Any] = None,
mcp_session: Optional[Any] = None,
llm_model: Optional[str] = None) -> StandardizationResult
Asynchronously standardizes a datasource into a single, canonical SDIF SQLite file.
This function serves as the primary entry point for the SATIF standardization layer. It orchestrates the conversion of various input file formats (e.g., CSV, Excel, PDF) from the provided datasource into a unified SDIF (Standard Data Interchange Format) SQLite file. The process may involve AI-driven parsing, schema adaptation, and data tidying, depending on the configuration and the capabilities of the underlying standardizer.
Arguments:
datasource
- The source of the data to be standardized. This can be a single file path (str or Path), a list of file paths, or other datasource types supported by the chosen standardizer.
output_path
- The path (str or Path) where the output SDIF SQLite database file will be created (e.g., "./output/my_data.sdif").
overwrite
- If True, an existing SDIF file atoutput_path
will be overwritten. Defaults to False.
sdif_schema
- Optional. Path to an SDIF schema definition file (e.g., a JSON file) or a dictionary representing the schema. If provided, the standardization process (specifically if using the defaultAIStandardizer
) may attempt to adapt the data to this target schema.
tidy_adapter
- Optional. If True, a defaultTidyAdapter
may be used. Alternatively, a specificTidyAdapter
instance can be provided to perform data tidying processes (e.g., cleaning, normalization, restructuring tables). If False (default), no explicit tidying step is initiated by this top-level function, though underlying standardizers might perform their own internal tidying. The specifics depend on the standardizer's capabilities.
config
- Optional. A dictionary for advanced or standardizer-specific configurations. This config is passed directly to theoutput_path
0 method of the chosen standardizer.
output_path
1 - Optional. An instance of anoutput_path
2 subclass. If provided, this instance will be used for standardization. This allows for using pre-configured or custom standardizers. If None, a defaultAIStandardizer
is instantiated usingoutput_path
4,output_path
5,output_path
6,sdif_schema
, andtidy_adapter
.
output_path
4 - Optional. The MCP (Model Coordination Platform) server instance. Used ifoutput_path
1 is None for the defaultAIStandardizer
.
output_path
5 - Optional. The MCP session or transport object. Used ifoutput_path
1 is None for the defaultAIStandardizer
.
output_path
6 - Optional. The language model to be used by the defaultAIStandardizer
if nooutput_path
1 instance is provided (e.g., "gpt-4o"). Each standardizer may have its own default model.Returns:
A
overwrite
8 object containing:
output_path
: The absoluteoutput_path
0 to the created or updated SDIF database file.
output_path
1: An optional dictionary detailing configurations used for each processed input file, if applicable and returned by the standardizer.Raises:
output_path
2 - If thedatasource
(or parts of it) does not exist.output_path
4 - Ifoutput_path
exists andoverwrite
is False.output_path
7 - If input arguments are invalid (e.g., unsupported datasource type).output_path
8 - For general errors during the standardization process. Specific exceptions may also be raised by the underlying standardizer implementation.