satif_sdk.standardizers.csv

CSVStandardizer Objects

class CSVStandardizer(Standardizer)

Standardizer for one or multiple CSV files into a single SDIF database.

Transforms CSV data into the SDIF format, handling single or multiple files. Default CSV parsing options (delimiter, encoding, header, skip_rows, skip_columns) are set during initialization. These defaults can be overridden on a per-file basis when calling the standardize method. Includes basic type inference for columns (INTEGER, REAL, TEXT).

Attributes:

default_delimiter Optional[str] - Default CSV delimiter character. If None, attempts auto-detection.

default_encoding Optional[str] - Default file encoding. If None, attempts auto-detection.

default_has_header bool - Default assumption whether CSV files have a header row.

default_skip_rows SkipRowsConfig - Raw config for rows to skip, validated from constructor.

default_skip_columns SkipColumnsConfig - Raw config for columns to skip, validated from constructor.

descriptions Optional[Union[str, List[Optional[str]]]] - Descriptions for the data sources.

table_names Optional[Union[str, List[Optional[str]]]] - Target table names in the SDIF database.

file_configs Optional[Union[Dict[str, CSVFileConfig], List[Optional[CSVFileConfig]]]] - File-specific configuration overrides.

column_definitions ColumnDefinitionsConfig - Column definitions for the data sources.

init

def __init__(
    delimiter: Optional[str] = None,
    encoding: Optional[str] = None,
    has_header: bool = True,
    skip_rows: SkipRowsConfig = 0,
    skip_columns: SkipColumnsConfig = None,
    descriptions: Optional[Union[str, List[Optional[str]]]] = None,
    table_names: Optional[Union[str, List[Optional[str]]]] = None,
    column_definitions: ColumnDefinitionsConfig = None,
    file_configs: Optional[Union[Dict[str, CSVFileConfig],
                                 List[Optional[CSVFileConfig]]]] = None)

Initialize the CSV standardizer with default and task-specific configurations.

Arguments:

delimiter - Default CSV delimiter character. If None, attempts auto-detection. If auto-detection fails, defaults to ',' with a warning.

encoding - Default file encoding. If None, attempts auto-detection using charset-normalizer. If auto-detection fails, defaults to 'utf-8' with a warning.

has_header - Default assumption whether CSV files have a header row.

skip_rows - Rows to skip. Can be:

An int: Skips the first N rows.

A List[int] or Set[int]: Skips rows by their specific 0-based index (negative indices count from end). Defaults to 0 (skip no rows). Non-negative indices only for positive specification.

skip_columns - Columns to skip. Can be:

An int or str: Skip a single column by 0-based index or name.

A encoding0 or encoding1 containing int or str: Skip multiple columns by index or name. Column names are only effective if encoding4. Non-negative indices only. Defaults to None (skip no columns).

encoding5 - A single description for all sources, or a list of descriptions (one per input file expected in standardize). If None, descriptions are omitted. Used for encoding6.

encoding7 - A single table name (used as a base if multiple files), a list of table names (one per input file expected in standardize), or None. If None, table names are derived from input filenames.

encoding8 - Optional configuration overrides. Can be a single dict applied to all files, or a list of dicts (one per file expected in standardize, use None in list to apply defaults). Keys in the dict can include 'delimiter', 'encoding', 'has_header', 'skip_rows', 'skip_columns', 'description', 'table_name', 'column_definitions'. These override the defaults set above.

encoding9 - Provides explicit definitions for columns, overriding automatic header processing or inference. This allows renaming columns, selecting specific columns, and providing descriptions. Types are still inferred. Can be:

A has_header0: Defines columns for a single table. If multiple input files are processed and this single list is provided, it's applied to each. Each has_header1 is a dict:

has_header2{"original_identifier"has_header3

has_header2original_identifierhas_header2 str - Name or 0-based index (as str) in the CSV.

has_header2final_column_namehas_header2 str - Desired name in the SDIF table.

has_header2descriptionhas_header2 str, optional - Column description.

A skip_rows0: Maps final table names to their column specs. Useful when encoding7 are known and you want to define columns per table.

A skip_rows2: A list corresponding to each input file. Each element can be skip_rows3 (use default handling), a has_header0 for that file's table, or a skip_rows0 if that file might map to specific table names (though CSV standardizer typically creates one table per file).

If skip_rows3 (default), columns are derived from CSV header or generated, and types inferred.

standardize

def standardize(datasource: Datasource,
                output_path: SDIFPath,
                *,
                overwrite: bool = False) -> StandardizationResult

Standardize one or more CSV files into a single SDIF database file, using configurations provided during initialization or overridden per file.

Arguments:

datasource - A single file path (str or Path) or a list of file paths for the CSV files to be standardized.

output_path - The path (str or Path) where the output SDIF database file will be created.

overwrite - If True, an existing SDIF file at output_path will be overwritten. Defaults to False (raises an error if file exists).

Returns:

A StandardizationResult object containing the path to the created SDIF file and a dictionary of the final configurations used for each processed input file.

Raises:

FileNotFoundError - If an input CSV file is not found.

ValueError - If input parameters are invalid (e.g., no input datasource, input path is not a file).

TypeError - If datasource type is incorrect. Various other exceptions from underlying CSV parsing or database operations can also be raised if critical errors occur.

CSVStandardizer Objects​

__init__​

standardize​

CSVStandardizer Objects

init

standardize