satif_sdk.standardizers.csv
CSVStandardizer Objects
class CSVStandardizer(Standardizer)
Standardizer for one or multiple CSV files into a single SDIF database.
Transforms CSV data into the SDIF format, handling single or multiple files. Default CSV parsing options (delimiter, encoding, header, skip_rows, skip_columns) are set during initialization. These defaults can be overridden on a per-file basis when calling the
standardize
method. Includes basic type inference for columns (INTEGER, REAL, TEXT).Attributes:
default_delimiter
Optional[str] - Default CSV delimiter character. If None, attempts auto-detection.default_encoding
Optional[str] - Default file encoding. If None, attempts auto-detection.default_has_header
bool - Default assumption whether CSV files have a header row.default_skip_rows
SkipRowsConfig - Raw config for rows to skip, validated from constructor.default_skip_columns
SkipColumnsConfig - Raw config for columns to skip, validated from constructor.descriptions
Optional[Union[str, List[Optional[str]]]] - Descriptions for the data sources.table_names
Optional[Union[str, List[Optional[str]]]] - Target table names in the SDIF database.file_configs
Optional[Union[Dict[str, CSVFileConfig], List[Optional[CSVFileConfig]]]] - File-specific configuration overrides.column_definitions
ColumnDefinitionsConfig - Column definitions for the data sources.
__init__
def __init__(
delimiter: Optional[str] = None,
encoding: Optional[str] = None,
has_header: bool = True,
skip_rows: SkipRowsConfig = 0,
skip_columns: SkipColumnsConfig = None,
descriptions: Optional[Union[str, List[Optional[str]]]] = None,
table_names: Optional[Union[str, List[Optional[str]]]] = None,
column_definitions: ColumnDefinitionsConfig = None,
file_configs: Optional[Union[Dict[str, CSVFileConfig],
List[Optional[CSVFileConfig]]]] = None)
Initialize the CSV standardizer with default and task-specific configurations.
Arguments:
delimiter
- Default CSV delimiter character. If None, attempts auto-detection. If auto-detection fails, defaults to ',' with a warning.encoding
- Default file encoding. If None, attempts auto-detection using charset-normalizer. If auto-detection fails, defaults to 'utf-8' with a warning.has_header
- Default assumption whether CSV files have a header row.skip_rows
- Rows to skip. Can be:
- An
int
: Skips the first N rows.- A
List[int]
orSet[int]
: Skips rows by their specific 0-based index (negative indices count from end). Defaults to 0 (skip no rows). Non-negative indices only for positive specification.skip_columns
- Columns to skip. Can be:
- An
int
orstr
: Skip a single column by 0-based index or name.- A
encoding
0 orencoding
1 containingint
orstr
: Skip multiple columns by index or name. Column names are only effective ifencoding
4. Non-negative indices only. Defaults to None (skip no columns).encoding
5 - A single description for all sources, or a list of descriptions (one per input file expected in standardize). If None, descriptions are omitted. Used forencoding
6.encoding
7 - A single table name (used as a base if multiple files), a list of table names (one per input file expected in standardize), or None. If None, table names are derived from input filenames.encoding
8 - Optional configuration overrides. Can be a single dict applied to all files, or a list of dicts (one per file expected in standardize, use None in list to apply defaults). Keys in the dict can include 'delimiter', 'encoding', 'has_header', 'skip_rows', 'skip_columns', 'description', 'table_name', 'column_definitions'. These override the defaults set above.encoding
9 - Provides explicit definitions for columns, overriding automatic header processing or inference. This allows renaming columns, selecting specific columns, and providing descriptions. Types are still inferred. Can be:
- A
has_header
0: Defines columns for a single table. If multiple input files are processed and this single list is provided, it's applied to each. Eachhas_header
1 is a dict:has_header
2{"original_identifier"has_header
3has_header
2original_identifierhas_header
2 str - Name or 0-based index (as str) in the CSV.has_header
2final_column_namehas_header
2 str - Desired name in the SDIF table.has_header
2descriptionhas_header
2 str, optional - Column description.
- A
skip_rows
0: Maps final table names to their column specs. Useful whenencoding
7 are known and you want to define columns per table.- A
skip_rows
2: A list corresponding to each input file. Each element can beskip_rows
3 (use default handling), ahas_header
0 for that file's table, or askip_rows
0 if that file might map to specific table names (though CSV standardizer typically creates one table per file).- If
skip_rows
3 (default), columns are derived from CSV header or generated, and types inferred.
standardize
def standardize(datasource: Datasource,
output_path: SDIFPath,
*,
overwrite: bool = False) -> StandardizationResult
Standardize one or more CSV files into a single SDIF database file, using configurations provided during initialization or overridden per file.
Arguments:
datasource
- A single file path (str or Path) or a list of file paths for the CSV files to be standardized.
output_path
- The path (str or Path) where the output SDIF database file will be created.
overwrite
- If True, an existing SDIF file atoutput_path
will be overwritten. Defaults to False (raises an error if file exists).Returns:
A StandardizationResult object containing the path to the created SDIF file and a dictionary of the final configurations used for each processed input file.
Raises:
FileNotFoundError
- If an input CSV file is not found.ValueError
- If input parameters are invalid (e.g., no input datasource, input path is not a file).TypeError
- If datasource type is incorrect. Various other exceptions from underlying CSV parsing or database operations can also be raised if critical errors occur.