satif_sdk.standardizers.xlsx
XLSXFileConfig Objects
class XLSXFileConfig(FileConfig)
Configuration settings applied to a single XLSX file during standardization.
XLSXStandardizer Objects
class XLSXStandardizer(Standardizer)
Standardizer for one or multiple Excel (.xlsx) files/sheets into an SDIF database.
Transforms data from specified sheets within Excel files into the SDIF format. Default options (sheet_name/index, header_row, skip_rows, skip_columns) are set during initialization. These defaults can be overridden on a per-file basis when calling the
standardize
method using thefile_configs
parameter. Infers SQLite types (INTEGER, REAL, TEXT) from pandas dtypes.If
column_definitions
are provided for a file, they take precedence for selecting, renaming, and describing columns. Otherwise, headers are taken from the Excel sheet (respectingheader_row
,skip_rows
, andskip_columns
).Attributes:
default_sheet_name
Union[str, int] - Default sheet identifier (name or 0-based index).default_header_row
int - Default 0-based index for the header row.default_skip_rows
int - Default number of rows to skip before the header row.default_skip_columns
Set[Union[str, int]] - Default names or 0-based indices of columns to skip. Primarily intended for names with Excel.file_configs
0 Optional[Union[str, List[Optional[str]]]] - Default descriptions for data sources.file_configs
1 Optional[Union[str, List[Optional[str]]]] - Default target table names.column_definitions
ColumnDefinitionsConfig - Default column definitions. file_configs (Optional[Union[Dict[str, XLSXFileConfig], List[Optional[XLSXFileConfig]]]]): Default file-specific configuration overrides.
__init__
def __init__(
sheet_name: Optional[Union[str, int]] = None,
header_row: int = DEFAULT_HEADER_ROW,
skip_rows: int = DEFAULT_SKIP_ROWS,
skip_columns: Optional[List[Union[str, int]]] = None,
descriptions: Optional[Union[str, List[Optional[str]]]] = None,
table_names: Optional[Union[str, List[Optional[str]]]] = None,
column_definitions: ColumnDefinitionsConfig = None,
file_configs: Optional[Union[Dict[str, XLSXFileConfig],
List[Optional[XLSXFileConfig]]]] = None)
Initialize the XLSX standardizer with default and task-specific configurations.
Arguments:
sheet_name
- Default sheet to read (name as str, 0-based index as int). If None, defaults to the first sheet (index 0).header_row
- Default 0-based row index to use as column headers.skip_rows
- Default number of rows to skip at the beginning of the sheet before the header row.skip_columns
- Default list of column names (exact match, case-sensitive) or 0-based integer indices to exclude from the standardization. If using indices, they refer to the column order afterheader_row
andskip_rows
are applied by pandas.descriptions
- A single description for all sources, or a list of descriptions (one per input file expected in standardize). If None, descriptions are omitted. Used forsdif_sources.source_description
. This can be overridden bydescription
key infile_configs
.header_row
0 - A single table name (used as a base if multiple files), a list of table names (one per input file expected in standardize), or None. If None, table names are derived from sheet names (or filenames if sheet name unavailable). This can be overridden byheader_row
1 key infile_configs
.header_row
3 - Default column definitions to precisely control column selection, renaming, and descriptions.header_row
4 inheader_row
5 maps to original Excel header names. Types are still inferred from data.file_configs
- Optional configuration overrides. Can be a single dict applied to all files, or a list of dicts (one per file expected in standardize, use None in list to apply defaults). Keys in the dict can include 'sheet_name', 'header_row', 'skip_rows', 'skip_columns', 'description', 'table_name', 'column_definitions'. These override the defaults set above for the specific file.
standardize
def standardize(datasource: Datasource,
output_path: SDIFPath,
*,
overwrite: bool = False) -> StandardizationResult
Standardize one or more Excel files into a single SDIF database file.
Reads a specified sheet from each input Excel file and stores its data in a corresponding table within the output SDIF database.
Arguments:
datasource
- A single path or a list of paths to the input Excel file(s) (.xlsx).
output_path
- The path for the output SDIF database file.
overwrite
- If True, overwrite the output SDIF file if it exists.Returns:
A StandardizationResult object containing the path to the created SDIF file and a dictionary of the final configurations used for each processed input file.
Raises:
ValueError
- If input files are invalid, list arguments stored in the instance have incorrect lengths compared to datasource, config values are invalid, or pandas/database errors occur.FileNotFoundError
- If an input Excel file does not exist.ImportError
- If 'pandas' or 'openpyxl' is not installed.RuntimeError
- For errors during Excel parsing or database operations.