CSV Standardizer
The CSVStandardizer is a component that transforms CSV files into SDIF. It handles complex CSV parsing scenarios with extensive configuration options for delimiter detection, encoding, header handling, row/column skipping, column renaming, and type inference.
1. Basic Usage
from satif_sdk.standardizers.csv import CSVStandardizer
from pathlib import Path
# Initialize with default settings
standardizer = CSVStandardizer()
# Standardize a single CSV file
result = standardizer.standardize(
    datasource="data.csv",
    output_path="standardized_data.sdif",
    overwrite=True
)
print(f"Standardization complete. Output at: {result.output_path}")
2. Initialization Parameters
The CSVStandardizer accepts numerous parameters to customize its behavior:
standardizer = CSVStandardizer(
    delimiter=",",                # CSV delimiter character
    encoding="utf-8",             # File encoding
    has_header=True,              # Whether files have header rows
    skip_rows=0,                  # Rows to skip
    skip_columns=None,            # Columns to skip
    descriptions=None,            # Table descriptions
    table_names=None,             # Output table names
    column_definitions=None,      # Column specifications
    file_configs=None             # Per-file configuration overrides
)
2.1 Core Parsing Parameters
- 
delimiter: Character that separates fields in the CSV- If 
None, auto-detection is attempted usingclevercsv.Sniffer - Defaults to 
,if auto-detection fails - Example values: 
,,;,\t,| 
 - If 
 - 
encoding: Character encoding of the input files- If 
None, auto-detection is attempted usingcharset-normalizer - Defaults to 
utf-8if auto-detection fails - Common values: 
utf-8,latin-1,cp1252,utf-16 
 - If 
 - 
has_header: Whether CSV files have a header row (default:True)- If 
True, the first non-skipped row is treated as column names - If 
False, column names are generated ascolumn_0,column_1, etc. 
 - If 
 
2.2 Skip Configuration
- 
skip_rows: Rows to exclude from processing- Integer: Skip the first N rows (e.g., 
skip_rows=3) - List/Set of integers: Skip specific rows by 0-based index (e.g., 
skip_rows=[0, 5, 10]) - Supports negative indices to count from the end (e.g., 
skip_rows=[0, 1, -1]to skip first two rows and last row) 
 - Integer: Skip the first N rows (e.g., 
 - 
skip_columns: Columns to exclude from processing- Integer: Skip a specific column by 0-based index (e.g., 
skip_columns=0) - String: Skip a column by its name (only if 
has_header=True) (e.g.,skip_columns="timestamp") - List/Set of integers/strings: Skip multiple columns (e.g., 
skip_columns=["id", 2, "notes"]) 
 - Integer: Skip a specific column by 0-based index (e.g., 
 
2.3 Naming and Metadata
- 
descriptions: Descriptions for tables in the SDIF output- Single string: Same description for all tables
 - List of strings: One description per input file
 - Used in the SDIF metadata
 
 - 
table_names: Names for tables in the SDIF output- Single string: Base name for all tables (will be suffixed with index for multiple files)
 - List of strings: One name per input file
 - If 
None, names are derived from input file names 
 
2.4 Column Definitions
column_definitions: Explicit column specifications to control output- List of column specs:
column_definitions=[
{"original_identifier": "Customer ID", "final_column_name": "customer_id", "description": "Unique customer identifier"},
{"original_identifier": "2", "final_column_name": "purchase_amount", "description": "Amount in USD"}
] - Dictionary mapping table names to column specs:
column_definitions={
"customers": [
{"original_identifier": "ID", "final_column_name": "customer_id"},
{"original_identifier": "Name", "final_column_name": "customer_name"}
]
} - List of definitions (one per input file)
 
- List of column specs:
 
2.5 Per-File Configuration
file_configs: Overrides defaults for specific files- Dictionary of configurations (applies to all files):
file_configs={
"delimiter": ";",
"encoding": "latin-1",
"has_header": False
} - List of configurations (one per input file):
file_configs=[
{"delimiter": ",", "encoding": "utf-8"},
{"delimiter": ";", "encoding": "latin-1", "skip_rows": 3},
None # Use defaults for third file
] 
- Dictionary of configurations (applies to all files):
 
3. Type Inference
The CSVStandardizer automatically infers column data types from a sample of the data:
- INTEGER: If all non-empty values in the sample can be parsed as integers
 - REAL: If all non-empty values in the sample can be parsed as floating-point numbers
 - TEXT: If any value cannot be parsed as a number or if explicitly specified
 
The type inference process:
- Examines up to 100 rows (configurable via 
SAMPLE_SIZEconstant) - Initially considers all possible types (INTEGER, REAL, TEXT) for each column
 - Progressively eliminates types that don't match the data
 - Follows a type hierarchy: INTEGER > REAL > TEXT (prioritizes more specific types)
 - Falls back to TEXT if any ambiguity exists
 
4. Multiple File Handling
The standardizer can process multiple CSV files into a single SDIF database:
# Process multiple files
result = standardizer.standardize(
    datasource=["customers.csv", "orders.csv", "products.csv"],
    output_path="combined_data.sdif"
)
When standardizing multiple files:
- Each file becomes a separate table in the SDIF output
 - Table names are automatically derived and made unique:
- First file: Uses the base name
 - Subsequent files: Base name with suffix 
_1,_2, etc. 
 - File-specific configurations can be provided via the 
file_configsparameter 
5. Advanced Usage Examples
5.1 Auto-detection with Fallbacks
# Let the standardizer auto-detect parameters with fallbacks
standardizer = CSVStandardizer(
    delimiter=None,  # Try to auto-detect
    encoding=None,   # Try to auto-detect
    has_header=True
)
5.2 Skipping Metadata and Comments
# Skip header metadata and footer rows
standardizer = CSVStandardizer(
    skip_rows=[0, 1, 2, -1, -2],  # Skip first 3 rows and last 2 rows
)
# Skip metadata rows at the beginning
standardizer = CSVStandardizer(
    skip_rows=5,  # Skip first 5 rows
)
5.3 Column Selection and Renaming
# Select specific columns and rename them
standardizer = CSVStandardizer(
    column_definitions=[
        {"original_identifier": "ID", "final_column_name": "customer_id"},
        {"original_identifier": "First Name", "final_column_name": "first_name"},
        {"original_identifier": "Last Name", "final_column_name": "last_name"},
        # Skip other columns by not including them
    ]
)
5.4 Different Configurations for Multiple Files
# Configure each file differently
standardizer = CSVStandardizer(
    # File-specific configurations
    file_configs=[
        {
            "delimiter": ",",
            "encoding": "utf-8",
            "has_header": True,
            "skip_rows": 2
        },
        {
            "delimiter": ";",
            "encoding": "latin-1",
            "has_header": False
        }
    ],
    # Table names
    table_names=["customers", "orders"],
    # Table descriptions
    descriptions=[
        "Customer master data",
        "Order transactions"
    ]
)
# Standardize multiple files
result = standardizer.standardize(
    datasource=["customers.csv", "orders.csv"],
    output_path="data.sdif"
)
5.5 Advanced Column Definitions
# Define columns with descriptions and position-based identifiers
standardizer = CSVStandardizer(
    has_header=False,
    column_definitions=[
        {"original_identifier": "0", "final_column_name": "id", "description": "Unique identifier"},
        {"original_identifier": "1", "final_column_name": "name", "description": "Full name"},
        {"original_identifier": "3", "final_column_name": "email", "description": "Contact email"}
        # Note: Column at position 2 is skipped
    ]
)
6. Output
The standardization process returns a StandardizationResult object with:
output_path: Path to the generated SDIF filefile_configs: Configurations used for each processed file
result = standardizer.standardize(...)
print(f"Output SDIF file: {result.output_path}")
print(f"Configurations used: {result.file_configs}")
7. Error Handling
The standardizer provides comprehensive error handling:
FileNotFoundError: If input files don't existValueError: For invalid parameters or configurationsTypeError: For incorrect argument typesUnicodeDecodeError: For encoding issues
try:
    result = standardizer.standardize(
        datasource="data.csv",
        output_path="output.sdif"
    )
except FileNotFoundError:
    print("Input file not found")
except ValueError as e:
    print(f"Invalid parameter: {e}")
except UnicodeDecodeError:
    print("Encoding issue - try specifying the correct encoding")
8. CSV Format Heuristics
The standardizer employs several heuristics to handle common CSV format issues:
- Blank Line Handling: Blank lines after the initial skip count are automatically skipped
 - Mismatched Column Counts: Rows with too few columns have missing values treated as NULL; extra columns are ignored
 - Header Name Collisions: Duplicate column names are disambiguated with numeric suffixes
 - Case-Insensitive Column Matching: When resolving column names for skipping, falls back to case-insensitive matching
 - Name Sanitization: Column names are sanitized for SQL compatibility (replacing special characters, ensuring valid identifiers)
 
9. Performance Considerations
- For large files with negative skip indices, a preliminary pass is required to count total rows
 - Type inference uses a configurable sample size (default: 100 rows) to balance accuracy and performance
 - Automatic encoding and delimiter detection read only small samples from the beginning of files
 - Memory usage is optimized by processing files row-by-row rather than loading the entire file