SDIF Database
The SDIFDatabase is a component for creating, managing, and interacting with SDIF (.sqlite) files. It provides a comprehensive API for storing structured tabular data, JSON objects, binary media, and semantic relationships of a datasource in a single SQLite-based file.
1. Basic Usage
The simplest way to use SDIFDatabase is through the context manager pattern:
from sdif_db import SDIFDatabase
from pathlib import Path
import pandas as pd
# Creating a new SDIF database
with SDIFDatabase(path="my_dataset.sdif", overwrite=True) as db:
    # Add a data source
    source_id = db.add_source(file_name="original_data.csv", file_type="csv",
                              description="Sales data from Q1 2023")
    # Create a table with your data
    df = pd.DataFrame({
        "id": [1, 2, 3],
        "name": ["Product A", "Product B", "Product C"],
        "price": [19.99, 24.99, 15.50]
    })
    # Write the data to an SDIF table
    db.write_dataframe(df=df, table_name="products", source_id=source_id,
                       description="Product catalog")
# Reading from an existing SDIF database
with SDIFDatabase(path="my_dataset.sdif", read_only=True) as db:
    # Read a table into a DataFrame
    products_df = db.read_table("products")
    # Execute a custom SQL query
    query_result = db.query("SELECT * FROM products WHERE price < 20")
    # Get the database schema
    schema = db.get_schema()
2. Database Initialization
The SDIFDatabase constructor accepts parameters to control how the database is opened and initialized.
from sdif_db import SDIFDatabase
from pathlib import Path
# Create or open an existing SDIF file
db = SDIFDatabase(
    path="data.sdif",              # Path to the SDIF SQLite file (str or Path)
    overwrite=False,               # If True, overwrite the file if it exists
    read_only=False,               # If True, open in read-only mode
    schema_name="db1"              # Schema name when attached in connections
)
Parameters:
path: The path to the SDIF SQLite file (string orPath).overwrite: IfTrue, overwrite the file if it exists. Ignored whenread_only=True. Default:False.read_only: IfTrue, open in read-only mode. Will raise an error if the file doesn't exist. Default:False.schema_name: Schema name to use when the database is attached in a connection. Default:"db1".
Initialization Behavior:
- If opened in read-only mode (
read_only=True), the file must exist, and no write operations are allowed. - If opened in read-write mode (default) and the file exists:
- With 
overwrite=True: The existing file will be deleted and a new one created. - With 
overwrite=False: The existing file will be opened for reading and writing. 
 - With 
 - If opened in read-write mode and the file doesn't exist, a new file will be created with the required SDIF metadata tables.
 
3. Adding Data
3.1 Adding a Data Source
Before adding any data to an SDIF file, you must register a source.
source_id = db.add_source(
    file_name="sales_2023.xlsx",   # Original file name
    file_type="xlsx",              # Original file type
    description="Sales data for 2023 fiscal year"  # Optional description
)
Parameters:
file_name: The name of the original file.file_type: The type of the original file (e.g., "csv", "xlsx", "json").description: (Optional) Description of the source.
Returns:
- The 
source_idof the inserted source, which you'll need when adding tables, objects, or media. 
3.2 Creating Tables
Create a table structure to store tabular data:
# Define columns with types and properties
columns = {
    "id": {"type": "INTEGER", "primary_key": True, "description": "Unique identifier"},
    "name": {"type": "TEXT", "not_null": True, "description": "Product name"},
    "category": {"type": "TEXT", "description": "Product category"},
    "price": {"type": "REAL", "description": "Current price in USD"},
    "in_stock": {"type": "INTEGER", "description": "Whether item is in stock", "original_column_name": "available"}
}
actual_table_name = db.create_table(
    table_name="products",         # Desired table name
    columns=columns,               # Column definitions
    source_id=source_id,           # Source ID from add_source
    description="Product catalog",  # Optional description
    original_identifier="Products", # Optional original identifier (e.g., sheet name)
    if_exists="fail"               # Behavior if table exists: 'fail', 'replace', or 'add'
)
Parameters:
table_name: The name for the table (must not start with "sdif_").columns: Dictionary mapping column names to their properties.source_id: The source ID (fromadd_source).description: (Optional) Description of the table.original_identifier: (Optional) Original identifier for the table.if_exists: Strategy for handling table name conflicts:"fail"(default): Raise an error if table exists."replace": Drop existing table and recreate."add": Create a new table with a unique suffixed name (e.g., "table_name_1").
Column Definition Properties:
type: SQLite type (e.g., "TEXT", "INTEGER", "REAL", "BLOB", "NUMERIC").primary_key: (Optional) IfTrue, column is part of primary key.not_null: (Optional) IfTrue, column cannot contain NULL values.unique: (Optional) IfTrue, column values must be unique.foreign_key: (Optional) Dictionary with keys:"table": Target table name"column": Target column name"on_delete": (Optional) Action on delete (e.g., "CASCADE", "SET NULL")"on_update": (Optional) Action on update
description: (Optional) Column description.original_column_name: (Optional) Original name of the column.
Returns:
- The actual name of the created table (which might be different if 
if_exists="add"and a conflict occurred). 
3.3 Inserting Data
Insert data into a previously created table:
# Prepare data as a list of dictionaries
data = [
    {"id": 1, "name": "Widget A", "category": "Widgets", "price": 19.99, "in_stock": 1},
    {"id": 2, "name": "Gadget B", "category": "Gadgets", "price": 24.99, "in_stock": 1},
    {"id": 3, "name": "Thingamajig C", "category": "Misc", "price": 15.50, "in_stock": 0}
]
# Insert the data
db.insert_data(table_name="products", data=data)
Parameters:
table_name: The name of the target table.data: List of dictionaries mapping column names to values.
3.4 Writing DataFrames Directly
When working with pandas DataFrames, you can use write_dataframe to handle table creation and data insertion in one step:
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Product A", "Product B", "Product C"],
    "price": [19.99, 24.99, 15.50],
    "in_stock": [True, True, False]
})
# Write the DataFrame to a table
db.write_dataframe(
    df=df,                         # DataFrame to write
    table_name="products",         # Desired table name
    source_id=source_id,           # Source ID
    description="Product catalog", # Optional description
    if_exists="fail",              # Behavior if table exists: 'fail', 'replace', 'append'
    columns_metadata={             # Optional metadata for columns
        "name": {"description": "Product name"},
        "price": {"description": "Price in USD", "original_column_name": "unit_price"}
    }
)
Parameters:
df: The pandas DataFrame to write.table_name: The name for the new table.source_id: The source ID (fromadd_source).description: (Optional) Description for the table.original_identifier: (Optional) Original identifier for the table.if_exists: Behavior if the table already exists:"fail"(default): Raise an error if table exists."replace": Drop existing table and recreate."append": Append data to existing table (not fully implemented yet).
columns_metadata: (Optional) Dictionary with column metadata.
3.5 Adding JSON Objects
Store structured JSON data:
# Define a JSON object
config = {
    "settings": {
        "theme": "dark",
        "notifications": True,
        "refresh_interval": 60
    },
    "permissions": ["read", "write", "admin"]
}
# Add the JSON object to the SDIF file
db.add_object(
    object_name="app_config",      # Unique name
    json_data=config,              # JSON-serializable data
    source_id=source_id,           # Source ID
    description="Application configuration", # Optional description
    schema_hint={                  # Optional JSON schema
        "type": "object",
        "properties": {
            "settings": {"type": "object"},
            "permissions": {"type": "array", "items": {"type": "string"}}
        }
    }
)
Parameters:
object_name: A unique name for the object.json_data: The data to store (will be converted to a JSON string).source_id: The source ID (fromadd_source).description: (Optional) Description of the object.schema_hint: (Optional) JSON schema as a dictionary.
3.6 Adding Binary Media
Store binary data such as images, audio, or any other binary content:
# Read binary data
with open("logo.png", "rb") as f:
    image_data = f.read()
# Add the media to the SDIF file
db.add_media(
    media_name="company_logo",     # Unique name
    media_data=image_data,         # Binary data (bytes)
    media_type="image",            # Type: image, audio, video, binary
    source_id=source_id,           # Source ID
    description="Company logo image", # Optional description
    original_format="png",         # Optional format information
    technical_metadata={           # Optional technical details
        "width": 512,
        "height": 512,
        "color_mode": "RGB"
    }
)
Parameters:
media_name: A unique name for the media.media_data: The binary data (must be bytes).media_type: The type of media (e.g., "image", "audio", "video", "binary").source_id: The source ID (fromadd_source).description: (Optional) Description of the media.original_format: (Optional) Original format (e.g., "png", "mp3").technical_metadata: (Optional) Technical metadata as a dictionary.
3.7 Adding Semantic Links
Create semantic relationships between elements:
# Create a semantic link (e.g., relating a product to its image)
db.add_semantic_link(
    link_type="reference",         # Type of link
    from_element_type="table",     # Source element type
    from_element_spec={            # Source element specification
        "table": "products",
        "column": "id",
        "value": 1
    },
    to_element_type="media",       # Target element type
    to_element_spec={              # Target element specification
        "media_name": "product_1_image"
    },
    description="Product image reference" # Optional description
)
Parameters:
link_type: The type of link (e.g., "annotation", "reference", "logical_foreign_key").from_element_type: Type of source element (one of "table", "column", "object", "media", "json_path", "source").from_element_spec: Specification of the source element (as a dictionary).to_element_type: Type of target element (one of "table", "column", "object", "media", "json_path", "source").to_element_spec: Specification of the target element (as a dictionary).description: (Optional) Description of the semantic link.
4. Reading Data
4.1 Reading Tables
Read tabular data into a pandas DataFrame:
# Read an entire table
products_df = db.read_table("products")
# Process the data
filtered_products = products_df[products_df["price"] < 20]
Parameters:
table_name: The name of the table to read.
Returns:
- A pandas DataFrame containing the table data.
 
4.2 Querying with SQL
Execute SQL queries for custom data retrieval:
# Execute a query and get results as a DataFrame
results_df = db.query(
    plain_sql="SELECT p.name, p.price FROM products p WHERE p.category = 'Widgets' ORDER BY p.price DESC",
    return_format="dataframe"  # Options: "dataframe" or "dict"
)
# Execute a query and get results as a list of dictionaries
results_dict = db.query(
    plain_sql="SELECT COUNT(*) as count, category FROM products GROUP BY category",
    return_format="dict"
)
Parameters:
plain_sql: The SQL query string to execute (SELECT statements only).return_format: The desired return format:"dataframe"(default): Returns a pandas DataFrame."dict": Returns a list of dictionary rows.
Returns:
- Results in the specified format (pandas DataFrame or list of dictionaries).
 
Security Note:
- The 
querymethod performs checks to ensure only read-only operations are executed, preventing modifications to the database. - Opening the database in read-only mode (
read_only=True) provides the strongest protection against unintended changes. 
4.3 Retrieving JSON Objects
Get stored JSON objects:
# Get and parse an object
app_config = db.get_object(
    object_name="app_config",      # Object name
    parse_json=True                # Parse JSON string to Python object
)
# Access the data
if app_config:
    theme = app_config["json_data"]["settings"]["theme"]
    permissions = app_config["json_data"]["permissions"]
Parameters:
object_name: The name of the object to retrieve.parse_json: IfTrue(default), parse the JSON data into Python objects.
Returns:
- A dictionary containing object data and metadata, including:
"json_data": The stored data (parsed ifparse_json=True)."source_id": The source ID."description": Object description."schema_hint": Schema information (parsed ifparse_json=True).
 
4.4 Retrieving Binary Media
Get stored binary media:
# Get media item
logo = db.get_media(
    media_name="company_logo",     # Media name
    parse_json=True                # Parse technical_metadata JSON
)
# Use the binary data
if logo:
    image_bytes = logo["media_data"]
    image_format = logo["original_format"]
    width = logo["technical_metadata"]["width"]
    # Example: save to file
    with open(f"retrieved_logo.{image_format}", "wb") as f:
        f.write(image_bytes)
Parameters:
media_name: The name of the media item to retrieve.parse_json: IfTrue(default), parse the technical metadata into a Python object.
Returns:
- A dictionary containing media data and metadata, including:
"media_data": The binary data (as bytes)."source_id": The source ID."media_type": Type of media."description": Media description."original_format": Original format information."technical_metadata": Technical metadata (parsed ifparse_json=True).
 
5. Metadata and Analysis
5.1 Listing Resources
Get information about available resources:
# List all sources
sources = db.list_sources()
# List all tables
tables = db.list_tables()
# List all objects
objects = db.list_objects()
# List all media items
media_items = db.list_media()
# List all semantic links
links = db.list_semantic_links()
5.2 Getting Schema Information
Retrieve the complete database schema:
# Get the full database schema
schema = db.get_schema()
# Access schema components
sdif_properties = schema["sdif_properties"]
sources = schema["sources"]
table_info = schema["tables"]["products"]
objects_info = schema["objects"]
media_info = schema["media"]
links = schema["semantic_links"]
# Example: Get column information for a specific table
columns = schema["tables"]["products"]["columns"]
foreign_keys = schema["tables"]["products"]["foreign_keys"]
Returns:
- A comprehensive dictionary with the complete database structure, including:
- Global properties
 - Sources
 - Tables (with columns, types, constraints, metadata)
 - Objects metadata (excluding actual data)
 - Media metadata (excluding binary data)
 - Semantic links
 
 
5.3 Data Sampling and Analysis
Generate statistical analysis of the data:
# Get sample data and analysis
analysis = db.get_sample_analysis(
    num_sample_rows=5,             # Number of sample rows per table
    top_n_common_values=10,        # Number of most common values to report
    include_objects=True,          # Include object metadata
    include_media=True             # Include media metadata
)
# Access analysis components
tables_analysis = analysis["tables"]
products_analysis = analysis["tables"]["products"]
sample_rows = products_analysis["sample_rows"]
column_stats = products_analysis["column_analysis"]["price"]
# Check numeric statistics for a column (if available)
if "numeric_summary" in column_stats:
    min_price = column_stats["numeric_summary"]["min"]
    max_price = column_stats["numeric_summary"]["max"]
    avg_price = column_stats["numeric_summary"]["mean"]
Parameters:
num_sample_rows: The number of random rows to sample from each table.top_n_common_values: The number of most frequent distinct values to report per column.include_objects: IfTrue, includes a list of object names and descriptions.include_media: IfTrue, includes a list of media names and descriptions.
Returns:
- A detailed analysis dictionary containing:
- Table information (row counts, sample rows)
 - Column analyses (types, null percentages, distinct counts, common values)
 - Numeric summaries for numeric columns (min, max, mean, median, etc.)
 - Optional object and media information
 
 
6. Error Handling
SDIFDatabase provides specific exceptions for various error conditions:
from sdif_db import SDIFDatabase
from pathlib import Path
try:
    with SDIFDatabase("data.sdif") as db:
        # Attempt operations
        source_id = db.add_source("data.csv", "csv")
        db.create_table("products", {...}, source_id)
except FileNotFoundError:
    # Handle case where file doesn't exist in read_only mode
    print("The SDIF file was not found.")
except PermissionError:
    # Handle permission errors (e.g., writing to read_only database)
    print("Operation not permitted. Check if database is read-only.")
except ValueError as e:
    # Handle validation errors (e.g., invalid table/column names)
    print(f"Validation error: {e}")
except sqlite3.IntegrityError as e:
    # Handle integrity constraint violations
    print(f"Data integrity error: {e}")
except sqlite3.Error as e:
    # Handle other SQLite errors
    print(f"Database error: {e}")
except Exception as e:
    # Catch other unexpected errors
    print(f"Unexpected error: {e}")
    # Potentially log the full stack trace for debugging
Common Exception Types:
FileNotFoundError: When trying to open a non-existent file in read-only mode.PermissionError: When attempting to write to a read-only database.ValueError: For validation errors in input parameters.TypeError: For type-related errors (e.g., providing bytes where str is expected).sqlite3.IntegrityError: For constraint violations (e.g., duplicate unique values).sqlite3.OperationalError: For SQL syntax errors or operational issues.sqlite3.Error: General SQLite database errors.
7. Database Lifecycle Management
7.1 Opening and Closing
Best practices for managing database connections:
# Recommended: Using the context manager
with SDIFDatabase("data.sdif") as db:
    # Operations inside the block
    db.add_source(...)
    # Connection is automatically closed when leaving the block
# Alternative: Manual opening and closing
db = SDIFDatabase("data.sdif")
try:
    # Operations
    db.add_source(...)
finally:
    # Always close the connection
    db.close()
7.2 Modifications to Existing Data
Handle changes to existing database content:
# Drop a table and its metadata
db.drop_table("outdated_table")
# Replace a table with new data
db.create_table("products", columns, source_id, if_exists="replace")
# Update data (using standard SQL)
db.query("UPDATE products SET price = price * 1.1 WHERE category = 'Widgets'")