SDIF Database

The SDIFDatabase is a component for creating, managing, and interacting with SDIF (.sqlite) files. It provides a comprehensive API for storing structured tabular data, JSON objects, binary media, and semantic relationships of a datasource in a single SQLite-based file.

1. Basic Usage

The simplest way to use SDIFDatabase is through the context manager pattern:

from sdif_db import SDIFDatabase
from pathlib import Path
import pandas as pd

# Creating a new SDIF database
with SDIFDatabase(path="my_dataset.sdif", overwrite=True) as db:
    # Add a data source
    source_id = db.add_source(file_name="original_data.csv", file_type="csv",
                              description="Sales data from Q1 2023")

    # Create a table with your data
    df = pd.DataFrame({
        "id": [1, 2, 3],
        "name": ["Product A", "Product B", "Product C"],
        "price": [19.99, 24.99, 15.50]
    })

    # Write the data to an SDIF table
    db.write_dataframe(df=df, table_name="products", source_id=source_id,
                       description="Product catalog")

# Reading from an existing SDIF database
with SDIFDatabase(path="my_dataset.sdif", read_only=True) as db:
    # Read a table into a DataFrame
    products_df = db.read_table("products")

    # Execute a custom SQL query
    query_result = db.query("SELECT * FROM products WHERE price < 20")

    # Get the database schema
    schema = db.get_schema()

2. Database Initialization

The SDIFDatabase constructor accepts parameters to control how the database is opened and initialized.

from sdif_db import SDIFDatabase
from pathlib import Path

# Create or open an existing SDIF file
db = SDIFDatabase(
    path="data.sdif",              # Path to the SDIF SQLite file (str or Path)
    overwrite=False,               # If True, overwrite the file if it exists
    read_only=False,               # If True, open in read-only mode
    schema_name="db1"              # Schema name when attached in connections
)

Parameters:

path: The path to the SDIF SQLite file (string or Path).
overwrite: If True, overwrite the file if it exists. Ignored when read_only=True. Default: False.
read_only: If True, open in read-only mode. Will raise an error if the file doesn't exist. Default: False.
schema_name: Schema name to use when the database is attached in a connection. Default: "db1".

Initialization Behavior:

If opened in read-only mode (read_only=True), the file must exist, and no write operations are allowed.
If opened in read-write mode (default) and the file exists:
- With overwrite=True: The existing file will be deleted and a new one created.
- With overwrite=False: The existing file will be opened for reading and writing.
If opened in read-write mode and the file doesn't exist, a new file will be created with the required SDIF metadata tables.

3. Adding Data

3.1 Adding a Data Source

Before adding any data to an SDIF file, you must register a source.

source_id = db.add_source(
    file_name="sales_2023.xlsx",   # Original file name
    file_type="xlsx",              # Original file type
    description="Sales data for 2023 fiscal year"  # Optional description
)

Parameters:

file_name: The name of the original file.
file_type: The type of the original file (e.g., "csv", "xlsx", "json").
description: (Optional) Description of the source.

Returns:

The source_id of the inserted source, which you'll need when adding tables, objects, or media.

3.2 Creating Tables

Create a table structure to store tabular data:

# Define columns with types and properties
columns = {
    "id": {"type": "INTEGER", "primary_key": True, "description": "Unique identifier"},
    "name": {"type": "TEXT", "not_null": True, "description": "Product name"},
    "category": {"type": "TEXT", "description": "Product category"},
    "price": {"type": "REAL", "description": "Current price in USD"},
    "in_stock": {"type": "INTEGER", "description": "Whether item is in stock", "original_column_name": "available"}
}

actual_table_name = db.create_table(
    table_name="products",         # Desired table name
    columns=columns,               # Column definitions
    source_id=source_id,           # Source ID from add_source
    description="Product catalog",  # Optional description
    original_identifier="Products", # Optional original identifier (e.g., sheet name)
    if_exists="fail"               # Behavior if table exists: 'fail', 'replace', or 'add'
)

Parameters:

table_name: The name for the table (must not start with "sdif_").
columns: Dictionary mapping column names to their properties.
source_id: The source ID (from add_source).
description: (Optional) Description of the table.
original_identifier: (Optional) Original identifier for the table.
if_exists: Strategy for handling table name conflicts:
- "fail" (default): Raise an error if table exists.
- "replace": Drop existing table and recreate.
- "add": Create a new table with a unique suffixed name (e.g., "table_name_1").

Column Definition Properties:

type: SQLite type (e.g., "TEXT", "INTEGER", "REAL", "BLOB", "NUMERIC").
primary_key: (Optional) If True, column is part of primary key.
not_null: (Optional) If True, column cannot contain NULL values.
unique: (Optional) If True, column values must be unique.
foreign_key: (Optional) Dictionary with keys:
- "table": Target table name
- "column": Target column name
- "on_delete": (Optional) Action on delete (e.g., "CASCADE", "SET NULL")
- "on_update": (Optional) Action on update
description: (Optional) Column description.
original_column_name: (Optional) Original name of the column.

Returns:

The actual name of the created table (which might be different if if_exists="add" and a conflict occurred).

3.3 Inserting Data

Insert data into a previously created table:

# Prepare data as a list of dictionaries
data = [
    {"id": 1, "name": "Widget A", "category": "Widgets", "price": 19.99, "in_stock": 1},
    {"id": 2, "name": "Gadget B", "category": "Gadgets", "price": 24.99, "in_stock": 1},
    {"id": 3, "name": "Thingamajig C", "category": "Misc", "price": 15.50, "in_stock": 0}
]

# Insert the data
db.insert_data(table_name="products", data=data)

Parameters:

table_name: The name of the target table.
data: List of dictionaries mapping column names to values.

3.4 Writing DataFrames Directly

When working with pandas DataFrames, you can use write_dataframe to handle table creation and data insertion in one step:

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Product A", "Product B", "Product C"],
    "price": [19.99, 24.99, 15.50],
    "in_stock": [True, True, False]
})

# Write the DataFrame to a table
db.write_dataframe(
    df=df,                         # DataFrame to write
    table_name="products",         # Desired table name
    source_id=source_id,           # Source ID
    description="Product catalog", # Optional description
    if_exists="fail",              # Behavior if table exists: 'fail', 'replace', 'append'
    columns_metadata={             # Optional metadata for columns
        "name": {"description": "Product name"},
        "price": {"description": "Price in USD", "original_column_name": "unit_price"}
    }
)

Parameters:

df: The pandas DataFrame to write.
table_name: The name for the new table.
source_id: The source ID (from add_source).
description: (Optional) Description for the table.
original_identifier: (Optional) Original identifier for the table.
if_exists: Behavior if the table already exists:
- "fail" (default): Raise an error if table exists.
- "replace": Drop existing table and recreate.
- "append": Append data to existing table (not fully implemented yet).
columns_metadata: (Optional) Dictionary with column metadata.

3.5 Adding JSON Objects

Store structured JSON data:

# Define a JSON object
config = {
    "settings": {
        "theme": "dark",
        "notifications": True,
        "refresh_interval": 60
    },
    "permissions": ["read", "write", "admin"]
}

# Add the JSON object to the SDIF file
db.add_object(
    object_name="app_config",      # Unique name
    json_data=config,              # JSON-serializable data
    source_id=source_id,           # Source ID
    description="Application configuration", # Optional description
    schema_hint={                  # Optional JSON schema
        "type": "object",
        "properties": {
            "settings": {"type": "object"},
            "permissions": {"type": "array", "items": {"type": "string"}}
        }
    }
)

Parameters:

object_name: A unique name for the object.
json_data: The data to store (will be converted to a JSON string).
source_id: The source ID (from add_source).
description: (Optional) Description of the object.
schema_hint: (Optional) JSON schema as a dictionary.

3.6 Adding Binary Media

Store binary data such as images, audio, or any other binary content:

# Read binary data
with open("logo.png", "rb") as f:
    image_data = f.read()

# Add the media to the SDIF file
db.add_media(
    media_name="company_logo",     # Unique name
    media_data=image_data,         # Binary data (bytes)
    media_type="image",            # Type: image, audio, video, binary
    source_id=source_id,           # Source ID
    description="Company logo image", # Optional description
    original_format="png",         # Optional format information
    technical_metadata={           # Optional technical details
        "width": 512,
        "height": 512,
        "color_mode": "RGB"
    }
)

Parameters:

media_name: A unique name for the media.
media_data: The binary data (must be bytes).
media_type: The type of media (e.g., "image", "audio", "video", "binary").
source_id: The source ID (from add_source).
description: (Optional) Description of the media.
original_format: (Optional) Original format (e.g., "png", "mp3").
technical_metadata: (Optional) Technical metadata as a dictionary.

3.7 Adding Semantic Links

Create semantic relationships between elements:

# Create a semantic link (e.g., relating a product to its image)
db.add_semantic_link(
    link_type="reference",         # Type of link
    from_element_type="table",     # Source element type
    from_element_spec={            # Source element specification
        "table": "products",
        "column": "id",
        "value": 1
    },
    to_element_type="media",       # Target element type
    to_element_spec={              # Target element specification
        "media_name": "product_1_image"
    },
    description="Product image reference" # Optional description
)

Parameters:

link_type: The type of link (e.g., "annotation", "reference", "logical_foreign_key").
from_element_type: Type of source element (one of "table", "column", "object", "media", "json_path", "source").
from_element_spec: Specification of the source element (as a dictionary).
to_element_type: Type of target element (one of "table", "column", "object", "media", "json_path", "source").
to_element_spec: Specification of the target element (as a dictionary).
description: (Optional) Description of the semantic link.

4. Reading Data

4.1 Reading Tables

Read tabular data into a pandas DataFrame:

# Read an entire table
products_df = db.read_table("products")

# Process the data
filtered_products = products_df[products_df["price"] < 20]

Parameters:

table_name: The name of the table to read.

Returns:

A pandas DataFrame containing the table data.

4.2 Querying with SQL

Execute SQL queries for custom data retrieval:

# Execute a query and get results as a DataFrame
results_df = db.query(
    plain_sql="SELECT p.name, p.price FROM products p WHERE p.category = 'Widgets' ORDER BY p.price DESC",
    return_format="dataframe"  # Options: "dataframe" or "dict"
)

# Execute a query and get results as a list of dictionaries
results_dict = db.query(
    plain_sql="SELECT COUNT(*) as count, category FROM products GROUP BY category",
    return_format="dict"
)

Parameters:

plain_sql: The SQL query string to execute (SELECT statements only).
return_format: The desired return format:
- "dataframe" (default): Returns a pandas DataFrame.
- "dict": Returns a list of dictionary rows.

Returns:

Results in the specified format (pandas DataFrame or list of dictionaries).

Security Note:

The query method performs checks to ensure only read-only operations are executed, preventing modifications to the database.
Opening the database in read-only mode (read_only=True) provides the strongest protection against unintended changes.

4.3 Retrieving JSON Objects

Get stored JSON objects:

# Get and parse an object
app_config = db.get_object(
    object_name="app_config",      # Object name
    parse_json=True                # Parse JSON string to Python object
)

# Access the data
if app_config:
    theme = app_config["json_data"]["settings"]["theme"]
    permissions = app_config["json_data"]["permissions"]

Parameters:

object_name: The name of the object to retrieve.
parse_json: If True (default), parse the JSON data into Python objects.

Returns:

A dictionary containing object data and metadata, including:
- "json_data": The stored data (parsed if parse_json=True).
- "source_id": The source ID.
- "description": Object description.
- "schema_hint": Schema information (parsed if parse_json=True).

4.4 Retrieving Binary Media

Get stored binary media:

# Get media item
logo = db.get_media(
    media_name="company_logo",     # Media name
    parse_json=True                # Parse technical_metadata JSON
)

# Use the binary data
if logo:
    image_bytes = logo["media_data"]
    image_format = logo["original_format"]
    width = logo["technical_metadata"]["width"]

    # Example: save to file
    with open(f"retrieved_logo.{image_format}", "wb") as f:
        f.write(image_bytes)

Parameters:

media_name: The name of the media item to retrieve.
parse_json: If True (default), parse the technical metadata into a Python object.

Returns:

A dictionary containing media data and metadata, including:
- "media_data": The binary data (as bytes).
- "source_id": The source ID.
- "media_type": Type of media.
- "description": Media description.
- "original_format": Original format information.
- "technical_metadata": Technical metadata (parsed if parse_json=True).

5. Metadata and Analysis

5.1 Listing Resources

Get information about available resources:

# List all sources
sources = db.list_sources()

# List all tables
tables = db.list_tables()

# List all objects
objects = db.list_objects()

# List all media items
media_items = db.list_media()

# List all semantic links
links = db.list_semantic_links()

5.2 Getting Schema Information

Retrieve the complete database schema:

# Get the full database schema
schema = db.get_schema()

# Access schema components
sdif_properties = schema["sdif_properties"]
sources = schema["sources"]
table_info = schema["tables"]["products"]
objects_info = schema["objects"]
media_info = schema["media"]
links = schema["semantic_links"]

# Example: Get column information for a specific table
columns = schema["tables"]["products"]["columns"]
foreign_keys = schema["tables"]["products"]["foreign_keys"]

Returns:

A comprehensive dictionary with the complete database structure, including:
- Global properties
- Sources
- Tables (with columns, types, constraints, metadata)
- Objects metadata (excluding actual data)
- Media metadata (excluding binary data)
- Semantic links

5.3 Data Sampling and Analysis

Generate statistical analysis of the data:

# Get sample data and analysis
analysis = db.get_sample_analysis(
    num_sample_rows=5,             # Number of sample rows per table
    top_n_common_values=10,        # Number of most common values to report
    include_objects=True,          # Include object metadata
    include_media=True             # Include media metadata
)

# Access analysis components
tables_analysis = analysis["tables"]
products_analysis = analysis["tables"]["products"]
sample_rows = products_analysis["sample_rows"]
column_stats = products_analysis["column_analysis"]["price"]

# Check numeric statistics for a column (if available)
if "numeric_summary" in column_stats:
    min_price = column_stats["numeric_summary"]["min"]
    max_price = column_stats["numeric_summary"]["max"]
    avg_price = column_stats["numeric_summary"]["mean"]

Parameters:

num_sample_rows: The number of random rows to sample from each table.
top_n_common_values: The number of most frequent distinct values to report per column.
include_objects: If True, includes a list of object names and descriptions.
include_media: If True, includes a list of media names and descriptions.

Returns:

A detailed analysis dictionary containing:
- Table information (row counts, sample rows)
- Column analyses (types, null percentages, distinct counts, common values)
- Numeric summaries for numeric columns (min, max, mean, median, etc.)
- Optional object and media information

6. Error Handling

SDIFDatabase provides specific exceptions for various error conditions:

from sdif_db import SDIFDatabase
from pathlib import Path

try:
    with SDIFDatabase("data.sdif") as db:
        # Attempt operations
        source_id = db.add_source("data.csv", "csv")
        db.create_table("products", {...}, source_id)

except FileNotFoundError:
    # Handle case where file doesn't exist in read_only mode
    print("The SDIF file was not found.")

except PermissionError:
    # Handle permission errors (e.g., writing to read_only database)
    print("Operation not permitted. Check if database is read-only.")

except ValueError as e:
    # Handle validation errors (e.g., invalid table/column names)
    print(f"Validation error: {e}")

except sqlite3.IntegrityError as e:
    # Handle integrity constraint violations
    print(f"Data integrity error: {e}")

except sqlite3.Error as e:
    # Handle other SQLite errors
    print(f"Database error: {e}")

except Exception as e:
    # Catch other unexpected errors
    print(f"Unexpected error: {e}")
    # Potentially log the full stack trace for debugging

Common Exception Types:

FileNotFoundError: When trying to open a non-existent file in read-only mode.
PermissionError: When attempting to write to a read-only database.
ValueError: For validation errors in input parameters.
TypeError: For type-related errors (e.g., providing bytes where str is expected).
sqlite3.IntegrityError: For constraint violations (e.g., duplicate unique values).
sqlite3.OperationalError: For SQL syntax errors or operational issues.
sqlite3.Error: General SQLite database errors.

7. Database Lifecycle Management

7.1 Opening and Closing

Best practices for managing database connections:

# Recommended: Using the context manager
with SDIFDatabase("data.sdif") as db:
    # Operations inside the block
    db.add_source(...)
    # Connection is automatically closed when leaving the block

# Alternative: Manual opening and closing
db = SDIFDatabase("data.sdif")
try:
    # Operations
    db.add_source(...)
finally:
    # Always close the connection
    db.close()

7.2 Modifications to Existing Data

Handle changes to existing database content:

# Drop a table and its metadata
db.drop_table("outdated_table")

# Replace a table with new data
db.create_table("products", columns, source_id, if_exists="replace")

# Update data (using standard SQL)
db.query("UPDATE products SET price = price * 1.1 WHERE category = 'Widgets'")

1. Basic Usage​

2. Database Initialization​

3. Adding Data​

3.1 Adding a Data Source​

3.2 Creating Tables​

3.3 Inserting Data​

3.4 Writing DataFrames Directly​

3.5 Adding JSON Objects​

3.6 Adding Binary Media​

3.7 Adding Semantic Links​

4. Reading Data​

4.1 Reading Tables​

4.2 Querying with SQL​

4.3 Retrieving JSON Objects​

4.4 Retrieving Binary Media​

5. Metadata and Analysis​

5.1 Listing Resources​

5.2 Getting Schema Information​

5.3 Data Sampling and Analysis​

6. Error Handling​

7. Database Lifecycle Management​

7.1 Opening and Closing​

7.2 Modifications to Existing Data​

1. Basic Usage

2. Database Initialization

3. Adding Data

3.1 Adding a Data Source

3.2 Creating Tables

3.3 Inserting Data

3.4 Writing DataFrames Directly

3.5 Adding JSON Objects

3.6 Adding Binary Media

3.7 Adding Semantic Links

4. Reading Data

4.1 Reading Tables

4.2 Querying with SQL

4.3 Retrieving JSON Objects

4.4 Retrieving Binary Media

5. Metadata and Analysis

5.1 Listing Resources

5.2 Getting Schema Information

5.3 Data Sampling and Analysis

6. Error Handling

7. Database Lifecycle Management

7.1 Opening and Closing

7.2 Modifications to Existing Data