SDIF

Standardized Data Interoperable Format (SDIF) is the standardized intermediate representation at the core of SATIF, implemented as a SQLite database file.

Core Design Principle

SDIF eliminates format styling from source files to focus exclusively on the substantive data and relationships.

It aims to:

Preserve source information while imposing structure
Create a consistent interface for transformation logic

Data Organization

SDIF organizes content into three fundamental data categories:

1. Tables

Tables represent normalized, schematized data like:

Rows from CSV/Excel sheets
Extracted tabular data from PDFs
Normalized collections from JSON/XML

Every table includes full metadata about its origin and structure, with columns properly typed and constrained (PRIMARY KEY, FOREIGN KEY, etc.).

2. Objects (JSON)

Non-tabular structured content stored as queryable JSON:

Document sections and paragraphs
Headers, footers, and page metadata
Sparse, irregular, or hierarchical data
Configuration blocks and property collections

Objects maintain their internal structure while gaining rich queryability through SQLite's JSON functions.

3. Media Assets

Binary content preserved with contextual metadata:

Images embedded in documents
Audio/video content
PDF page renderings
Other binary resources

Each asset is stored alongside technical metadata (dimensions, format, etc.) and semantic descriptions.

Metadata Layer

SDIF enriches raw data with comprehensive metadata:

Source Attribution: Every data element tracks its original source file and location
Semantic Descriptions: AI-generated explanations of data meaning
Technical Context: Data types, formats and constraints
Annotations: Additional insights attached to any data element

This metadata enables AI to understand both structure and meaning, facilitating more accurate transformations.

Relationship Model

SDIF captures both explicit and implicit relationships between data elements:

SQL Foreign Keys: Native database relationships between tables
Semantic Links: Connections between heterogeneous elements (e.g., an object referencing a table)

Technical Implementation

The format consists of system tables with a sdif_ prefix (objects, metadata, sources, relationships) and data tables without this prefix. Please refer to the RFC for more detailed information.

Why SQLite ?

SDIF's SQLite foundation offers key advantages for the transformation pipeline:

LLM Fluent: SQL is well-understood by LLMs, enabling them to efficiently query and transform data
Memory Efficiency: SQLite reads only required data, it does not load the entire dataset into memory
Targeted Access: SQL queries extract only needed data without parsing entire files
Transfer Efficiency: Single-file design simplifies transmission between systems
Compression Ratio: 2-4x reduction with standard compression tools (gzip, zstd)
Cross-File Joins: Multiple SQLite files can be attached and queried together
Reliability: Battle-tested, zero-configuration embedded database with no external dependencies

Core Design Principle​

Data Organization​

1. Tables​

2. Objects (JSON)​

3. Media Assets​

Metadata Layer​

Relationship Model​

Technical Implementation​

Why SQLite ?​