Reader Registry System¶

This document explains Dift’s centralized dataset reader registry architecture, connector routing system, and extensibility model.

The reader registry is a foundational part of Dift’s internal architecture because it enables scalable connector growth while keeping the comparison engine connector-agnostic.

Why the Reader Registry Exists¶

Earlier versions of Dift handled dataset routing directly inside a single loading workflow.

As connector support expanded, this created several architectural problems:

connector logic became tightly coupled
routing logic became difficult to maintain
validation behavior became duplicated
adding connectors required core modifications
plugin preparation became difficult

The reader registry architecture solves these issues by introducing a centralized routing and registration system.

Design Goals¶

The reader registry system was designed to provide:

centralized connector routing
modular dataset readers
reusable validation workflows
dynamic reader registration
future plugin preparation
connector isolation
scalable connector architecture

High-Level Architecture¶

                 ┌────────────────────┐
                 │   CLI / Config     │
                 └─────────┬──────────┘
                           ▼
                 ┌────────────────────┐
                 │  Reader Registry   │
                 └─────────┬──────────┘
                           ▼
        ┌────────────────────────────────────┐
        │              Readers              │
        ├────────────────────────────────────┤
        │ LocalFileReader                   │
        │ DuckDBReader                      │
        │ SQLReader                         │
        │ BigQueryReader                    │
        └────────────────────────────────────┘
                           ▼
                 ┌────────────────────┐
                 │ Polars DataFrame   │
                 └────────────────────┘

Core Files¶

The registry architecture is implemented primarily in:

dift/io/
├── base_reader.py
├── registry.py
├── readers.py
├── sql_reader.py
├── duckdb_reader.py
└── bigquery_reader.py

Base Reader Interface¶

File:

dift/io/base_reader.py

All dataset readers implement a shared interface.

Example:

class BaseReader:
    def can_handle(self, source: str) -> bool:
        ...

    def read(self, source: str):
        ...

Why a Shared Interface Matters¶

The shared interface standardizes:

connector behavior
routing logic
dataset loading contracts
validation expectations

This allows the registry to treat all readers consistently.

Reader Responsibilities¶

Each reader is responsible for:

determining whether it supports a source
validating connector input
loading datasets
raising actionable errors

Readers should NOT:

perform comparisons
generate reports
calculate risk
handle CLI orchestration

Reader Registry¶

File:

dift/io/registry.py

The registry acts as the central connector routing system.

Responsibilities include:

registering readers
discovering compatible readers
prioritizing readers
centralizing routing logic

Registry Example¶

Example registration:

registry = ReaderRegistry()

registry.register(LocalFileReader())
registry.register(SQLReader())
registry.register(DuckDBReader())

Reader Resolution Example¶

Example routing workflow:

reader = registry.get_reader(source)
df = reader.read(source)

Reader Discovery Flow¶

Current routing flow:

Source Input
      ↓
Registry Iteration
      ↓
Reader.can_handle()
      ↓
Compatible Reader
      ↓
Reader.read()

Current Built-In Readers¶

Dift currently includes:

Reader	Purpose
LocalFileReader	Local files
DuckDBReader	DuckDB databases
SQLReader	SQL databases
BigQueryReader	BigQuery warehouses

LocalFileReader¶

Responsibilities:

local path validation
file extension handling
filesystem checks
local dataset loading

Supported formats:

CSV
Parquet
Excel
JSON

DuckDBReader¶

Responsibilities:

DuckDB URI parsing
local database access
analytical table loading

Example URI:

duckdb:///warehouse.duckdb:customers

SQLReader¶

Responsibilities:

SQLAlchemy connector integration
SQL URI parsing
dependency guidance
database loading

Supported systems include:

SQLite
PostgreSQL
MySQL
Redshift
Snowflake

BigQueryReader¶

Responsibilities:

BigQuery URI parsing
warehouse extraction
Google Cloud authentication workflows

Example URI:

bigquery://project.dataset.table

Unified Dataset Contract¶

All readers return:

polars.DataFrame

This is extremely important because the comparison engine remains completely connector-agnostic.

The comparison engine never needs to understand:

SQLAlchemy
DuckDB
warehouse APIs
cloud authentication
connector-specific behavior

Why Connector Isolation Matters¶

Connector isolation improves:

maintainability
testability
scalability
plugin preparation
dependency management

Validation Philosophy¶

Readers are responsible for actionable validation behavior.

Examples include:

unsupported format guidance
invalid URI guidance
missing dependency guidance
connector troubleshooting hints

Good Validation Example¶

PostgreSQL support requires psycopg2.

Install it with:
  pip install psycopg2-binary

Poor Validation Example¶

ValueError: failed

Reader Prioritization¶

Reader registration order matters.

Example:

registry.register(SQLReader())
registry.register(LocalFileReader())

Specialized readers should generally be registered before generic readers.

Why Prioritization Exists¶

Some source patterns may overlap.

Prioritization ensures:

deterministic routing
predictable behavior
extensibility safety

Centralized Routing Benefits¶

Without a registry:

CLI
 └── Large conditional routing logic

With a registry:

CLI
 └── Registry
      └── Readers

Benefits include:

cleaner architecture
modular connectors
future scalability

Dynamic Registration¶

Readers can be registered dynamically.

Example:

registry.register(MyCustomReader())

This is foundational for future plugin support.

Current Routing Workflow¶

Current workflow:

CLI
  ↓
Registry
  ↓
Reader
  ↓
Polars DataFrame
  ↓
Comparison Engine

Future Plugin Preparation¶

The registry architecture is intentionally designed to support future plugin ecosystems.

Potential future structure:

dift/plugins/
├── databricks/
├── kafka/
├── s3/
├── spark/
└── custom/

Future Plugin Workflow¶

Potential future behavior:

registry.load_plugins()

Potential capabilities:

dynamic plugin discovery
optional connectors
third-party integrations
enterprise extensions

Optional Connector Loading¶

Future connectors may become separately installable.

Examples:

pip install dift-snowflake
pip install dift-kafka

Benefits:

reduced dependency bloat
modular ecosystems
isolated integrations

Connector Metadata (Future)¶

Future reader metadata may include:

class ReaderMetadata:
    name: str
    version: str
    supported_sources: list[str]

Potential uses:

capability discovery
debugging
plugin inspection
enterprise tooling

Registry Testing¶

Registry testing validates:

reader routing
prioritization
extensibility behavior
connector isolation
dynamic registration

Error Handling Philosophy¶

Readers should raise errors that are:

actionable
readable
connector-aware
beginner-friendly

Design Philosophy¶

The reader registry architecture prioritizes:

modularity
extensibility
maintainability
connector scalability
plugin readiness

Future Goals¶

Planned future improvements include:

plugin auto-discovery
capability inspection
async loading
streaming connectors
remote connector support
distributed loading

Architectural Benefits¶

The registry architecture enables:

scalable connector growth
cleaner maintenance
future ecosystem expansion
enterprise extensibility
community integrations

Reader Registry System¶

Why the Reader Registry Exists¶

Design Goals¶

High-Level Architecture¶

Core Files¶

Base Reader Interface¶

Why a Shared Interface Matters¶

Reader Responsibilities¶

Reader Registry¶

Registry Example¶

Reader Resolution Example¶

Reader Discovery Flow¶

Current Built-In Readers¶

LocalFileReader¶

DuckDBReader¶

SQLReader¶

BigQueryReader¶

Unified Dataset Contract¶

Why Connector Isolation Matters¶

Validation Philosophy¶

Good Validation Example¶

Poor Validation Example¶

Reader Prioritization¶

Why Prioritization Exists¶

Centralized Routing Benefits¶

Dynamic Registration¶

Current Routing Workflow¶

Future Plugin Preparation¶

Future Plugin Workflow¶

Optional Connector Loading¶

Connector Metadata (Future)¶

Registry Testing¶

Error Handling Philosophy¶

Design Philosophy¶

Future Goals¶

Architectural Benefits¶

Related Developer Docs¶