Skip to content

Reader Registry System

This document explains Dift’s centralized dataset reader registry architecture, connector routing system, and extensibility model.

The reader registry is a foundational part of Dift’s internal architecture because it enables scalable connector growth while keeping the comparison engine connector-agnostic.


Why the Reader Registry Exists

Earlier versions of Dift handled dataset routing directly inside a single loading workflow.

As connector support expanded, this created several architectural problems:

  • connector logic became tightly coupled
  • routing logic became difficult to maintain
  • validation behavior became duplicated
  • adding connectors required core modifications
  • plugin preparation became difficult

The reader registry architecture solves these issues by introducing a centralized routing and registration system.


Design Goals

The reader registry system was designed to provide:

  • centralized connector routing
  • modular dataset readers
  • reusable validation workflows
  • dynamic reader registration
  • future plugin preparation
  • connector isolation
  • scalable connector architecture

High-Level Architecture

                 ┌────────────────────┐
                 │   CLI / Config     │
                 └─────────┬──────────┘
                           ▼
                 ┌────────────────────┐
                 │  Reader Registry   │
                 └─────────┬──────────┘
                           ▼
        ┌────────────────────────────────────┐
        │              Readers              │
        ├────────────────────────────────────┤
        │ LocalFileReader                   │
        │ DuckDBReader                      │
        │ SQLReader                         │
        │ BigQueryReader                    │
        └────────────────────────────────────┘
                           ▼
                 ┌────────────────────┐
                 │ Polars DataFrame   │
                 └────────────────────┘

Core Files

The registry architecture is implemented primarily in:

dift/io/
├── base_reader.py
├── registry.py
├── readers.py
├── sql_reader.py
├── duckdb_reader.py
└── bigquery_reader.py

Base Reader Interface

File:

dift/io/base_reader.py

All dataset readers implement a shared interface.

Example:

class BaseReader:
    def can_handle(self, source: str) -> bool:
        ...

    def read(self, source: str):
        ...

Why a Shared Interface Matters

The shared interface standardizes:

  • connector behavior
  • routing logic
  • dataset loading contracts
  • validation expectations

This allows the registry to treat all readers consistently.


Reader Responsibilities

Each reader is responsible for:

  • determining whether it supports a source
  • validating connector input
  • loading datasets
  • raising actionable errors

Readers should NOT:

  • perform comparisons
  • generate reports
  • calculate risk
  • handle CLI orchestration

Reader Registry

File:

dift/io/registry.py

The registry acts as the central connector routing system.

Responsibilities include:

  • registering readers
  • discovering compatible readers
  • prioritizing readers
  • centralizing routing logic

Registry Example

Example registration:

registry = ReaderRegistry()

registry.register(LocalFileReader())
registry.register(SQLReader())
registry.register(DuckDBReader())

Reader Resolution Example

Example routing workflow:

reader = registry.get_reader(source)
df = reader.read(source)

Reader Discovery Flow

Current routing flow:

Source Input
      ↓
Registry Iteration
      ↓
Reader.can_handle()
      ↓
Compatible Reader
      ↓
Reader.read()

Current Built-In Readers

Dift currently includes:

Reader Purpose
LocalFileReader Local files
DuckDBReader DuckDB databases
SQLReader SQL databases
BigQueryReader BigQuery warehouses

LocalFileReader

Responsibilities:

  • local path validation
  • file extension handling
  • filesystem checks
  • local dataset loading

Supported formats:

  • CSV
  • Parquet
  • Excel
  • JSON

DuckDBReader

Responsibilities:

  • DuckDB URI parsing
  • local database access
  • analytical table loading

Example URI:

duckdb:///warehouse.duckdb:customers

SQLReader

Responsibilities:

  • SQLAlchemy connector integration
  • SQL URI parsing
  • dependency guidance
  • database loading

Supported systems include:

  • SQLite
  • PostgreSQL
  • MySQL
  • Redshift
  • Snowflake

BigQueryReader

Responsibilities:

  • BigQuery URI parsing
  • warehouse extraction
  • Google Cloud authentication workflows

Example URI:

bigquery://project.dataset.table

Unified Dataset Contract

All readers return:

polars.DataFrame

This is extremely important because the comparison engine remains completely connector-agnostic.

The comparison engine never needs to understand:

  • SQLAlchemy
  • DuckDB
  • warehouse APIs
  • cloud authentication
  • connector-specific behavior

Why Connector Isolation Matters

Connector isolation improves:

  • maintainability
  • testability
  • scalability
  • plugin preparation
  • dependency management

Validation Philosophy

Readers are responsible for actionable validation behavior.

Examples include:

  • unsupported format guidance
  • invalid URI guidance
  • missing dependency guidance
  • connector troubleshooting hints

Good Validation Example

PostgreSQL support requires psycopg2.

Install it with:
  pip install psycopg2-binary

Poor Validation Example

ValueError: failed

Reader Prioritization

Reader registration order matters.

Example:

registry.register(SQLReader())
registry.register(LocalFileReader())

Specialized readers should generally be registered before generic readers.


Why Prioritization Exists

Some source patterns may overlap.

Prioritization ensures:

  • deterministic routing
  • predictable behavior
  • extensibility safety

Centralized Routing Benefits

Without a registry:

CLI
 └── Large conditional routing logic

With a registry:

CLI
 └── Registry
      └── Readers

Benefits include:

  • cleaner architecture
  • modular connectors
  • future scalability

Dynamic Registration

Readers can be registered dynamically.

Example:

registry.register(MyCustomReader())

This is foundational for future plugin support.


Current Routing Workflow

Current workflow:

CLI
  ↓
Registry
  ↓
Reader
  ↓
Polars DataFrame
  ↓
Comparison Engine

Future Plugin Preparation

The registry architecture is intentionally designed to support future plugin ecosystems.

Potential future structure:

dift/plugins/
├── databricks/
├── kafka/
├── s3/
├── spark/
└── custom/

Future Plugin Workflow

Potential future behavior:

registry.load_plugins()

Potential capabilities:

  • dynamic plugin discovery
  • optional connectors
  • third-party integrations
  • enterprise extensions

Optional Connector Loading

Future connectors may become separately installable.

Examples:

pip install dift-snowflake
pip install dift-kafka

Benefits:

  • reduced dependency bloat
  • modular ecosystems
  • isolated integrations

Connector Metadata (Future)

Future reader metadata may include:

class ReaderMetadata:
    name: str
    version: str
    supported_sources: list[str]

Potential uses:

  • capability discovery
  • debugging
  • plugin inspection
  • enterprise tooling

Registry Testing

Registry testing validates:

  • reader routing
  • prioritization
  • extensibility behavior
  • connector isolation
  • dynamic registration

Error Handling Philosophy

Readers should raise errors that are:

  • actionable
  • readable
  • connector-aware
  • beginner-friendly

Design Philosophy

The reader registry architecture prioritizes:

  • modularity
  • extensibility
  • maintainability
  • connector scalability
  • plugin readiness

Future Goals

Planned future improvements include:

  • plugin auto-discovery
  • capability inspection
  • async loading
  • streaming connectors
  • remote connector support
  • distributed loading

Architectural Benefits

The registry architecture enables:

  • scalable connector growth
  • cleaner maintenance
  • future ecosystem expansion
  • enterprise extensibility
  • community integrations

Related Developer Docs

See also:

  • architecture.md
  • plugin-preparation.md
  • report-system.md
  • testing.md
  • codebase-overview.md