Skip to content

Roadmap

This roadmap outlines the planned evolution of Dift from an open-source dataset comparison CLI into a comprehensive data trust and validation platform.

The roadmap reflects both completed milestones and future platform direction.


Vision

Dift aims to become the open-source standard for:

  • dataset regression testing
  • data drift monitoring
  • ML data validation
  • warehouse trust validation
  • automated data quality enforcement
  • data deployment validation
  • dataset observability

Roadmap Philosophy

Dift development focuses on:

  • trust-first validation
  • automation-friendly workflows
  • scalable architecture
  • warehouse interoperability
  • extensibility
  • developer experience
  • enterprise readiness
  • open ecosystem growth

v0.1.0 — Foundation Release

Initial public release focused on core comparison workflows.

Core Comparison Engine

  • [x] Schema comparison
  • [x] Row-level comparison
  • [x] Null analysis
  • [x] Duplicate detection
  • [x] Risk scoring
  • [x] Console reporting

Dataset Support

  • [x] CSV support
  • [x] Parquet support
  • [x] Excel support
  • [x] JSON support

Reporting

  • [x] Console reports
  • [x] JSON reports
  • [x] Basic CSV reports

Developer Experience

  • [x] CLI workflows
  • [x] Initial testing setup
  • [x] Initial documentation

v0.2.1 — Reporting & Validation Improvements

Focused on usability, reporting quality, and validation consistency.

Reporting Improvements

  • [x] Better JSON report structure
  • [x] Better CSV summaries
  • [x] HTML report generation
  • [x] Excel report generation

Validation Improvements

  • [x] Better validation errors
  • [x] Clearer unsupported file messages
  • [x] Better CLI help output

CLI Improvements

  • [x] Better command examples
  • [x] Improved terminal formatting
  • [x] Cleaner workflows

Developer Experience

  • [x] Improved testing coverage
  • [x] Better report consistency
  • [x] Validation regression testing

v0.3.0 — Reusable Workflow System

Focused on configuration-driven workflows and reusable validation systems.

Configuration System

Config File Support

  • [x] YAML configuration support
  • [x] TOML configuration support
  • [x] JSON configuration support
  • [x] Config validation support

Dataset Config Workflows

  • [x] Dataset paths inside config files
  • [x] CLI override support
  • [x] Reusable validation workflows

Reusable Threshold Configs

  • [x] Numeric drift thresholds
  • [x] Categorical shift thresholds
  • [x] Outlier thresholds
  • [x] Column-level threshold overrides
  • [x] Dataset-specific threshold profiles

Environment-Based Configs

  • [x] Development/staging/production configs
  • [x] Environment variable support
  • [x] Environment-aware comparison workflows

Reporting Improvements

  • [x] Improved HTML reports
  • [x] Improved Excel reports
  • [x] Better metadata support
  • [x] Output directory support

Developer Experience

  • [x] Better validation diagnostics
  • [x] Improved CLI UX
  • [x] Better testing workflows

v0.5.0 — Drift Intelligence & Automation

Major expansion into drift detection and automation workflows.

Drift Detection

Numeric Drift

  • [x] Mean shift detection
  • [x] Standard deviation drift
  • [x] Range shift detection
  • [x] Configurable drift thresholds
  • [x] Severity classification

Categorical Drift

  • [x] New categorical value detection
  • [x] Removed categorical value detection
  • [x] Frequency distribution shifts
  • [x] Severity classification

Outlier Detection

  • [x] IQR outlier detection
  • [x] Outlier spike detection
  • [x] Outlier percentage tracking
  • [x] Risk integration

Automation Features

Scheduled Comparisons

  • [x] Scheduled dataset checks
  • [x] Cron-friendly execution
  • [x] Time-based comparison workflows
  • [x] Scheduled report generation

CLI Automation Workflows

  • [x] Non-interactive CLI support
  • [x] Automation-friendly exit codes
  • [x] Pipeline integration support
  • [x] CI/CD-friendly execution

Batch Dataset Comparison

  • [x] Multi-dataset comparison support
  • [x] Folder-based comparisons
  • [x] Batch report generation
  • [x] Recursive dataset discovery

Comparison History

  • [x] Historical comparison tracking
  • [x] Drift trend analysis
  • [x] Historical risk tracking
  • [x] Historical report retention

Reporting Improvements

Better Excel Reports

  • [x] Severity color coding
  • [x] Conditional formatting
  • [x] Improved worksheet layouts
  • [x] Better readability styling
  • [x] Summary dashboards
  • [x] Risk highlighting

Better HTML Reports

  • [x] Visual summary cards
  • [x] Severity badges
  • [x] Drift highlighting
  • [x] Responsive layouts
  • [x] Risk dashboards

JSON Reporting Improvements

  • [x] Stable JSON schema
  • [x] Better API compatibility
  • [x] Machine-readable metadata

Data Trust & Validation

Risk Engine Improvements

  • [x] Explainable risk scoring
  • [x] Risk contribution summaries
  • [x] Risk weighting configuration
  • [x] Column-level risk scoring

v0.6.0 — Connectors & Extensible Architecture

Major architectural release introducing database and warehouse workflows.

Database Support

SQL Database Integration

  • [x] Direct database-to-database comparison
  • [x] Table-to-table comparison support
  • [x] Query-based dataset comparison
  • [x] Connection string support
  • [x] CLI database input support
  • [x] Cross-database comparison support

PostgreSQL Connector

  • [x] PostgreSQL table reader
  • [x] PostgreSQL query reader
  • [x] Schema inference support
  • [x] Secure connection handling
  • [x] PostgreSQL schema comparison support

MySQL Connector

  • [x] MySQL table reader
  • [x] Query-based comparisons
  • [x] Type compatibility handling
  • [x] MySQL schema comparison support

SQLite Connector

  • [x] SQLite local database support
  • [x] SQLite query support
  • [x] Lightweight comparison workflows
  • [x] File-based database comparison

DuckDB Support

  • [x] Native DuckDB integration
  • [x] DuckDB query support
  • [x] Analytical dataset support
  • [x] Parquet interoperability
  • [x] Local analytics workflow support

Data Warehouse Support

Snowflake Connector

  • [x] Snowflake authentication support
  • [x] Warehouse query execution
  • [x] Large-scale dataset comparison
  • [x] Snowflake schema support

BigQuery Connector

  • [x] BigQuery dataset comparison
  • [x] Service account authentication
  • [x] Query-based workflows
  • [x] BigQuery table extraction

Redshift Connector

  • [x] Redshift warehouse support
  • [x] Efficient table extraction
  • [x] Warehouse schema compatibility

Developer Experience

Testing Improvements

  • [x] Connector integration tests
  • [x] Cross-format consistency tests
  • [x] Warehouse mock testing
  • [x] End-to-end workflow testing

CLI Improvements

  • [x] Better help messages
  • [x] Clearer validation errors
  • [x] Progress indicators
  • [x] Better terminal formatting
  • [x] Improved error diagnostics

Plugin Preparation

  • [x] Extensible reader interfaces
  • [x] Connector registry architecture
  • [x] Internal plugin preparation
  • [x] Hook system preparation

Documentation Improvements

  • [x] Better CLI examples
  • [x] Database integration guides
  • [x] CI/CD setup examples
  • [x] Contribution examples

Internal Architecture

  • [x] Reader registry system
  • [x] Shared reader abstractions
  • [x] Modular connector loading
  • [x] Plugin-safe interfaces
  • [x] Connector isolation preparation

v0.7.0 — Scale & Performance

Focused on scalability and advanced statistical analysis.

Performance Optimization

  • [ ] Chunked dataset processing
  • [ ] Streaming comparisons
  • [ ] Parallel processing
  • [ ] Memory optimization
  • [ ] Large dataset optimization
  • [ ] Lazy loading workflows
  • [ ] Faster schema comparison

Testing Improvements

  • [ ] Performance benchmarks
  • [ ] Regression test suite
  • [ ] Large dataset tests
  • [ ] Stress testing
  • [ ] Connector reliability tests

Better Statistical Analysis

  • [ ] Quantile drift detection
  • [ ] Percentile comparison
  • [ ] Correlation drift detection
  • [ ] Distribution similarity scoring
  • [ ] Statistical confidence scoring
  • [ ] Population Stability Index (PSI)
  • [ ] KL divergence support
  • [ ] Jensen-Shannon divergence support

Large Dataset Features

  • [ ] Billion-row comparison preparation
  • [ ] Sampling-based comparisons
  • [ ] Approximate diff algorithms
  • [ ] Distributed comparison preparation

Smart Drift Analysis

  • [ ] Auto-threshold recommendations
  • [ ] Adaptive drift scoring
  • [ ] Dynamic severity classification
  • [ ] Smart anomaly grouping

NoSQL Support

  • [ ] MongoDB connector
  • [ ] Collection comparison
  • [ ] Aggregation pipeline comparison
  • [ ] Nested document flattening
  • [ ] JSON schema inference

v0.8.0 — ML & Observability

Focused on ML workflows, governance, and observability.

ML & Data Science Features

  • [ ] ML feature drift analysis
  • [ ] Feature importance drift
  • [ ] Dataset quality scoring
  • [ ] Training vs production comparison
  • [ ] Label distribution analysis
  • [ ] Training-serving skew detection
  • [ ] Feature health summaries

Time-Series Support

  • [ ] Time-series dataset comparison
  • [ ] Trend shift detection
  • [ ] Rolling window analysis
  • [ ] Seasonal drift analysis
  • [ ] Time-aware anomaly detection

Advanced Risk Engine

  • [ ] Configurable weighted scoring
  • [ ] Custom risk policies
  • [ ] Rule-based validation
  • [ ] Risk explainability
  • [ ] Severity calibration
  • [ ] Risk scoring plugins

Observability Features

  • [ ] Drift monitoring workflows
  • [ ] Historical drift tracking
  • [ ] Drift trend visualization
  • [ ] Risk monitoring dashboards

Governance Features

  • [ ] Dataset audit trails
  • [ ] Validation history
  • [ ] Compliance-oriented reporting
  • [ ] Approval workflow preparation

v0.9.0 — Collaboration & Platform Integrations

Focused on ecosystem integration and collaborative workflows.

CI/CD & DevOps Integration

  • [ ] GitHub Actions integration
  • [ ] GitLab CI integration
  • [ ] Jenkins integration
  • [ ] Pre-deployment data validation
  • [ ] dbt workflow integration
  • [ ] Native Airflow integration

Alerting & Notifications

  • [ ] Slack alerts
  • [ ] Email notifications
  • [ ] Webhook support
  • [ ] Drift alert summaries
  • [ ] Severity-based alerts
  • [ ] Scheduled notifications

Reporting Improvements

  • [ ] Interactive HTML reports
  • [ ] Dashboard-style reports
  • [ ] Historical comparison tracking
  • [ ] Exportable charts
  • [ ] Trend dashboards
  • [ ] Executive summary reports

Collaboration Features

  • [ ] Shared report publishing
  • [ ] Team comparison workflows
  • [ ] Shared configuration profiles
  • [ ] Comparison annotations

Data Observability Features

  • [ ] Continuous drift monitoring
  • [ ] Health score tracking
  • [ ] Data freshness indicators
  • [ ] Trust trend analysis

v1.0.0 — Enterprise Platform

Focused on platform maturity, stability, and ecosystem expansion.

Enterprise Readiness

  • [ ] Stable public APIs
  • [ ] Plugin architecture
  • [ ] Extension system
  • [ ] Enterprise documentation
  • [ ] Long-term support structure
  • [ ] Stable configuration system
  • [ ] Version compatibility guarantees

Ecosystem Expansion

  • [ ] Python SDK
  • [ ] REST API service
  • [ ] Web UI dashboard
  • [ ] Cloud deployment support
  • [ ] Containerized deployments
  • [ ] Hosted execution support

Open Source Growth

  • [ ] Contributor templates
  • [ ] Community plugin registry
  • [ ] Official benchmarking datasets
  • [ ] Comprehensive documentation portal
  • [ ] Community integrations
  • [ ] Public example gallery

Reliability & Stability

  • [ ] Full regression coverage
  • [ ] Production hardening
  • [ ] Backward compatibility guarantees
  • [ ] Release automation
  • [ ] Security review workflows
  • [ ] Long-term maintenance processes

Data Trust Platform Vision

  • [ ] Trust-first validation workflows
  • [ ] Enterprise-grade risk analysis
  • [ ] Unified dataset trust scoring
  • [ ] Automated validation pipelines
  • [ ] Cross-platform data trust ecosystem

Beyond v1.0

Long-term ecosystem possibilities:

Distributed & Cloud Workflows

  • [ ] Distributed execution engine
  • [ ] Spark integration
  • [ ] Databricks integration
  • [ ] Kubernetes-native execution
  • [ ] Serverless validation workflows

Streaming & Real-Time Validation

  • [ ] Kafka support
  • [ ] Streaming dataset validation
  • [ ] Real-time drift detection
  • [ ] Continuous trust scoring

AI-Assisted Validation

  • [ ] AI-generated validation suggestions
  • [ ] Smart drift explanations
  • [ ] Automatic anomaly summarization
  • [ ] Natural language trust reporting

Enterprise Governance

  • [ ] Policy-as-code validation
  • [ ] Data contract enforcement
  • [ ] Governance dashboards
  • [ ] Enterprise compliance workflows

Long-Term Vision

Dift aims to become:

The open-source standard for dataset trust validation and automated data quality enforcement.

The long-term goal is to build a scalable ecosystem for:

  • regression testing
  • warehouse validation
  • ML data monitoring
  • observability
  • deployment trust checks
  • automated governance
  • enterprise-grade data trust workflows