Skip to content

Automation

Dift is designed to support automation-first data validation workflows.

This document explains how to integrate Dift into:

  • CI/CD pipelines
  • cron jobs
  • scheduled workflows
  • Airflow
  • Jenkins
  • GitHub Actions
  • Prefect
  • Dagster
  • enterprise monitoring systems

Why Automation Matters

Modern data systems continuously evolve.

Bad data can silently break:

  • dashboards
  • ML models
  • ETL pipelines
  • reports
  • warehouse transformations
  • production analytics

Dift helps automate trust validation before bad data propagates downstream.


Core Automation Features

Dift automation features include:

  • reusable profiles
  • scheduled comparisons
  • strict exit codes
  • quiet mode
  • no-color mode
  • comparison history
  • batch workflows
  • reusable configs
  • environment-aware execution

Automation Philosophy

Dift automation workflows prioritize:

  • reproducibility
  • machine-readable outputs
  • predictable execution
  • CI/CD compatibility
  • non-interactive workflows

Non-Interactive CLI Execution

Dift fully supports non-interactive execution.

Example:

dift old.csv new.csv \
  --key customer_id \
  --strict-exit-codes \
  --quiet \
  --no-color

This is ideal for:

  • cron jobs
  • CI/CD systems
  • scheduled monitoring
  • automated warehouse checks

Strict Exit Codes

Strict exit codes allow Dift to fail workflows automatically when risky drift is detected.

Enable strict mode:

dift old.csv new.csv \
  --strict-exit-codes

Exit Code Mapping

Exit Code Meaning
0 Low-risk comparison
1 Medium-risk drift detected
2 High-risk drift detected
3 Runtime error or failed comparison

Why Strict Exit Codes Matter

Strict exit codes allow systems to automatically:

  • fail CI jobs
  • stop deployments
  • block ETL workflows
  • trigger alerts
  • enforce validation gates

Quiet Mode

Suppress non-error output:

dift old.csv new.csv --quiet

Useful for:

  • automation logs
  • scheduled workflows
  • CI environments

No-Color Mode

Disable ANSI terminal colors:

dift old.csv new.csv --no-color

Useful for:

  • plain-text logging systems
  • CI logs
  • centralized observability platforms

Recommended Automation Command

Recommended production automation workflow:

dift prod.csv staging.csv \
  --key customer_id \
  --strict-exit-codes \
  --quiet \
  --no-color

Reusable Profiles

Profiles simplify recurring workflows.

Create profile:

dift profile create nightly-check \
  --old prod.csv \
  --new candidate.csv \
  --key customer_id \
  --report html

Run profile:

dift profile run nightly-check

Why Profiles Matter

Profiles help standardize:

  • nightly validations
  • production checks
  • warehouse monitoring
  • recurring comparisons

Scheduled Comparisons

Dift supports reusable schedule generation workflows.

Generate cron command:

dift schedule cron nightly-check

Example output:

0 2 * * * dift profile run nightly-check --history --strict-exit-codes

Custom Schedule Times

Generate custom schedules:

dift schedule cron nightly-check \
  --hour 5 \
  --minute 30

Saved Schedules

Create reusable schedules:

dift schedule create daily-check \
  --profile nightly-check \
  --cron "0 2 * * *"

List Schedules

dift schedule list

Run Saved Schedule

dift schedule run daily-check

Delete Schedule

dift schedule delete daily-check

Cron Integration

Open cron editor:

crontab -e

Add generated schedule:

0 2 * * * dift profile run nightly-check --history --strict-exit-codes

Linux/macOS Workflow Example

Example nightly validation:

0 2 * * * dift profile run production-check \
  --history \
  --strict-exit-codes \
  --quiet \
  --no-color

Windows Task Scheduler

Use generated Dift commands directly inside:

  • Windows Task Scheduler
  • PowerShell automation
  • scheduled batch workflows

Example:

dift profile run nightly-check --history --strict-exit-codes

GitHub Actions Integration

Example workflow:

name: Dift Validation

on:
  push:

jobs:
  validate:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Install Dift
        run: pip install dift-cli

      - name: Run Validation
        run: |
          dift old.csv new.csv \
            --key customer_id \
            --strict-exit-codes \
            --quiet \
            --no-color

Jenkins Integration

Example Jenkins pipeline step:

stage('Validate Data') {
    steps {
        sh '''
        dift prod.csv staging.csv \
          --key customer_id \
          --strict-exit-codes
        '''
    }
}

Airflow Integration

Dift integrates naturally into Airflow workflows.

Example:

from airflow.operators.bash import BashOperator

validate_data = BashOperator(
    task_id="validate_data",
    bash_command="""
    dift old.csv new.csv \
      --key customer_id \
      --strict-exit-codes
    """
)

Airflow Use Cases

Common Airflow workflows:

  • ETL validation
  • warehouse regression testing
  • production dataset checks
  • scheduled monitoring

Prefect Integration

Example:

from prefect import flow
import subprocess

@flow
def validate():
    subprocess.run(
        [
            "dift",
            "old.csv",
            "new.csv",
            "--strict-exit-codes"
        ],
        check=True,
    )

Dagster Integration

Example:

import subprocess

subprocess.run(
    [
        "dift",
        "old.csv",
        "new.csv",
        "--strict-exit-codes",
    ],
    check=True,
)

Batch Comparison Automation

Batch workflows are ideal for:

  • warehouse snapshots
  • multi-table ETL validation
  • large monitoring workflows

Example:

dift batch \
  --old-dir data/old \
  --new-dir data/new \
  --strict-exit-codes

Batch HTML Reports

Example:

dift batch \
  --old-dir data/old \
  --new-dir data/new \
  --report html \
  --output-dir reports/batch

Comparison History

Persist historical validation runs:

dift old.csv new.csv \
  --history

Why History Tracking Matters

History tracking supports:

  • trend analysis
  • recurring risk visibility
  • long-term monitoring
  • compliance workflows

History Directory

Default location:

.dift/history/history.jsonl

Custom History Location

Example:

dift old.csv new.csv \
  --history \
  --history-dir reports/history

Environment-Aware Automation

Dift supports environment-specific configs.

Example:

environments:
  development:
    threshold: 0.2

  production:
    threshold: 0.05

Run:

dift --config config.yaml --env production

Environment Variables

Dift supports environment variable interpolation.

Example:

old_dataset: ${OLD_DATASET}
new_dataset: ${NEW_DATASET}

CI/CD-Friendly Configs

Configs help centralize automation behavior.

Example:

report: json
threshold: 0.1

Run:

dift --config ci_config.yaml

Machine-Readable Reporting

Recommended formats for automation:

  • JSON
  • CSV

Example:

dift old.csv new.csv \
  --report json \
  --output report.json

JSON Reporting Benefits

JSON reports support:

  • APIs
  • observability systems
  • custom dashboards
  • downstream automation

Progress Indicators

Dift includes lightweight progress indicators for long-running workflows.

Progress visibility includes:

  • dataset loading
  • warehouse queries
  • report generation
  • comparison execution

Progress Design Goals

Progress indicators are intentionally:

  • lightweight
  • automation-safe
  • non-intrusive
  • readable

Connector Automation Workflows

Automation works across:

  • local datasets
  • DuckDB
  • PostgreSQL
  • MySQL
  • BigQuery
  • Snowflake
  • Redshift

Example SQL Workflow

dift postgresql://user:password@localhost:5432/db:old \
     postgresql://user:password@localhost:5432/db:new \
     --strict-exit-codes

Example BigQuery Workflow

dift bigquery://analytics.sales.old \
     bigquery://analytics.sales.new \
     --strict-exit-codes

Example DuckDB Workflow

dift duckdb:///warehouse.duckdb:old \
     duckdb:///warehouse.duckdb:new \
     --strict-exit-codes

Automation Best Practices

Recommended best practices:

  • use strict exit codes
  • use reusable profiles
  • use quiet mode in CI
  • persist history
  • use JSON reporting
  • standardize configs

Enterprise Workflow Goals

Dift automation workflows are designed to support:

  • deployment gating
  • warehouse trust validation
  • ML dataset regression testing
  • production monitoring
  • scheduled data quality enforcement

Design Philosophy

The Dift automation architecture prioritizes:

  • reproducibility
  • CI/CD compatibility
  • predictable execution
  • enterprise readiness
  • automation scalability

Related Documentation

See also:

  • configuration.md
  • profiles.md
  • history.md
  • connectors/sql.md
  • developer/architecture.md
  • developer/plugin-preparation.md