Skip to content

Quick Start

This guide helps you get started with Dift quickly.

You will learn how to:

  • compare datasets
  • detect drift
  • generate reports
  • use configuration files
  • run batch comparisons
  • automate workflows

Your First Comparison

Compare two CSV files:

dift examples/old.csv examples/new.csv --key customer_id

This command compares:

  • schema changes
  • row changes
  • null spikes
  • duplicate spikes
  • drift patterns
  • outlier changes

Understanding the --key

The --key option defines the column used to match rows across datasets.

Example:

--key customer_id

Typical keys:

  • customer_id
  • order_id
  • transaction_id
  • product_id

Example Output

╭─────────────────────────╮
│ Dift Dataset Comparison │
│ Risk Level: MEDIUM      │
╰─────────────────────────╯

Warnings

Numeric drift:
'revenue'
mean shift 900.00%
(high, threshold 0.1)

Outlier spike:
'revenue' increased by 100.00%
(high)

Categorical shift:
'segment' max frequency shift 60.00%
(high)

Generate Reports


JSON Report

dift examples/old.csv examples/new.csv \
  --key customer_id \
  --report json \
  --output report.json

CSV Report

dift examples/old.csv examples/new.csv \
  --key customer_id \
  --report csv \
  --output report.csv

Excel Report

dift examples/old.csv examples/new.csv \
  --key customer_id \
  --report excel \
  --output report.xlsx

HTML Report

dift examples/old.csv examples/new.csv \
  --key customer_id \
  --report html \
  --output report.html

HTML Templates

Customize HTML report appearance:

dift examples/old.csv examples/new.csv \
  --report html \
  --template dark \
  --output report.html

Available templates:

  • default
  • clean
  • compact
  • enterprise
  • dark

Drift Thresholds

Control drift sensitivity using --threshold.

Default threshold:

0.1

Example:

dift examples/old.csv examples/new.csv \
  --key customer_id \
  --threshold 0.2

Lower values detect smaller changes.

Higher values reduce sensitivity.


Output Directory Support

Save reports into a directory with auto-generated filenames:

dift examples/old.csv examples/new.csv \
  --report html \
  --output-dir reports/

Generated filenames include:

  • dift_report.json
  • dift_report.csv
  • dift_report.xlsx
  • dift_report.html

Using Configuration Files

Run comparisons using reusable configuration files.


YAML Example

old_dataset: examples/old.csv
new_dataset: examples/new.csv
key: customer_id
threshold: 0.1
report: html

Run:

dift --config examples/config_sample.yaml

Environment-Based Configs

Select reusable environments:

dift --config examples/config_env.yaml --env production

Useful for:

  • development
  • staging
  • production
  • CI/CD workflows

Saved Profiles

Create reusable comparison workflows.


Create Profile

dift profile create nightly-check \
  --old examples/old.csv \
  --new examples/new.csv \
  --key customer_id \
  --report html

Run Profile

dift profile run nightly-check

Batch Comparisons

Compare multiple dataset pairs automatically.


Example

dift batch \
  --old-dir data/old \
  --new-dir data/new \
  --key customer_id

Useful for:

  • ETL validation
  • warehouse monitoring
  • scheduled quality checks
  • multi-table validation

Comparison History

Save historical comparison results:

dift examples/old.csv examples/new.csv \
  --key customer_id \
  --history

View history:

dift history list

Automation-Friendly Mode

Use Dift inside:

  • CI/CD pipelines
  • Airflow
  • Jenkins
  • Prefect
  • Dagster
  • cron jobs

Strict Exit Codes

dift prod.csv staging.csv \
  --key customer_id \
  --strict-exit-codes

Exit codes:

Code Meaning
0 Low risk
1 Medium risk
2 High risk
3 Runtime error

Quiet Mode

dift old.csv new.csv --quiet

Disable Colors

dift old.csv new.csv --no-color

DuckDB Example

dift duckdb:///warehouse.duckdb:customers_old \
     duckdb:///warehouse.duckdb:customers_new \
     --key customer_id

BigQuery Example

dift bigquery://analytics.sales.orders_old \
     bigquery://analytics.sales.orders_new \
     --key order_id

PostgreSQL Example

dift postgresql://user:password@localhost:5432/sales:customers_old \
     postgresql://user:password@localhost:5432/sales:customers_new \
     --key customer_id

MySQL Example

dift mysql+pymysql://user:password@localhost:3306/sales:orders_old \
     mysql+pymysql://user:password@localhost:3306/sales:orders_new \
     --key order_id

Snowflake Example

dift snowflake://user:password@account/db/schema?warehouse=compute_wh:orders_old \
     snowflake://user:password@account/db/schema?warehouse=compute_wh:orders_new \
     --key order_id

Redshift Example

dift redshift+redshift_connector://user:password@cluster.region.redshift.amazonaws.com:5439/dev:orders_old \
     redshift+redshift_connector://user:password@cluster.region.redshift.amazonaws.com:5439/dev:orders_new \
     --key order_id

Common Use Cases

ETL Validation

dift before.csv after.csv

ML Dataset Drift Detection

dift train_v1.csv train_v2.csv --threshold 0.1

Production vs Staging Validation

dift prod.csv staging.csv --key id

Historical Monitoring

dift prod.csv staging.csv \
  --key customer_id \
  --history

Next Steps

Continue with:

  • Usage Guide
  • Reports
  • Configuration
  • Thresholds
  • Automation
  • Connectors
  • Developer Architecture