The standard benchmark for data quality tools — detection, transformation, entity resolution, and pipeline orchestration.
The ImageNet of data quality — standardized benchmarks for validation, transformation, entity resolution, and pipeline tools.
Every data quality tool claims to be the best. But there's no standard way to compare them. DQBench fixes that with:
- Four benchmark categories — Detect, Transform, ER, and Pipeline
- Three difficulty tiers — basics, realistic, and adversarial
- Ground truth — every planted issue is documented with affected rows
- Fair scoring — recall AND precision matter (no gaming by flagging everything)
- One number — DQBench Score (0-100) for easy comparison
- 20-line integration — implement one method to benchmark any tool
The repo now also includes an OCR Company benchmark for post-OCR company-name confidence and correction quality.
pip install dqbench# Run detection benchmark with GoldenCheck
pip install goldencheck
dqbench run goldencheck
# Run ER benchmark with GoldenMatch
pip install goldenmatch
dqbench run goldenmatch
# Run pipeline benchmark with GoldenPipe
pip install goldenpipe
dqbench run goldenpipe
# Run with a custom adapter
dqbench run --adapter my_adapter.py| Category | What it measures | Example tools |
|---|---|---|
| Detect | Find data quality issues in a dataset | GoldenCheck, Great Expectations, Pandera, Soda Core |
| Transform | Clean, normalize, and repair data | GoldenFlow, dbt, pandas |
| ER | Entity resolution — deduplicate and link records | GoldenMatch, Splink, Dedupe |
| Pipeline | End-to-end pipeline orchestration and quality gates | GoldenPipe, Airflow, Prefect |
| OCR Company | OCR company-name confidence, review, and correction quality | OCR confidence/correction tools |
| Tool | Mode | T1 F1 | T2 F1 | T3 F1 | Score |
|---|---|---|---|---|---|
| GoldenCheck | zero-config | 84.9% | 80.0% | 57.6% | 72.00 |
| Pandera | best-effort | 36.4% | 38.1% | 25.0% | 32.51 |
| Soda Core | best-effort | 38.1% | 23.5% | 13.3% | 22.36 |
| Great Expectations | best-effort | 36.4% | 23.5% | 12.5% | 21.68 |
| Great Expectations | auto-profiled | 22.2% | 42.1% | 0.0% | 21.29 |
| Soda Core | auto-profiled | 0.0% | 11.1% | 6.2% | 6.94 |
| All tools | zero-config | 0.0% | 0.0% | 0.0% | 0.00 |
GoldenCheck's zero-config discovery outperforms every competitor's hand-written rules.
| Tool | Mode | T1 F1 | T2 F1 | T3 F1 | Score |
|---|---|---|---|---|---|
| GoldenMatch | with LLM | 92.6% | 97.8% | 94.1% | 95.30 |
| GoldenMatch | without LLM | — | — | — | 77.21 |
GoldenMatch with LLM achieves a 95.30 DQBench ER Score across all three tiers.
Cost estimate: ~$0.15-0.30 per full run (3 tiers) with LLM scoring. Without LLM: free, ~23s total. With LLM: ~$0.25, ~670s total. LLM scoring is optional and activates automatically when
OPENAI_API_KEYorANTHROPIC_API_KEYis set.External validation: GoldenMatch also scores 75.0% F1 on BPID (Amazon's adversarial PII deduplication benchmark, EMNLP 2024), matching Ditto (75.2%) with zero training data. See the benchmark writeup.
Run the comparisons yourself:
# Detect benchmark
pip install dqbench goldencheck great_expectations pandera soda-core
dqbench run all
# ER benchmark
pip install dqbench goldenmatch
dqbench run goldenmatch| Tier | Rows | Columns | Domain | Difficulty |
|---|---|---|---|---|
| 1 — Basics | 5,000 | 20 | Customer DB | Obvious errors, baseline |
| 2 — Realistic | 50,000 | 30 | E-commerce | Subtle issues + false positive traps |
| 3 — Adversarial | 100,000 | 50 | Healthcare | Encoding traps, semantic errors, cross-column logic |
Each tier has columns WITH planted issues and columns WITHOUT (false positive traps). Tools that flag clean columns lose precision points.
| Metric | Description |
|---|---|
| Recall | % of planted-issue columns detected |
| Precision | % of flagged columns that actually have issues |
| F1 | Harmonic mean of recall and precision |
| FPR | Clean columns incorrectly flagged (WARNING/ERROR only) |
| DQBench Score | Tier1_F1 x 20% + Tier2_F1 x 40% + Tier3_F1 x 40% |
Implement one class to benchmark any tool:
from dqbench.adapters.base import DQBenchAdapter
from dqbench.models import DQBenchFinding
from pathlib import Path
class MyToolAdapter(DQBenchAdapter):
@property
def name(self) -> str:
return "MyTool"
@property
def version(self) -> str:
return "1.0.0"
def validate(self, csv_path: Path) -> list[DQBenchFinding]:
# Run your tool on the CSV
# Return a list of DQBenchFinding objects
return [
DQBenchFinding(
column="email",
severity="error", # "error", "warning", or "info"
check="format", # what kind of issue
message="Invalid email format",
confidence=0.9, # optional, 0.0-1.0
)
]Then run:
dqbench run --adapter my_adapter.pyTo benchmark an entity resolution tool, implement the EntityResolutionAdapter interface:
from dqbench.adapters.er_base import EntityResolutionAdapter
from dqbench.models import ERPrediction
from pathlib import Path
import polars as pl
class MyERAdapter(EntityResolutionAdapter):
@property
def name(self) -> str:
return "MyERTool"
@property
def version(self) -> str:
return "1.0.0"
def resolve(self, df: pl.DataFrame) -> list[ERPrediction]:
# Given a DataFrame with potential duplicates,
# return predicted duplicate pairs
return [
ERPrediction(
record_id_a="row_001",
record_id_b="row_042",
confidence=0.95, # 0.0-1.0
match=True, # True = predicted duplicate
)
]Then run:
dqbench run --adapter my_er_adapter.py| Command | Description |
|---|---|
dqbench run <adapter> |
Run benchmark |
dqbench run --adapter <path> |
Run with custom adapter file |
dqbench run <adapter> --tier 2 |
Run specific tier only |
dqbench run <adapter> --json |
JSON output |
dqbench run goldenmatch |
Run ER benchmark with GoldenMatch |
dqbench run goldenpipe |
Run Pipeline benchmark with GoldenPipe |
dqbench run placeholder --adapter <path> |
Run a custom OCR Company adapter |
dqbench generate |
Generate/cache detection datasets |
dqbench generate --er |
Generate ER benchmark datasets |
dqbench generate --pipeline |
Generate Pipeline benchmark datasets |
dqbench generate --ocr-company |
Generate OCR Company benchmark datasets |
dqbench generate --all |
Generate datasets for all categories |
dqbench generate --force |
Regenerate datasets |
| Category | Tiers | Tests | Description |
|---|---|---|---|
| Detect | 3 | 83 | Data quality issue detection |
| Transform | 3 | — | Data cleaning and normalization |
| ER | 3 | — | Entity resolution and deduplication |
| Pipeline | 3 | — | End-to-end pipeline orchestration |
| OCR Company | 3 | — | OCR company-name confidence and correction |
This benchmark measures company-name OCR scoring quality rather than generic data validation.
It ships with deterministic synthetic company OCR tiers and scores:
- confidence separation between clean and corrupted names
- false review rate on clean names
- review recall on corrupted names
- weakest-token hit rate
- correction exact-hit rate
- correction improvement rate
Generate OCR Company datasets:
dqbench generate --ocr-companyRun with a custom OCR Company adapter:
dqbench run placeholder --adapter examples/ocr_company_adapter.py5 categories, 15 tiers, 164+ tests.
| Adapter | Tool | Category | Modes | Install |
|---|---|---|---|---|
goldencheck |
GoldenCheck | Detect | zero-config | pip install goldencheck |
gx-zero, gx-auto, gx-best |
Great Expectations | Detect | zero / auto / best-effort | pip install great_expectations |
pandera-zero, pandera-auto, pandera-best |
Pandera | Detect | zero / auto / best-effort | pip install pandera |
soda-zero, soda-auto, soda-best |
Soda Core | Detect | zero / auto / best-effort | pip install soda-core |
goldenmatch |
GoldenMatch | ER | with-LLM / without-LLM | pip install goldenmatch |
goldenpipe |
GoldenPipe | Pipeline | default | pip install goldenpipe |
Want to add your tool? See CONTRIBUTING.md.
- Datasets are generated deterministically (
random.seed(42), stdlib only) - Canonical datasets committed as release artifacts
- Version-locked: published benchmark versions are immutable
MIT
From the maker of GoldenCheck, GoldenMatch, GoldenFlow, and GoldenPipe.