Diff-Claim Audit — Reproducibility Artifact

Reviewer-runnable artifact for "When 'Where Do Two Models Differ' Does Not Reproduce: A Reliability Audit and 113 GB Benchmark for Black-Box LLM Divergence Claims" (ICDM 2026, Applied Track).

Main reliability results require no LLM API calls and no GPU. The full 113 GB response cache is hosted separately (see Data access below); nothing in this repository requires downloading it.

code/        Research pipeline: black-box divergence scoring + BO/Siamese search.
             Env-specific paths; needs the 113 GB cache + a GPU + env keys to
             RE-RUN from scratch. Released for transparency, not as the repro path.
data/        Expert taxonomies + fixed generation prompts (small, non-PHI):
               icd_blocks.csv                 ICD-10 chapters I–XXII (Tabular List)
               NGSS_Table_All_3C_utf8.csv     NGSS strands
               phrases_train_scenarios_only.json   fixed GPT-3.5 generation scenarios
tables/      Source-data CSV/JSON for every main reliability table/figure
             (each paper number is computed from a file here).
mini_repro/  Standalone CPU-only reproduction (NO cache, NO API). Runs in seconds.
technical_report/  Companion technical report (techreport.pdf + source): the full
             TV target-alignment and tree-Wasserstein = weighted-l1 derivations
             that the main paper defers.
CHECKSUMS_release.txt   SHA-256 of every file in this repo.

Quick start (mini-reproduction, ~4 s on a laptop CPU)

cd mini_repro
pip install numpy pandas scipy        # only these three
bash run.sh                           # or: python3 reliability_validity_link.py

Reproduces the reliability ↔ validity coupling (Spearman ρ ≈ 0.99) on planted ground truth — the result behind the paper's positive control (§Discussion, R4): high split-half reliability recovers the planted truth (validity); near-chance reliability recovers nothing. Compare your output to mini_repro/EXPECTED_OUTPUT.txt.

Paper table/figure → source data (in `tables/`)

Paper element	Source file(s)
Fig. 1 regime map / R1 dichotomy	`r1_testretest_summary.csv`, `g1_aggregate.csv`
R2 variance budget (≈97 % noise)	`r2_observed_splithalf_icc5.csv`
R2 rank-1 / low-rank	`r3_lowrank.csv`
R3 chapter rank reliability (ρ≈0.36)	`r10_chapter_rank.csv`
R3 metric robustness (Table III)	`p1_chapter_split_half_summary.csv`, `p1_anchor_certificate.json`
R4 criterion validity (MMLU/ARC)	`r_criterion_validity.csv` (with HuggingFace leaderboard provenance)
R4 budget–reliability law	`e2_budget_reliability_law_v2.csv`, `p0_budget_reliability_with_spearman_brown.csv`
Evidence Map / certificate	`p0_evidence_map_certificate.csv`
Controls (pos/neg)	`p0_real_positive_negative_controls.csv`

Data access (113 GB cache)

The full cache (13 models × 2 taxonomies, with per-query logs) is available to reviewers and will be deposited with SHA-256 checksums. It is not required to verify any table above (the source-data CSVs already carry the computed values). To regenerate a SHA-256 manifest of the cache yourself: find <cache>/ -name 'data*.pqt' -print0 | xargs -0 sha256sum > CHECKSUMS_cache.txt

Notes

code/configure.py reads OPENAI_API_KEY / HF_API_TOKEN from the environment; no secrets are committed.
Submission is single-blind, so this repository may be linked directly from the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diff-Claim Audit — Reproducibility Artifact

Contents

Quick start (mini-reproduction, ~4 s on a laptop CPU)

Paper table/figure → source data (in `tables/`)

Data access (113 GB cache)

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
code		code
data		data
mini_repro		mini_repro
tables		tables
technical_report		technical_report
.gitignore		.gitignore
CHECKSUMS_release.txt		CHECKSUMS_release.txt
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Diff-Claim Audit — Reproducibility Artifact

Contents

Quick start (mini-reproduction, ~4 s on a laptop CPU)

Paper table/figure → source data (in tables/)

Data access (113 GB cache)

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Paper table/figure → source data (in `tables/`)

Packages