Reviewer-runnable artifact for "When 'Where Do Two Models Differ' Does Not Reproduce: A Reliability Audit and 113 GB Benchmark for Black-Box LLM Divergence Claims" (ICDM 2026, Applied Track).
Main reliability results require no LLM API calls and no GPU. The full 113 GB response cache is hosted separately (see Data access below); nothing in this repository requires downloading it.
code/ Research pipeline: black-box divergence scoring + BO/Siamese search.
Env-specific paths; needs the 113 GB cache + a GPU + env keys to
RE-RUN from scratch. Released for transparency, not as the repro path.
data/ Expert taxonomies + fixed generation prompts (small, non-PHI):
icd_blocks.csv ICD-10 chapters I–XXII (Tabular List)
NGSS_Table_All_3C_utf8.csv NGSS strands
phrases_train_scenarios_only.json fixed GPT-3.5 generation scenarios
tables/ Source-data CSV/JSON for every main reliability table/figure
(each paper number is computed from a file here).
mini_repro/ Standalone CPU-only reproduction (NO cache, NO API). Runs in seconds.
technical_report/ Companion technical report (techreport.pdf + source): the full
TV target-alignment and tree-Wasserstein = weighted-l1 derivations
that the main paper defers.
CHECKSUMS_release.txt SHA-256 of every file in this repo.
cd mini_repro
pip install numpy pandas scipy # only these three
bash run.sh # or: python3 reliability_validity_link.pyReproduces the reliability ↔ validity coupling (Spearman ρ ≈ 0.99) on planted
ground truth — the result behind the paper's positive control (§Discussion, R4):
high split-half reliability recovers the planted truth (validity); near-chance
reliability recovers nothing. Compare your output to mini_repro/EXPECTED_OUTPUT.txt.
| Paper element | Source file(s) |
|---|---|
| Fig. 1 regime map / R1 dichotomy | r1_testretest_summary.csv, g1_aggregate.csv |
| R2 variance budget (≈97 % noise) | r2_observed_splithalf_icc5.csv |
| R2 rank-1 / low-rank | r3_lowrank.csv |
| R3 chapter rank reliability (ρ≈0.36) | r10_chapter_rank.csv |
| R3 metric robustness (Table III) | p1_chapter_split_half_summary.csv, p1_anchor_certificate.json |
| R4 criterion validity (MMLU/ARC) | r_criterion_validity.csv (with HuggingFace leaderboard provenance) |
| R4 budget–reliability law | e2_budget_reliability_law_v2.csv, p0_budget_reliability_with_spearman_brown.csv |
| Evidence Map / certificate | p0_evidence_map_certificate.csv |
| Controls (pos/neg) | p0_real_positive_negative_controls.csv |
The full cache (13 models × 2 taxonomies, with per-query logs) is available to
reviewers and will be deposited with SHA-256 checksums. It is not required to
verify any table above (the source-data CSVs already carry the computed values).
To regenerate a SHA-256 manifest of the cache yourself:
find <cache>/ -name 'data*.pqt' -print0 | xargs -0 sha256sum > CHECKSUMS_cache.txt
code/configure.pyreadsOPENAI_API_KEY/HF_API_TOKENfrom the environment; no secrets are committed.- Submission is single-blind, so this repository may be linked directly from the paper.