Skip to content

Elina73/diffclaim-audit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diff-Claim Audit — Reproducibility Artifact

Reviewer-runnable artifact for "When 'Where Do Two Models Differ' Does Not Reproduce: A Reliability Audit and 113 GB Benchmark for Black-Box LLM Divergence Claims" (ICDM 2026, Applied Track).

Main reliability results require no LLM API calls and no GPU. The full 113 GB response cache is hosted separately (see Data access below); nothing in this repository requires downloading it.

Contents

code/        Research pipeline: black-box divergence scoring + BO/Siamese search.
             Env-specific paths; needs the 113 GB cache + a GPU + env keys to
             RE-RUN from scratch. Released for transparency, not as the repro path.
data/        Expert taxonomies + fixed generation prompts (small, non-PHI):
               icd_blocks.csv                 ICD-10 chapters I–XXII (Tabular List)
               NGSS_Table_All_3C_utf8.csv     NGSS strands
               phrases_train_scenarios_only.json   fixed GPT-3.5 generation scenarios
tables/      Source-data CSV/JSON for every main reliability table/figure
             (each paper number is computed from a file here).
mini_repro/  Standalone CPU-only reproduction (NO cache, NO API). Runs in seconds.
technical_report/  Companion technical report (techreport.pdf + source): the full
             TV target-alignment and tree-Wasserstein = weighted-l1 derivations
             that the main paper defers.
CHECKSUMS_release.txt   SHA-256 of every file in this repo.

Quick start (mini-reproduction, ~4 s on a laptop CPU)

cd mini_repro
pip install numpy pandas scipy        # only these three
bash run.sh                           # or: python3 reliability_validity_link.py

Reproduces the reliability ↔ validity coupling (Spearman ρ ≈ 0.99) on planted ground truth — the result behind the paper's positive control (§Discussion, R4): high split-half reliability recovers the planted truth (validity); near-chance reliability recovers nothing. Compare your output to mini_repro/EXPECTED_OUTPUT.txt.

Paper table/figure → source data (in tables/)

Paper element Source file(s)
Fig. 1 regime map / R1 dichotomy r1_testretest_summary.csv, g1_aggregate.csv
R2 variance budget (≈97 % noise) r2_observed_splithalf_icc5.csv
R2 rank-1 / low-rank r3_lowrank.csv
R3 chapter rank reliability (ρ≈0.36) r10_chapter_rank.csv
R3 metric robustness (Table III) p1_chapter_split_half_summary.csv, p1_anchor_certificate.json
R4 criterion validity (MMLU/ARC) r_criterion_validity.csv (with HuggingFace leaderboard provenance)
R4 budget–reliability law e2_budget_reliability_law_v2.csv, p0_budget_reliability_with_spearman_brown.csv
Evidence Map / certificate p0_evidence_map_certificate.csv
Controls (pos/neg) p0_real_positive_negative_controls.csv

Data access (113 GB cache)

The full cache (13 models × 2 taxonomies, with per-query logs) is available to reviewers and will be deposited with SHA-256 checksums. It is not required to verify any table above (the source-data CSVs already carry the computed values). To regenerate a SHA-256 manifest of the cache yourself: find <cache>/ -name 'data*.pqt' -print0 | xargs -0 sha256sum > CHECKSUMS_cache.txt

Notes

  • code/configure.py reads OPENAI_API_KEY / HF_API_TOKEN from the environment; no secrets are committed.
  • Submission is single-blind, so this repository may be linked directly from the paper.

About

Reproducibility artifact for 'When Where Do Two Models Differ Does Not Reproduce: A Reliability Audit for Black-Box LLM Divergence Claims' (ICDM 2026 Applied Track)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors