ReflexBench Supplementary v2 (NeurIPS 2026 Datasets & Benchmarks Track)

This v2 supplementary supersedes the v1 release of 2026-04-29 by adding two new audit appendices that strengthen the self-evaluation disclosure of the main paper.

Contents (additions in v2 marked with [NEW])

Path	Description
`reflexbench.jsonl`	All 80 evaluation prompts (20 scenarios x 4 OD levels)
`scenarios/`	Per-scenario JSON descriptions, organised by domain
`responses/`	Raw model responses for the 9 evaluated public LLMs
`scoring/`	LLM-as-judge scoring scripts and rubric definitions
`ablations/moe_vs_dense/`	MoE vs Dense ablation scores (App. A of main paper)
`audit/` [NEW]	SHA-256 disjointness check between ReflexBench scenarios and Trained MoE training corpus (supports App. C)
`judge_robustness/` [NEW]	3-judge Cohen's kappa + Fleiss kappa + per-judge bias permutation test (supports App. B)

Reproduce key claims

Claim	Path
Trained MoE arm did not see ReflexBench at training time	`audit/scenario_corpus_disjoint.json`
LLM judge agreement is substantial (kappa = 0.69 Fleiss)	`judge_robustness/inter_judge_kappa.json`
9-model ranking is judge-independent (Kendall tau >= 0.86)	`judge_robustness/inter_judge_kappa.json`
MoE provides architectural premium	`ablations/moe_vs_dense/moe_vs_dense_scores.csv`

Anonymisation

This supplementary contains no author identifiers. All file paths are relative; no usernames, machine names, or training cluster identifiers are present. The anonymous repository is at: https://anonymous.4open.science/r/reflexbench-24E0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReflexBench Supplementary v2 (NeurIPS 2026 Datasets & Benchmarks Track)

Contents (additions in v2 marked with [NEW])

Reproduce key claims

Anonymisation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
ablations/moe_vs_dense		ablations/moe_vs_dense
audit		audit
judge_robustness		judge_robustness
paper		paper
responses		responses
scenarios		scenarios
scoring		scoring
.gitignore		.gitignore
README.md		README.md
reflexbench.jsonl		reflexbench.jsonl

Folders and files

Latest commit

History

Repository files navigation

ReflexBench Supplementary v2 (NeurIPS 2026 Datasets & Benchmarks Track)

Contents (additions in v2 marked with [NEW])

Reproduce key claims

Anonymisation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages