Publish incumbent-comparison scorecards (schema + pointer + validator) by PavelMakarchuk · Pull Request #4 · PolicyEngine/populace-benchmarks

PavelMakarchuk · 2026-06-15T14:53:55Z

Fixes #3.

What

Gives the US incumbent comparison a machine-readable, validated scorecard artifact so downstream consumers (the calibration-diagnostics dashboard) can read the populace-vs-enhanced-CPS head-to-head without re-running the harness.

archive/us/populace-us-2024-9f1260b-20260611/scorecard.json — the first published scorecard, status: archived (reconstructed from that release's sound_ecps_replacement_comparison.json, since the comparison was correctly dropped from live populace in populace#37). Carries the promotion metrics (full/holdout loss, unweighted MSRE), per-target win/loss/tie (populace 1,040 vs eCPS 2,613 of 3,704), per-family loss breakdown, and top movers.
benchmarks/us/incumbent-comparison/latest.json — pointer to the current scorecard (scorecard_path + candidate_release_id), so consumers don't hard-code a path (mirrors the populace#9 latest.json pattern).
scorecard.schema.json — the published contract (JSON Schema draft-07).
tools/validate_scorecard.py — stdlib-only validator: required keys, promotion-metric types, and the consistency the schema can't express — win counts sum to the target count, and candidate_beats_baseline agrees with the loss values. Verified it rejects both kinds of inconsistency.
CI (.github/workflows/validate-scorecards.yml) runs the validator on every PR — the repo's first gate.

Boundary

Respects the repo's rules: no candidate is built or discovered from a working dir; the scorecard is a committed result keyed to the certified pinned-production-ecps-2024 incumbent, and archived status is explicit so it isn't mistaken for a fresh promotion-valid run.

Validation

$ python tools/validate_scorecard.py
OK: 2 file(s) valid (1 scorecard(s), 1 pointer(s)).

Consumer

The dashboard's incumbent-comparison view reads this artifact (a companion PR points its live fetch at latest.json, falling back to its committed snapshot until this merges).

🤖 Generated with Claude Code

Closes the gap that left the US incumbent comparison without a machine-readable artifact: the per-target populace-vs-enhanced-CPS result only survived on the older populace-us releases and was dropped from live populace in PolicyEngine/populace#37. - archive/us/populace-us-2024-9f1260b-20260611/scorecard.json: the first published scorecard (status "archived"), reconstructed from that release's sound_ecps_replacement_comparison.json — promotion metrics (full/holdout loss, unweighted MSRE), per-target win/loss/tie, per-family loss breakdown, top movers. - benchmarks/us/incumbent-comparison/latest.json: pointer to the most recent scorecard, so consumers resolve it without hard-coding a path. - scorecard.schema.json: the published contract (draft-07). - tools/validate_scorecard.py: stdlib validator — required keys, promotion metrics, and consistency the schema can't express (win counts sum to the target count; candidate_beats_baseline agrees with the losses). Verified it rejects both inconsistencies. - CI runs the validator on every PR. Fixes #3. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

PavelMakarchuk requested a review from MaxGhenis June 15, 2026 14:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish incumbent-comparison scorecards (schema + pointer + validator)#4

Publish incumbent-comparison scorecards (schema + pointer + validator)#4
PavelMakarchuk wants to merge 1 commit into
mainfrom
publish-incumbent-scorecard

PavelMakarchuk commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PavelMakarchuk commented Jun 15, 2026

What

Boundary

Validation

Consumer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant