Skip to content

Publish incumbent-comparison scorecards (schema + pointer + validator)#4

Draft
PavelMakarchuk wants to merge 1 commit into
mainfrom
publish-incumbent-scorecard
Draft

Publish incumbent-comparison scorecards (schema + pointer + validator)#4
PavelMakarchuk wants to merge 1 commit into
mainfrom
publish-incumbent-scorecard

Conversation

@PavelMakarchuk

Copy link
Copy Markdown

Fixes #3.

What

Gives the US incumbent comparison a machine-readable, validated scorecard artifact so downstream consumers (the calibration-diagnostics dashboard) can read the populace-vs-enhanced-CPS head-to-head without re-running the harness.

  • archive/us/populace-us-2024-9f1260b-20260611/scorecard.json — the first published scorecard, status: archived (reconstructed from that release's sound_ecps_replacement_comparison.json, since the comparison was correctly dropped from live populace in populace#37). Carries the promotion metrics (full/holdout loss, unweighted MSRE), per-target win/loss/tie (populace 1,040 vs eCPS 2,613 of 3,704), per-family loss breakdown, and top movers.
  • benchmarks/us/incumbent-comparison/latest.json — pointer to the current scorecard (scorecard_path + candidate_release_id), so consumers don't hard-code a path (mirrors the populace#9 latest.json pattern).
  • scorecard.schema.json — the published contract (JSON Schema draft-07).
  • tools/validate_scorecard.py — stdlib-only validator: required keys, promotion-metric types, and the consistency the schema can't express — win counts sum to the target count, and candidate_beats_baseline agrees with the loss values. Verified it rejects both kinds of inconsistency.
  • CI (.github/workflows/validate-scorecards.yml) runs the validator on every PR — the repo's first gate.

Boundary

Respects the repo's rules: no candidate is built or discovered from a working dir; the scorecard is a committed result keyed to the certified pinned-production-ecps-2024 incumbent, and archived status is explicit so it isn't mistaken for a fresh promotion-valid run.

Validation

$ python tools/validate_scorecard.py
OK: 2 file(s) valid (1 scorecard(s), 1 pointer(s)).

Consumer

The dashboard's incumbent-comparison view reads this artifact (a companion PR points its live fetch at latest.json, falling back to its committed snapshot until this merges).

🤖 Generated with Claude Code

Closes the gap that left the US incumbent comparison without a
machine-readable artifact: the per-target populace-vs-enhanced-CPS result
only survived on the older populace-us releases and was dropped from live
populace in PolicyEngine/populace#37.

- archive/us/populace-us-2024-9f1260b-20260611/scorecard.json: the first
  published scorecard (status "archived"), reconstructed from that release's
  sound_ecps_replacement_comparison.json — promotion metrics (full/holdout
  loss, unweighted MSRE), per-target win/loss/tie, per-family loss
  breakdown, top movers.
- benchmarks/us/incumbent-comparison/latest.json: pointer to the most
  recent scorecard, so consumers resolve it without hard-coding a path.
- scorecard.schema.json: the published contract (draft-07).
- tools/validate_scorecard.py: stdlib validator — required keys, promotion
  metrics, and consistency the schema can't express (win counts sum to the
  target count; candidate_beats_baseline agrees with the losses). Verified
  it rejects both inconsistencies.
- CI runs the validator on every PR.

Fixes #3.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@PavelMakarchuk PavelMakarchuk requested a review from MaxGhenis June 15, 2026 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Publish a machine-readable incumbent-comparison scorecard JSON at a stable path

1 participant