Publish incumbent-comparison scorecards (schema + pointer + validator)#4
Draft
PavelMakarchuk wants to merge 1 commit into
Draft
Publish incumbent-comparison scorecards (schema + pointer + validator)#4PavelMakarchuk wants to merge 1 commit into
PavelMakarchuk wants to merge 1 commit into
Conversation
Closes the gap that left the US incumbent comparison without a machine-readable artifact: the per-target populace-vs-enhanced-CPS result only survived on the older populace-us releases and was dropped from live populace in PolicyEngine/populace#37. - archive/us/populace-us-2024-9f1260b-20260611/scorecard.json: the first published scorecard (status "archived"), reconstructed from that release's sound_ecps_replacement_comparison.json — promotion metrics (full/holdout loss, unweighted MSRE), per-target win/loss/tie, per-family loss breakdown, top movers. - benchmarks/us/incumbent-comparison/latest.json: pointer to the most recent scorecard, so consumers resolve it without hard-coding a path. - scorecard.schema.json: the published contract (draft-07). - tools/validate_scorecard.py: stdlib validator — required keys, promotion metrics, and consistency the schema can't express (win counts sum to the target count; candidate_beats_baseline agrees with the losses). Verified it rejects both inconsistencies. - CI runs the validator on every PR. Fixes #3. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #3.
What
Gives the US incumbent comparison a machine-readable, validated scorecard artifact so downstream consumers (the calibration-diagnostics dashboard) can read the populace-vs-enhanced-CPS head-to-head without re-running the harness.
archive/us/populace-us-2024-9f1260b-20260611/scorecard.json— the first published scorecard,status: archived(reconstructed from that release'ssound_ecps_replacement_comparison.json, since the comparison was correctly dropped from live populace in populace#37). Carries the promotion metrics (full/holdout loss, unweighted MSRE), per-target win/loss/tie (populace 1,040 vs eCPS 2,613 of 3,704), per-family loss breakdown, and top movers.benchmarks/us/incumbent-comparison/latest.json— pointer to the current scorecard (scorecard_path+candidate_release_id), so consumers don't hard-code a path (mirrors the populace#9latest.jsonpattern).scorecard.schema.json— the published contract (JSON Schema draft-07).tools/validate_scorecard.py— stdlib-only validator: required keys, promotion-metric types, and the consistency the schema can't express — win counts sum to the target count, andcandidate_beats_baselineagrees with the loss values. Verified it rejects both kinds of inconsistency..github/workflows/validate-scorecards.yml) runs the validator on every PR — the repo's first gate.Boundary
Respects the repo's rules: no candidate is built or discovered from a working dir; the scorecard is a committed result keyed to the certified
pinned-production-ecps-2024incumbent, andarchivedstatus is explicit so it isn't mistaken for a fresh promotion-valid run.Validation
Consumer
The dashboard's incumbent-comparison view reads this artifact (a companion PR points its live fetch at
latest.json, falling back to its committed snapshot until this merges).🤖 Generated with Claude Code