Reform validation: populace estimates vs JCT scores#16
Draft
PavelMakarchuk wants to merge 3 commits into
Draft
Conversation
Downstream validation to complement calibration diagnostics: how closely populace-US reproduces the budget effects of reforms the JCT has officially scored (OBBBA and other JCT-scored reforms), tracked release-over-release. - lib/populace/reforms.ts: pure-HF loader for a new per-release artifact reform_validation.json (schema v1 documented inline), deriving the populace−JCT error per reform, and buildReformHistory for run-over-run. - API: /api/populace/reforms (one release; 200 + available:false when the artifact isn't published yet) and /api/populace/reforms/history. - /populace/reforms page: KPIs (reforms scored, mean |error|, within-10%), a populace-vs-JCT table, and a run-over-run trend with sparklines. - Nav entry; hooks + types; bun tests for error derivation, summary, and the chronological history delta. The scores are produced by the populace build pipeline (a follow-up PR on PolicyEngine/populace publishes reform_validation.json); the dashboard reads it live and shows a clear empty state until then. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sync with the producer schema (PolicyEngine/populace#63): each reform now carries in_sample + period. Out-of-sample reforms (OBBBA provisions the calibration never saw) are the genuine fidelity test; in-sample reforms are JCT tax-expenditure calibration targets the dataset was tuned to. - reforms.ts: read in_sample/period; summary adds out-of-sample-only stats (n_out_of_sample, out_of_sample_within_10pct, out_of_sample_mean_abs_rel_err); history series carries in_sample. - View: out-of-sample KPIs headline; per-reform in-sample/out-of-sample badge; out-of-sample rows sorted first; description explains the split. - Tests assert the out-of-sample summary isolates the in-sample miss. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The producer now emits big-provision tax-expenditure reforms (CTC/EITC/CDCC/ standard/itemized) benchmarked against JCT or Treasury, plus magnitude-only rows for provisions neither scores (standard deduction, all-itemized). Update the view labels accordingly: "JCT score" → "Benchmark", and explain that some rows show the repeal magnitude only. The loader already handles null benchmark (error/within-10% become null), so those rows flow through as magnitude-only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A new Reform validation view (consumer side) that compares populace-US's microsimulated budget effects for JCT-scored reforms (OBBBA and other JCT-scored reforms) against the official JCT scores, and tracks the gap release-over-release.
This is downstream validation to complement the existing calibration diagnostics: calibration asks "does the dataset reproduce its calibration targets"; this asks "does the dataset reproduce the budget effects of reforms an authority has scored".
How it fits the pure-HF architecture
The dashboard can't run microsimulation, so the scores come from a new per-release artifact,
reform_validation.json, published by the populace build pipeline (the producer side is a follow-up PR onPolicyEngine/populace). This PR is the consumer:lib/populace/reforms.ts— pure-HF loader; schema v1 documented inline (the producer/consumer contract). Derivespopulace − JCTerror per reform;buildReformHistoryassembles per-reform run-over-run series across releases./api/populace/reforms?release=(returns200withavailable:falsewhen a release predates the artifact, not an error) and/api/populace/reforms/history./populace/reformspage — KPIs (reforms scored, mean |error|, within-10%), a populace-vs-JCT table, and a run-over-run trend with sparklines.State
Until the producer PR lands and a build publishes
reform_validation.json, the page shows a clear "not published yet" empty state. Verified live against HF: the endpoint returnsavailable:falsefor the current release, history is empty, page renders 200.Tests
bun test— 12 pass (4 new: error derivation, summary counting, chronological history delta, zero-score guard).tsc --noEmitclean.Follow-up (producer)
PR on
PolicyEngine/populace: a build step that scores a fixed set of JCT-scored reforms on each release and publishesreform_validation.jsonper the schema inreforms.ts.🤖 Generated with Claude Code