Reform validation: populace estimates vs JCT scores by PavelMakarchuk · Pull Request #16 · PolicyEngine/calibration-diagnostics

PavelMakarchuk · 2026-06-16T04:15:12Z

What

A new Reform validation view (consumer side) that compares populace-US's microsimulated budget effects for JCT-scored reforms (OBBBA and other JCT-scored reforms) against the official JCT scores, and tracks the gap release-over-release.

This is downstream validation to complement the existing calibration diagnostics: calibration asks "does the dataset reproduce its calibration targets"; this asks "does the dataset reproduce the budget effects of reforms an authority has scored".

How it fits the pure-HF architecture

The dashboard can't run microsimulation, so the scores come from a new per-release artifact, reform_validation.json, published by the populace build pipeline (the producer side is a follow-up PR on PolicyEngine/populace). This PR is the consumer:

lib/populace/reforms.ts — pure-HF loader; schema v1 documented inline (the producer/consumer contract). Derives populace − JCT error per reform; buildReformHistory assembles per-reform run-over-run series across releases.
API: /api/populace/reforms?release= (returns 200 with available:false when a release predates the artifact, not an error) and /api/populace/reforms/history.
/populace/reforms page — KPIs (reforms scored, mean |error|, within-10%), a populace-vs-JCT table, and a run-over-run trend with sparklines.
Nav entry; React Query hooks + types; bun tests.

State

Until the producer PR lands and a build publishes reform_validation.json, the page shows a clear "not published yet" empty state. Verified live against HF: the endpoint returns available:false for the current release, history is empty, page renders 200.

Tests

bun test — 12 pass (4 new: error derivation, summary counting, chronological history delta, zero-score guard). tsc --noEmit clean.

Follow-up (producer)

PR on PolicyEngine/populace: a build step that scores a fixed set of JCT-scored reforms on each release and publishes reform_validation.json per the schema in reforms.ts.

🤖 Generated with Claude Code

Downstream validation to complement calibration diagnostics: how closely populace-US reproduces the budget effects of reforms the JCT has officially scored (OBBBA and other JCT-scored reforms), tracked release-over-release. - lib/populace/reforms.ts: pure-HF loader for a new per-release artifact reform_validation.json (schema v1 documented inline), deriving the populace−JCT error per reform, and buildReformHistory for run-over-run. - API: /api/populace/reforms (one release; 200 + available:false when the artifact isn't published yet) and /api/populace/reforms/history. - /populace/reforms page: KPIs (reforms scored, mean |error|, within-10%), a populace-vs-JCT table, and a run-over-run trend with sparklines. - Nav entry; hooks + types; bun tests for error derivation, summary, and the chronological history delta. The scores are produced by the populace build pipeline (a follow-up PR on PolicyEngine/populace publishes reform_validation.json); the dashboard reads it live and shows a clear empty state until then. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Sync with the producer schema (PolicyEngine/populace#63): each reform now carries in_sample + period. Out-of-sample reforms (OBBBA provisions the calibration never saw) are the genuine fidelity test; in-sample reforms are JCT tax-expenditure calibration targets the dataset was tuned to. - reforms.ts: read in_sample/period; summary adds out-of-sample-only stats (n_out_of_sample, out_of_sample_within_10pct, out_of_sample_mean_abs_rel_err); history series carries in_sample. - View: out-of-sample KPIs headline; per-reform in-sample/out-of-sample badge; out-of-sample rows sorted first; description explains the split. - Tests assert the out-of-sample summary isolates the in-sample miss. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The producer now emits big-provision tax-expenditure reforms (CTC/EITC/CDCC/ standard/itemized) benchmarked against JCT or Treasury, plus magnitude-only rows for provisions neither scores (standard deduction, all-itemized). Update the view labels accordingly: "JCT score" → "Benchmark", and explain that some rows show the repeal magnitude only. The loader already handles null benchmark (error/within-10% become null), so those rows flow through as magnitude-only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

PavelMakarchuk mentioned this pull request Jun 16, 2026

Emit reform_validation.json: dataset budget effects vs JCT scores PolicyEngine/populace#63

Merged

PavelMakarchuk and others added 2 commits June 16, 2026 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reform validation: populace estimates vs JCT scores#16

Reform validation: populace estimates vs JCT scores#16
PavelMakarchuk wants to merge 3 commits into
mainfrom
reform-validation

PavelMakarchuk commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PavelMakarchuk commented Jun 16, 2026

What

How it fits the pure-HF architecture

State

Tests

Follow-up (producer)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant