Emit reform_validation.json: dataset budget effects vs JCT scores#63
Conversation
Sync with the producer schema (PolicyEngine/populace#63): each reform now carries in_sample + period. Out-of-sample reforms (OBBBA provisions the calibration never saw) are the genuine fidelity test; in-sample reforms are JCT tax-expenditure calibration targets the dataset was tuned to. - reforms.ts: read in_sample/period; summary adds out-of-sample-only stats (n_out_of_sample, out_of_sample_within_10pct, out_of_sample_mean_abs_rel_err); history series carries in_sample. - View: out-of-sample KPIs headline; per-reform in-sample/out-of-sample badge; out-of-sample rows sorted first; description explains the split. - Tests assert the out-of-sample summary isolates the in-sample miss. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Verified end-to-end on the released populace-US datasetRan the reforms through policyengine-us 1.334.0 on
Data coverage on the populace dataset (this is what makes the reforms measurable): The in-sample neutralize path was also verified to construct and run (neutralizing SALT/medical/charitable raises income_tax by $22.4B / $11.3B / $61.9B — the positive tax-expenditure values, correct convention). The ~40% under-estimate is the actual validation signal: populace under-captures tip/overtime income relative to JCT's assumptions. Corrections found while testing: the CTC path |
Adds a per-release reform-validation artifact, the downstream counterpart to calibration_diagnostics.json: where calibration measures fit to its targets, this measures how closely the calibrated dataset reproduces the budget effects of JCT-scored reforms. The calibration-diagnostics dashboard consumes it. Two labelled kinds of reform: - in-sample: the JCT tax-expenditure reforms that are themselves calibration targets (US_JCT_TAX_EXPENDITURE_REFORMS). Their populace estimate is the calibration's own final_estimate — no extra simulation — flagged in_sample=True so a consumer knows agreement is expected. - out-of-sample: OBBBA provisions the calibration never saw (obbba_reforms.json: no-tax-on-tips and no-tax-on-overtime, with their per-FY JCX-35-25 scores). OBBBA is baked into the policyengine-us baseline, so each is encoded as a counterfactual revert and the provision effect is baseline - reform (sign-comparable to the JCT enactment score), simulated at FY2026. - packages/populace-build/.../reform_validation.py: ReformValidationSpec, the in-sample/out-of-sample spec builders, reform_validation_payload (microsim isolated behind an injected simulate() for testing), write_reform_validation. - obbba_reforms.json: curated out-of-sample set; excludes provisions whose JCT line bundles TCJA extension (SALT/CTC/standard deduction), lacks a standalone line (senior deduction), or isn't modeled (Trump accounts) — documented inline. - build_us_fiscal_refresh_release.py: writes reform_validation.json after the release H5, adds it to the release manifest; --skip-reform-validation and --skip-out-of-sample-reforms flags. - 9 unit tests (sign conventions, in-sample-from-calibration, config loading), fake-sim isolated so they need no policyengine-us. ruff clean. Out-of-sample budget effects populate when a release build runs the OBBBA microsims; the artifact is otherwise the in-sample rows plus null estimates. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…mized) Extends reform validation beyond OBBBA to the major tax provisions, each a neutralize_variable repeal whose simulated revenue change is compared to a published tax-expenditure figure: - CTC vs JCT $173.8B (JCX-48-24), EITC vs JCT $124.2B, CDCC vs Treasury $3.69B (JCT bundles CDCC with the employer-childcare exclusion). - Standard deduction and all-itemized-combined carry NO benchmark: both JCT and Treasury treat the standard deduction as baseline, and neither publishes a combined itemized total — so these publish the repeal magnitude only (jct_score is now Optional to support that). - in_sample flags calibration status honestly: EITC is in-sample (SOI EITC targets), CTC partly, CDCC/standard/itemized out-of-sample. The individual itemized deductions (SALT/mortgage/charitable/medical/QBI) are already validated in-sample, so they aren't duplicated here. Verified on the released populace_us_2024.h5 (FY2024): CTC $114.9B vs $173.8B (-34%), EITC $96.2B vs $124.2B (-23%), CDCC $3.08B vs $3.69B (-17%), standard $261.6B, all-itemized $88.1B — populace under-captures the big refundable credits, a real validation signal. tax_expenditure_reforms.json config + tax_expenditure_reform_specs loader, wired into load_default_reform_specs; 2 new tests. ruff clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
#65 reworked SimpleTaxExpenditureReform — the JCT dollar figure (.value) now lives in the ledger target, not on the reform object. in_sample_reform_specs no longer reads reform.value; instead reform_validation_payload takes in_sample_targets (the calibration target value per id), and the builder supplies it from the calibration result. So an in-sample reform's JCT score is the target it was calibrated to, and its populace estimate is the calibrated final_estimate — both straight from the calibration diagnostics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
6bca838 to
1f17db7
Compare
The counterfactual-revert and null-benchmark tests called the real build_reform(), whose lazy policyengine_core import is absent in the populace-build CI env. Monkeypatch build_reform to a sentinel like the sibling budget-effect test, so the suite stays simulation-injected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
Adds a per-release reform-validation artifact,
reform_validation.json— the downstream counterpart tocalibration_diagnostics.json. Calibration measures fit to its targets; this measures how closely the calibrated dataset reproduces the budget effects of JCT-scored reforms. It's consumed by the calibration-diagnostics dashboard (PolicyEngine/calibration-diagnostics#16).Two labelled kinds of reform
US_JCT_TAX_EXPENDITURE_REFORMS). Their populace estimate is the calibration's ownfinal_estimate(no extra simulation), flaggedin_sample=trueso a consumer knows agreement is expected.obbba_reforms.json): no-tax-on-tips and no-tax-on-overtime, with their per-fiscal-year JCX-35-25 scores. OBBBA is baked into the policyengine-us baseline, so each is encoded as a counterfactual revert; the provision effect isbaseline − reform(sign-comparable to the JCT enactment score), simulated at FY2026 against JCT's FY2026 line.Why only two OBBBA provisions (for now)
The curated set deliberately excludes provisions where a clean validation isn't possible — documented inline in
obbba_reforms.json:That leaves tips and overtime: genuinely new provisions whose revert captures the whole provision and whose JCT line is exact.
Files
packages/populace-build/src/populace/build/us/reform_validation.py—ReformValidationSpec, in/out-of-sample spec builders,reform_validation_payload(microsim isolated behind an injectedsimulate()),write_reform_validation..../us/obbba_reforms.json— curated out-of-sample set + JCT citations.tools/build_us_fiscal_refresh_release.py— writesreform_validation.jsonafter the release H5 and registers it in the release manifest; adds--skip-reform-validation/--skip-out-of-sample-reforms.Tests
packages/populace-build/tests/test_reform_validation.py— 9 tests (sign conventions incl. the counterfactual flip, in-sample-from-calibration, shipped-config loading), fake-sim isolated so they need no policyengine-us. Existingtest_us_fiscal_targets.py(20) unaffected. ruff clean.State / follow-up
Out-of-sample budget effects populate when a release build actually runs the OBBBA microsims (
build_us_fiscal_refresh_release.py); until then the artifact is the in-sample rows plus null out-of-sample estimates. I have not run a full release build here (needs the base H5 + a calibration run), so the OBBBA parameter paths and the resulting FY2026 magnitudes should be sanity-checked against a real build before merge.🤖 Generated with Claude Code