Emit reform_validation.json: dataset budget effects vs JCT scores by PavelMakarchuk · Pull Request #63 · PolicyEngine/populace

PavelMakarchuk · 2026-06-16T04:35:02Z

What

Adds a per-release reform-validation artifact, reform_validation.json — the downstream counterpart to calibration_diagnostics.json. Calibration measures fit to its targets; this measures how closely the calibrated dataset reproduces the budget effects of JCT-scored reforms. It's consumed by the calibration-diagnostics dashboard (PolicyEngine/calibration-diagnostics#16).

Two labelled kinds of reform

in-sample — the JCT tax-expenditure reforms that are themselves calibration targets (US_JCT_TAX_EXPENDITURE_REFORMS). Their populace estimate is the calibration's own final_estimate (no extra simulation), flagged in_sample=true so a consumer knows agreement is expected.
out-of-sample — OBBBA provisions the calibration never saw (obbba_reforms.json): no-tax-on-tips and no-tax-on-overtime, with their per-fiscal-year JCX-35-25 scores. OBBBA is baked into the policyengine-us baseline, so each is encoded as a counterfactual revert; the provision effect is baseline − reform (sign-comparable to the JCT enactment score), simulated at FY2026 against JCT's FY2026 line.

Why only two OBBBA provisions (for now)

The curated set deliberately excludes provisions where a clean validation isn't possible — documented inline in obbba_reforms.json:

SALT cap, CTC, standard deduction — the JCX-35-25 line bundles TCJA extension + enhancement, so a parameter revert can't be isolated to the JCT figure.
Senior bonus deduction — no standalone JCX-35-25 line (it's netted inside the personal-exemption termination line).
Trump accounts — not modeled in policyengine-us.
Estate exemption — clean parameter, but estate tax rarely fires in microdata.

That leaves tips and overtime: genuinely new provisions whose revert captures the whole provision and whose JCT line is exact.

Files

packages/populace-build/src/populace/build/us/reform_validation.py — ReformValidationSpec, in/out-of-sample spec builders, reform_validation_payload (microsim isolated behind an injected simulate()), write_reform_validation.
.../us/obbba_reforms.json — curated out-of-sample set + JCT citations.
tools/build_us_fiscal_refresh_release.py — writes reform_validation.json after the release H5 and registers it in the release manifest; adds --skip-reform-validation / --skip-out-of-sample-reforms.

Tests

packages/populace-build/tests/test_reform_validation.py — 9 tests (sign conventions incl. the counterfactual flip, in-sample-from-calibration, shipped-config loading), fake-sim isolated so they need no policyengine-us. Existing test_us_fiscal_targets.py (20) unaffected. ruff clean.

State / follow-up

Out-of-sample budget effects populate when a release build actually runs the OBBBA microsims (build_us_fiscal_refresh_release.py); until then the artifact is the in-sample rows plus null out-of-sample estimates. I have not run a full release build here (needs the base H5 + a calibration run), so the OBBBA parameter paths and the resulting FY2026 magnitudes should be sanity-checked against a real build before merge.

🤖 Generated with Claude Code

Sync with the producer schema (PolicyEngine/populace#63): each reform now carries in_sample + period. Out-of-sample reforms (OBBBA provisions the calibration never saw) are the genuine fidelity test; in-sample reforms are JCT tax-expenditure calibration targets the dataset was tuned to. - reforms.ts: read in_sample/period; summary adds out-of-sample-only stats (n_out_of_sample, out_of_sample_within_10pct, out_of_sample_mean_abs_rel_err); history series carries in_sample. - View: out-of-sample KPIs headline; per-reform in-sample/out-of-sample badge; out-of-sample rows sorted first; description explains the split. - Tests assert the out-of-sample summary isolates the in-sample miss. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

PavelMakarchuk · 2026-06-16T05:11:26Z

Verified end-to-end on the released populace-US dataset

Ran the reforms through policyengine-us 1.334.0 on populace_us_2024.h5 (the live HF release), FY2026. Both out-of-sample provisions construct as proper counterfactual reverts and produce real, correctly-signed budget effects:

Reform	populace	JCT FY2026 (JCX-35-25)	error
No tax on tips	−$6.27B	−$10.12B	+38% (under)
No tax on overtime	−$17.67B	−$32.81B	+46% (under)

Data coverage on the populace dataset (this is what makes the reforms measurable): tip_income = $136.9B, fsla_overtime_premium = $118.3B (6,578 records). Note: overtime is a no-op on the default CPS (fsla_overtime_premium is an unimputed input = 0 there) but populace imputes it, so it validates on populace.

The in-sample neutralize path was also verified to construct and run (neutralizing SALT/medical/charitable raises income_tax by $22.4B / $11.3B / $61.9B — the positive tax-expenditure values, correct convention).

The ~40% under-estimate is the actual validation signal: populace under-captures tip/overtime income relative to JCT's assumptions.

Corrections found while testing: the CTC path gov.irs.credits.ctc.amount.base is a bracket ParameterScale, not a scalar — it can't be set with a flat value (it was already excluded for TCJA-bundling, now also confirmed unencodable as written). Remaining gap: in-sample budget effects come from a live calibration final_estimate, which a full release build produces — not exercised here.

Adds a per-release reform-validation artifact, the downstream counterpart to calibration_diagnostics.json: where calibration measures fit to its targets, this measures how closely the calibrated dataset reproduces the budget effects of JCT-scored reforms. The calibration-diagnostics dashboard consumes it. Two labelled kinds of reform: - in-sample: the JCT tax-expenditure reforms that are themselves calibration targets (US_JCT_TAX_EXPENDITURE_REFORMS). Their populace estimate is the calibration's own final_estimate — no extra simulation — flagged in_sample=True so a consumer knows agreement is expected. - out-of-sample: OBBBA provisions the calibration never saw (obbba_reforms.json: no-tax-on-tips and no-tax-on-overtime, with their per-FY JCX-35-25 scores). OBBBA is baked into the policyengine-us baseline, so each is encoded as a counterfactual revert and the provision effect is baseline - reform (sign-comparable to the JCT enactment score), simulated at FY2026. - packages/populace-build/.../reform_validation.py: ReformValidationSpec, the in-sample/out-of-sample spec builders, reform_validation_payload (microsim isolated behind an injected simulate() for testing), write_reform_validation. - obbba_reforms.json: curated out-of-sample set; excludes provisions whose JCT line bundles TCJA extension (SALT/CTC/standard deduction), lacks a standalone line (senior deduction), or isn't modeled (Trump accounts) — documented inline. - build_us_fiscal_refresh_release.py: writes reform_validation.json after the release H5, adds it to the release manifest; --skip-reform-validation and --skip-out-of-sample-reforms flags. - 9 unit tests (sign conventions, in-sample-from-calibration, config loading), fake-sim isolated so they need no policyengine-us. ruff clean. Out-of-sample budget effects populate when a release build runs the OBBBA microsims; the artifact is otherwise the in-sample rows plus null estimates. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…mized) Extends reform validation beyond OBBBA to the major tax provisions, each a neutralize_variable repeal whose simulated revenue change is compared to a published tax-expenditure figure: - CTC vs JCT $173.8B (JCX-48-24), EITC vs JCT $124.2B, CDCC vs Treasury $3.69B (JCT bundles CDCC with the employer-childcare exclusion). - Standard deduction and all-itemized-combined carry NO benchmark: both JCT and Treasury treat the standard deduction as baseline, and neither publishes a combined itemized total — so these publish the repeal magnitude only (jct_score is now Optional to support that). - in_sample flags calibration status honestly: EITC is in-sample (SOI EITC targets), CTC partly, CDCC/standard/itemized out-of-sample. The individual itemized deductions (SALT/mortgage/charitable/medical/QBI) are already validated in-sample, so they aren't duplicated here. Verified on the released populace_us_2024.h5 (FY2024): CTC $114.9B vs $173.8B (-34%), EITC $96.2B vs $124.2B (-23%), CDCC $3.08B vs $3.69B (-17%), standard $261.6B, all-itemized $88.1B — populace under-captures the big refundable credits, a real validation signal. tax_expenditure_reforms.json config + tax_expenditure_reform_specs loader, wired into load_default_reform_specs; 2 new tests. ruff clean. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

#65 reworked SimpleTaxExpenditureReform — the JCT dollar figure (.value) now lives in the ledger target, not on the reform object. in_sample_reform_specs no longer reads reform.value; instead reform_validation_payload takes in_sample_targets (the calibration target value per id), and the builder supplies it from the calibration result. So an in-sample reform's JCT score is the target it was calibrated to, and its populace estimate is the calibrated final_estimate — both straight from the calibration diagnostics. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The counterfactual-revert and null-benchmark tests called the real build_reform(), whose lazy policyengine_core import is absent in the populace-build CI env. Monkeypatch build_reform to a sentinel like the sibling budget-effect test, so the suite stays simulation-injected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

PavelMakarchuk and others added 3 commits June 16, 2026 12:35

PavelMakarchuk force-pushed the reform-validation branch from 6bca838 to 1f17db7 Compare June 16, 2026 16:39

PavelMakarchuk marked this pull request as ready for review June 16, 2026 17:34

PavelMakarchuk merged commit 7ef04bc into main Jun 16, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit reform_validation.json: dataset budget effects vs JCT scores#63

Emit reform_validation.json: dataset budget effects vs JCT scores#63
PavelMakarchuk merged 4 commits into
mainfrom
reform-validation

PavelMakarchuk commented Jun 16, 2026

Uh oh!

PavelMakarchuk commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PavelMakarchuk commented Jun 16, 2026

What

Two labelled kinds of reform

Why only two OBBBA provisions (for now)

Files

Tests

State / follow-up

Uh oh!

PavelMakarchuk commented Jun 16, 2026

Verified end-to-end on the released populace-US dataset

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant