Skip to content

Emit reform_validation.json: dataset budget effects vs JCT scores#63

Merged
PavelMakarchuk merged 4 commits into
mainfrom
reform-validation
Jun 16, 2026
Merged

Emit reform_validation.json: dataset budget effects vs JCT scores#63
PavelMakarchuk merged 4 commits into
mainfrom
reform-validation

Conversation

@PavelMakarchuk

Copy link
Copy Markdown
Contributor

What

Adds a per-release reform-validation artifact, reform_validation.json — the downstream counterpart to calibration_diagnostics.json. Calibration measures fit to its targets; this measures how closely the calibrated dataset reproduces the budget effects of JCT-scored reforms. It's consumed by the calibration-diagnostics dashboard (PolicyEngine/calibration-diagnostics#16).

Two labelled kinds of reform

  • in-sample — the JCT tax-expenditure reforms that are themselves calibration targets (US_JCT_TAX_EXPENDITURE_REFORMS). Their populace estimate is the calibration's own final_estimate (no extra simulation), flagged in_sample=true so a consumer knows agreement is expected.
  • out-of-sample — OBBBA provisions the calibration never saw (obbba_reforms.json): no-tax-on-tips and no-tax-on-overtime, with their per-fiscal-year JCX-35-25 scores. OBBBA is baked into the policyengine-us baseline, so each is encoded as a counterfactual revert; the provision effect is baseline − reform (sign-comparable to the JCT enactment score), simulated at FY2026 against JCT's FY2026 line.

Why only two OBBBA provisions (for now)

The curated set deliberately excludes provisions where a clean validation isn't possible — documented inline in obbba_reforms.json:

  • SALT cap, CTC, standard deduction — the JCX-35-25 line bundles TCJA extension + enhancement, so a parameter revert can't be isolated to the JCT figure.
  • Senior bonus deduction — no standalone JCX-35-25 line (it's netted inside the personal-exemption termination line).
  • Trump accounts — not modeled in policyengine-us.
  • Estate exemption — clean parameter, but estate tax rarely fires in microdata.

That leaves tips and overtime: genuinely new provisions whose revert captures the whole provision and whose JCT line is exact.

Files

  • packages/populace-build/src/populace/build/us/reform_validation.pyReformValidationSpec, in/out-of-sample spec builders, reform_validation_payload (microsim isolated behind an injected simulate()), write_reform_validation.
  • .../us/obbba_reforms.json — curated out-of-sample set + JCT citations.
  • tools/build_us_fiscal_refresh_release.py — writes reform_validation.json after the release H5 and registers it in the release manifest; adds --skip-reform-validation / --skip-out-of-sample-reforms.

Tests

packages/populace-build/tests/test_reform_validation.py — 9 tests (sign conventions incl. the counterfactual flip, in-sample-from-calibration, shipped-config loading), fake-sim isolated so they need no policyengine-us. Existing test_us_fiscal_targets.py (20) unaffected. ruff clean.

State / follow-up

Out-of-sample budget effects populate when a release build actually runs the OBBBA microsims (build_us_fiscal_refresh_release.py); until then the artifact is the in-sample rows plus null out-of-sample estimates. I have not run a full release build here (needs the base H5 + a calibration run), so the OBBBA parameter paths and the resulting FY2026 magnitudes should be sanity-checked against a real build before merge.

🤖 Generated with Claude Code

PavelMakarchuk added a commit to PolicyEngine/calibration-diagnostics that referenced this pull request Jun 16, 2026
Sync with the producer schema (PolicyEngine/populace#63): each reform now
carries in_sample + period. Out-of-sample reforms (OBBBA provisions the
calibration never saw) are the genuine fidelity test; in-sample reforms are
JCT tax-expenditure calibration targets the dataset was tuned to.

- reforms.ts: read in_sample/period; summary adds out-of-sample-only stats
  (n_out_of_sample, out_of_sample_within_10pct, out_of_sample_mean_abs_rel_err);
  history series carries in_sample.
- View: out-of-sample KPIs headline; per-reform in-sample/out-of-sample badge;
  out-of-sample rows sorted first; description explains the split.
- Tests assert the out-of-sample summary isolates the in-sample miss.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@PavelMakarchuk

Copy link
Copy Markdown
Contributor Author

Verified end-to-end on the released populace-US dataset

Ran the reforms through policyengine-us 1.334.0 on populace_us_2024.h5 (the live HF release), FY2026. Both out-of-sample provisions construct as proper counterfactual reverts and produce real, correctly-signed budget effects:

Reform populace JCT FY2026 (JCX-35-25) error
No tax on tips −$6.27B −$10.12B +38% (under)
No tax on overtime −$17.67B −$32.81B +46% (under)

Data coverage on the populace dataset (this is what makes the reforms measurable): tip_income = $136.9B, fsla_overtime_premium = $118.3B (6,578 records). Note: overtime is a no-op on the default CPS (fsla_overtime_premium is an unimputed input = 0 there) but populace imputes it, so it validates on populace.

The in-sample neutralize path was also verified to construct and run (neutralizing SALT/medical/charitable raises income_tax by $22.4B / $11.3B / $61.9B — the positive tax-expenditure values, correct convention).

The ~40% under-estimate is the actual validation signal: populace under-captures tip/overtime income relative to JCT's assumptions.

Corrections found while testing: the CTC path gov.irs.credits.ctc.amount.base is a bracket ParameterScale, not a scalar — it can't be set with a flat value (it was already excluded for TCJA-bundling, now also confirmed unencodable as written). Remaining gap: in-sample budget effects come from a live calibration final_estimate, which a full release build produces — not exercised here.

PavelMakarchuk and others added 3 commits June 16, 2026 12:35
Adds a per-release reform-validation artifact, the downstream counterpart to
calibration_diagnostics.json: where calibration measures fit to its targets,
this measures how closely the calibrated dataset reproduces the budget effects
of JCT-scored reforms. The calibration-diagnostics dashboard consumes it.

Two labelled kinds of reform:
- in-sample: the JCT tax-expenditure reforms that are themselves calibration
  targets (US_JCT_TAX_EXPENDITURE_REFORMS). Their populace estimate is the
  calibration's own final_estimate — no extra simulation — flagged
  in_sample=True so a consumer knows agreement is expected.
- out-of-sample: OBBBA provisions the calibration never saw (obbba_reforms.json:
  no-tax-on-tips and no-tax-on-overtime, with their per-FY JCX-35-25 scores).
  OBBBA is baked into the policyengine-us baseline, so each is encoded as a
  counterfactual revert and the provision effect is baseline - reform
  (sign-comparable to the JCT enactment score), simulated at FY2026.

- packages/populace-build/.../reform_validation.py: ReformValidationSpec, the
  in-sample/out-of-sample spec builders, reform_validation_payload (microsim
  isolated behind an injected simulate() for testing), write_reform_validation.
- obbba_reforms.json: curated out-of-sample set; excludes provisions whose JCT
  line bundles TCJA extension (SALT/CTC/standard deduction), lacks a standalone
  line (senior deduction), or isn't modeled (Trump accounts) — documented inline.
- build_us_fiscal_refresh_release.py: writes reform_validation.json after the
  release H5, adds it to the release manifest; --skip-reform-validation and
  --skip-out-of-sample-reforms flags.
- 9 unit tests (sign conventions, in-sample-from-calibration, config loading),
  fake-sim isolated so they need no policyengine-us. ruff clean.

Out-of-sample budget effects populate when a release build runs the OBBBA
microsims; the artifact is otherwise the in-sample rows plus null estimates.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…mized)

Extends reform validation beyond OBBBA to the major tax provisions, each a
neutralize_variable repeal whose simulated revenue change is compared to a
published tax-expenditure figure:

- CTC vs JCT $173.8B (JCX-48-24), EITC vs JCT $124.2B, CDCC vs Treasury $3.69B
  (JCT bundles CDCC with the employer-childcare exclusion).
- Standard deduction and all-itemized-combined carry NO benchmark: both JCT and
  Treasury treat the standard deduction as baseline, and neither publishes a
  combined itemized total — so these publish the repeal magnitude only
  (jct_score is now Optional to support that).
- in_sample flags calibration status honestly: EITC is in-sample (SOI EITC
  targets), CTC partly, CDCC/standard/itemized out-of-sample. The individual
  itemized deductions (SALT/mortgage/charitable/medical/QBI) are already
  validated in-sample, so they aren't duplicated here.

Verified on the released populace_us_2024.h5 (FY2024): CTC $114.9B vs $173.8B
(-34%), EITC $96.2B vs $124.2B (-23%), CDCC $3.08B vs $3.69B (-17%), standard
$261.6B, all-itemized $88.1B — populace under-captures the big refundable
credits, a real validation signal.

tax_expenditure_reforms.json config + tax_expenditure_reform_specs loader,
wired into load_default_reform_specs; 2 new tests. ruff clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
#65 reworked SimpleTaxExpenditureReform — the JCT dollar figure (.value) now
lives in the ledger target, not on the reform object. in_sample_reform_specs no
longer reads reform.value; instead reform_validation_payload takes
in_sample_targets (the calibration target value per id), and the builder
supplies it from the calibration result. So an in-sample reform's JCT score is
the target it was calibrated to, and its populace estimate is the calibrated
final_estimate — both straight from the calibration diagnostics.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The counterfactual-revert and null-benchmark tests called the real
build_reform(), whose lazy policyengine_core import is absent in the
populace-build CI env. Monkeypatch build_reform to a sentinel like the
sibling budget-effect test, so the suite stays simulation-injected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@PavelMakarchuk PavelMakarchuk marked this pull request as ready for review June 16, 2026 17:34
@PavelMakarchuk PavelMakarchuk merged commit 7ef04bc into main Jun 16, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant