populace-build: reference-pinned, recorded, reconstructable parity (part of #19) by MaxGhenis · Pull Request #21 · PolicyEngine/populace

MaxGhenis · 2026-06-14T18:00:32Z

Fixes part of PolicyEngine/populace-benchmarks#1 — the pin + record + reconstructable runner half. The data-gap closure (missing PUF-derived credit inputs) and the CI drift-check job remain open in PolicyEngine/populace-benchmarks#1.

What PolicyEngine/populace-benchmarks#1 found

The certified US default passed its build-time parity gate but the verdict was not reproducible, on three process counts:

The parity reference was unpinned. Parity ran against a working-copy eCPS, not a recorded revision, so gaps=0 silently rotted the moment the eCPS changed (and Certified populace-us drifts behind current eCPS on 10 parity layers; pin the parity reference + restore the gate runner populace-benchmarks#1 reproduced exactly that — the certified default fell behind a newer eCPS on 10 layers).
The release manifest recorded no gate result — no gaps count, no reference identity, no skipped-layer count. The verdict wasn't in the artifact.
The runner was deleted from HEAD in fda3838 (packages/populace-data/build/us/check_parity.py). The gate library (populace.build.parity_gate) survived, but nothing invoked it against a reference — the contract was only reconstructable from git history.

What this PR adds

A new module packages/populace-build/src/populace/build/parity_reference.py that re-homes the simulation-level parity runner into the package and makes it reference-pinned and recorded. It reuses the surviving parity_gate for the judging logic (does not reimplement it):

ReferenceSpec — frozen dataclass identifying the reference eCPS by sha256 (mandatory) plus either a Hugging Face repo + revision or a local path. An unpinned reference is refused at construction — that is the exact bug Certified populace-us drifts behind current eCPS on 10 parity layers; pin the parity reference + restore the gate runner populace-benchmarks#1 fixes. from_local_file() hashes via the shared trace.sha256_file, so the reference is hashed by the same algorithm the build's TRACE provenance uses for every other artifact.
judge_parity(candidate_shares, reference_shares, reference_spec, *, known_gaps=(), skipped=()) — pure, fully unit-tested. Delegates the verdict to parity_gate (failure lines verbatim, not reinvented) and records in GateResult.details: the reference identity (reference: sha256 + repo/revision or path + kind), skipped / skipped_layers, and candidate_populated_layers alongside the base gate's reference_populated_layers / gaps / exempted. Runs with no 355 MB dataset and no policyengine_us — it takes precomputed share dicts. This is the core separation: judge + record reference identity (pure, tested) vs gather shares (needs the sim).
gather_candidate_shares(reference_layers, *, year, tax_benefit_system, calculate) — ports check_parity.py's simulation loop: skip variables the engine does not register or that are non-annual; non-zero share per variable; pop structural weights (household_weight, person_weight); a failed calculate is recorded as a 0.0 candidate share (a real gap), not a skip — matching the deleted runner. The Microsimulation is isolated behind an injected calculate callable so the skip rules and weight-popping are unit-testable without policyengine_us or a dataset.
reference_layers(path, *, year) — reads the eCPS's flat var/YEAR HDF5 layers (port of stored_layers), skipping string/object columns and filtering by year correctly. This fixes a latent defect in the original: an off-year dataset (takes_up_aca_if_eligible/2025) was mis-parsed into a phantom layer named "2025". Verified against the real frozen reference (enhanced_cps_2024_hf_main.h5): reads 236 true 2024 numeric layers (the original's 237 included that one phantom key).
run_parity_against_reference(candidate_path, reference, *, year, known_gaps=()) — thin orchestrator wiring gather → judge: resolves/hashes the reference to a pinned ReferenceSpec, reads its layers, builds the candidate sim (the only path that imports policyengine_us), injects sim.calculate(var, year).values as the seam, and returns the recorded GateResult.

Tested vs deferred

Tested (TDD, 25 tests, no network, no large files):

ReferenceSpec rejects construction without a sha256 (and None sha256, and a HF spec missing repo/revision); from_local_file hashes the bytes.
judge_parity: gap when the reference populates a layer the candidate-shares show 0; pass when the candidate populates; known_gaps exemption honored; reference identity (sha256/repo/revision/path) recorded in details; layer/skipped counts recorded; failure text identical to parity_gate's.
gather_candidate_shares: with a fake tax_benefit_system stub (annual + monthly + unregistered vars) and an injected fake sim, the skip logic, structural-weight popping, and the failed-calculate-is-a-gap rule.
reference_layers: flat var/YEAR reader, string-column skipping, year filtering (no off-year/phantom-key leak), and the entity/variable single-year layout (real h5py, tiny in-memory files).
Round-trip: a recorded GateResult serializes via GateReport.to_manifest() into a manifest-style dict carrying the reference identity, and is JSON-serializable.
Package re-export from populace.build.

Deferred to follow-up (still tracked in PolicyEngine/populace-benchmarks#1, noted in the module docstring):

The CI drift-check that re-runs run_parity_against_reference against the latest published eCPS on every build so reference drift fails loudly. Not built here.
Closing the data gaps (the 6 PUF-derived credit/penalty inputs, the self_employed_health_insurance_ald upstream, the 3 reported aggregates) — separate impute-stage work.

Documentation note

The repo has no docs/ tree or changelog tooling, so the release-manifest contract is documented in the module docstring: the produced GateResult should travel with the release via GateReport.to_manifest() (already content-hashed into the build's TRACE TRO by trace.py), and the CI drift-check is the companion follow-up.

Deviation from spec

gather_candidate_shares takes an injected calculate callable (and the tax_benefit_system) rather than a candidate_path. Building the Microsimulation from a path inside gather would make it untestable without policyengine_us — which the spec explicitly forbids for the test suite. The path → sim step lives in the run_parity_against_reference orchestrator instead, which is a faithful reading of "the sim dependency is isolated here" + "inject the sim via a callable." Same intent, the seam just sits one call out.

🤖 Generated with Claude Code

…art of #19) Issue #19 found the certified US parity verdict was unreproducible on three counts: the parity reference eCPS was unpinned (gaps=0 was judged against a working-copy eCPS that since drifted), the release manifest recorded no gate result (no gaps count, reference identity, or skipped-layer count), and the runner that produced the verdict was deleted from HEAD in fda3838 (only the gate *library* survived). So "parity 0" was neither reproducible nor drift-detectable. Add packages/populace-build/src/populace/build/parity_reference.py, which re-homes the simulation-level parity runner into the package and pins + records what it judges against: - ReferenceSpec: a frozen dataclass identifying the reference eCPS by sha256 (mandatory) plus either a Hugging Face repo+revision or a local path. An unpinned reference is refused at construction — it is the exact #19 bug. from_local_file() hashes via the shared trace.sha256_file (one hash definition across the build). - judge_parity(): pure. Reuses the surviving populace.build.parity_gate for the verdict (failure lines verbatim, not reinvented) and records the reference identity + gap/skip/populated-layer counts in GateResult.details. Runs with no dataset and no policyengine_us — it takes precomputed share dicts. This is the "judge + record reference identity" half, separated from "gather shares". - gather_candidate_shares(): ports check_parity.py's simulation loop (skip vars not engine-registered or non-annual; non-zero share per var; pop structural weights; a failed sim.calculate is a recorded gap, not a skip), with the Microsimulation isolated behind an injected calculate callable so the skip rules are unit-testable without policyengine_us or a 355MB dataset. - reference_layers(): reads the eCPS's flat var/YEAR HDF5 layers (port of stored_layers), skipping string/object columns and — fixing a latent defect in the original — filtering by year correctly so an off-year dataset (e.g. takes_up_aca_if_eligible/2025) is neither counted under the wrong year nor recorded with a year token as its variable name. Verified against the real frozen reference: 236 true 2024 numeric layers (the original's 237 included one such phantom "2025" key). - run_parity_against_reference(): the thin orchestrator that wires the real sim in (the only path that imports policyengine_us), hashes/pins the reference, and records the verdict. The module docstring states the release manifest should carry this GateResult (via GateReport.to_manifest(), already content-hashed into the build TRO by trace.py) and that a CI drift-check re-running parity against the latest eCPS is the follow-up — not built here. TDD: 25 tests covering the sha256 requirement, gap/pass/known-gap exemption, recorded reference identity (sha256/repo/revision) and counts, the gather skip logic + structural-weight popping over a fake sim, the H5 var/YEAR reader (including year filtering and the entity/variable layout), and a manifest round-trip. No network, no large files. Re-exported from populace.build alongside the gates. Scope: does NOT close the PUF-derived credit-input data gaps or build the CI drift-check job — both remain tracked in #19. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MaxGhenis · 2026-06-14T19:50:22Z

Closing out of live Populace. Incumbent/eCPS comparison and reference-drift harnesses now belong in PolicyEngine/populace-benchmarks; the related issues were transferred there as PolicyEngine/populace-benchmarks#1 and #2.

MaxGhenis force-pushed the pin-parity-reference branch from 59c90a2 to 12ac387 Compare June 14, 2026 18:03

This was referenced Jun 14, 2026

EPIC: certified populace-us default correctness (gates not fully wired / reference rotted) #28

Closed

Fix Populace release and export gates #37

Merged

MaxGhenis closed this Jun 14, 2026

MaxGhenis deleted the pin-parity-reference branch June 14, 2026 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

populace-build: reference-pinned, recorded, reconstructable parity (part of #19)#21

populace-build: reference-pinned, recorded, reconstructable parity (part of #19)#21
MaxGhenis wants to merge 1 commit into
mainfrom
pin-parity-reference

MaxGhenis commented Jun 14, 2026 •

edited

Loading

Uh oh!

MaxGhenis commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What PolicyEngine/populace-benchmarks#1 found

What this PR adds

Tested vs deferred

Documentation note

Deviation from spec

Uh oh!

MaxGhenis commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxGhenis commented Jun 14, 2026 •

edited

Loading