Skip to content

populace-build: reference-pinned, recorded, reconstructable parity (part of #19)#21

Closed
MaxGhenis wants to merge 1 commit into
mainfrom
pin-parity-reference
Closed

populace-build: reference-pinned, recorded, reconstructable parity (part of #19)#21
MaxGhenis wants to merge 1 commit into
mainfrom
pin-parity-reference

Conversation

@MaxGhenis

@MaxGhenis MaxGhenis commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Fixes part of PolicyEngine/populace-benchmarks#1 — the pin + record + reconstructable runner half. The data-gap closure (missing PUF-derived credit inputs) and the CI drift-check job remain open in PolicyEngine/populace-benchmarks#1.

What PolicyEngine/populace-benchmarks#1 found

The certified US default passed its build-time parity gate but the verdict was not reproducible, on three process counts:

  1. The parity reference was unpinned. Parity ran against a working-copy eCPS, not a recorded revision, so gaps=0 silently rotted the moment the eCPS changed (and Certified populace-us drifts behind current eCPS on 10 parity layers; pin the parity reference + restore the gate runner populace-benchmarks#1 reproduced exactly that — the certified default fell behind a newer eCPS on 10 layers).
  2. The release manifest recorded no gate result — no gaps count, no reference identity, no skipped-layer count. The verdict wasn't in the artifact.
  3. The runner was deleted from HEAD in fda3838 (packages/populace-data/build/us/check_parity.py). The gate library (populace.build.parity_gate) survived, but nothing invoked it against a reference — the contract was only reconstructable from git history.

What this PR adds

A new module packages/populace-build/src/populace/build/parity_reference.py that re-homes the simulation-level parity runner into the package and makes it reference-pinned and recorded. It reuses the surviving parity_gate for the judging logic (does not reimplement it):

  • ReferenceSpec — frozen dataclass identifying the reference eCPS by sha256 (mandatory) plus either a Hugging Face repo + revision or a local path. An unpinned reference is refused at construction — that is the exact bug Certified populace-us drifts behind current eCPS on 10 parity layers; pin the parity reference + restore the gate runner populace-benchmarks#1 fixes. from_local_file() hashes via the shared trace.sha256_file, so the reference is hashed by the same algorithm the build's TRACE provenance uses for every other artifact.

  • judge_parity(candidate_shares, reference_shares, reference_spec, *, known_gaps=(), skipped=()) — pure, fully unit-tested. Delegates the verdict to parity_gate (failure lines verbatim, not reinvented) and records in GateResult.details: the reference identity (reference: sha256 + repo/revision or path + kind), skipped / skipped_layers, and candidate_populated_layers alongside the base gate's reference_populated_layers / gaps / exempted. Runs with no 355 MB dataset and no policyengine_us — it takes precomputed share dicts. This is the core separation: judge + record reference identity (pure, tested) vs gather shares (needs the sim).

  • gather_candidate_shares(reference_layers, *, year, tax_benefit_system, calculate) — ports check_parity.py's simulation loop: skip variables the engine does not register or that are non-annual; non-zero share per variable; pop structural weights (household_weight, person_weight); a failed calculate is recorded as a 0.0 candidate share (a real gap), not a skip — matching the deleted runner. The Microsimulation is isolated behind an injected calculate callable so the skip rules and weight-popping are unit-testable without policyengine_us or a dataset.

  • reference_layers(path, *, year) — reads the eCPS's flat var/YEAR HDF5 layers (port of stored_layers), skipping string/object columns and filtering by year correctly. This fixes a latent defect in the original: an off-year dataset (takes_up_aca_if_eligible/2025) was mis-parsed into a phantom layer named "2025". Verified against the real frozen reference (enhanced_cps_2024_hf_main.h5): reads 236 true 2024 numeric layers (the original's 237 included that one phantom key).

  • run_parity_against_reference(candidate_path, reference, *, year, known_gaps=()) — thin orchestrator wiring gather → judge: resolves/hashes the reference to a pinned ReferenceSpec, reads its layers, builds the candidate sim (the only path that imports policyengine_us), injects sim.calculate(var, year).values as the seam, and returns the recorded GateResult.

Tested vs deferred

Tested (TDD, 25 tests, no network, no large files):

  • ReferenceSpec rejects construction without a sha256 (and None sha256, and a HF spec missing repo/revision); from_local_file hashes the bytes.
  • judge_parity: gap when the reference populates a layer the candidate-shares show 0; pass when the candidate populates; known_gaps exemption honored; reference identity (sha256/repo/revision/path) recorded in details; layer/skipped counts recorded; failure text identical to parity_gate's.
  • gather_candidate_shares: with a fake tax_benefit_system stub (annual + monthly + unregistered vars) and an injected fake sim, the skip logic, structural-weight popping, and the failed-calculate-is-a-gap rule.
  • reference_layers: flat var/YEAR reader, string-column skipping, year filtering (no off-year/phantom-key leak), and the entity/variable single-year layout (real h5py, tiny in-memory files).
  • Round-trip: a recorded GateResult serializes via GateReport.to_manifest() into a manifest-style dict carrying the reference identity, and is JSON-serializable.
  • Package re-export from populace.build.

Deferred to follow-up (still tracked in PolicyEngine/populace-benchmarks#1, noted in the module docstring):

  • The CI drift-check that re-runs run_parity_against_reference against the latest published eCPS on every build so reference drift fails loudly. Not built here.
  • Closing the data gaps (the 6 PUF-derived credit/penalty inputs, the self_employed_health_insurance_ald upstream, the 3 reported aggregates) — separate impute-stage work.

Documentation note

The repo has no docs/ tree or changelog tooling, so the release-manifest contract is documented in the module docstring: the produced GateResult should travel with the release via GateReport.to_manifest() (already content-hashed into the build's TRACE TRO by trace.py), and the CI drift-check is the companion follow-up.

Deviation from spec

gather_candidate_shares takes an injected calculate callable (and the tax_benefit_system) rather than a candidate_path. Building the Microsimulation from a path inside gather would make it untestable without policyengine_us — which the spec explicitly forbids for the test suite. The path → sim step lives in the run_parity_against_reference orchestrator instead, which is a faithful reading of "the sim dependency is isolated here" + "inject the sim via a callable." Same intent, the seam just sits one call out.

🤖 Generated with Claude Code

…art of #19)

Issue #19 found the certified US parity verdict was unreproducible on three
counts: the parity reference eCPS was unpinned (gaps=0 was judged against a
working-copy eCPS that since drifted), the release manifest recorded no gate
result (no gaps count, reference identity, or skipped-layer count), and the
runner that produced the verdict was deleted from HEAD in fda3838 (only the
gate *library* survived). So "parity 0" was neither reproducible nor
drift-detectable.

Add packages/populace-build/src/populace/build/parity_reference.py, which
re-homes the simulation-level parity runner into the package and pins + records
what it judges against:

- ReferenceSpec: a frozen dataclass identifying the reference eCPS by sha256
  (mandatory) plus either a Hugging Face repo+revision or a local path. An
  unpinned reference is refused at construction — it is the exact #19 bug.
  from_local_file() hashes via the shared trace.sha256_file (one hash
  definition across the build).

- judge_parity(): pure. Reuses the surviving populace.build.parity_gate for
  the verdict (failure lines verbatim, not reinvented) and records the
  reference identity + gap/skip/populated-layer counts in GateResult.details.
  Runs with no dataset and no policyengine_us — it takes precomputed share
  dicts. This is the "judge + record reference identity" half, separated from
  "gather shares".

- gather_candidate_shares(): ports check_parity.py's simulation loop (skip
  vars not engine-registered or non-annual; non-zero share per var; pop
  structural weights; a failed sim.calculate is a recorded gap, not a skip),
  with the Microsimulation isolated behind an injected calculate callable so
  the skip rules are unit-testable without policyengine_us or a 355MB dataset.

- reference_layers(): reads the eCPS's flat var/YEAR HDF5 layers (port of
  stored_layers), skipping string/object columns and — fixing a latent defect
  in the original — filtering by year correctly so an off-year dataset
  (e.g. takes_up_aca_if_eligible/2025) is neither counted under the wrong year
  nor recorded with a year token as its variable name. Verified against the
  real frozen reference: 236 true 2024 numeric layers (the original's 237
  included one such phantom "2025" key).

- run_parity_against_reference(): the thin orchestrator that wires the real
  sim in (the only path that imports policyengine_us), hashes/pins the
  reference, and records the verdict.

The module docstring states the release manifest should carry this GateResult
(via GateReport.to_manifest(), already content-hashed into the build TRO by
trace.py) and that a CI drift-check re-running parity against the latest eCPS
is the follow-up — not built here.

TDD: 25 tests covering the sha256 requirement, gap/pass/known-gap exemption,
recorded reference identity (sha256/repo/revision) and counts, the gather skip
logic + structural-weight popping over a fake sim, the H5 var/YEAR reader
(including year filtering and the entity/variable layout), and a manifest
round-trip. No network, no large files. Re-exported from populace.build
alongside the gates.

Scope: does NOT close the PUF-derived credit-input data gaps or build the CI
drift-check job — both remain tracked in #19.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxGhenis

Copy link
Copy Markdown
Contributor Author

Closing out of live Populace. Incumbent/eCPS comparison and reference-drift harnesses now belong in PolicyEngine/populace-benchmarks; the related issues were transferred there as PolicyEngine/populace-benchmarks#1 and #2.

@MaxGhenis MaxGhenis closed this Jun 14, 2026
@MaxGhenis MaxGhenis deleted the pin-parity-reference branch June 14, 2026 19:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant