Skip to content

Exported enum input person.race ships raw CPS codes, not Race enum members; add an enum_domain gate #36

@MaxGhenis

Description

@MaxGhenis

Summary

The certified populace_us_2024.h5 exports the PolicyEngine-US input variable person.race as raw CPS numeric codes (0, 1, 10, 11, 12, 13, …) for all 160,858 person rows, rather than the Race enum member names PolicyEngine expects (WHITE, BLACK, HISPANIC, OTHER). Every value is out of the enum's domain.

This is the same gate-escape class as the EPIC (#28): exported_nonzero passes (the codes are non-zero) and parity doesn't look at value domains, so nothing in the suite catches that an exported enum variable holds values PolicyEngine-US cannot interpret.

Reproduction

Loading the sha-verified certified artifact and checking every exported column that maps to a PE variable with possible_values:

[person] race: invalid ['0','1','10','11','12','13', …] on 160858/160858 rows;
         valid enum names e.g. ['BLACK','HISPANIC','OTHER','WHITE']

race is the only enum-typed input affected (state_code, filing status, etc. are in-domain). The build maps a raw cps_race numeric (packages/populace-build/src/populace/build/us/sources.py) but never converts it to the Race enum on export.

Impact

  • Numerical impact is low for race specifically — it is not a federal tax/benefit driver, so the certified aggregates are unaffected.
  • Interop impact is real: a downstream consumer loading the artifact into Microsimulation gets out-of-domain enum values and must coerce them. The CRFB taxation-of-benefits build had to add a sanitize_enum_inputs step that rewrote all 160,858 race values to the enum default before it could simulate. Any consumer hits this.
  • The gate gap is the bigger issue: nothing validates exported enum domains, so a future enum input that does matter (anything tax/benefit-relevant) could ship as raw codes and pass certification the same way.

Fix

  1. Map cps_race to the PolicyEngine-US Race enum at the export step so race ships valid member names.
  2. Add an enum_domain gate to populace.build.gates (and wire it into the build per EPIC: certified populace-us default correctness (gates not fully wired / reference rotted) #28): for every exported column whose PE variable has possible_values, assert all stored values are valid enum members. This catches the race defect and any future enum-as-codes regression.

Relationship to existing issues

Distinct from #27 (drop raw/scratch columns — race is a legitimate PE input to keep, just with correct values) and from #24/#25 (formula-owned over-export — race is a pure input). Belongs under the EPIC #28 "gates not fully wired" theme; the enum_domain gate should run in the re-certification gate suite.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions