Skip to content

Avoid live Hugging Face download in county FIPS loader #8650

@daphnehanse11

Description

@daphnehanse11

Problem

policyengine_us/tools/geography/county_helpers.py::load_county_fips_dataset() treats the county FIPS table as a live runtime download. When data/county_fips_2020.csv.gz is absent, it downloads county_fips_2020.csv.gz from the policyengine/policyengine-us-data Hugging Face repo.

That makes baseline household tests and county calculations depend on live network access. A fresh CI runner, fresh install, or offline environment can fail on unrelated policy tests if Hugging Face is unavailable, slow, rate-limited, or otherwise unreachable.

PR #8307 surfaced this in Full Suite - Baseline (irs-household) after merging main. The failing policy test was:

policyengine_us/tests/policy/baseline/household/demographic/geographic/county/county.yaml

Failing job from PR #8307: https://github.com/PolicyEngine/policyengine-us/actions/runs/27633741706/job/81715292464

Local reproduction

Running the baseline household batch locally reproduced the failure when the dataset download was unavailable:

PYTHONPATH=. python policyengine_us/tests/test_batched.py policyengine_us/tests/policy/baseline/household --batches 2

The first batch failed in the county FIPS YAML tests because the dataset could not be downloaded.

Preferred fix

Make county FIPS reference data a packaged, versioned resource rather than a default live download.

Suggested implementation:

  1. Store the county FIPS table inside the package, for example under policyengine_us/tools/geography/data/county_fips_2020.csv.
  2. Load it with importlib.resources, not Path("data"), so it works from wheels and installed packages.
  3. Keep the Hugging Face download only as a dev/update fallback or dataset-refresh path, not the ordinary runtime path.
  4. Update ordinary CI tests to validate the packaged resource and county mapping behavior without network access.
  5. Move any live Hugging Face download coverage behind an explicit integration/network test flag.

Less complete alternative

CI could prefetch data/county_fips_2020.csv.gz before running the baseline household suite, but that only makes CI less flaky. It would still leave installed/runtime behavior dependent on an undeclared external network fetch.

A local vendored CSV copy fixed the failing shard during PR #8307 debugging, but that broader change was intentionally left out of the Tennessee property tax relief PR to keep that PR scoped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions