Add 5 reform-API regression eval cases (closes #82) by vahid-ahmadi · Pull Request #90 · PolicyEngine/policyengine-uk-chat

vahid-ahmadi · 2026-05-29T09:11:01Z

Summary

Adds 5 Test B scenarios (b6-b10) under evals/scenarios/ that catch silent reform-expressibility regressions — directly motivated by the 2026-05-28 live test where the agent could not express a basic-rate +1pp reform and produced no eval signal. Closes #82.

Path chosen: A (extend #52)

PR #52 (feat/eval-harness) is OPEN, mergeable, and structurally complete (SPEC, scenarios, runner, fixture builder, grader, first results writeup). The right home for #82's cases is the existing harness — this PR branches off feat/eval-harness and adds 5 new scenario YAMLs in the harness's existing shape.

Base branch: feat/eval-harness (rebase onto main once #52 merges).

What the cases assert

Per #82's constraint: direction + order-of-magnitude only — no exact-number pinning. Tax microsim numbers shift with dataset version and parameter year; exact pinning would create flaky tests.

ID	Reform	Direction	Magnitude band
`b6_basic_rate_plus_1pp`	basic rate 20%→21%	positive (revenue raise)	~£5-10bn (midpoint £7.5bn, ±50%)
`b7_pa_to_15k`	PA £12,570→£15,000	negative (cost)	~£15-25bn (midpoint -£20bn, ±50%)
`b8_ni_primary_threshold_plus_1k`	NI primary threshold +£1k	negative (cost)	~£2-5bn (midpoint -£3.5bn, ±60%)
`b9_child_benefit_uprate_10pct`	Child Benefit +10%	negative (cost)	~£1-2bn (midpoint -£1.3bn, ±70%)
`b10_two_band_collapse`	basic + higher → single 25%	unpinned (structural)	magnitude-only check

Each case also uses anchor.must_mention to assert the chat's prose names the reform parameters (basic rate, personal allowance, child benefit, etc.) and anchor.must_not_say to catch sign-flip regressions ("revenue rise" on a costing reform, etc.).

How it slots into the harness

The grader (evals/runner/grade.py:474) already supports expected_approx in fields_to_compare as a fixture-free alternative path — when reference.fixture is omitted, the diff uses the inline expected value. No grader changes needed; the new cases use existing field paths (budget.budgetary_impact, budget.tax_revenue_impact, budget.benefit_spending_impact) that the extractor's FIELD_LABELS already covers.

Each case pins:

dataset: enhanced_frs_2023_24
parameter_year: 2025
chat_settings.model_backend: uk_python
chat_settings.num_runs: 3

evals/README.md is updated with a "Reform-API regression suite" section explaining the two reference shapes (fixture-backed vs expected_approx-only).

What's deliberately out of scope

Not running the cases live (per Add reform-API eval cases to catch silent regressions like the 1pp basic-rate failure #82's "cheap to run" constraint and CI-integration being a separate concern).
No changes to chatbot.py or agent_tools.py (per constraint).
No grader changes — uses existing expected_approx plumbing.
No exact-number pinning — direction + order-of-magnitude only.

Test plan

Wait for Add eval harness scaffold: spec, scenarios, fixtures dir #52 to merge, then this PR rebases onto main.
After rebase, runner picks up the 5 new YAMLs automatically (no registration needed).
First eval pass produces B_results.md entries for b6-b10; assertions should pass when the agent successfully expresses each reform via the PolicyEngine UK API.
Verify YAML parses: python -c "import yaml; from pathlib import Path; [yaml.safe_load(p.read_text()) for p in Path('evals/scenarios').glob('b[6-9]_*.yaml')]" ✅ (done)
Verify b10 (the expected_approx: 0 magnitude-only case) doesn't crash _diff_scalar — the grader's expected != 0 guard handles it; diff entry shows extracted-only as intended.

🤖 Generated with Claude Code

Adds 5 Test B scenarios (b6-b10) that catch silent reform-expressibility regressions like the 2026-05-28 basic-rate +1pp failure. Each pins dataset (enhanced_frs_2023_24) and parameter_year (2025), uses expected_approx with wide tolerance_pct for direction + order-of-magnitude assertions only — no fixture files, no exact-number pinning — and uses anchor.must_mention to assert the chat's prose answer references the reform parameters by name. Branched off feat/eval-harness (PR #52) since #82 explicitly depends on the harness landing. README updated to document the two reference shapes (fixture-backed vs expected_approx-only) and the regression sub-suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-29T09:11:07Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
policyengine-uk-chat	Ready	Preview, Comment	May 29, 2026 9:11am

github-actions · 2026-05-29T09:11:32Z

Beta preview is ready.

Frontend: open preview
Backend: open backend

anth-volk · 2026-06-16T14:03:50Z

@vahid-ahmadi — before this goes further, two things I'd like to understand.

1. What is this actually validating, and does the drift filter undercut it?

Help me with the intent here. The Test B mechanism these extend grades the chat's output against locally-computed policyengine_uk reference numbers, and build_fixtures.py drops any field whose local value diverges more than 10% from the published reference (the drift_report.md rationale being "baseline has moved since publication"). My worry: if we drop the fields where divergence is highest, aren't we removing exactly the cases most likely to surface a real problem — and keeping, by construction, a fixture set that already agrees? After filtering, what is the suite meant to catch: agent reform-expressibility, engine fidelity vs the API/published analyses, or regression over time? I want to be sure the kept set still tests something meaningful rather than passing tautologically.

(For b6–b10 specifically, I see you sidestep this by using expected_approx direction-and-magnitude bands instead of fixtures — which avoids the drift issue, but also means these check only sign and rough scale, not numeric parity. Can you confirm that's the intended scope? It changes what failure they'd actually catch.)

2. Are these still wanted? If so, they'll need to be redone.

The ground has shifted under this PR:

Its base is feat/eval-harness (PR Add eval harness scaffold: spec, scenarios, fixtures dir #52), which is now closed — so the framework this builds on has no path to main.
The eval system that actually landed on main is the typed backend/evaluation/ harness (evals/cases/), not the evals/runner/ + evals/scenarios/ framework this targets.
It also still pins model_backend: uk_python from the closed backend-selector stack (Add model backend selector and Claude Agent SDK runner #51).

So as-is it can't merge. If you still want this coverage, it needs redoing against the merged harness — ported into evals/cases/ in the current case schema (or a new numeric/answer suite), rather than added to the unmerged scenarios framework. If you don't think it's still needed, let's close it.

Happy to do the port myself once you confirm (a) the cases are still wanted and (b) the intent in point 1.

vercel Bot deployed to Preview May 29, 2026 09:11 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 5 reform-API regression eval cases (closes #82)#90

Add 5 reform-API regression eval cases (closes #82)#90
vahid-ahmadi wants to merge 1 commit into
feat/eval-harnessfrom
feat/reform-api-eval-cases

vahid-ahmadi commented May 29, 2026

Uh oh!

vercel Bot commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

anth-volk commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vahid-ahmadi commented May 29, 2026

Summary

Path chosen: A (extend #52)

What the cases assert

How it slots into the harness

What's deliberately out of scope

Test plan

Uh oh!

vercel Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

anth-volk commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented May 29, 2026 •

edited

Loading