Skip to content

Add 5 reform-API regression eval cases (closes #82)#90

Open
vahid-ahmadi wants to merge 1 commit into
feat/eval-harnessfrom
feat/reform-api-eval-cases
Open

Add 5 reform-API regression eval cases (closes #82)#90
vahid-ahmadi wants to merge 1 commit into
feat/eval-harnessfrom
feat/reform-api-eval-cases

Conversation

@vahid-ahmadi

Copy link
Copy Markdown
Collaborator

Summary

Adds 5 Test B scenarios (b6-b10) under evals/scenarios/ that catch silent reform-expressibility regressions — directly motivated by the 2026-05-28 live test where the agent could not express a basic-rate +1pp reform and produced no eval signal. Closes #82.

Path chosen: A (extend #52)

PR #52 (feat/eval-harness) is OPEN, mergeable, and structurally complete (SPEC, scenarios, runner, fixture builder, grader, first results writeup). The right home for #82's cases is the existing harness — this PR branches off feat/eval-harness and adds 5 new scenario YAMLs in the harness's existing shape.

Base branch: feat/eval-harness (rebase onto main once #52 merges).

What the cases assert

Per #82's constraint: direction + order-of-magnitude only — no exact-number pinning. Tax microsim numbers shift with dataset version and parameter year; exact pinning would create flaky tests.

ID Reform Direction Magnitude band
b6_basic_rate_plus_1pp basic rate 20%→21% positive (revenue raise) ~£5-10bn (midpoint £7.5bn, ±50%)
b7_pa_to_15k PA £12,570→£15,000 negative (cost) ~£15-25bn (midpoint -£20bn, ±50%)
b8_ni_primary_threshold_plus_1k NI primary threshold +£1k negative (cost) ~£2-5bn (midpoint -£3.5bn, ±60%)
b9_child_benefit_uprate_10pct Child Benefit +10% negative (cost) ~£1-2bn (midpoint -£1.3bn, ±70%)
b10_two_band_collapse basic + higher → single 25% unpinned (structural) magnitude-only check

Each case also uses anchor.must_mention to assert the chat's prose names the reform parameters (basic rate, personal allowance, child benefit, etc.) and anchor.must_not_say to catch sign-flip regressions ("revenue rise" on a costing reform, etc.).

How it slots into the harness

The grader (evals/runner/grade.py:474) already supports expected_approx in fields_to_compare as a fixture-free alternative path — when reference.fixture is omitted, the diff uses the inline expected value. No grader changes needed; the new cases use existing field paths (budget.budgetary_impact, budget.tax_revenue_impact, budget.benefit_spending_impact) that the extractor's FIELD_LABELS already covers.

Each case pins:

  • dataset: enhanced_frs_2023_24
  • parameter_year: 2025
  • chat_settings.model_backend: uk_python
  • chat_settings.num_runs: 3

evals/README.md is updated with a "Reform-API regression suite" section explaining the two reference shapes (fixture-backed vs expected_approx-only).

What's deliberately out of scope

Test plan

  • Wait for Add eval harness scaffold: spec, scenarios, fixtures dir #52 to merge, then this PR rebases onto main.
  • After rebase, runner picks up the 5 new YAMLs automatically (no registration needed).
  • First eval pass produces B_results.md entries for b6-b10; assertions should pass when the agent successfully expresses each reform via the PolicyEngine UK API.
  • Verify YAML parses: python -c "import yaml; from pathlib import Path; [yaml.safe_load(p.read_text()) for p in Path('evals/scenarios').glob('b[6-9]_*.yaml')]" ✅ (done)
  • Verify b10 (the expected_approx: 0 magnitude-only case) doesn't crash _diff_scalar — the grader's expected != 0 guard handles it; diff entry shows extracted-only as intended.

🤖 Generated with Claude Code

Adds 5 Test B scenarios (b6-b10) that catch silent reform-expressibility
regressions like the 2026-05-28 basic-rate +1pp failure. Each pins dataset
(enhanced_frs_2023_24) and parameter_year (2025), uses expected_approx with
wide tolerance_pct for direction + order-of-magnitude assertions only — no
fixture files, no exact-number pinning — and uses anchor.must_mention to
assert the chat's prose answer references the reform parameters by name.

Branched off feat/eval-harness (PR #52) since #82 explicitly depends on the
harness landing. README updated to document the two reference shapes
(fixture-backed vs expected_approx-only) and the regression sub-suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented May 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
policyengine-uk-chat Ready Ready Preview, Comment May 29, 2026 9:11am

Request Review

@github-actions

Copy link
Copy Markdown

Beta preview is ready.

@anth-volk

Copy link
Copy Markdown
Contributor

@vahid-ahmadi — before this goes further, two things I'd like to understand.

1. What is this actually validating, and does the drift filter undercut it?

Help me with the intent here. The Test B mechanism these extend grades the chat's output against locally-computed policyengine_uk reference numbers, and build_fixtures.py drops any field whose local value diverges more than 10% from the published reference (the drift_report.md rationale being "baseline has moved since publication"). My worry: if we drop the fields where divergence is highest, aren't we removing exactly the cases most likely to surface a real problem — and keeping, by construction, a fixture set that already agrees? After filtering, what is the suite meant to catch: agent reform-expressibility, engine fidelity vs the API/published analyses, or regression over time? I want to be sure the kept set still tests something meaningful rather than passing tautologically.

(For b6b10 specifically, I see you sidestep this by using expected_approx direction-and-magnitude bands instead of fixtures — which avoids the drift issue, but also means these check only sign and rough scale, not numeric parity. Can you confirm that's the intended scope? It changes what failure they'd actually catch.)

2. Are these still wanted? If so, they'll need to be redone.

The ground has shifted under this PR:

So as-is it can't merge. If you still want this coverage, it needs redoing against the merged harness — ported into evals/cases/ in the current case schema (or a new numeric/answer suite), rather than added to the unmerged scenarios framework. If you don't think it's still needed, let's close it.

Happy to do the port myself once you confirm (a) the cases are still wanted and (b) the intent in point 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants