Motivation
A core claim of investigating-bugs is "make the smallest targeted fix; don't bundle
refactoring." That is a scope property, and llm_judge rubrics grade it unreliably. eval-magic
already materializes the fixture in env/, so it can diff the before/after and expose objective
scope metrics.
In the timezone eval, the correct source fix is small and single-locus; a per-consumer nudge
spree (or a symptom-patch-plus-cleanup) is larger and touches more files. A diff metric makes that
difference measurable instead of judged.
Proposed surface
A new assertion type plus benchmark reporting:
{
"id": "minimal_fix",
"type": "diff_scope",
"max_files_touched": 1,
"max_lines_changed": 8
}
And always expose the raw metrics per run in benchmark.json (even with no threshold), e.g.:
"diff_scope": { "files_touched": 1, "lines_added": 3, "lines_removed": 1, "hunks": 1 }
Optionally make the metrics available as context to an llm_judge rubric (so a judge can weigh
"is this the smallest correct fix" with the numbers in hand).
Semantics
- Diff is computed over the case's
files set plus any files the agent created in env/.
- Thresholds are an opt-in assertion; the default behavior is to report the metric, not gate
on it.
Acceptance criteria
benchmark.json includes diff-scope metrics per condition.
- An optional
diff_scope assertion can PASS/FAIL on max_files_touched / max_lines_changed.
- Documentation states plainly that "smallest" ≠ "best" — hard gates are a secondary signal,
to be combined with a correctness assertion, never used alone.
Back-compat
Additive. No change to existing suites unless they add the assertion.
Motivation
A core claim of
investigating-bugsis "make the smallest targeted fix; don't bundlerefactoring." That is a scope property, and
llm_judgerubrics grade it unreliably. eval-magicalready materializes the fixture in
env/, so it can diff the before/after and expose objectivescope metrics.
In the timezone eval, the correct source fix is small and single-locus; a per-consumer nudge
spree (or a symptom-patch-plus-cleanup) is larger and touches more files. A diff metric makes that
difference measurable instead of judged.
Proposed surface
A new assertion type plus benchmark reporting:
{ "id": "minimal_fix", "type": "diff_scope", "max_files_touched": 1, "max_lines_changed": 8 }And always expose the raw metrics per run in
benchmark.json(even with no threshold), e.g.:Optionally make the metrics available as context to an
llm_judgerubric (so a judge can weigh"is this the smallest correct fix" with the numbers in hand).
Semantics
filesset plus any files the agent created inenv/.on it.
Acceptance criteria
benchmark.jsonincludes diff-scope metrics per condition.diff_scopeassertion can PASS/FAIL onmax_files_touched/max_lines_changed.to be combined with a correctness assertion, never used alone.
Back-compat
Additive. No change to existing suites unless they add the assertion.