Skip to content

feat: diff_scope assertion — measure fix scope (files/lines changed) #116

Description

@slowdini

Motivation

A core claim of investigating-bugs is "make the smallest targeted fix; don't bundle
refactoring." That is a scope property, and llm_judge rubrics grade it unreliably. eval-magic
already materializes the fixture in env/, so it can diff the before/after and expose objective
scope metrics.

In the timezone eval, the correct source fix is small and single-locus; a per-consumer nudge
spree (or a symptom-patch-plus-cleanup) is larger and touches more files. A diff metric makes that
difference measurable instead of judged.

Proposed surface

A new assertion type plus benchmark reporting:

{
  "id": "minimal_fix",
  "type": "diff_scope",
  "max_files_touched": 1,
  "max_lines_changed": 8
}

And always expose the raw metrics per run in benchmark.json (even with no threshold), e.g.:

"diff_scope": { "files_touched": 1, "lines_added": 3, "lines_removed": 1, "hunks": 1 }

Optionally make the metrics available as context to an llm_judge rubric (so a judge can weigh
"is this the smallest correct fix" with the numbers in hand).

Semantics

  • Diff is computed over the case's files set plus any files the agent created in env/.
  • Thresholds are an opt-in assertion; the default behavior is to report the metric, not gate
    on it.

Acceptance criteria

  • benchmark.json includes diff-scope metrics per condition.
  • An optional diff_scope assertion can PASS/FAIL on max_files_touched / max_lines_changed.
  • Documentation states plainly that "smallest" ≠ "best" — hard gates are a secondary signal,
    to be combined with a correctness assertion, never used alone.

Back-compat

Additive. No change to existing suites unless they add the assertion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions