feat: diff_scope assertion — measure fix scope (files/lines changed)

## Motivation

A core claim of `investigating-bugs` is "make the **smallest** targeted fix; don't bundle
refactoring." That is a *scope* property, and `llm_judge` rubrics grade it unreliably. eval-magic
already materializes the fixture in `env/`, so it can diff the before/after and expose objective
scope metrics.

In the timezone eval, the correct source fix is small and single-locus; a per-consumer nudge
spree (or a symptom-patch-plus-cleanup) is larger and touches more files. A diff metric makes that
difference measurable instead of judged.

## Proposed surface

A new assertion type plus benchmark reporting:

```json
{
  "id": "minimal_fix",
  "type": "diff_scope",
  "max_files_touched": 1,
  "max_lines_changed": 8
}
```

And always expose the raw metrics per run in `benchmark.json` (even with no threshold), e.g.:

```json
"diff_scope": { "files_touched": 1, "lines_added": 3, "lines_removed": 1, "hunks": 1 }
```

Optionally make the metrics available as context to an `llm_judge` rubric (so a judge can weigh
"is this the smallest *correct* fix" with the numbers in hand).

## Semantics

- Diff is computed over the case's `files` set plus any files the agent created in `env/`.
- Thresholds are an **opt-in** assertion; the default behavior is to *report* the metric, not gate
  on it.

## Acceptance criteria

- `benchmark.json` includes diff-scope metrics per condition.
- An optional `diff_scope` assertion can PASS/FAIL on `max_files_touched` / `max_lines_changed`.
- Documentation states plainly that **"smallest" ≠ "best"** — hard gates are a secondary signal,
  to be combined with a correctness assertion, never used alone.

## Back-compat

Additive. No change to existing suites unless they add the assertion.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: diff_scope assertion — measure fix scope (files/lines changed) #116

Motivation

Proposed surface

Semantics

Acceptance criteria

Back-compat

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: diff_scope assertion — measure fix scope (files/lines changed) #116

Description

Motivation

Proposed surface

Semantics

Acceptance criteria

Back-compat

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions