feat: multi-turn / scripted-clarification for under-specified reports

## Motivation

A debugging skill's largest real-world value shows up when the **bug report is missing detail** —
the disciplined move is to reproduce / ask before fixing. Today every case is a single
prompt → single dispatch → single response, so we can only test fully-specified reports, which is
the easy case. Under-specified reports are exactly where an undisciplined agent guesses wrong and
a disciplined one asks first — but we can't stage that.

Driver: the `investigating-bugs` timezone eval would be sharper if the report could be vague
("the date is wrong for some users") and the missing detail ("they're all in US timezones; it's a
date-only field") were delivered only if the agent asks for it.

## Proposed surface

Let a case define follow-up turns or canned answers:

```json
{
  "prompt": "The due date is wrong for some users. Fix it.",
  "clarification": {
    "answers": "The affected users are all in US timezones; the field is a date-only value (no time).",
    "deliver_when": "agent_asks"
  }
}
```

- `deliver_when: "agent_asks"` — the canned `answers` are fed as a second user turn **only if** the
  agent posed a clarifying question first. Reward asking-before-fixing.
- `deliver_when: "always"` — fixed follow-up turn(s) delivered regardless (for staging a known
  back-and-forth).
- Generalize to an ordered `turns` list for multi-step scripts if useful.

## Semantics

- Requires a second dispatch turn feeding the agent the follow-up, then grading the final state.
- A `transcript_check`/`llm_judge` can assert the clarifying question preceded any edit (i.e. the
  agent asked before fixing), which is the behavior under test.

## Acceptance criteria

- A case can deliver a scripted follow-up turn after the agent's first response.
- `agent_asks` mode delivers the answer only when the agent actually asked, and this is
  observable to assertions.
- Single-turn cases are unaffected when `clarification`/`turns` is absent.

## Back-compat

Additive; absence preserves today's one-shot behavior. Document the extra dispatch cost per case.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multi-turn / scripted-clarification for under-specified reports #117

Motivation

Proposed surface

Semantics

Acceptance criteria

Back-compat

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: multi-turn / scripted-clarification for under-specified reports #117

Description

Motivation

Proposed surface

Semantics

Acceptance criteria

Back-compat

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions