Motivation
A debugging skill's largest real-world value shows up when the bug report is missing detail —
the disciplined move is to reproduce / ask before fixing. Today every case is a single
prompt → single dispatch → single response, so we can only test fully-specified reports, which is
the easy case. Under-specified reports are exactly where an undisciplined agent guesses wrong and
a disciplined one asks first — but we can't stage that.
Driver: the investigating-bugs timezone eval would be sharper if the report could be vague
("the date is wrong for some users") and the missing detail ("they're all in US timezones; it's a
date-only field") were delivered only if the agent asks for it.
Proposed surface
Let a case define follow-up turns or canned answers:
{
"prompt": "The due date is wrong for some users. Fix it.",
"clarification": {
"answers": "The affected users are all in US timezones; the field is a date-only value (no time).",
"deliver_when": "agent_asks"
}
}
deliver_when: "agent_asks" — the canned answers are fed as a second user turn only if the
agent posed a clarifying question first. Reward asking-before-fixing.
deliver_when: "always" — fixed follow-up turn(s) delivered regardless (for staging a known
back-and-forth).
- Generalize to an ordered
turns list for multi-step scripts if useful.
Semantics
- Requires a second dispatch turn feeding the agent the follow-up, then grading the final state.
- A
transcript_check/llm_judge can assert the clarifying question preceded any edit (i.e. the
agent asked before fixing), which is the behavior under test.
Acceptance criteria
- A case can deliver a scripted follow-up turn after the agent's first response.
agent_asks mode delivers the answer only when the agent actually asked, and this is
observable to assertions.
- Single-turn cases are unaffected when
clarification/turns is absent.
Back-compat
Additive; absence preserves today's one-shot behavior. Document the extra dispatch cost per case.
Motivation
A debugging skill's largest real-world value shows up when the bug report is missing detail —
the disciplined move is to reproduce / ask before fixing. Today every case is a single
prompt → single dispatch → single response, so we can only test fully-specified reports, which is
the easy case. Under-specified reports are exactly where an undisciplined agent guesses wrong and
a disciplined one asks first — but we can't stage that.
Driver: the
investigating-bugstimezone eval would be sharper if the report could be vague("the date is wrong for some users") and the missing detail ("they're all in US timezones; it's a
date-only field") were delivered only if the agent asks for it.
Proposed surface
Let a case define follow-up turns or canned answers:
{ "prompt": "The due date is wrong for some users. Fix it.", "clarification": { "answers": "The affected users are all in US timezones; the field is a date-only value (no time).", "deliver_when": "agent_asks" } }deliver_when: "agent_asks"— the cannedanswersare fed as a second user turn only if theagent posed a clarifying question first. Reward asking-before-fixing.
deliver_when: "always"— fixed follow-up turn(s) delivered regardless (for staging a knownback-and-forth).
turnslist for multi-step scripts if useful.Semantics
transcript_check/llm_judgecan assert the clarifying question preceded any edit (i.e. theagent asked before fixing), which is the behavior under test.
Acceptance criteria
agent_asksmode delivers the answer only when the agent actually asked, and this isobservable to assertions.
clarification/turnsis absent.Back-compat
Additive; absence preserves today's one-shot behavior. Document the extra dispatch cost per case.