Motivation
Some skills — investigating-bugs is the type case — are quality-gradient: the agent already
produces a result that passes the reported repro, and the skill's value is that the fix is also
correct on paths the report never mentioned. The only rigorous way to grade that is to run a
held-out test (one the agent never saw) against the agent's final fixture state.
Today this is approximated with an llm_judge rubric reasoning about un-run code, which is the
dominant source of flaky/low-confidence gradings on these skills. We want an objective,
deterministic decoy-catcher.
Concrete driver: the new investigating-bugs timezone eval ships a held-out
fixtures/tz-date-only/holdout/tz-matrix.holdout.ts. A "+1 day" or "force local parse" decoy
passes the reported repro but fails the held-out test; only the calendar-date source fix passes.
Proposed surface
A new assertion type that runs a command against the post-agent fixture and grades on exit code
(and optional stdout match):
{
"id": "all_consumers_correct",
"type": "command_check",
"setup_files": ["holdout/tz-matrix.holdout.ts"],
"command": "bun test ./holdout/tz-matrix.holdout.ts",
"expect_exit_code": 0,
"expect_stdout": "(optional regex)"
}
setup_files (optional): files copied into env/ after the agent's final turn and
never exposed to the agent (they must not appear in the case's files). This is what keeps
the test "held out".
command: run in env/ after the agent finishes.
expect_exit_code (default 0) and/or expect_stdout regex → PASS/FAIL.
Semantics
- Runs as the runner's own post-agent step, so it composes with
--guard (the guard governs the
agent's turn; the check is the runner executing a trusted command).
- Counts as a normal assertion in pass-rate aggregation.
- Validation should reject a
command_check whose setup_files overlap the case's files
(that would leak the held-out test to the agent).
Acceptance criteria
- A case can assert on a held-out command's exit code and stdout.
setup_files are present for the command run but absent from the agent's workspace.
eval-magic validate errors on setup_files ∩ files ≠ ∅, and warns (not errors) on
command_check when run by an older binary / a harness that can't execute commands.
- Works under
--guard.
Back-compat
New assertion type; existing suites unaffected. Document the harness requirement (must be able to
execute a shell command in env/ post-run).
Motivation
Some skills —
investigating-bugsis the type case — are quality-gradient: the agent alreadyproduces a result that passes the reported repro, and the skill's value is that the fix is also
correct on paths the report never mentioned. The only rigorous way to grade that is to run a
held-out test (one the agent never saw) against the agent's final fixture state.
Today this is approximated with an
llm_judgerubric reasoning about un-run code, which is thedominant source of flaky/low-confidence gradings on these skills. We want an objective,
deterministic decoy-catcher.
Concrete driver: the new
investigating-bugstimezone eval ships a held-outfixtures/tz-date-only/holdout/tz-matrix.holdout.ts. A "+1 day" or "force local parse" decoypasses the reported repro but fails the held-out test; only the calendar-date source fix passes.
Proposed surface
A new assertion type that runs a command against the post-agent fixture and grades on exit code
(and optional stdout match):
{ "id": "all_consumers_correct", "type": "command_check", "setup_files": ["holdout/tz-matrix.holdout.ts"], "command": "bun test ./holdout/tz-matrix.holdout.ts", "expect_exit_code": 0, "expect_stdout": "(optional regex)" }setup_files(optional): files copied intoenv/after the agent's final turn andnever exposed to the agent (they must not appear in the case's
files). This is what keepsthe test "held out".
command: run inenv/after the agent finishes.expect_exit_code(default0) and/orexpect_stdoutregex → PASS/FAIL.Semantics
--guard(the guard governs theagent's turn; the check is the runner executing a trusted command).
command_checkwhosesetup_filesoverlap the case'sfiles(that would leak the held-out test to the agent).
Acceptance criteria
setup_filesare present for the command run but absent from the agent's workspace.eval-magic validateerrors onsetup_files∩files ≠ ∅, and warns (not errors) oncommand_checkwhen run by an older binary / a harness that can't execute commands.--guard.Back-compat
New assertion type; existing suites unaffected. Document the harness requirement (must be able to
execute a shell command in
env/post-run).