Skip to content

feat: command_check — run a held-out test against the agent's final fixture state #114

Description

@slowdini

Motivation

Some skills — investigating-bugs is the type case — are quality-gradient: the agent already
produces a result that passes the reported repro, and the skill's value is that the fix is also
correct on paths the report never mentioned. The only rigorous way to grade that is to run a
held-out test (one the agent never saw) against the agent's final fixture state.

Today this is approximated with an llm_judge rubric reasoning about un-run code, which is the
dominant source of flaky/low-confidence gradings on these skills. We want an objective,
deterministic decoy-catcher.

Concrete driver: the new investigating-bugs timezone eval ships a held-out
fixtures/tz-date-only/holdout/tz-matrix.holdout.ts. A "+1 day" or "force local parse" decoy
passes the reported repro but fails the held-out test; only the calendar-date source fix passes.

Proposed surface

A new assertion type that runs a command against the post-agent fixture and grades on exit code
(and optional stdout match):

{
  "id": "all_consumers_correct",
  "type": "command_check",
  "setup_files": ["holdout/tz-matrix.holdout.ts"],
  "command": "bun test ./holdout/tz-matrix.holdout.ts",
  "expect_exit_code": 0,
  "expect_stdout": "(optional regex)"
}
  • setup_files (optional): files copied into env/ after the agent's final turn and
    never exposed to the agent (they must not appear in the case's files). This is what keeps
    the test "held out".
  • command: run in env/ after the agent finishes.
  • expect_exit_code (default 0) and/or expect_stdout regex → PASS/FAIL.

Semantics

  • Runs as the runner's own post-agent step, so it composes with --guard (the guard governs the
    agent's turn; the check is the runner executing a trusted command).
  • Counts as a normal assertion in pass-rate aggregation.
  • Validation should reject a command_check whose setup_files overlap the case's files
    (that would leak the held-out test to the agent).

Acceptance criteria

  • A case can assert on a held-out command's exit code and stdout.
  • setup_files are present for the command run but absent from the agent's workspace.
  • eval-magic validate errors on setup_filesfiles ≠ ∅, and warns (not errors) on
    command_check when run by an older binary / a harness that can't execute commands.
  • Works under --guard.

Back-compat

New assertion type; existing suites unaffected. Document the harness requirement (must be able to
execute a shell command in env/ post-run).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions