feat: command_check — run a held-out test against the agent's final fixture state

## Motivation

Some skills — `investigating-bugs` is the type case — are *quality-gradient*: the agent already
produces a result that passes the reported repro, and the skill's value is that the fix is also
correct on paths the report never mentioned. The only rigorous way to grade that is to run a
**held-out** test (one the agent never saw) against the agent's **final** fixture state.

Today this is approximated with an `llm_judge` rubric reasoning about un-run code, which is the
dominant source of flaky/low-confidence gradings on these skills. We want an objective,
deterministic decoy-catcher.

Concrete driver: the new `investigating-bugs` timezone eval ships a held-out
`fixtures/tz-date-only/holdout/tz-matrix.holdout.ts`. A "+1 day" or "force local parse" decoy
passes the reported repro but fails the held-out test; only the calendar-date source fix passes.

## Proposed surface

A new assertion type that runs a command against the post-agent fixture and grades on exit code
(and optional stdout match):

```json
{
  "id": "all_consumers_correct",
  "type": "command_check",
  "setup_files": ["holdout/tz-matrix.holdout.ts"],
  "command": "bun test ./holdout/tz-matrix.holdout.ts",
  "expect_exit_code": 0,
  "expect_stdout": "(optional regex)"
}
```

- `setup_files` (optional): files copied into `env/` **after** the agent's final turn and
  **never** exposed to the agent (they must not appear in the case's `files`). This is what keeps
  the test "held out".
- `command`: run in `env/` after the agent finishes.
- `expect_exit_code` (default `0`) and/or `expect_stdout` regex → PASS/FAIL.

## Semantics

- Runs as the runner's own post-agent step, so it composes with `--guard` (the guard governs the
  agent's turn; the check is the runner executing a trusted command).
- Counts as a normal assertion in pass-rate aggregation.
- Validation should reject a `command_check` whose `setup_files` overlap the case's `files`
  (that would leak the held-out test to the agent).

## Acceptance criteria

- A case can assert on a held-out command's exit code and stdout.
- `setup_files` are present for the command run but absent from the agent's workspace.
- `eval-magic validate` errors on `setup_files` ∩ `files ≠ ∅`, and warns (not errors) on
  `command_check` when run by an older binary / a harness that can't execute commands.
- Works under `--guard`.

## Back-compat

New assertion type; existing suites unaffected. Document the harness requirement (must be able to
execute a shell command in `env/` post-run).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: command_check — run a held-out test against the agent's final fixture state #114

Motivation

Proposed surface

Semantics

Acceptance criteria

Back-compat

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: command_check — run a held-out test against the agent's final fixture state #114

Description

Motivation

Proposed surface

Semantics

Acceptance criteria

Back-compat

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions