diff --git a/.gitignore b/.gitignore index 0592392..24b97b6 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,6 @@ /target .DS_Store + +# eval-magic run artifacts (workspace root + per-env outputs) — churn every run +.eval-magic/ +.eval-magic-outputs/ diff --git a/Cargo.lock b/Cargo.lock index 31f8217..dcc9b44 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -275,7 +275,7 @@ dependencies = [ [[package]] name = "eval-magic" -version = "0.3.4" +version = "0.4.0" dependencies = [ "anyhow", "assert_cmd", diff --git a/Cargo.toml b/Cargo.toml index 2b2ed8a..8c0b826 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "eval-magic" -version = "0.3.4" +version = "0.4.0" edition = "2024" description = "One-stop CLI for running skill evals — measure whether an agent skill actually shifts behavior." license = "MIT" diff --git a/README.md b/README.md index 4edb9d2..4d8ec07 100644 --- a/README.md +++ b/README.md @@ -41,7 +41,7 @@ cargo build --release # binary at target/release/eval-magic ## How an eval works -For each test case, the runner sets up two conditions and your agent dispatches a fresh subagent into each, with clean context: +For each test case, the runner sets up two conditions and a fresh subagent runs each with clean context — *how* that subagent is dispatched (in-session vs. one-shot CLI) is the run-mode axis covered under [Harnesses](#harnesses): - **Mode A — new skill:** `with_skill` vs `without_skill`. Validates a brand-new skill beats baseline behavior with no skill loaded. - **Mode B — revision (the common case):** `old_skill` vs `new_skill`. Tests a language change to an existing skill — you snapshot the old `SKILL.md`, then run both variants against the same prompts. A negative or zero `delta.pass_rate` is a signal to revert. @@ -85,24 +85,35 @@ environment. ### Mode A — new skill (with vs. without) ```bash -# 1. Build the iteration workspace (arm --guard — see Cost & confirmation). +# 1. Build the iteration's isolated env (arm --guard — see Cost & confirmation). +# run stages skills into .eval-magic/my-skill/iteration-1/env/, copies +# fixtures in, and writes RUNBOOK.md. It does NOT dispatch — it prints a handoff. # Add --runs to dispatch every eval N times per condition for variance # reduction (a per-eval "runs" field in evals.json overrides the flag). eval-magic run --guard -# 2. Your agent dispatches each task in skills-workspace/my-skill/iteration-1/dispatch.json -# as a fresh subagent (each reads its dispatch_prompt_path and follows it). +# 2. Enter the isolated env and follow the runbook. cd into iteration-1/env/ and +# start a fresh agent session there (interactive Claude Code: the staged skills +# must be present at session start), then say: "Read and follow RUNBOOK.md". +# That session drives the whole loop below — dispatch → switch-condition → +# ingest → finalize — and writes benchmark.json into iteration-1/. (Headless: +# you, a human, follow the same RUNBOOK.md top to bottom; hybrid: the session +# shells out a `claude -p` / `codex exec` recipe per task.) See Claude Code +# below for the plugin-isolation and transcript specifics. -# 3. Assemble records, detect stray writes, grade. ingest auto-resolves the -# subagents dir from CLAUDE_CODE_SESSION_ID; outside that session pass -# --session-id or --subagents-dir . +# Steps 3–5 are driven from inside the runbook — shown here for reference: + +# 3. ingest assembles records, detects stray writes, and grades, stopping at the +# judge hand-off. In-session it auto-resolves transcripts from +# CLAUDE_CODE_SESSION_ID; hybrid/headless read each task's events file instead. eval-magic ingest # 4. Dispatch the judge tasks ingest lists, then finalize. If --guard is still # armed, finalize reminds you to run teardown-guard before editing source. eval-magic finalize -# 5. Read skills-workspace/my-skill/iteration-1/benchmark.json, then clean up: +# 5. Read .eval-magic/my-skill/iteration-1/benchmark.json (the prep session +# resumes here), then clean up: eval-magic teardown ``` @@ -120,20 +131,35 @@ If you snapshot *before* editing, omit `--ref` (it then reads the working tree) ## The run loop -A run is one canonical workflow, the same in both modes: +A run is one canonical workflow. `run` *prepares* an isolated env and hands off; a session entered in that env drives the rest of the loop to `benchmark.json`: ``` -run → dispatch agents → ingest → dispatch judges → finalize → teardown +run (prepare env/ + RUNBOOK.md) + └─► [in env/, runbook-driven] dispatch batch A → switch-condition → dispatch batch B + → ingest → dispatch judges → finalize ──► benchmark.json +teardown ``` -1. **`run`** builds the iteration workspace, snapshots the `SKILL.md`, stages skills, and emits `dispatch.json` (machine-readable) alongside `dispatch-manifest.md` (human-readable). -2. **Dispatch agents.** Read `dispatch.json`. Each task object points at a `dispatch_prompt_path` (the full prompt lives in a file so you never reproduce kilobytes inline), an `agent_description` to pass through *verbatim* as the dispatch description, and the exact `run_record_path` / `timing_path`. For each task, dispatch a fresh subagent told to read the file at `dispatch_prompt_path` and follow it exactly. The `agent_description` is namespaced with the iteration and a per-run nonce (`:[:r]:i-`; the `r` segment appears only in multi-run cells, see `run --help` on `--runs`) — passing it through unchanged is what lets transcripts correlate to runs. -3. **`ingest`** (a fixed-order chain: record-runs → fill-transcripts → detect-stray-writes → grade) assembles each task's `run.json` and `timing.json` from `dispatch.json` + the subagent's `outputs/final-message.md` + the persisted transcript, scans for stray writes, and grades the `transcript_check` assertions. It stops at the judge hand-off, listing a judge task per `llm_judge` assertion. -4. **Dispatch judges.** Same pattern as step 2: dispatch a fresh subagent for each judge task to read its prompt file and write its verdict back. -5. **`finalize`** (grade `--finalize` → aggregate) merges the judge verdicts and writes `benchmark.json`. Read it. If a `--guard` marker is still live, it also reminds you to run `teardown-guard` before editing source. -6. **`teardown`** disarms the guard, removes the staged skill set, and reclaims the workspace artifacts that are safe to delete. +1. **`run` prepares — it does not dispatch.** It builds the iteration workspace (`iteration-N/`), snapshots the `SKILL.md`, stages skills into the isolated env `iteration-N/env/` (the agent's cwd), copies fixtures in so it reads like a real repo, emits `dispatch.json` (machine-readable) alongside `dispatch-manifest.md` (human-readable), and writes `RUNBOOK.md` into `env/`. Then it prints a handoff, not a dispatch. +2. **Enter the isolated env.** `cd` into `iteration-N/env/`, begin a run session there, and say *Read and follow `RUNBOOK.md`*. How you enter differs by run mode (see [Run modes](#run-modes)): + - **Interactive (Claude Code):** start a *fresh* Claude Code session in `env/` so the staged skills are present at session start; it dispatches in-session subagents and runs the rest of the loop itself. + - **Hybrid (Claude Code / Codex):** an orchestrating session follows `RUNBOOK.md`, shelling out a `claude -p` / `codex exec` recipe per task. + - **Headless (Claude Code / Codex):** no session — a human follows the same `RUNBOOK.md`, pasting each recipe and command top to bottom. +3. **Dispatch agents (runbook-driven).** Read `dispatch.json`. Each task object points at a `dispatch_prompt_path` (the full prompt lives in a file so you never reproduce kilobytes inline), an `agent_description` to pass through *verbatim* as the dispatch description, and the exact `run_record_path` / `timing_path`. For each task, dispatch a fresh subagent told to read the file at `dispatch_prompt_path` and follow it exactly. The `agent_description` is namespaced with the iteration and a per-run nonce (`:[:r]:i-`; the `r` segment appears only in multi-run cells, see `run --help` on `--runs`) — passing it through unchanged is what lets transcripts correlate to runs. +4. **`switch-condition` between condition batches.** Conditions run as sequential batches, never interleaved. After joining *all* of the first batch's subagents, run `eval-magic switch-condition --condition ` to remove the off-condition's staged skill from `env/.claude/skills/`, so the next batch can't read it — the read-isolation barrier. When a run has more than one **isolation group** (see below), the runbook also dispatches each group as its own batch and runs `eval-magic reset-batch --group ` between groups — it wipes the shared `env/` working tree and re-seeds it with that group's fixtures, the per-group isolation barrier. The runbook spells out the exact `switch-condition` / `reset-batch` sequence; you never work it out yourself. +5. **`ingest`** (a fixed-order chain: record-runs → fill-transcripts → detect-stray-writes → grade), run from inside `env/`, assembles each task's `run.json` and `timing.json` from `dispatch.json` + the subagent's `outputs/final-message.md` + the persisted transcript, scans for stray writes, and grades the `transcript_check` assertions. It stops at the judge hand-off, listing a judge task per `llm_judge` assertion. +6. **Dispatch judges.** Same pattern as step 3: dispatch a fresh subagent for each judge task to read its prompt file and write its verdict back. +7. **`finalize`** (grade `--finalize` → aggregate) merges the judge verdicts and writes `benchmark.json` into `iteration-N/`, *above* `env/`. Read it. If a `--guard` marker is still live, it also reminds you to run `teardown-guard` before editing source. +8. **`teardown`** disarms the guard, removes the staged skill set, and reclaims the workspace artifacts that are safe to delete. + +The chains run in-process and stop at the first failure; re-running after a fix is safe — every sub-step skips work that's already done. The individual steps (`record-runs`, `fill-transcripts`, `detect-stray-writes`, `grade`, `aggregate`) remain callable for inspection or recovery. Under the `Cli` mechanism (Claude Code hybrid/headless, Codex), the per-task dispatch recipe lives in `RUNBOOK.md` and `ingest` reads each task's events file (`claude-events.jsonl` / `codex-events.jsonl`) instead of an in-session transcript; un-wired harnesses still write records by hand until their adapters land. + +### Isolation grouping (which agents batch together) -The chains run in-process and stop at the first failure; re-running after a fix is safe — every sub-step skips work that's already done. The individual steps (`record-runs`, `fill-transcripts`, `detect-stray-writes`, `grade`, `aggregate`) remain callable for inspection or recovery. Codex uses the same chain with `--harness codex` after each task captures `outputs/codex-events.jsonl`; un-wired harnesses still write records by hand until their adapters land. +`run` decides at **setup** time which evals can share an environment and which need their own, writes the plan into `dispatch.json` (a `groups[]` summary plus a per-task `group`/`eval_root`), and the runbook follows it — the executing session does no isolation reasoning itself. By default every eval shares one group (today's behavior, unchanged). Two things create a separate group: evals whose fixtures would clobber each other, and an eval that opts out explicitly with `"isolation": "isolated"` in `evals.json` (use it when an eval's agent *mutates* a fixture another eval reads). How groups are realized depends on the run mode: + +- **Interactive (in-session):** one `env/`; groups are dispatched as sequential batches with a `reset-batch` barrier between them. +- **Hybrid / headless (Cli):** one env per `(group, condition)` — `iteration-N/env--/` — so each subprocess `cd`s into a fully-isolated cwd. This also gives the Cli path real per-condition isolation: the control arm's env contains no skill at all. ## Cost & confirmation @@ -175,7 +201,7 @@ Read `validity_warnings` **before** trusting any delta — a low skill-invocatio Per skill being evaluated, the runner produces this tree (everything but `evals/evals.json` is generated): ``` -skills-workspace// # outside the skill directory, gitignore it +.eval-magic// # outside the skill directory, gitignore it snapshots/ # Mode B baselines, persist across iterations