Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ Project-local staged skills live in the isolated env at `env/.claude/skills/`, i

**Where transcripts live.** Claude Code persists subagent transcripts under `~/.claude/projects/<project-slug>/<parent-session-id>/subagents/`. `ingest` auto-resolves this from the `CLAUDE_CODE_SESSION_ID` the orchestrating session exports (deriving `<project-slug>` from the cwd and scanning `projects/*` if needed), so you normally don't pass `--subagents-dir` at all. When running outside that session — or to target a past session — pass `--session-id <id>`, or override the lookup entirely with `--subagents-dir <path>`. Besides out-of-bounds writes, `detect-stray-writes` also flags **live-source reads**: an arm whose subagent read the live skill source instead of its staged copy. That usually means the Skill tool couldn't resolve the staged slug yet and the agent improvised — fatal in revision mode, where the `old_skill` arm then sees new-skill content. Treat a flagged cell's arm as contaminated.

**Dispatching via the Task tool.** `dispatch.json` is a top-level object (`{ skill_name, iteration, run_nonce, …, tasks: [...] }`); iterate `tasks[]`. For each task, dispatch a fresh subagent via the Task tool with the prompt `Read the file at <dispatch_prompt_path> and follow its instructions exactly.` (substituting the task's `dispatch_prompt_path`), and pass `agent_description` *verbatim* as the description — it's namespaced `<eval_id>:<condition>:i<N>-<nonce>`, and passing it unchanged is what lets transcript correlation work. (The Task tool documents `description` as "short", but pass the full string regardless — correlation depends on the exact value.) You do **not** write `run.json`/`timing.json` yourself; the subagent writes `outputs/final-message.md`, and `ingest` (`record-runs`) assembles both records from disk. For a plan-mode-relevant skill, add `--plan-mode` to inject Claude Code's verbatim plan-mode procedure as a `<system-reminder>` operating-context layer.
**Dispatching via the Task tool.** `dispatch.json` is a top-level object (`{ skill_name, iteration, run_nonce, …, tasks: [...] }`); iterate `tasks[]`. For each task, dispatch a fresh subagent via the Task tool with the prompt `Read the file at <dispatch_prompt_path> and follow its instructions exactly.` (substituting the task's `dispatch_prompt_path`), and pass `agent_description` *verbatim* as the description — it's namespaced `<eval_id>:<condition>:i<N>-<nonce>`, and passing it unchanged is what lets transcript correlation work. (The Task tool documents `description` as "short", but pass the full string regardless — correlation depends on the exact value.) You do **not** write `run.json`/`timing.json` yourself; the subagent writes `outputs/final-message.md`, and `ingest` (`record-runs`) assembles both records from disk. For a plan-mode-relevant skill, add `--plan-mode` to inject the shared plan-mode procedure as a `<system-reminder>` operating-context layer.

**Hybrid mode (`--run-mode hybrid`).** Pass `--run-mode hybrid` to dispatch each task through the `claude -p` one-shot CLI instead of in-session subagents — the same shape as Codex's hybrid flow, where an agent session orchestrates while each test/judge shells out to the CLI. `run` then prints (and `dispatch-manifest.md` / `RUNBOOK.md` carry) a `claude -p` recipe per task:

Expand Down Expand Up @@ -371,7 +371,7 @@ eval-magic ingest --harness codex

Judge tasks follow the same model-selection rule. `run --judge-model <id>` becomes the default `model` in `judge-tasks.json`; an individual `llm_judge.model` overrides it. The Codex judge recipe reads each task's resolved `model` and passes `-m "$model"` only when one is present.

`finalize` and `teardown` work the same with `--harness codex`. Codex results are lower fidelity than Claude Code in a few places: `transcript_check` matches parsed `item.completed` entries (`command_execution`, `file_change`, `web_search`, MCP items); the automatic `__skill_invoked` meta-check uses the LLM-judge fallback (Codex's JSONL exposes no deterministic skill-tool event); and `--plan-mode` injects Codex's plan-mode procedure as text rather than launching `codex exec` into the interactive CLI's real `/plan` mode. `--guard` stages a Codex `PreToolUse` hook that blocks out-of-sandbox `Bash` mutations and `apply_patch` targets before they run; `detect-stray-writes` remains the post-run audit. Bias Codex suites toward `llm_judge` assertions for behavior and `transcript_check` for tool events.
`finalize` and `teardown` work the same with `--harness codex`. Codex results are lower fidelity than Claude Code in a few places: `transcript_check` matches parsed `item.completed` entries (`command_execution`, `file_change`, `web_search`, MCP items); the automatic `__skill_invoked` meta-check uses the LLM-judge fallback (Codex's JSONL exposes no deterministic skill-tool event); and `--plan-mode` injects the shared plan-mode procedure as text rather than launching `codex exec` into the interactive CLI's real `/plan` mode. `--guard` stages a Codex `PreToolUse` hook that blocks out-of-sandbox `Bash` mutations and `apply_patch` targets before they run; `detect-stray-writes` remains the post-run audit. Bias Codex suites toward `llm_judge` assertions for behavior and `transcript_check` for tool events.

When running `eval-magic run --harness codex` from inside Codex itself, staging writes `.agents/skills`; adding `--guard` also writes `.codex/hooks.json`. Those project-local Codex config paths are protected by Codex's default workspace-write sandbox, so the runner may need approval/escalation or an external terminal invocation. That approval is Codex's own permission boundary, not something eval-magic bypasses.

Expand All @@ -396,7 +396,7 @@ OpenCode skill names must be lowercase alphanumeric with single-hyphen separator
## Bundled assets

- `schema/` — JSON Schemas for every artifact (`evals`, run records, `grading`, `stray-writes`, `benchmark`, `judge-tasks`); the portable cross-harness contract, embedded in the binary
- `profiles/` — per-harness plan-mode procedure profiles (`--plan-mode`), embedded in the binary
- `profiles/` — the shared plan-mode procedure profile (`--plan-mode`) and runbook templates, embedded in the binary

## Development

Expand Down
4 changes: 2 additions & 2 deletions docs/harness-parity.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,14 +83,14 @@ For each `HarnessAdapter` method below, compare your harness's impl against the
| Skill-eval auto-record (run/timing assembly) | (consumes `parse_transcript_full` / `parse_cli_events_full` + `cli_events_filename`) | `src/pipeline/record_runs.rs` assembles each task's `run.json` + `timing.json` from disk: carry-over fields from `dispatch.json`, `final_message` from `outputs/final-message.md` (falling back to the transcript's final text — the primary path for `claude -p`, which has no `--output-last-message`), and tool invocations/tokens/duration from the parsed transcript. A harness closes this gap by supplying the transcript its parser consumes (the portable fallback — hand-authored records against `run-record.schema.json` — is unchanged) |
| Cli model selection | `cli_model_flag`, `cli_next_steps`, `cli_manifest_section`, `cli_judge_next_steps` | For one-shot CLI dispatch, `run --agent-model` is rendered into the agent command recipe and `run --judge-model` becomes the default model in `judge-tasks.json`; assertion-level `llm_judge.model` remains the most specific override. Codex is the reference: `cli_model_flag` returns `-m`, agent recipes render `codex --ask-for-approval never exec ... -m <model>`, and judge recipes read each task's resolved `model` and pass `-m "$model"` only when present |
| Eval subagent write enforcement | `install_guard` | Opt-in `--guard` stages a pre-tool hook (`src/sandbox/`) that *denies* subagent writes/installs outside the eval sandbox while dispatches run. Portable fallback for every harness: the `eval-magic detect-stray-writes` post-pass (`src/pipeline/detect_stray_writes.rs`) flags out-of-bounds writes from the parsed transcript after the fact |
| Eval plan-mode operating context | `plan_mode_profile`, `render_plan_mode_context` | Opt-in `--plan-mode` injects the harness's `profiles/<harness>/plan-mode.md` (embedded at compile time) as a `<system-reminder>` operating-context layer in every dispatch. Claude Code and Codex profiles exist today; a harness with no profile has no `--plan-mode` and an unchanged dispatch contract |
| Eval plan-mode operating context | `render_plan_mode_context` | Opt-in `--plan-mode` injects the shared `profiles/shared/plan-mode.md` (embedded at compile time) as a `<system-reminder>` operating-context layer in every dispatch — identical text for every harness |
| Harness-details operator guide | (docs, not a trait method) | The README's per-harness section, e.g. [Claude Code](../README.md#claude-code-fully-wired) |

**Note on the transcript adapter (raised bar).** Baseline eval suites use `transcript_check` assertions — deterministic regex checks against a run's tool invocations (e.g. "a test command ran", "the sibling skill was loaded"). These only grade when `parse_transcript` is implemented for your harness. A harness without it still functions: those assertions grade as *unverifiable* and the `llm_judge` assertions carry the substantive measurement. But adapter richness is an explicit parity target, not optional polish — implementing or enriching `parse_transcript*` lets more of a baseline suite grade mechanically. Treat it as a goal to aim at, not a box already checked.

**Note on write enforcement (parity goal).** Eval subagents are instructed to write only inside their `outputs/` dir, but nothing in the portable contract *enforces* it — a misbehaving subagent can edit the real repo or install packages, silently tainting the run. Two layers address this: the portable `detect-stray-writes` post-pass (available to every harness, since it works off the same parsed transcript) and an opt-in harness-native `install_guard` that stages a pre-tool hook to *block* the write before it happens. Claude Code and Codex both wire this through their `PreToolUse` hook surfaces — for Claude Code across *both* mechanisms, since `claude -p` loads the same project-local `settings.local.json` hook from the env cwd each dispatch runs in. **Harness-level tool enforcement is an explicit parity goal, not optional polish.** A harness that can express a pre-tool guard (a hook, a permission rule, a sandboxed cwd) should wire `install_guard`; until then, `detect-stray-writes` is the honest fallback.

**Note on plan-mode fidelity (residual parity goal).** `--plan-mode` injects a harness's *verbatim* plan-mode procedure as operating context, the closest a harness's eval runner can get to reproducing the wild failure where a real plan mode makes loading a skill feel redundant. It is **not** the real mode: it is still text the dispatched subagent reads, not a state the harness places it under, so a pass remains necessary-not-sufficient (the seeding ceiling is explained in the [`slow-powers`](https://github.com/slowdini/slow-powers) `evaluating-skills` skill). A harness that can dispatch an eval subagent *into* its own plan/research mode would close this gap; `--plan-mode` (a profile + renderer) is the approximation every harness can reach in the meantime.
**Note on plan-mode fidelity (residual parity goal).** `--plan-mode` injects a shared, harness-agnostic plan-mode procedure as operating context, the closest a harness's eval runner can get to reproducing the wild failure where a real plan mode makes loading a skill feel redundant. It is **not** the real mode: it is still text the dispatched subagent reads, not a state the harness places it under, so a pass remains necessary-not-sufficient (the seeding ceiling is explained in the [`slow-powers`](https://github.com/slowdini/slow-powers) `evaluating-skills` skill). A harness that can dispatch an eval subagent *into* its own plan/research mode would close this gap; `--plan-mode` (a profile + renderer) is the approximation every harness can reach in the meantime.

Surface your findings inline using this template:

Expand Down
11 changes: 0 additions & 11 deletions profiles/claude-code/plan-mode.md

This file was deleted.

11 changes: 0 additions & 11 deletions profiles/codex/plan-mode.md

This file was deleted.

11 changes: 0 additions & 11 deletions profiles/opencode/plan-mode.md

This file was deleted.

11 changes: 11 additions & 0 deletions profiles/shared/plan-mode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Plan mode is active. The user wants to review an approach before any code is written, so you must NOT implement yet: do not edit files, run formatters or commands that rewrite files, apply migrations, generate code, install dependencies, or otherwise change product or system state. You may inspect files and run read-only checks. The only writes permitted are the plan itself and any final-response output the user explicitly requested. This constraint supersedes any other instruction you have received this session.

You are operating inside a plan-mode workflow — a fixed, multi-phase procedure. Work through the phases in order:

1. **Understand.** Read the relevant code, configs, schemas, docs, and tests with read-only tools until you can describe the change concretely. Reuse what already exists rather than proposing new code.
2. **Design.** Decide the implementation approach: the interfaces and data flow it touches, the edge cases and failure modes, and how you will verify it. Prefer existing patterns over new abstractions.
3. **Review.** Re-check the design against the user's request. Remove placeholders and vague steps, confirm every referenced file is real, and resolve material open questions with the user — ask only questions that change the plan.
4. **Propose the plan.** Lay out the finished plan: name the files to change, how each step maps to the goal, and how to verify the result. Do not defer decisions to implementation time.
5. **Hand off.** Present the finished plan for the user's approval and stop. Do not begin implementation until the user has approved it.

Terminal rail: your turn must end in exactly one of two ways — by asking the user the smallest necessary question, or by presenting the finished plan for the user's approval. Do not stop for any other reason and do not begin implementation until the plan is approved.
Loading
Loading