diff --git a/README.md b/README.md index 4d8ec07..36ad482 100644 --- a/README.md +++ b/README.md @@ -308,7 +308,7 @@ Project-local staged skills live in the isolated env at `env/.claude/skills/`, i **Where transcripts live.** Claude Code persists subagent transcripts under `~/.claude/projects///subagents/`. `ingest` auto-resolves this from the `CLAUDE_CODE_SESSION_ID` the orchestrating session exports (deriving `` from the cwd and scanning `projects/*` if needed), so you normally don't pass `--subagents-dir` at all. When running outside that session — or to target a past session — pass `--session-id `, or override the lookup entirely with `--subagents-dir `. Besides out-of-bounds writes, `detect-stray-writes` also flags **live-source reads**: an arm whose subagent read the live skill source instead of its staged copy. That usually means the Skill tool couldn't resolve the staged slug yet and the agent improvised — fatal in revision mode, where the `old_skill` arm then sees new-skill content. Treat a flagged cell's arm as contaminated. -**Dispatching via the Task tool.** `dispatch.json` is a top-level object (`{ skill_name, iteration, run_nonce, …, tasks: [...] }`); iterate `tasks[]`. For each task, dispatch a fresh subagent via the Task tool with the prompt `Read the file at and follow its instructions exactly.` (substituting the task's `dispatch_prompt_path`), and pass `agent_description` *verbatim* as the description — it's namespaced `::i-`, and passing it unchanged is what lets transcript correlation work. (The Task tool documents `description` as "short", but pass the full string regardless — correlation depends on the exact value.) You do **not** write `run.json`/`timing.json` yourself; the subagent writes `outputs/final-message.md`, and `ingest` (`record-runs`) assembles both records from disk. For a plan-mode-relevant skill, add `--plan-mode` to inject Claude Code's verbatim plan-mode procedure as a `` operating-context layer. +**Dispatching via the Task tool.** `dispatch.json` is a top-level object (`{ skill_name, iteration, run_nonce, …, tasks: [...] }`); iterate `tasks[]`. For each task, dispatch a fresh subagent via the Task tool with the prompt `Read the file at and follow its instructions exactly.` (substituting the task's `dispatch_prompt_path`), and pass `agent_description` *verbatim* as the description — it's namespaced `::i-`, and passing it unchanged is what lets transcript correlation work. (The Task tool documents `description` as "short", but pass the full string regardless — correlation depends on the exact value.) You do **not** write `run.json`/`timing.json` yourself; the subagent writes `outputs/final-message.md`, and `ingest` (`record-runs`) assembles both records from disk. For a plan-mode-relevant skill, add `--plan-mode` to inject the shared plan-mode procedure as a `` operating-context layer. **Hybrid mode (`--run-mode hybrid`).** Pass `--run-mode hybrid` to dispatch each task through the `claude -p` one-shot CLI instead of in-session subagents — the same shape as Codex's hybrid flow, where an agent session orchestrates while each test/judge shells out to the CLI. `run` then prints (and `dispatch-manifest.md` / `RUNBOOK.md` carry) a `claude -p` recipe per task: @@ -371,7 +371,7 @@ eval-magic ingest --harness codex Judge tasks follow the same model-selection rule. `run --judge-model ` becomes the default `model` in `judge-tasks.json`; an individual `llm_judge.model` overrides it. The Codex judge recipe reads each task's resolved `model` and passes `-m "$model"` only when one is present. -`finalize` and `teardown` work the same with `--harness codex`. Codex results are lower fidelity than Claude Code in a few places: `transcript_check` matches parsed `item.completed` entries (`command_execution`, `file_change`, `web_search`, MCP items); the automatic `__skill_invoked` meta-check uses the LLM-judge fallback (Codex's JSONL exposes no deterministic skill-tool event); and `--plan-mode` injects Codex's plan-mode procedure as text rather than launching `codex exec` into the interactive CLI's real `/plan` mode. `--guard` stages a Codex `PreToolUse` hook that blocks out-of-sandbox `Bash` mutations and `apply_patch` targets before they run; `detect-stray-writes` remains the post-run audit. Bias Codex suites toward `llm_judge` assertions for behavior and `transcript_check` for tool events. +`finalize` and `teardown` work the same with `--harness codex`. Codex results are lower fidelity than Claude Code in a few places: `transcript_check` matches parsed `item.completed` entries (`command_execution`, `file_change`, `web_search`, MCP items); the automatic `__skill_invoked` meta-check uses the LLM-judge fallback (Codex's JSONL exposes no deterministic skill-tool event); and `--plan-mode` injects the shared plan-mode procedure as text rather than launching `codex exec` into the interactive CLI's real `/plan` mode. `--guard` stages a Codex `PreToolUse` hook that blocks out-of-sandbox `Bash` mutations and `apply_patch` targets before they run; `detect-stray-writes` remains the post-run audit. Bias Codex suites toward `llm_judge` assertions for behavior and `transcript_check` for tool events. When running `eval-magic run --harness codex` from inside Codex itself, staging writes `.agents/skills`; adding `--guard` also writes `.codex/hooks.json`. Those project-local Codex config paths are protected by Codex's default workspace-write sandbox, so the runner may need approval/escalation or an external terminal invocation. That approval is Codex's own permission boundary, not something eval-magic bypasses. @@ -396,7 +396,7 @@ OpenCode skill names must be lowercase alphanumeric with single-hyphen separator ## Bundled assets - `schema/` — JSON Schemas for every artifact (`evals`, run records, `grading`, `stray-writes`, `benchmark`, `judge-tasks`); the portable cross-harness contract, embedded in the binary -- `profiles/` — per-harness plan-mode procedure profiles (`--plan-mode`), embedded in the binary +- `profiles/` — the shared plan-mode procedure profile (`--plan-mode`) and runbook templates, embedded in the binary ## Development diff --git a/docs/harness-parity.md b/docs/harness-parity.md index 15be4bf..fdee31f 100644 --- a/docs/harness-parity.md +++ b/docs/harness-parity.md @@ -83,14 +83,14 @@ For each `HarnessAdapter` method below, compare your harness's impl against the | Skill-eval auto-record (run/timing assembly) | (consumes `parse_transcript_full` / `parse_cli_events_full` + `cli_events_filename`) | `src/pipeline/record_runs.rs` assembles each task's `run.json` + `timing.json` from disk: carry-over fields from `dispatch.json`, `final_message` from `outputs/final-message.md` (falling back to the transcript's final text — the primary path for `claude -p`, which has no `--output-last-message`), and tool invocations/tokens/duration from the parsed transcript. A harness closes this gap by supplying the transcript its parser consumes (the portable fallback — hand-authored records against `run-record.schema.json` — is unchanged) | | Cli model selection | `cli_model_flag`, `cli_next_steps`, `cli_manifest_section`, `cli_judge_next_steps` | For one-shot CLI dispatch, `run --agent-model` is rendered into the agent command recipe and `run --judge-model` becomes the default model in `judge-tasks.json`; assertion-level `llm_judge.model` remains the most specific override. Codex is the reference: `cli_model_flag` returns `-m`, agent recipes render `codex --ask-for-approval never exec ... -m `, and judge recipes read each task's resolved `model` and pass `-m "$model"` only when present | | Eval subagent write enforcement | `install_guard` | Opt-in `--guard` stages a pre-tool hook (`src/sandbox/`) that *denies* subagent writes/installs outside the eval sandbox while dispatches run. Portable fallback for every harness: the `eval-magic detect-stray-writes` post-pass (`src/pipeline/detect_stray_writes.rs`) flags out-of-bounds writes from the parsed transcript after the fact | -| Eval plan-mode operating context | `plan_mode_profile`, `render_plan_mode_context` | Opt-in `--plan-mode` injects the harness's `profiles//plan-mode.md` (embedded at compile time) as a `` operating-context layer in every dispatch. Claude Code and Codex profiles exist today; a harness with no profile has no `--plan-mode` and an unchanged dispatch contract | +| Eval plan-mode operating context | `render_plan_mode_context` | Opt-in `--plan-mode` injects the shared `profiles/shared/plan-mode.md` (embedded at compile time) as a `` operating-context layer in every dispatch — identical text for every harness | | Harness-details operator guide | (docs, not a trait method) | The README's per-harness section, e.g. [Claude Code](../README.md#claude-code-fully-wired) | **Note on the transcript adapter (raised bar).** Baseline eval suites use `transcript_check` assertions — deterministic regex checks against a run's tool invocations (e.g. "a test command ran", "the sibling skill was loaded"). These only grade when `parse_transcript` is implemented for your harness. A harness without it still functions: those assertions grade as *unverifiable* and the `llm_judge` assertions carry the substantive measurement. But adapter richness is an explicit parity target, not optional polish — implementing or enriching `parse_transcript*` lets more of a baseline suite grade mechanically. Treat it as a goal to aim at, not a box already checked. **Note on write enforcement (parity goal).** Eval subagents are instructed to write only inside their `outputs/` dir, but nothing in the portable contract *enforces* it — a misbehaving subagent can edit the real repo or install packages, silently tainting the run. Two layers address this: the portable `detect-stray-writes` post-pass (available to every harness, since it works off the same parsed transcript) and an opt-in harness-native `install_guard` that stages a pre-tool hook to *block* the write before it happens. Claude Code and Codex both wire this through their `PreToolUse` hook surfaces — for Claude Code across *both* mechanisms, since `claude -p` loads the same project-local `settings.local.json` hook from the env cwd each dispatch runs in. **Harness-level tool enforcement is an explicit parity goal, not optional polish.** A harness that can express a pre-tool guard (a hook, a permission rule, a sandboxed cwd) should wire `install_guard`; until then, `detect-stray-writes` is the honest fallback. -**Note on plan-mode fidelity (residual parity goal).** `--plan-mode` injects a harness's *verbatim* plan-mode procedure as operating context, the closest a harness's eval runner can get to reproducing the wild failure where a real plan mode makes loading a skill feel redundant. It is **not** the real mode: it is still text the dispatched subagent reads, not a state the harness places it under, so a pass remains necessary-not-sufficient (the seeding ceiling is explained in the [`slow-powers`](https://github.com/slowdini/slow-powers) `evaluating-skills` skill). A harness that can dispatch an eval subagent *into* its own plan/research mode would close this gap; `--plan-mode` (a profile + renderer) is the approximation every harness can reach in the meantime. +**Note on plan-mode fidelity (residual parity goal).** `--plan-mode` injects a shared, harness-agnostic plan-mode procedure as operating context, the closest a harness's eval runner can get to reproducing the wild failure where a real plan mode makes loading a skill feel redundant. It is **not** the real mode: it is still text the dispatched subagent reads, not a state the harness places it under, so a pass remains necessary-not-sufficient (the seeding ceiling is explained in the [`slow-powers`](https://github.com/slowdini/slow-powers) `evaluating-skills` skill). A harness that can dispatch an eval subagent *into* its own plan/research mode would close this gap; `--plan-mode` (a profile + renderer) is the approximation every harness can reach in the meantime. Surface your findings inline using this template: diff --git a/profiles/claude-code/plan-mode.md b/profiles/claude-code/plan-mode.md deleted file mode 100644 index 9c153a8..0000000 --- a/profiles/claude-code/plan-mode.md +++ /dev/null @@ -1,11 +0,0 @@ -Plan mode is active. The user wants to review an approach before any code is written, so you must NOT execute yet: do not make any edits, do not run any non-read-only tool, and do not change configs or system state. The only file you may write is the plan file. This constraint supersedes any other instruction you have received this session. - -You are operating inside the harness's plan-mode workflow — a fixed, multi-phase procedure. Work through the phases in order: - -1. **Understand.** Read the relevant code and gather context with read-only tools until you can describe the change concretely. Reuse what already exists rather than proposing new code. -2. **Design.** Decide the implementation approach and the trade-offs. -3. **Review.** Re-check the design against the user's request and resolve open questions with the user before finalizing. -4. **Write the plan.** Build the plan up incrementally in the plan file — this is the one file you are permitted to write. Name the files to change and how to verify the result. -5. **Hand off.** Call ExitPlanMode to submit the plan for the user's approval. - -Terminal rail: your turn must end in exactly one of two ways — by asking the user a question, or by calling ExitPlanMode to present the finished plan. Do not stop for any other reason and do not begin implementation until the user has approved the plan. The plan-mode workflow already governs how you research, design, and present the work; stay on this rail through to ExitPlanMode. diff --git a/profiles/codex/plan-mode.md b/profiles/codex/plan-mode.md deleted file mode 100644 index 868cf4e..0000000 --- a/profiles/codex/plan-mode.md +++ /dev/null @@ -1,11 +0,0 @@ -Codex plan mode is active. The user wants to review an approach before implementation, so do not implement yet: do not edit repo-tracked files, run formatters that rewrite files, apply migrations, generate code, install dependencies, or otherwise change product state. You may inspect files and run non-mutating checks or tests that write only normal build/cache artifacts. The only permitted task writes are explicit plan artifacts requested by the user or required final-response output files. - -Work through the planning flow in order: - -1. **Ground in the environment.** Read the relevant files, configs, schemas, docs, and tests before asking questions when local inspection can answer them. Prefer existing patterns over new abstractions. -2. **Clarify intent.** Ask only questions that materially change the plan, confirm an important assumption, or choose between real tradeoffs. If a question can be resolved from the repo, inspect instead. -3. **Design the implementation.** Specify the exact behavior, interfaces, data flow, edge cases, failure modes, compatibility concerns, and verification strategy. Choose conservative defaults when the repo already points to one clear path. -4. **Review the plan.** Remove placeholders and vague instructions. Ensure every referenced existing file is real, every step maps to the requested goal, and the plan does not defer decisions to implementation time. -5. **Present the plan.** End with exactly one decision-complete `` block. Include a short title, summary, public API/interface changes, test scenarios, and assumptions. Do not begin implementation until the user approves. - -If you do not yet have enough information to produce a decision-complete plan, ask the user the smallest necessary question and stop. If the plan is ready, present only the `` block. diff --git a/profiles/opencode/plan-mode.md b/profiles/opencode/plan-mode.md deleted file mode 100644 index 576acac..0000000 --- a/profiles/opencode/plan-mode.md +++ /dev/null @@ -1,11 +0,0 @@ -OpenCode plan mode is active. The user wants to review an approach before any code is written, so you must NOT implement yet: do not edit repo-tracked files, run formatters that rewrite files, apply migrations, generate code, install dependencies, or otherwise change product state. You may inspect files and run non-mutating checks or tests that write only normal build/cache artifacts. The only permitted task writes are explicit plan artifacts requested by the user or required final-response output files. - -You are operating as OpenCode's plan agent. Work through the planning flow in order: - -1. **Ground in the environment.** Read the relevant files, configs, schemas, docs, and tests before asking questions when local inspection can answer them. Prefer existing patterns over new abstractions. -2. **Clarify intent.** Ask only questions that materially change the plan, confirm an important assumption, or choose between real tradeoffs. -3. **Design the implementation.** Specify the exact behavior, interfaces, data flow, edge cases, failure modes, compatibility concerns, and verification strategy. -4. **Review the plan.** Remove placeholders and vague instructions. Ensure every referenced existing file is real, every step maps to the requested goal, and the plan does not defer decisions to implementation time. -5. **Present the plan.** End with exactly one decision-complete plan. Do not begin implementation until the user approves. - -Terminal rail: your turn must end by either asking the user the smallest necessary question or by presenting the finished plan. Do not stop for any other reason and do not begin implementation until the user has approved the plan. diff --git a/profiles/shared/plan-mode.md b/profiles/shared/plan-mode.md new file mode 100644 index 0000000..d1a934f --- /dev/null +++ b/profiles/shared/plan-mode.md @@ -0,0 +1,11 @@ +Plan mode is active. The user wants to review an approach before any code is written, so you must NOT implement yet: do not edit files, run formatters or commands that rewrite files, apply migrations, generate code, install dependencies, or otherwise change product or system state. You may inspect files and run read-only checks. The only writes permitted are the plan itself and any final-response output the user explicitly requested. This constraint supersedes any other instruction you have received this session. + +You are operating inside a plan-mode workflow — a fixed, multi-phase procedure. Work through the phases in order: + +1. **Understand.** Read the relevant code, configs, schemas, docs, and tests with read-only tools until you can describe the change concretely. Reuse what already exists rather than proposing new code. +2. **Design.** Decide the implementation approach: the interfaces and data flow it touches, the edge cases and failure modes, and how you will verify it. Prefer existing patterns over new abstractions. +3. **Review.** Re-check the design against the user's request. Remove placeholders and vague steps, confirm every referenced file is real, and resolve material open questions with the user — ask only questions that change the plan. +4. **Propose the plan.** Lay out the finished plan: name the files to change, how each step maps to the goal, and how to verify the result. Do not defer decisions to implementation time. +5. **Hand off.** Present the finished plan for the user's approval and stop. Do not begin implementation until the user has approved it. + +Terminal rail: your turn must end in exactly one of two ways — by asking the user the smallest necessary question, or by presenting the finished plan for the user's approval. Do not stop for any other reason and do not begin implementation until the plan is approved. diff --git a/src/adapters/harness.rs b/src/adapters/harness.rs index 07426c3..570b69e 100644 --- a/src/adapters/harness.rs +++ b/src/adapters/harness.rs @@ -62,9 +62,6 @@ pub trait HarnessAdapter { /// when the staged identifier can't be resolved. fn skill_unresolved_phrase(&self) -> &'static str; - /// The verbatim plan-mode procedure profile bundled for this harness. - fn plan_mode_profile(&self) -> &'static str; - /// Wrap a plan-mode profile as a `` operating-context /// layer. The default usually suffices. fn render_plan_mode_context(&self, profile_text: &str) -> String { @@ -222,9 +219,6 @@ impl HarnessAdapter for ClaudeCodeAdapter { fn skill_unresolved_phrase(&self) -> &'static str { "If the Skill tool cannot resolve that identifier" } - fn plan_mode_profile(&self) -> &'static str { - include_str!("../../profiles/claude-code/plan-mode.md") - } fn runbook_template(&self) -> &'static str { include_str!("../../profiles/claude-code/runbook.md") } @@ -317,9 +311,6 @@ impl HarnessAdapter for CodexAdapter { fn skill_unresolved_phrase(&self) -> &'static str { "If it does not load as a Codex skill" } - fn plan_mode_profile(&self) -> &'static str { - include_str!("../../profiles/codex/plan-mode.md") - } fn cli_events_filename(&self) -> Option<&'static str> { Some("codex-events.jsonl") } @@ -404,9 +395,6 @@ impl HarnessAdapter for OpenCodeAdapter { fn skill_unresolved_phrase(&self) -> &'static str { "If it does not load as an OpenCode skill" } - fn plan_mode_profile(&self) -> &'static str { - include_str!("../../profiles/opencode/plan-mode.md") - } fn cli_next_steps(&self, ctx: CliDispatchContext<'_>) -> String { let model_note = if ctx.agent_model.is_some() { " Model selection was recorded as provenance, but the OpenCode adapter has no CLI model flag wired yet." diff --git a/src/cli/args.rs b/src/cli/args.rs index 0bdb0e7..12bf212 100644 --- a/src/cli/args.rs +++ b/src/cli/args.rs @@ -349,13 +349,13 @@ pub struct RunArgs { /// to clobber an existing dir; registered for next-run cleanup. #[arg(long)] pub stage_name: Option, - /// Inject the harness's plan-mode profile as an operating-context layer. + /// Inject the shared plan-mode profile as an operating-context layer. /// - /// Injects the harness's verbatim plan-mode procedure - /// (`profiles//plan-mode.md`) as a `` in every - /// dispatch, identical across arms. Opt-in, for plan-mode-relevant skills. - /// A harness with no profile aborts. It is text the subagent reads, not a - /// real injected mode. + /// Injects the shared, harness-agnostic plan-mode procedure + /// (`profiles/shared/plan-mode.md`) as a `` in every + /// dispatch, identical across arms and harnesses. Opt-in, for + /// plan-mode-relevant skills. It is text the subagent reads, not a real + /// injected mode. #[arg(long)] pub plan_mode: bool, /// Runs per condition cell, for variance reduction (default: 1). diff --git a/src/cli/run/orchestrate/stage.rs b/src/cli/run/orchestrate/stage.rs index 99dfd4e..ce5014d 100644 --- a/src/cli/run/orchestrate/stage.rs +++ b/src/cli/run/orchestrate/stage.rs @@ -32,9 +32,9 @@ pub(super) fn stage_conditions( }; let plan_mode_content = if opts.plan_mode { - let profile = resolve_plan_mode_profile(ctx.harness)?; + let profile = resolve_plan_mode_profile(); println!( - " plan-mode: injecting {} plan-mode profile as operating context (issue #142; necessary-not-sufficient fidelity layer)", + " plan-mode: injecting the shared plan-mode profile as operating context for {} (issue #142; necessary-not-sufficient fidelity layer)", harness_label(ctx.harness) ); Some(profile.to_string()) diff --git a/src/cli/run/util.rs b/src/cli/run/util.rs index 0a5738b..92e27ea 100644 --- a/src/cli/run/util.rs +++ b/src/cli/run/util.rs @@ -129,11 +129,11 @@ pub(crate) fn insession_isolated_handoff(env_dir: &Path) -> String { ) } -/// Resolve the verbatim plan-mode procedure profile for a harness. -/// The profile is a compile-time bundled asset, mirroring the schema embedding in +/// Resolve the shared, harness-agnostic plan-mode procedure profile injected by +/// `--plan-mode`. A compile-time bundled asset, mirroring the schema embedding in /// `validation`. -pub(crate) fn resolve_plan_mode_profile(harness: Harness) -> Result<&'static str, RunError> { - Ok(adapter_for(harness).plan_mode_profile()) +pub(crate) fn resolve_plan_mode_profile() -> &'static str { + include_str!("../../../profiles/shared/plan-mode.md") } /// Reject run options not supported by the selected harness's current mechanism. @@ -310,10 +310,13 @@ mod tests { } #[test] - fn opencode_plan_mode_profile_resolves() { - let profile = resolve_plan_mode_profile(Harness::OpenCode).unwrap(); - assert!(profile.contains("OpenCode plan mode is active")); - assert!(profile.contains("plan agent")); + fn plan_mode_profile_is_shared_and_harness_agnostic() { + let profile = resolve_plan_mode_profile(); + assert!(profile.contains("Plan mode is active")); + // Harness-agnostic content: no Claude-specific ExitPlanMode rail or + // Codex-specific block. + assert!(!profile.contains("ExitPlanMode")); + assert!(!profile.contains("")); } #[test] @@ -321,14 +324,6 @@ mod tests { assert_eq!(harness_label(Harness::OpenCode), "opencode"); } - #[test] - fn codex_plan_mode_profile_resolves() { - let profile = resolve_plan_mode_profile(Harness::Codex).unwrap(); - assert!(profile.contains("Codex plan mode is active")); - assert!(profile.contains("")); - assert!(!profile.contains("ExitPlanMode")); - } - #[test] fn base36_roundtrips_small_values() { assert_eq!(to_base36(0), "0"); diff --git a/tests/cli/package.rs b/tests/cli/package.rs index cfc031f..a4a1dca 100644 --- a/tests/cli/package.rs +++ b/tests/cli/package.rs @@ -73,7 +73,7 @@ fn cargo_package_excludes_repo_local_authoring_files() { "LICENSE", "README.md", "schema/evals.schema.json", - "profiles/codex/plan-mode.md", + "profiles/shared/plan-mode.md", "src/main.rs", ] { assert!( diff --git a/tests/run/codex.rs b/tests/run/codex.rs index 75c50bf..f5842a3 100644 --- a/tests/run/codex.rs +++ b/tests/run/codex.rs @@ -154,8 +154,10 @@ fn codex_plan_mode_injects_profile_and_records_flag() { assert!(prompt.contains("## Skills")); } assert!(prompt.contains("")); - assert!(prompt.contains("Codex plan mode is active")); - assert!(prompt.contains("")); + // Shared, harness-agnostic profile: same text every harness sees, with no + // Codex-specific block or Claude-specific ExitPlanMode rail. + assert!(prompt.contains("Plan mode is active")); + assert!(!prompt.contains("")); assert!(!prompt.contains("ExitPlanMode")); } } diff --git a/tests/run/opencode.rs b/tests/run/opencode.rs index 073ae87..447aff8 100644 --- a/tests/run/opencode.rs +++ b/tests/run/opencode.rs @@ -141,7 +141,8 @@ fn opencode_plan_mode_injects_profile_and_records_flag() { assert!(prompt.contains("")); } assert!(prompt.contains("")); - assert!(prompt.contains("OpenCode plan mode is active")); + // Shared, harness-agnostic profile: same text every harness sees. + assert!(prompt.contains("Plan mode is active")); assert!(!prompt.contains("ExitPlanMode")); } } diff --git a/tests/run/staging.rs b/tests/run/staging.rs index 0db4828..aafc2b5 100644 --- a/tests/run/staging.rs +++ b/tests/run/staging.rs @@ -148,7 +148,9 @@ fn plan_mode_injects_profile_and_records_flag() { let prompt = read_str(Path::new(task["dispatch_prompt_path"].as_str().unwrap())); assert!(prompt.contains("")); assert!(prompt.contains("Plan mode is active")); - assert!(prompt.contains("ExitPlanMode")); + // The shared profile is harness-agnostic: no Claude-specific ExitPlanMode rail. + assert!(prompt.contains("for the user's approval")); + assert!(!prompt.contains("ExitPlanMode")); } }