Conversation
Resolves issue #77 (design spike for the isolated-run epic #74): fixes the one decision that blocks the rest of the epic and documents it at docs/isolated-run.md. Decision (option a): one isolated session runs the whole loop with a switch-condition barrier between condition batches. This preserves in-session transcript resolution via CLAUDE_CODE_SESSION_ID (the #79 invariant) and the singular env/ layout, while delivering real read isolation by removing/swapping the staged skill so the control arm has nothing to read. Rejects (b) separate sessions (reintroduces the cross-session --subagents-dir dance) and (c) one shared env (detect-stray-writes is blind to staged-copy reads inside env/). The doc fixes the env/ vs iteration-N/ layout, the condition/dispatch model under Claude's subagent cwd inheritance, where eval-magic meta vs the clean env live, how benchmark.json is written above env/ (trusted binary within guard allowedRoots), and the switch-condition barrier contract. The "validate with one real Claude-interactive run" requirement is captured as a checklist for #78/#79 to execute once env/ exists. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs(isolated-run): resolve design spike & add env/dispatch design note (#77)
feat(isolated-runs): runbook artifact
feat(isolated-runs): create isolated env for eval runs
Three mechanical changes that let the isolated session (cwd = iteration-N/env/) drive the dispatch -> ingest -> finalize loop, per #79 (epic #74). - command_target_args now threads an absolute --workspace-dir so ingest/finalize (and the upcoming switch-condition) resolve the iteration tree above the env instead of defaulting workspace_root to <cwd>/skills-workspace and bailing. - Per-task dispatch outputs move into env/.eval-magic/outputs/eval-<id>/<cond>[/run-k]/, so the agent-under-test never writes above its cwd (docs/isolated-run.md §8); run.json/timing.json stay above the env in iteration-N/. record_runs is unchanged — it keys off task.outputs_dir. - dispatch.json tasks[] are grouped by condition (all cond-A, then all cond-B) so the runbook's per-condition batch flow maps to a top-to-bottom walk of tasks[]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements the per-condition read-isolation barrier and reframes the interactive runbook to drive the whole loop from the isolated session, per #79 (epic #74). - New `eval-magic switch-condition --condition <keep>` subcommand: removes the off-condition's staged skill from env/.claude/skills/ between dispatch batches so the next batch cannot read it. Surgical (only the one slug from conditions.json, not cleanup_staged_skills), idempotent, and leaves the sibling guard marker (and an armed guard) intact. Resolves the iteration from --workspace-dir so it runs from cwd=env/. Uniform across new-skill (remove with_skill) and revision (remove old_skill) since both arms stage at run time. - Refactor the in-session dispatch guidance into shared per-condition fragments (insession_dispatch_batch / insession_switch_command / insession_ingest_command) used by both the runbook tokens and the post-run "Next:" message, so they cannot drift. - Reframe profiles/claude-code/runbook.md into the batch loop: dispatch cond-A → join → switch-condition → dispatch cond-B → join → ingest → judges → finalize → teardown. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Keep docs/isolated-run.md evergreen with #79: - §2 Status: #79 has landed — the isolated session drives the whole loop, dispatch outputs live in env/.eval-magic/outputs/, and commands carry an absolute --workspace-dir to resolve from cwd=env/. - §2 layout table: add the env/.eval-magic/outputs/ row. - §4: switch-condition removes the off-condition slug (uniform across modes; both arms stage at run time), does not call cleanup_staged_skills, and does not re-arm the guard — superseding the earlier "in-place content swap" sketch. - §6: the design assumptions are confirmed by the dogfood run. Holistic README/--help pass for the isolated-run UX is deferred to #84. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Drop the historical "used to materialize as a side effect" narrative and the duplicated meta/env-split note; keep the evergreen rationale. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
feat(isolated-runs): full-loop handoff for the isolated session (#79)
The env (incl. env/.claude/skills/) is now built before the isolated session starts in it, so Claude Code's file-watcher discovery hazard is gone structurally. Remove the workarounds it spawned: - Drop staging_discovery_warning and staging_plugin_shadow_action (and the skills_dir_preexisted plumbing they were the only consumers of). The plain plugin-shadow banner stays — plugin shadowing is independent of staging. - Replace the printed dispatch loop (insession_dispatch_next_steps) with a clean handoff (insession_isolated_handoff): the run summary now points the user to cd into env/, start a fresh session, and "Read and follow RUNBOOK.md". The full dispatch → switch-condition → ingest → finalize loop lives only in RUNBOOK.md now. - Reconcile --no-stage: clarify it still runs inside the isolated env/ and only skips populating the harness skills dir (no behavior change). - Reframe docs (docs/isolated-run.md, README Claude Code section, --help) to the isolated-env flow; delete the obsolete "Same-session staging gotcha". Verified: cargo test (0 failures), fmt, clippy -D warnings, zero dead refs. Closes #80 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-juggling feat(isolated-runs): retire the session-juggling apparatus (#80)
…uard feat(isolated-runs): focus guard flag for new isolated envs
feat(claude): hybrid run mode support
feat(claude): headless run mode support
docs(isolated-runs): encapsulation bug fixes and docs update
Decide at run-setup which evals can share an environment and which need isolation, and write the batching plan into dispatch.json so the executing session does no isolation reasoning itself. Grouping (src/cli/run/grouping.rs) is deterministic greedy first-fit: fixture conflicts auto-split into separate groups, an eval's new `isolation: isolated` hint forces its own singleton, and everything else shares one group. Realization is split by dispatch mechanism: in-session keeps one env/ and swaps groups via a new `reset-batch` barrier (full wipe + re-seed) with conditions still on switch-condition; Cli (hybrid/headless) materializes one env per (group, condition), which also closes a latent gap where the control arm's Cli env physically contained the staged skill. Single-group in-session runs stay byte-identical to the pre-grouping shape (no groups/group/eval_root keys, bare env/, unchanged runbook). dispatch.json gains a groups[] summary plus per-task group/eval_root; the Cli recipes read a per-task eval_root. docs/isolated-run.md (§1-§4, §8), README, and --help updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
feat(run): setup-time isolation grouping for multi-run batches (#90)
…ow (#99) The isolation-grouping feature (#90/#98) moved the Cli (hybrid/headless) dispatch path from a single `env/` to one env per `(group, condition)`, but left two best-effort mechanisms assuming the old single `env/`. Both are now scoped to the Cli mechanism; the in-session single-`env/` path is unchanged. - Plugin-shadow preflight: `post_build` scanned `ctx.stage_root` (`env/`), which Cli never creates, so project-local `.claude/settings.json` enabledPlugins were unseen. Hoist the existing `env_targets` and scan the first staged env instead (in-session's first target is `env/`, so byte-identical there). - teardown/finalize guard checks were cwd-scoped. Add `staged_env_roots` (a directory scan over the iteration dir) and, under Cli, have `teardown` disarm each per-env guard before reclaim and `finalize` detect an armed env guard. The reminder now points at `eval-magic teardown` (cwd-only `teardown-guard` can't reach per-env markers). Updates the `--guard`/`finalize` --help text and docs/isolated-run.md §4. Tests: cli_plugin_shadow_preflight_reads_per_env_project_settings, teardown_disarms_per_group_condition_cli_guards, finalize_warns_about_armed_cli_per_env_guard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
fix(run): walk per-(group,condition) Cli envs for guard + plugin-shadow (#99)
The isolated-runs epic (#74) has shipped — env builder (#78), full-loop handoff (#79), session-juggling retirement (#80), guard re-evaluation (#81), isolation grouping (#90), and Cli multi-env follow-ups (#99) are all closed. The doc was development guidance; its evergreen operating contract already lives in README.md (the run loop, Isolation grouping, the "Discovery is structural now" watcher note, and guard behavior). - Remove the file's 8 references (README ×3, the switch-condition / reset-batch --help text in args.rs, the runbook.rs module note, and two build.rs comments) so deletion leaves no dangling pointers. - File the doc's only untracked future-work item — filesystem-level isolation (mount namespaces / overlay / chroot) — as #102, linked to epic #74. - Design rationale (spike alternatives, validation checklist) stays in the closed spikes #77/#90, git history, and code/tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
chore(docs): retire isolated-run.md (#100)
fix(codex): fix flag order
…magic chore(cli): rename artifacts directories
slowdini
added a commit
that referenced
this pull request
Jun 21, 2026
Merge pull request #107 from slowdini/dev
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Release notes
Replace this paragraph with a short narrative for the release. If left unchanged, GitHub's auto-generated notes will be used instead.