Release v0.4.0 by slowdini · Pull Request #107 · slowdini/eval-magic

slowdini · 2026-06-21T20:36:25Z

Release notes

Replace this paragraph with a short narrative for the release. If left unchanged, GitHub's auto-generated notes will be used instead.

Resolves issue #77 (design spike for the isolated-run epic #74): fixes the one decision that blocks the rest of the epic and documents it at docs/isolated-run.md. Decision (option a): one isolated session runs the whole loop with a switch-condition barrier between condition batches. This preserves in-session transcript resolution via CLAUDE_CODE_SESSION_ID (the #79 invariant) and the singular env/ layout, while delivering real read isolation by removing/swapping the staged skill so the control arm has nothing to read. Rejects (b) separate sessions (reintroduces the cross-session --subagents-dir dance) and (c) one shared env (detect-stray-writes is blind to staged-copy reads inside env/). The doc fixes the env/ vs iteration-N/ layout, the condition/dispatch model under Claude's subagent cwd inheritance, where eval-magic meta vs the clean env live, how benchmark.json is written above env/ (trusted binary within guard allowedRoots), and the switch-condition barrier contract. The "validate with one real Claude-interactive run" requirement is captured as a checklist for #78/#79 to execute once env/ exists. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

docs(isolated-run): resolve design spike & add env/dispatch design note (#77)

feat(isolated-runs): runbook artifact

feat(isolated-runs): create isolated env for eval runs

Three mechanical changes that let the isolated session (cwd = iteration-N/env/) drive the dispatch -> ingest -> finalize loop, per #79 (epic #74). - command_target_args now threads an absolute --workspace-dir so ingest/finalize (and the upcoming switch-condition) resolve the iteration tree above the env instead of defaulting workspace_root to <cwd>/skills-workspace and bailing. - Per-task dispatch outputs move into env/.eval-magic/outputs/eval-<id>/<cond>[/run-k]/, so the agent-under-test never writes above its cwd (docs/isolated-run.md §8); run.json/timing.json stay above the env in iteration-N/. record_runs is unchanged — it keys off task.outputs_dir. - dispatch.json tasks[] are grouped by condition (all cond-A, then all cond-B) so the runbook's per-condition batch flow maps to a top-to-bottom walk of tasks[]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Implements the per-condition read-isolation barrier and reframes the interactive runbook to drive the whole loop from the isolated session, per #79 (epic #74). - New `eval-magic switch-condition --condition <keep>` subcommand: removes the off-condition's staged skill from env/.claude/skills/ between dispatch batches so the next batch cannot read it. Surgical (only the one slug from conditions.json, not cleanup_staged_skills), idempotent, and leaves the sibling guard marker (and an armed guard) intact. Resolves the iteration from --workspace-dir so it runs from cwd=env/. Uniform across new-skill (remove with_skill) and revision (remove old_skill) since both arms stage at run time. - Refactor the in-session dispatch guidance into shared per-condition fragments (insession_dispatch_batch / insession_switch_command / insession_ingest_command) used by both the runbook tokens and the post-run "Next:" message, so they cannot drift. - Reframe profiles/claude-code/runbook.md into the batch loop: dispatch cond-A → join → switch-condition → dispatch cond-B → join → ingest → judges → finalize → teardown. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Keep docs/isolated-run.md evergreen with #79: - §2 Status: #79 has landed — the isolated session drives the whole loop, dispatch outputs live in env/.eval-magic/outputs/, and commands carry an absolute --workspace-dir to resolve from cwd=env/. - §2 layout table: add the env/.eval-magic/outputs/ row. - §4: switch-condition removes the off-condition slug (uniform across modes; both arms stage at run time), does not call cleanup_staged_skills, and does not re-arm the guard — superseding the earlier "in-place content swap" sketch. - §6: the design assumptions are confirmed by the dogfood run. Holistic README/--help pass for the isolated-run UX is deferred to #84. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Drop the historical "used to materialize as a side effect" narrative and the duplicated meta/env-split note; keep the evergreen rationale. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(isolated-runs): full-loop handoff for the isolated session (#79)

The env (incl. env/.claude/skills/) is now built before the isolated session starts in it, so Claude Code's file-watcher discovery hazard is gone structurally. Remove the workarounds it spawned: - Drop staging_discovery_warning and staging_plugin_shadow_action (and the skills_dir_preexisted plumbing they were the only consumers of). The plain plugin-shadow banner stays — plugin shadowing is independent of staging. - Replace the printed dispatch loop (insession_dispatch_next_steps) with a clean handoff (insession_isolated_handoff): the run summary now points the user to cd into env/, start a fresh session, and "Read and follow RUNBOOK.md". The full dispatch → switch-condition → ingest → finalize loop lives only in RUNBOOK.md now. - Reconcile --no-stage: clarify it still runs inside the isolated env/ and only skips populating the harness skills dir (no behavior change). - Reframe docs (docs/isolated-run.md, README Claude Code section, --help) to the isolated-env flow; delete the obsolete "Same-session staging gotcha". Verified: cargo test (0 failures), fmt, clippy -D warnings, zero dead refs. Closes #80 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…-juggling feat(isolated-runs): retire the session-juggling apparatus (#80)

…uard feat(isolated-runs): focus guard flag for new isolated envs

feat(claude): hybrid run mode support

feat(claude): headless run mode support

docs(isolated-runs): encapsulation bug fixes and docs update

Decide at run-setup which evals can share an environment and which need isolation, and write the batching plan into dispatch.json so the executing session does no isolation reasoning itself. Grouping (src/cli/run/grouping.rs) is deterministic greedy first-fit: fixture conflicts auto-split into separate groups, an eval's new `isolation: isolated` hint forces its own singleton, and everything else shares one group. Realization is split by dispatch mechanism: in-session keeps one env/ and swaps groups via a new `reset-batch` barrier (full wipe + re-seed) with conditions still on switch-condition; Cli (hybrid/headless) materializes one env per (group, condition), which also closes a latent gap where the control arm's Cli env physically contained the staged skill. Single-group in-session runs stay byte-identical to the pre-grouping shape (no groups/group/eval_root keys, bare env/, unchanged runbook). dispatch.json gains a groups[] summary plus per-task group/eval_root; the Cli recipes read a per-task eval_root. docs/isolated-run.md (§1-§4, §8), README, and --help updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(run): setup-time isolation grouping for multi-run batches (#90)

…ow (#99) The isolation-grouping feature (#90/#98) moved the Cli (hybrid/headless) dispatch path from a single `env/` to one env per `(group, condition)`, but left two best-effort mechanisms assuming the old single `env/`. Both are now scoped to the Cli mechanism; the in-session single-`env/` path is unchanged. - Plugin-shadow preflight: `post_build` scanned `ctx.stage_root` (`env/`), which Cli never creates, so project-local `.claude/settings.json` enabledPlugins were unseen. Hoist the existing `env_targets` and scan the first staged env instead (in-session's first target is `env/`, so byte-identical there). - teardown/finalize guard checks were cwd-scoped. Add `staged_env_roots` (a directory scan over the iteration dir) and, under Cli, have `teardown` disarm each per-env guard before reclaim and `finalize` detect an armed env guard. The reminder now points at `eval-magic teardown` (cwd-only `teardown-guard` can't reach per-env markers). Updates the `--guard`/`finalize` --help text and docs/isolated-run.md §4. Tests: cli_plugin_shadow_preflight_reads_per_env_project_settings, teardown_disarms_per_group_condition_cli_guards, finalize_warns_about_armed_cli_per_env_guard. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix(run): walk per-(group,condition) Cli envs for guard + plugin-shadow (#99)

The isolated-runs epic (#74) has shipped — env builder (#78), full-loop handoff (#79), session-juggling retirement (#80), guard re-evaluation (#81), isolation grouping (#90), and Cli multi-env follow-ups (#99) are all closed. The doc was development guidance; its evergreen operating contract already lives in README.md (the run loop, Isolation grouping, the "Discovery is structural now" watcher note, and guard behavior). - Remove the file's 8 references (README ×3, the switch-condition / reset-batch --help text in args.rs, the runbook.rs module note, and two build.rs comments) so deletion leaves no dangling pointers. - File the doc's only untracked future-work item — filesystem-level isolation (mount namespaces / overlay / chroot) — as #102, linked to epic #74. - Design rationale (spike alternatives, validation checklist) stays in the closed spikes #77/#90, git history, and code/tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

chore(docs): retire isolated-run.md (#100)

fix(codex): fix flag order

…magic chore(cli): rename artifacts directories

Merge pull request #107 from slowdini/dev

slowdini and others added 30 commits June 18, 2026 20:43

Merge pull request #88 from slowdini/docs/isolated-run-design-spike

48efa1f

docs(isolated-run): resolve design spike & add env/dispatch design note (#77)

feat(isolated-runs): runbook artifact

d27c492

Merge pull request #89 from slowdini/feat/runbook-artifact

eb807f8

feat(isolated-runs): runbook artifact

feat(isolated-runs): create isolated env for eval runs

cbad44d

Merge pull request #91 from slowdini/feat/isolated-run-env-builder

078bfa1

feat(isolated-runs): create isolated env for eval runs

chore(isolated-runs): tighten the per-run output dir comments

660f999

Drop the historical "used to materialize as a side effect" narrative and the duplicated meta/env-split note; keep the evergreen rationale. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Merge pull request #92 from slowdini/feat/isolated-run-full-loop

492b0d5

feat(isolated-runs): full-loop handoff for the isolated session (#79)

Merge pull request #93 from slowdini/feat/isolated-run-retire-session…

8313b9d

…-juggling feat(isolated-runs): retire the session-juggling apparatus (#80)

feat(isolated-runs): focus guard flag for new isolated envs

9d8b8b2

Merge pull request #94 from slowdini/feat/isolated-run-reeval-write-g…

c5705b4

…uard feat(isolated-runs): focus guard flag for new isolated envs

feat(claude): hybrid run mode support

d55bb72

Merge pull request #95 from slowdini/feat/claude-hybrid-dispatch

428ddde

feat(claude): hybrid run mode support

feat(claude): headless run mode support

21ec5bb

Merge pull request #96 from slowdini/feat/headless-run-mode

f7da791

feat(claude): headless run mode support

docs(isolated-runs): encapsulation bug fixes and docs update

1b6c278

Merge pull request #97 from slowdini/docs/isolated-run-final

c47f07d

docs(isolated-runs): encapsulation bug fixes and docs update

Merge pull request #98 from slowdini/feat/isolation-groups

ec385fd

feat(run): setup-time isolation grouping for multi-run batches (#90)

Merge pull request #101 from slowdini/fix/cli-multienv-guard-shadow

9850c3a

fix(run): walk per-(group,condition) Cli envs for guard + plugin-shadow (#99)

Merge pull request #103 from slowdini/chore/retire-isolated-run-doc

d6a4686

chore(docs): retire isolated-run.md (#100)

fix(codex): fix flag order

5cdcae0

Merge pull request #105 from slowdini/fix/codex-approval-flag-order

767bcd1

fix(codex): fix flag order

chore(cli): rename artifacts directories

f636bb7

slowdini and others added 3 commits June 21, 2026 16:35

Merge pull request #106 from slowdini/refactor/workspace-dir-to-eval-…

5fc8cde

…magic chore(cli): rename artifacts directories

chore: bump version to 0.4.0

8770e5e

Merge branch 'main' into dev

aee65ea

slowdini merged commit 5e5aba7 into main Jun 21, 2026
6 checks passed

slowdini added a commit that referenced this pull request Jun 21, 2026

Merge pull request #108 from slowdini/main

de27ec6

Merge pull request #107 from slowdini/dev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release v0.4.0#107

Release v0.4.0#107
slowdini merged 33 commits into
mainfrom
dev

slowdini commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

slowdini commented Jun 21, 2026

Release notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant