test: isolated onboarding harness for published @aoagents/ao by i-trytoohard · Pull Request #2049 · AgentWrapper/agent-orchestrator

i-trytoohard · 2026-05-23T18:37:11Z

What

A local end-to-end onboarding test for a published @aoagents/ao npm version — npm install -g → ao start → dashboard serves → ao stop — kept completely isolated from any real ao install/runtime on the machine.

Deliverables

scripts/test-onboarding.sh — configurable: --version latest|nightly|exact, --mode fresh|coexist, --port (auto if omitted), --keep. Clear pass/fail, non-zero exit on failure.
skills/onboarding-test/SKILL.md (+ README entry) — any agent can onboarding-test its own PR/version; documents both modes, isolation guarantees, cleanup.

The GitHub Actions workflow was dropped at the maintainer's request — this is run locally / on demand. The SKILL documents the one-line CI step for wiring it up later. Per-PR source-build onboarding remains covered by the existing "Test Fresh Onboarding" job.

Isolation (all enforced by the script)

Temp npm --prefix first on PATH — real global modules untouched.
Sandbox HOME → separate ~/.agent-orchestrator; sandbox ao stop only ever sees the sandbox running.json.
Auto-allocated free dashboard + terminal WS ports; refuses reserved 3001/14800/14801.
runtime: process (no shared tmux server) — cannot see/kill real tmux sessions.
Throwaway git init repo; non-interactive (AO_CALLER_TYPE=agent + --no-orchestrator, no LLM key needed).
One temp root, removed on trap EXIT (success/failure/interrupt): graceful sandbox ao stop --all, kill recorded daemon pid tree, pkill -f <unique-temp-root> safety net, rm -rf.

coexist mode additionally snapshots the real running.json + tmux ls + listening ports before/after and fails loudly if anything moved.

Two modes

fresh — canonical clean onboarding (clean CI / new machine).
coexist — machine already runs ao; proves zero change to the real install.

Validation (run locally, both modes)

--version 0.8.0 --mode coexist → PASS: dashboard /api/sessions 200, clean stop, and real ao state IDENTICAL before/after (running.json, tmux, ports).
--version 0.8.0 --mode fresh → PASS.
--version nightly --mode fresh → PASS.
Real daemon (pid 49919, port 3001, 5 tmux sessions) verified alive and unchanged after every run; no leftover temp dirs.

Bug the harness caught 🐛

--version latest (0.9.1) → FAIL: a fresh install returns HTTP 500 on the dashboard because the bundled agent-grok plugin throws Cannot find module '../package.json'. This is exactly what cdd1030d (#2040) fixes on main — but that fix is not in the published 0.9.1 release. The harness surfaced a real, ship-blocking onboarding regression in the current published package.

🤖 Generated with Claude Code

Adds a local end-to-end onboarding test for a PUBLISHED npm version of ao: npm install -g → ao start → dashboard serves → ao stop, fully isolated from any real ao install/runtime on the machine. - scripts/test-onboarding.sh: configurable (--version latest|nightly|exact, --mode fresh|coexist, --port, --keep). Isolates via temp npm --prefix, sandbox HOME, auto free ports, runtime: process (no shared tmux), throwaway git repo, AO_CALLER_TYPE=agent, single temp root cleaned on trap EXIT. coexist mode snapshots real running.json + tmux + listening ports before/after and fails if perturbed. - skills/onboarding-test/SKILL.md (+ README entry): lets any agent onboarding-test its own PR/version, documenting both modes, isolation guarantees, and cleanup. Run locally / on demand (not gated per-PR — the published package, not the PR's source, is what it exercises). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-23T18:38:23Z

Test Coverage Report

No TypeScript source files changed in this PR.

greptile-apps · 2026-05-23T18:40:31Z

Greptile Summary

This PR adds a local, on-demand end-to-end onboarding harness that installs a published @aoagents/ao npm version into an isolated temp prefix, starts the dashboard, verifies it serves, and stops cleanly — with an accompanying SKILL.md so agents can invoke it autonomously.

scripts/test-onboarding.sh: New script with two modes (fresh / coexist) and a clean isolation model (sandbox HOME, temp npm prefix, process runtime, throwaway git repo, single trap EXIT cleanup root).
skills/onboarding-test/SKILL.md + skills/README.md: Documents the harness, both modes, isolation guarantees, and how to test unmerged branches via verdaccio.

Confidence Score: 4/5

Safe to merge with a fix to the coexist lsof snapshot — the current implementation will produce spurious failures on any active development machine.

The coexist mode snapshots all listening TCP ports system-wide and diffs them before/after. On an active development machine — the explicit intended audience for coexist — any unrelated service binding or releasing a port during the test window will register as an isolation breach, making the mode unreliable in its primary use case.

scripts/test-onboarding.sh — specifically the lsof snapshot in snapshot_real and the diff -u comparison in the coexist block.

Important Files Changed

Filename	Overview
scripts/test-onboarding.sh	New E2E onboarding test harness. Solid isolation design; coexist mode's system-wide lsof snapshot will produce false positives from unrelated TCP port activity on busy machines.
skills/onboarding-test/SKILL.md	New skill document describing the onboarding test harness — well-written, accurately reflects the script's behavior and isolation guarantees.
skills/README.md	Adds a single table row registering the new onboarding-test skill — no issues.

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
scripts/test-onboarding.sh:183-187
**`coexist` lsof snapshot will produce false positives from unrelated port activity**

The before/after diff covers every TCP LISTEN socket on the entire machine. On an active development machine — the exact use case `coexist` mode is designed for — any other service that binds or releases a port during the ~15–30 s test window (a restarting dev server, a Docker container, `npx` ephemeral installs, etc.) will change the snapshot and fail the check. The false positive cannot be distinguished from a real isolation breach. Filtering the `lsof` snapshot to only ao's specific ports would eliminate the false-positive surface while still catching any actual sandbox leakage.

### Issue 2 of 2
scripts/test-onboarding.sh:79-82
`--port` accepts any string but only guards against three literal reserved values. Passing a non-numeric value (e.g. `--port foo`) silently slips through validation and produces a cryptic failure deep inside `ao start` rather than a clear argument error at the entry point.

```suggestion
case "$MODE" in
  fresh|coexist) ;;
  *) echo "Invalid --mode: $MODE (want fresh|coexist)" >&2; exit 2 ;;
esac

if [ -n "$FIXED_PORT" ]; then
  case "$FIXED_PORT" in
    ''|*[!0-9]*) echo "--port must be a positive integer, got: $FIXED_PORT" >&2; exit 2 ;;
  esac
fi
```

_{Reviews (2): Last reviewed commit: "test: support testing a branch via --reg..." | Re-trigger Greptile}

greptile-apps · 2026-05-23T18:40:34Z

+free_port() {
+  node -e 'const n=require("net");const s=n.createServer();s.listen(0,"127.0.0.1",()=>{const p=s.address().port;s.close(()=>console.log(p));});'
+}


Port TOCTOU and missing duplicate check

free_port() asks the kernel for a free port, closes the server immediately, then returns the number. The port is now unbound — another process on a busy machine can claim it before ao start binds to it. More critically, the three sequential free_port() calls are not checked for duplicates: if the kernel recycles the just-freed port on the next call, PORT, TERM_PORT, and DIRECT_TERM_PORT could collide. ao start would then try to bind two identical ports, fail to start, and the test would produce a misleading "daemon never appeared" error rather than a clear allocation failure.

Prompt To Fix With AI

This is a comment left during a code review. Path: scripts/test-onboarding.sh Line: 151-153 Comment: **Port TOCTOU and missing duplicate check** `free_port()` asks the kernel for a free port, closes the server immediately, then returns the number. The port is now unbound — another process on a busy machine can claim it before `ao start` binds to it. More critically, the three sequential `free_port()` calls are not checked for duplicates: if the kernel recycles the just-freed port on the next call, `PORT`, `TERM_PORT`, and `DIRECT_TERM_PORT` could collide. `ao start` would then try to bind two identical ports, fail to start, and the test would produce a misleading "daemon never appeared" error rather than a clear allocation failure. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-05-23T18:40:37Z

+step "Verifying clean shutdown"
+for _ in $(seq 1 40); do
+  kill -0 "$RUNNING_PID" 2>/dev/null || break
+  sleep 0.5
+done
+kill -0 "$RUNNING_PID" 2>/dev/null && die "daemon pid $RUNNING_PID still alive after stop"
+[ -f "$SANDBOX_RUNNING" ] && die "sandbox running.json still present after stop"
+curl -sf "http://127.0.0.1:$PORT/api/sessions" >/dev/null 2>&1 \
+  && die "dashboard still serving on $PORT after stop" || true
+ok "Sandbox daemon stopped, running.json cleared, port released"


kill -0 succeeds for zombie processes

After ao stop signals the daemon, the daemon can briefly enter the zombie state (exited but not yet reaped by its parent). kill -0 <zombie_pid> returns 0 on Linux until the parent calls wait(), so the poll loop runs all 40 iterations (20 s) before the final kill -0 check at line 329 can then pass or fail depending on reap timing. On fast machines the daemon's parent is init/systemd (since it daemonized), which reaps zombies quickly, but on slower machines this can add unexpected wall-clock time to the test.

Prompt To Fix With AI

This is a comment left during a code review. Path: scripts/test-onboarding.sh Line: 324-333 Comment: **`kill -0` succeeds for zombie processes** After `ao stop` signals the daemon, the daemon can briefly enter the zombie state (exited but not yet reaped by its parent). `kill -0 <zombie_pid>` returns 0 on Linux until the parent calls `wait()`, so the poll loop runs all 40 iterations (20 s) before the final `kill -0` check at line 329 can then pass or fail depending on reap timing. On fast machines the daemon's parent is `init`/`systemd` (since it daemonized), which reaps zombies quickly, but on slower machines this can add unexpected wall-clock time to the test. How can I resolve this? If you propose a fix, please make it concise.

…route Adds --registry <url> so an unmerged branch published to a throwaway local registry (verdaccio) can be onboarding-tested with full packaging fidelity, keeping every isolation guarantee. Documents that merged commits are already covered by the auto-published per-commit nightly (--version 0.x.y-nightly-<sha>), and why a source/npm-link build cannot reproduce packaging bugs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

greptile-apps Bot reviewed May 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: isolated onboarding harness for published @aoagents/ao#2049

test: isolated onboarding harness for published @aoagents/ao#2049
i-trytoohard wants to merge 2 commits into
mainfrom
test/onboarding-harness

i-trytoohard commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026

Uh oh!

greptile-apps Bot commented May 23, 2026 •

edited

Loading

Important Files Changed

Uh oh!

greptile-apps Bot May 23, 2026

Uh oh!

Uh oh!

greptile-apps Bot May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

i-trytoohard commented May 23, 2026

What

Deliverables

Isolation (all enforced by the script)

Two modes

Validation (run locally, both modes)

Bug the harness caught 🐛

Uh oh!

github-actions Bot commented May 23, 2026

Test Coverage Report

Uh oh!

greptile-apps Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 23, 2026 •

edited

Loading