Skip to content

feat(ao): land the end-to-end agent-spawn + SCM-observe chain#73

Closed
harshitsinghbhandari wants to merge 24 commits into
mainfrom
staging
Closed

feat(ao): land the end-to-end agent-spawn + SCM-observe chain#73
harshitsinghbhandari wants to merge 24 commits into
mainfrom
staging

Conversation

@harshitsinghbhandari
Copy link
Copy Markdown
Collaborator

Summary

Lands the full LCM↔SCM end-to-end chain on main. After this merges, ao spawn --prompt "<task>" --project <id> boots a real Claude Code agent in a real git worktree inside a real zellij pane, and the daemon automatically observes the PR the agent opens — CI failures, merge conflicts, and unresolved review comments arrive in the agent's workspace as <workspace>/.ao/inbox/<timestamp>.md files. The chain was verified end-to-end locally: registered project → spawned session → agent created a file on demand.

main already has PR #69 (the SCM provider, squashed). This PR brings everything around it: the agent adapters, the SM wiring, the agent shim, the inbox messenger, the SCM poller, the HTTP route, the ao spawn CLI, and the convenience helper script.

What's in this PR

Agent adapters (was PR #65, @yyovil — superseded by this PR)

  • backend/internal/adapters/agent/ — the rich agent.Agent interface, the Plugin registry, Claude Code and Codex implementations with native hooks
  • docs/agent/README.md, .envrc, flake.nix, flake.lock — design notes and the Nix dev shell

Bridge: agent.Agentports.Agent (was part of PR #70)

  • backend/internal/adapters/agent/portshim/ — wraps the richer interface to satisfy the narrower one the SM expects. POSIX shell-quotes argv into the single-string the legacy port wants.

Workspace repo resolution (was part of PR #70)

  • backend/internal/adapters/workspace/gitworktree/projectresolver/gitworktree.RepoResolver backed by project.Manager.Get. Lives in its own subpackage to keep gitworktree free of the project import.

Inbox-file messenger (was part of PR #70)

  • backend/internal/adapters/messenger/inbox/ports.AgentMessenger writes <workspace>/.ao/inbox/<rfc3339nano>_<hash>.md. Symlink-safe via os.Lstat.

Session Manager wired into the daemon (was part of PR #70)

  • backend/internal/daemon/session_wiring.go — assembles claudecode → portshim, gitworktree over projectresolver, inbox messenger over the sqlite store, and the SM. Reuses the existing zellij runtime, project.Manager, and lifecycle.Manager singletons (one of each, shared between SM and reaper/terminal/httpd).
  • Updates to daemon.go, lifecycle_wiring.go, wiring_test.go to call into it and assert singleton sharing.

ao spawn CLI + POST /api/v1/sessions route (was PR #71, squashed)

  • backend/internal/httpd/controllers/sessions.go + tests — the route. 503 if SM not wired, 400 for missing fields, 201 with {sessionId, workspacePath, runtimeHandle} on success.
  • backend/internal/cli/spawn.go + tests — the cobra subcommand. Reads the daemon's runfile, POSTs the spawn request, prints the result.
  • backend/internal/session/spawner.go — narrow Spawner interface (just Spawn) so the controller can be tested without dragging the SM's collaborators.

SCM poller (was PR #72)

  • backend/internal/observe/scm/poller.go + tests — 30s ticker. Per session: derives owner/repo from project, calls Provider.FindOpenPRForBranch, then Provider.Observe, then hands the result to pr.Manager.ApplyObservation (which writes the PR row + forwards to lifecycle.ApplyPRObservation).
  • backend/internal/daemon/scm_wiring.go — wires it next to the reaper. Reads AO_GITHUB_TOKEN / GITHUB_TOKEN at startup; missing token degrades gracefully (logs and noops).
  • backend/internal/adapters/scm/github/find_branch_pr.go + test — Provider.FindOpenPRForBranch(owner, repo, branch) over GET /pulls?head=…&state=open. Picks the most recently updated PR when multiple match.
  • Small additions to the existing provider.go / client.go / provider_test.go from feat(scm): GitHub provider adapter — Observe(prURL) → PRObservation #69.
  • backend/internal/integration/scm_poller_test.go — end-to-end through real pr.Manager + lifecycle.ApplyPRObservation + capture messenger.

Spawn correctness fix (commit b6c76b4, by aa-46)

  • backend/internal/session/manager.go — defaults cfg.Branch to "ao/" + sessionID when unset. The CLI/API doesn't expose a --branch flag; the SM is the only layer where a per-session default ref can be computed (the session id is assigned by the store inside Spawn). Without this, gitworktree refused empty branches and every spawn returned 500.
  • backend/internal/httpd/controllers/sessions.gowriteSpawnError now slog.Errors the internal error so future failures actually appear in daemon.log instead of being silently dropped. Response body stays generic (no detail leak).
  • Test added in backend/internal/session/manager_test.go pinning the default-branch behavior.

Helper script

  • scripts/ao-here.sh — builds the daemon to backend/bin/ao, starts it if not running, registers $PWD as a project, prints the ready-to-paste ao spawn command. Idempotent on re-registration (handles 409 cleanly).

End-to-end verification

Done locally on this branch (baa386f):

  1. ./scripts/ao-here.sh from the agent-orchestrator repo root — built the daemon, started it, registered ao-agents as a project.
  2. ./backend/bin/ao spawn --project ao-agents --prompt "write a file at .../gotten.md acknowledging" — returned 201 with the session id, workspace path, runtime handle.
  3. zellij attach <handle> — live Claude Code session in the new worktree at <workspacePath>. Claude created gotten.md as instructed.

Tests: 417/418 pass on the integrated tree. The one failure is TestSessionStreamsRealZellijPane, a pre-existing local-env issue where macOS $TMPDIR > 103 chars exceeds zellij's IPC socket limit. Independent of these changes; same failure shows on origin/main.

Known follow-ups (not in this PR)

  • Permissions threading: spawned workers currently block on Claude Code's "Do you want to create X?" interactive prompt because the portshim doesn't propagate Permissions from ports.SpawnConfig to the claudecode adapter's LaunchConfig. Two options to track separately: (a) hardcode --dangerously-skip-permissions for KindWorker in claudecode, or (b) properly thread Permissions through ports.SpawnConfigportshim.AgentConfig → claudecode's LaunchConfig with KindWorker defaulting to bypass. Recommended path is (b).
  • No PR URL column on session rows yet — the SCM poller falls back to branch-based PR discovery via gh /pulls?head=…. Once the session record grows a stored PR URL, the poller should prefer it. Intentionally deferred to a session/sqlc PR.
  • Codex adapter not wired — the codex adapter code is here, but the daemon only constructs claudecode. Codex wiring is a small follow-up.
  • Zellij send-keys pane-ping — agent reads .ao/inbox/ on demand today. A future PR can add a live ping so the agent notices new files faster.

Origin trail (preserved on staging)

These were the sub-PRs that landed on staging in order; they're all here as one cumulative diff against main:

yyovil and others added 23 commits June 1, 2026 05:31
Faithful copy of the agents plugin implementation from yyovil/better-ao
(internal/plugin/ -> backend/internal/plugin/) plus its PRD
(prds/plugins/agents/PRD.md), as a first-iteration proposal for review.

Imports are left at their original github.com/yyovil/better-ao/... paths and
are NOT yet reconciled to this repo's module; see PR description for the
integration deltas (module path, missing internal/utils dependency).

Co-authored-by: Claude <noreply@anthropic.com>
A fresh GitHub SCM provider adapter under
backend/internal/adapters/scm/github/ exposing one method:

  (*Provider).Observe(ctx, prURL) (ports.PRObservation, error)

It performs a REST GET on /repos/{o}/{r}/pulls/{n} for the authoritative
draft/merged/closed/head-SHA, one GraphQL query for the reviewDecision +
mergeStateStatus + statusCheckRollup + unresolved review threads, and
(only for failure-class CheckRuns) a REST GET on
/actions/jobs/{job_id}/logs to splice the last 20 lines of the failed
job into the observation.

The package is the observation primitive; the polling loop, cadence
selection, daemon wiring, persistence and webhook receiver are all
intentionally out of scope (separate PRs / lanes).

Closes #27 — this supersedes PR #28's attempt, which targeted types
(domain.SCMProvider / SCMSnapshot / ports.SCMObserveRequest) that the
PR #62 simplification refactor has since removed. The GraphQL queries
and mergeability composition logic are credited to @whoisasx from
PR #28's provider.go; the package was re-implemented against the
current ports.PRObservation seam (post-#62) rather than rebased.

Bot-author detection uses ONLY GitHub's typed signal (__typename
"Bot" / User.Type "Bot"). The strings.Contains(login, "bot") fallback
from PR #28 was intentionally dropped — aa-18's review flagged it as
a false-positive magnet for logins like "robothon" / "lambot123".

46 table-driven tests against httptest.NewServer cover happy path,
draft, merged, closed (not merged), CI passing/failing/pending,
StatusContext legacy, log-tail extraction (and the best-effort
log-fetch failure case), mergeability mergeable/conflicting/blocked
(including ci-failing → blocked even when GitHub still says CLEAN —
the load-bearing aa-18 contract)/unstable/unknown, review
approved/changes-requested/required/none, bot-author filtering
(including the robothon false-positive guard), unresolved-only
threads, all-bots → empty Comments, ETag-304 cache hit, primary +
secondary rate-limit (with errors.As → *RateLimitError), 401 →
ErrAuthFailed, malformed JSON → Fetched:false, network error →
Fetched:false, Authorization Bearer header injection,
StaticTokenSource blank/whitespace rejection, GHTokenSource memoize
+ invalidate.

Verification:
- go build ./...               clean
- go vet ./...                 clean
- gofmt -l backend/internal/adapters/scm/   clean
- golangci-lint run ./... (v2.12, repo .golangci.yml)   0 issues
- go test -race ./internal/adapters/scm/github/...      46/46 PASS

References:
- aa-18 review of PR #28: ~/.ao/agent-reports/aa-18.md
- aa-26 tracker adapter (sibling Go-adapter pattern): #36 / agent-reports/aa-26.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…x messenger

PR #62 ("simplify session lifecycle and zellij runtime") deleted the Session
Manager wiring from the daemon — every call to session.New() was removed and
only the integration test still constructed one. This brings SM back, end to
end: calling sm.Spawn() now launches a real Claude Code agent in a real git
worktree in a real zellij session, and lifecycle nudges reach the agent via
an inbox file.

Four new pieces:

- adapters/agent/portshim: bridges the richer adapters/agent.Agent interface
  (PR #65, @yyovil) onto the narrower ports.Agent the SM consumes. POSIX
  shell-quoting joins argv into the single string the zellij `sh -lc` wrapper
  expects.
- adapters/workspace/gitworktree/projectresolver: gitworktree.RepoResolver
  backed by project.Manager. Lives in its own subpackage so gitworktree stays
  free of the project import (and the cycle that would create).
- adapters/messenger/inbox: ports.AgentMessenger writing each message as
  <session-workspace>/.ao/inbox/<nano>_<hash>.md. Symlink-safe via os.Lstat
  on the .ao/inbox segments.
- daemon/session_wiring.go: assembles claudecode → portshim, gitworktree
  over projectresolver, inbox messenger over the sqlite store, and the SM
  itself. Reuses the existing zellij runtime / project manager / lcm
  singletons rather than constructing parallel copies.

Daemon-wide singleton sharing (the change of behavior under #62 / #65 +
this PR):

- One zellij.Runtime instance services both the terminal mux and SM.Spawn.
  Two adapters would race on the same socket.
- One lifecycle.Manager instance services both the reaper (runtime liveness
  observations) and the SM (spawn/restore/kill writes). Two LCMs would split
  agent-nudge state.
- One project.Manager instance services both httpd (/api/v1/projects) and
  the gitworktree RepoResolver. Two stores would diverge on cached reads.
- One ports.AgentMessenger services both the LCM (PR-driven reactions:
  CI fail, review feedback, merge conflict) and the SM (Send).
- One *sqlite.Store services CDC, lifecycle, SM, and the inbox workspace
  lookup. Already the case; preserved.

Also promotes the duplicated "agentSessionId" metadata key literal in the
claudecode and codex adapters to a single agent.MetadataKeyAgentSessionID
constant in adapters/agent, which the portshim now uses to populate
Session.Metadata for the underlying adapter's GetRestoreCommand.

What this PR does NOT do (covered by follow-ups γ/δ):
- No HTTP routes for SM (POST /sessions, etc.) — γ
- No `ao session new` CLI — γ
- No SCM poller — δ
- No codex agent wiring (claude-code only) — later
- No zellij send-keys pane-ping — the agent reads its inbox on demand

Tests:
- portshim: 15 table-driven cases (shell quoting, env, restore propagation,
  error fall-through, safe-string short-circuit).
- projectresolver: 4 cases (interface satisfaction, happy path, unknown
  project, degraded project).
- inbox: 9 cases (interface satisfaction, write, dir create, two-distinct,
  unknown session, empty workspace path, symlinked inbox refused, empty
  message, filename shape).
- daemon/wiring_test: SM stack constructed + sharing singletons + messenger
  reaches the same store via SessionMetadata.WorkspacePath end to end.

go test -race / go vet / gofmt clean. The pre-existing
TestSessionStreamsRealZellijPane integration test fails on this host
because \$TMPDIR > 103 chars (zellij IPC socket limit) — also fails on
origin/staging without these changes.

Branched from origin/staging (= main + #65) so claudecode is available;
PR base must be staging.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires the github.Provider (PR #69) into the daemon as a periodic poller
on top of the SM/lifecycle stack (PR #70). Every 30s the loop lists
alive sessions, branch-discovers each session's open PR, observes it
through the Provider under a 15s per-call deadline, and hands the
result to pr.Manager.ApplyObservation — which transactionally writes
the row and forwards to lifecycle.ApplyPRObservation for CI-failure
log-tail nudges, review-feedback nudges (capped at reviewMaxNudge=3),
and merge-conflict rebase nudges.

Branch discovery is the v1 fallback because sessions don't yet carry a
PR URL field; adding that column is a separate session/sqlc PR. Until
then the poller resolves owner/repo from project.Repo (currently
RepoOriginURL) or git remote get-url origin, then asks GitHub for the
open PR with head = owner:branch.

Error classification follows the spec:
  - ErrRateLimited: short-circuit rest of tick (don't burn through
    remaining sessions while GitHub asks us to back off)
  - ErrAuthFailed:  flip Healthy() to false (sticky — does NOT
    auto-recover, because a 304-cached success doesn't actually
    exercise the token) and continue
  - other:          log warn, continue

No-token environments degrade gracefully: startSCM logs an Info notice
and returns a closed done-channel; Stop is a free call.
Review fixes (PR #71):
- spawn CLI now uses a dedicated 90 s timeout (90 s > server's 60 s
  DefaultRequestTimeout) via context.WithTimeout, and stops sharing
  deps.HTTPClient — that client is sized for fast /healthz/shutdown
  probes (2 s) and was preempting the synchronous Spawn long before the
  daemon could finish provisioning a worktree + zellij pane + agent.
- Harden writeSpawnError so a *project.Error with a non-client Kind
  ("internal", "not_implemented", or anything unknown) falls through to
  the generic 500 SPAWN_FAILED envelope instead of passing the project
  error's Code/Message verbatim to the client. Adds three subtests that
  pin down the opacity contract.

Lint debt cleared (inherited from PRs #65/#70):
- Add doc comments on every exported symbol in the agent / claudecode /
  codex / adapters-registry packages (revive: exported)
- gosec G306/G301: inbox file/dir perms 0644→0600 and 0755→0750
- gosec G703 (path traversal via taint): excluded globally with the same
  rationale as G304 — adapter paths are daemon-config/worktree-derived,
  not user input
- gocritic emptyStringTest: len(strings.TrimSpace(...)) > 0 → != ""
- gocritic paramTypeCombine: combine adjacent same-type params
- errcheck: wrap deferred os.Remove(tmpName) in a closure
- prealloc: preallocate cmd slices on the resume paths
Inherited from PR #72 merging into staging after this branch opened.
golangci-lint v2.12.2 → 0 issues.
* feat(ao): `ao spawn` CLI + POST /api/v1/sessions route

* fix(ao): address PR γ review + clear inherited lint debt

Review fixes (PR #71):
- spawn CLI now uses a dedicated 90 s timeout (90 s > server's 60 s
  DefaultRequestTimeout) via context.WithTimeout, and stops sharing
  deps.HTTPClient — that client is sized for fast /healthz/shutdown
  probes (2 s) and was preempting the synchronous Spawn long before the
  daemon could finish provisioning a worktree + zellij pane + agent.
- Harden writeSpawnError so a *project.Error with a non-client Kind
  ("internal", "not_implemented", or anything unknown) falls through to
  the generic 500 SPAWN_FAILED envelope instead of passing the project
  error's Code/Message verbatim to the client. Adds three subtests that
  pin down the opacity contract.

Lint debt cleared (inherited from PRs #65/#70):
- Add doc comments on every exported symbol in the agent / claudecode /
  codex / adapters-registry packages (revive: exported)
- gosec G306/G301: inbox file/dir perms 0644→0600 and 0755→0750
- gosec G703 (path traversal via taint): excluded globally with the same
  rationale as G304 — adapter paths are daemon-config/worktree-derived,
  not user input
- gocritic emptyStringTest: len(strings.TrimSpace(...)) > 0 → != ""
- gocritic paramTypeCombine: combine adjacent same-type params
- errcheck: wrap deferred os.Remove(tmpName) in a closure
- prealloc: preallocate cmd slices on the resume paths

* fix(lint): lift loop condition in scm poller test (staticcheck QF1006)

Inherited from PR #72 merging into staging after this branch opened.
golangci-lint v2.12.2 → 0 issues.
Convenience script for the typical local-dev flow: cd into a git repo,
run `scripts/ao-here.sh`, and the daemon starts (if not already up),
the cwd is registered as an AO project, and the script prints the
project ID along with a ready-to-run `ao spawn` command.

Idempotent: 409 conflict on existing-path is handled by pulling the
existingProjectId out of the response so re-running just re-uses the
previous registration.
…TH lookup

The PATH version collided with the TypeScript orchestrator CLI (same
binary name 'ao', different commands). This version finds the repo via
the script's own location, builds (or rebuilds when sources are newer)
backend/cmd/ao into backend/bin/ao, and invokes that explicit binary
for both 'start' and 'spawn'. No reliance on PATH; no ambiguity about
which 'ao' is running.

backend/bin/ is already gitignored.
…h is unset

The CLI/API does not expose a branch flag, but the gitworktree adapter
requires a non-empty branch (and cannot have two worktrees on the same
branch). Spawn forwarded cfg.Branch verbatim to workspace.Create, so the
first real `ao spawn` returned 500 with "workspace: gitworktree: branch
is required". The session id is assigned by the store inside Spawn, so
the SM is the only layer where a per-session default ref can be
computed. Explicit branches still win.

Also surface internal SPAWN_FAILED errors to the daemon log via
slog.Error in writeSpawnError. Response stays opaque
("Failed to spawn session") — TestSessionsAPI_Spawn_InternalKindIsOpaque
still passes — but operators get a real error in the log instead of a
713µs 500 with no trace.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The envelope is flat — '.details.existingProjectId', not
'.error.details.existingProjectId' (the 'error' field is a string
kind, not an object). Re-running the script against an
already-registered project now correctly extracts the project id
from the conflict response instead of crashing the helper.

Found by aa-46 while debugging the end-to-end spawn flow.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR lands the complete LCM↔SCM end-to-end chain: from ao spawn --prompt "<task>" through a real Claude Code session in a git worktree, to automatic SCM observation that delivers CI failures, merge conflicts, and unresolved review comments into the agent's inbox. The change is large but well-decomposed — each layer (adapters, portshim, inbox messenger, session wiring, HTTP route, CLI, SCM poller) has tests and clear interface seams.

  • Agent adapters + portshim: New claudecode and codex adapters implement the rich agent.Agent interface; the portshim bridges them onto the narrower ports.Agent the Session Manager consumes by POSIX-quoting the argv into a single shell string. A follow-up is needed to thread Permissions through the portshim (currently dropped, so workers block on Claude's interactive approval prompts).
  • Inbox messenger + session wiring: inbox.Messenger writes timestamped .ao/inbox/*.md files with symlink-safe directory creation; the daemon assembles a single set of singletons (runtime, projects, messenger) shared between the lifecycle manager and session manager.
  • SCM poller: A 30-second polling loop discovers open PRs by branch, fetches observations via GitHub REST+GraphQL, and forwards them to pr.Manager.ApplyObservation. The healthy bit flips on ErrAuthFailed but the loop currently continues to all remaining sessions in that tick rather than short-circuiting as the rate-limit path does.

Confidence Score: 3/5

Safe to merge the infra and wiring; spawned worker sessions will block on Claude Code's interactive approval prompts because Permissions is not threaded through the portshim, and a launch-command generation failure produces a bare shell pane with no diagnostic output.

The portshim silently discards any error from the agent's GetLaunchCommand — no log, no propagation — so if command generation fails the runtime receives an empty string and either creates a bare shell pane (spawn looks like success, no agent runs) or returns an opaque error. Additionally, the portshim drops the Permissions field on every spawn, which is acknowledged in the PR description but means every worker session is immediately interactive-blocked in practice. Both issues touch the single most critical path in the new feature.

backend/internal/adapters/agent/portshim/shim.go (silent error drop + missing Permissions propagation); backend/internal/observe/scm/poller.go (auth-failure should short-circuit the tick like rate-limit does).

Important Files Changed

Filename Overview
backend/internal/adapters/agent/portshim/shim.go New shim bridging richer agent interface to ports.Agent; silently drops GetLaunchCommand errors and does not propagate Permissions.
backend/internal/observe/scm/poller.go New 30s SCM polling loop; correctly short-circuits on rate-limit but continues iterating all sessions on auth failure, causing redundant 401s per tick.
backend/internal/adapters/agent/claudecode/claudecode.go New Claude Code adapter; correct binary resolution, permission-mode mapping, atomic ~/.claude.json trust update, and UUIDv5-based session-id derivation.
backend/internal/adapters/agent/claudecode/hooks.go Idempotent hook installer merging into .claude/settings.local.json; correctly preserves existing user hooks and unrelated settings.
backend/internal/adapters/messenger/inbox/inbox.go Inbox-file AgentMessenger; symlink-safe directory creation and collision-resistant filenames; looks correct.
backend/internal/session/manager.go Added default-branch derivation (ao/+sessionID) in Spawn when Branch is empty; correctly placed after store.CreateSession assigns the ID.
backend/internal/httpd/controllers/sessions.go New POST /sessions route; handles nil SM with 503, validates required fields, and logs internal errors without leaking details to clients.
backend/internal/cli/spawn.go New ao spawn subcommand; reads runfile for daemon address, uses a dedicated 90s-timeout HTTP client separate from the shared probe client.
backend/internal/daemon/session_wiring.go Session stack assembly; correctly passes singleton runtime/projects/messenger instances to the Session Manager.
backend/internal/daemon/scm_wiring.go SCM stack wiring; gracefully degrades when GITHUB_TOKEN is absent; token eagerly validated at startup via NewProvider.
backend/internal/adapters/scm/github/find_branch_pr.go New FindOpenPRForBranch using GET /pulls?head=owner:branch&state=open; picks most-recently-updated match when multiple open PRs share the same head.
scripts/ao-here.sh Helper script; incrementally rebuilds the binary, waits for readiness, and handles 409 re-registration idempotently.

Reviews (1): Last reviewed commit: "fix(ao-here): correct jq path for 409 ex..." | Re-trigger Greptile

Comment on lines +32 to +38
func (s *Shim) GetLaunchCommand(cfg ports.AgentConfig) string {
argv, err := s.agent.GetLaunchCommand(context.Background(), launchConfigFor(cfg))
if err != nil {
return ""
}
return joinShellArgv(argv)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent error drop produces a dangerously empty launch command

When agent.GetLaunchCommand returns an error, the shim silently returns "" with no log output. The Session Manager then passes LaunchCommand: "" to runtime.Create. If the runtime accepts an empty command it starts a bare shell pane and the spawn returns HTTP 201 — an agent was never launched, but the caller has no idea. If the runtime rejects it, the error surfaces as a generic spawn failure with no indication that the real failure was in command generation. Either outcome is observable only through the resulting session behaviour, not from logs or the HTTP response.

Suggested change
func (s *Shim) GetLaunchCommand(cfg ports.AgentConfig) string {
argv, err := s.agent.GetLaunchCommand(context.Background(), launchConfigFor(cfg))
if err != nil {
return ""
}
return joinShellArgv(argv)
}
func (s *Shim) GetLaunchCommand(cfg ports.AgentConfig) string {
argv, err := s.agent.GetLaunchCommand(context.Background(), launchConfigFor(cfg))
if err != nil {
// Log the error so operators can diagnose a spawn that produces an
// empty launch command. The ports.Agent contract has no error return,
// so "" is the only channel for failure — make it at least visible.
slog.Default().Error("portshim: agent.GetLaunchCommand failed, returning empty command",
"session", cfg.SessionID, "err", err)
return ""
}
return joinShellArgv(argv)
}

Comment on lines +67 to +73
func launchConfigFor(cfg ports.AgentConfig) agent.LaunchConfig {
return agent.LaunchConfig{
SessionID: string(cfg.SessionID),
WorkspacePath: cfg.WorkspacePath,
Prompt: cfg.Prompt,
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Permissions field is not propagated to the underlying adapter

launchConfigFor copies SessionID, WorkspacePath, and Prompt from ports.AgentConfig but silently drops cfg.Permissions. As a result every spawned session runs in PermissionModeDefault and the Claude Code adapter emits no --permission-mode flag, so Claude Code falls back to the user's settings.json. For KindWorker sessions this means the interactive "Do you want to create X?" prompt is shown and the agent blocks. The PR description notes this as a known follow-up, but it renders every spawned worker session non-functional until the user manually answers the prompt or it times out.

Comment on lines +246 to +250
case errors.Is(err, scmgithub.ErrAuthFailed):
p.healthy.Store(false)
p.logger.Error("scm poller: auth failed, provider marked unhealthy",
"session", sid, "stage", stage, "err", err)
return false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Short-circuit the tick on auth failure for the same reason rate-limiting does. A token that is globally invalid will fail on every subsequent session in the same tick, so continuing is only wasteful.

Suggested change
case errors.Is(err, scmgithub.ErrAuthFailed):
p.healthy.Store(false)
p.logger.Error("scm poller: auth failed, provider marked unhealthy",
"session", sid, "stage", stage, "err", err)
return false
case errors.Is(err, scmgithub.ErrAuthFailed):
p.healthy.Store(false)
p.logger.Error("scm poller: auth failed, skipping rest of tick",
"session", sid, "stage", stage, "err", err)
return true

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@harshitsinghbhandari harshitsinghbhandari marked this pull request as draft June 1, 2026 18:17
…s wiring

Main moved structurally via PR #67 (session→session_manager rename + service
layer for read-model assembly). Staging had the live SM daemon wiring (PR #70)
but for the old session package shape. This merge adopts main's shape and
retargets staging's wiring on top.

Structural decisions:
- Keep main's httpd/api.go shape with controllers.SessionService (drop
  staging's session.Spawner).
- Keep main's rich httpd/controllers/sessions.go (list/get/spawn/restore/kill/
  send/rename/spawnOrchestrator) and its tests.
- Keep main's SCM provider/client/test code: the pagination guard, the
  merge_state_status REST field, and the application/vnd.github+json Accept
  header are intentional safety + correctness fixes. Staging's text/plain
  Accept tweak would 406 the /actions/jobs/{id}/logs endpoint.
- Delete session.Spawner: service.Session is the new controller contract and
  duck-types into controllers.SessionService.

Wiring retarget (daemon/session_wiring.go, daemon/daemon.go):
- buildSessionStack now constructs *session_manager.Manager and wraps it in
  service.NewSession(sm, store); the stack returns *service.Session instead of
  a bare *Manager.
- daemon.Run passes ss.svc into httpd.APIDeps{Sessions: ...}.

Preserved from staging:
- aa-46 default-branch fix in session_manager/manager.go::Spawn (branch
  defaults to "ao/<sessionID>" when SpawnConfig.Branch is empty) plus its two
  pinning tests (TestSpawn_DefaultsBranchPerSession_WhenUnset,
  TestSpawn_HonorsExplicitBranch).
- ao spawn CLI + POST /api/v1/sessions route, ao-here.sh helper, agent
  adapters (claudecode + portshim), inbox messenger, projectresolver,
  gitworktree daemon wiring, SCM poller, find_branch_pr, ao spawn route fix.

Import migration: internal/session → internal/session_manager updated in
daemon/session_wiring.go and adapters/agent/portshim/shim_test.go.

Verification: go build, go vet, gofmt all clean. go test -race ./... reports
406 pass / 1 fail (TestSessionStreamsRealZellijPane — pre-existing macOS
$TMPDIR > 103-char zellij IPC socket length issue, not introduced by merge).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@harshitsinghbhandari
Copy link
Copy Markdown
Collaborator Author

File-by-file tour (excluding @yyovil's agent adapters from #65)

Skipping .envrc, flake.nix, flake.lock, docs/agent/README.md, backend/internal/adapters/agent/agent.go, backend/internal/adapters/agent/claudecode/**, backend/internal/adapters/agent/codex/**, backend/internal/adapters/registry.go — those are @yyovil's plugin-registry + Claude Code + Codex adapters from #65 and aren't ours to explain.

Bridge: rich agent.Agent (PR #65) ↔ narrow ports.Agent (Session Manager)

  • backend/internal/adapters/agent/portshim/shim.go — wraps an agent.Agent to satisfy ports.Agent. GetLaunchCommand calls the rich interface and POSIX-shell-quotes the returned argv into the single string the Session Manager expects. GetEnvironment returns the AO_SESSION_ID/AO_PROJECT_ID/AO_ISSUE_ID baseline env. GetRestoreCommand joins the rich adapter's restore argv if ok=true, else returns empty (SM re-Spawns instead).
  • backend/internal/adapters/agent/portshim/shim_test.go — table-driven coverage: shell-quoting with paths containing spaces / quotes / $() / backticks, env keys, restore-not-available behaviour, and the var _ ports.Agent = (*Shim)(nil) interface assertion.

Workspace repo resolution

  • backend/internal/adapters/workspace/gitworktree/projectresolver/resolver.go — implements gitworktree.RepoResolver by delegating to project.Manager.Get(projectID) and returning the project's local path. Lives in a subpackage so gitworktree stays free of the project import (avoids cycle).
  • backend/internal/adapters/workspace/gitworktree/projectresolver/resolver_test.go — happy path (known project resolves), unknown id returns error, the var _ interface check.

Inbox-file messenger ("inform agent" pipe)

  • backend/internal/adapters/messenger/inbox/inbox.goports.AgentMessenger. On Send(ctx, sessionID, message): looks up the session's workspace path via an injected store interface, creates <workspace>/.ao/inbox/ if absent, writes the message to <workspace>/.ao/inbox/<rfc3339nano>_<sha256-prefix>.md. Symlink-safe via os.Lstat on .ao and .ao/inbox.
  • backend/internal/adapters/messenger/inbox/inbox_test.go — happy path, dir creation, two-sends-produce-distinct-files, unknown session, empty workspace path, symlinked inbox refused, empty message still written (zero-byte ok), filename shape.

SCM provider — additions on top of #69

  • backend/internal/adapters/scm/github/find_branch_pr.goProvider.FindOpenPRForBranch(owner, repo, branch). Queries GET /repos/{o}/{r}/pulls?head={o}:{branch}&state=open. Picks the most-recently-updated PR when multiple match. Error classification matches Observe.
  • backend/internal/adapters/scm/github/find_branch_pr_test.go — single/no/multi match, empty inputs, rate-limit, auth, and the synthetic PR URL it returns is the one Observe expects.

SCM poller — the 30s tick that drives observations

  • backend/internal/observe/scm/poller.goPoller.Start(ctx) → <-chan struct{}. Mirrors the reaper's goroutine pattern. Per tick: ListAllSessions → skip terminated / branch-less → resolve owner/repo from project.Manager.Get (with shell-out to git remote get-url origin as a fallback) → FindOpenPRForBranch(...)Provider.Observe(prURL) under a 15s per-call deadline → pr.Manager.ApplyObservation(ctx, sessionID, obs). Errors: ErrRateLimited short-circuits the rest of the tick; ErrAuthFailed flips the sticky Healthy() flag AND short-circuits (an invalid token will fail every subsequent session); other errors continue with the next session.
  • backend/internal/observe/scm/poller_test.go — 12 cases including apply, !Fetched skip, no-branch skip, no-PR skip, rate-limit short-circuit, auth-fail-sticky-short-circuit, generic error continue, project-lookup error, per-call deadline, Start/cancel drain, repeat ticks, the parseGitHubRemote URL-parsing table.

ao spawn CLI + POST /api/v1/sessions

  • backend/internal/cli/spawn.go — cobra subcommand. Flags: --prompt (required), --project (required), --agent (defaults to claude-code). Reads the daemon's runfile at <UserConfigDir>/agent-orchestrator/running.json to find the port. POSTs the spawn request. On 201 prints session ID + zellij attach <handle> hint; on 4xx/5xx prints the server's error to stderr and exits non-zero.
  • backend/internal/cli/spawn_test.gohttptest-backed coverage: happy path, server 4xx, server 5xx, network error, daemon not running, missing-required-flag.
  • backend/internal/cli/root.go — one line: registers spawn as a subcommand alongside start / stop / status / doctor.

Daemon wiring — assembles all the singletons

  • backend/internal/daemon/daemon.goRun() is the composition root. Reads config; opens the runfile; opens sqlite store; starts the CDC pipeline; constructs the zellij runtime + terminal manager + project manager; starts httpd with the API deps. Net change here: calls startLifecycle, startSession, startSCM; threads the SM-service into httpd.APIDeps{Sessions: ...}; shutdown order is stop()scmStk.Stop()lcStack.Stop()cdcPipe.Stop() (background goroutines drain after ctx-cancel, before storage closes).
  • backend/internal/daemon/lifecycle_wiring.go — minor changes: takes the messenger into lifecycle.New(store, messenger) instead of lifecycle.New(store, nil), so PR-driven nudges actually reach the inbox messenger.
  • backend/internal/daemon/session_wiring.gostartSession(...). Builds: claudecode plugin → portshim.New(...) (satisfies ports.Agent); gitworktree workspace over projectresolver.New(projectManager); inbox.New(store) messenger; session_manager.New(...) over the same lcm/store/runtime singletons the rest of the daemon uses; wraps the SM in service.NewSession(sm, store). Returns *service.Session so it can flow into httpd.APIDeps{Sessions: ...}.
  • backend/internal/daemon/scm_wiring.gostartSCM(...). Reads AO_GITHUB_TOKEN or GITHUB_TOKEN from env; if missing, logs an Info notice and returns a closed done-channel (graceful no-op for local dev). Otherwise builds scmgithub.NewProvider(...) + scm.New(scm.Deps{...}) and starts the poller.
  • backend/internal/daemon/wiring_test.go — asserts the SM stack is constructed, that the same project.Manager instance backs both httpd and the projectresolver, and that the same *lifecycle.Manager is shared by the reaper and the SM (singleton sharing is load-bearing — two LCMs would split agent-nudge state).

Integration test

  • backend/internal/integration/scm_poller_test.go — boots *sqlite.Store + *lifecycle.Manager + pr.Manager + scm.Poller against an httptest.NewServer returning fixture GitHub responses; inserts a session row; ticks the poller once; asserts the PR row is written, the failing check is recorded, and the lifecycle CI-failure nudge reaches the capture messenger. len(msgs) == 1 catches double-nudge regressions.

Spawn correctness fix (b6c76b4 from aa-46)

  • backend/internal/session_manager/manager.go — adds a default branch when cfg.Branch is unset: "ao/" + string(sessionID). Without this, gitworktree refused empty branches and every spawn returned 500. The session ID is assigned by the store inside Spawn, so the SM is the only layer where a per-session default ref can be computed.
  • backend/internal/session_manager/manager_test.go — two pinning tests: (a) empty cfg.Branch"ao/<sessionID>", (b) explicit cfg.Branch is preserved.

Misc

  • scripts/ao-here.sh — convenience helper. Builds backend/cmd/ao into backend/bin/ao if absent or stale, starts the daemon if not already running, polls /readyz, registers $PWD as a project via POST /api/v1/projects, prints the project ID + a ready-to-paste ao spawn command. Idempotent on re-registration (handles 409 by extracting existingProjectId from the conflict envelope).
  • .gitignore — adds backend/bin/ (the helper script's build output) and a couple of supporting entries.
  • backend/.golangci.yml — a one-line tweak to accommodate one of the new files; nothing structural.

What's already on main and just replayed here

The diff includes the session_manager/ rename + service/ layer that landed on main via #67 — those aren't "ours" to explain in this PR; they're prateek's, and aa-42's reconcile of staging adopted them rather than diverging. The Spawn-route's controllers/sessions.go is also main's (full list/get/spawn/restore/kill/send/rename surface); we wire into it rather than ship our own minimal one.

@harshitsinghbhandari
Copy link
Copy Markdown
Collaborator Author

Closing — yyovil's agent adapter (PR #65, the dependency for this PR's SM wiring) is stale. Staging will keep accumulating our integration work but won't be promoted to main wholesale. Pieces that don't depend on yyovil's adapter (SCM provider already on main via #69, SCM poller, ao spawn CLI/route) can be cherry-picked into main as separate PRs if/when desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants