feat(recover): add agentbox recover to reconnect a box without power-cycling#127
Merged
Conversation
A box's host-side state (the relay's in-memory registry + CloudBoxPoller,
the Hetzner SSH ControlMaster + port forwards, the host Portless aliases,
the detached agent tmux session) is separate from the box and is lost on a
host reboot / relay restart / new CLI process while the sandbox keeps
running. `start`/`unpause` only fix this by power-cycling the box and can't
touch a box missing from local state at all.
`agentbox recover [box]`:
- ensures the host relay is up and rehydrates every box into it,
- calls the new `Provider.reconnect(box)` — the no-power-cycle sibling of
`start`: cloud re-runs `reEnsureCloudBox` (refresh preview URLs, re-open
the Hetzner tunnel, re-register Portless + the relay poller, relaunch
in-box daemons) without `backend.start`; docker re-runs the idempotent
`startBox`,
- relaunches the agent the box was running (resuming, or starting
`box.lastAgent` fresh) and attaches.
Adds `BoxRecord.lastAgent` (claude/codex/opencode), written on every agent
launch (foreground + queued via `recordLastAgent`) — durable, unlike the
in-box session pointers cleared on stop, so recover knows which agent to
bring back.
`recover --provider <cloud> --adopt [ref]` rebuilds local state for a
sandbox missing from this host (from `backend.list()` + the agentbox.name
tag), minting fresh relay/bridge tokens that reach the in-box agent when
reconnect relaunches the ctl daemon (it writes /run/agentbox/relay.env).
Hetzner adoption needs the box's per-host SSH key; a box created elsewhere
can't be controlled and recover says so.
Works across all five providers. Docs updated (cli.mdx, state.md,
host-relay.md, cloud-providers.md). Unit tests for recordLastAgent and the
restoreAgentSessions launch-fresh path; docker reconnect + lastAgent +
fresh-launch verified live (StartedAt unchanged → no power-cycle).
Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
…ession Bugbot: `recover` passed the target agent to `restoreAgentSessions` only to gate the fresh-launch pass, while pass 1 still resumed every resumable agent that had an in-box pointer — so recover could resurrect an unrelated Claude/ Codex session (possibly from a stale pointer) alongside the intended agent. Rework `restoreAgentSessions`: `restoreOnly` (was `launchFresh`) now scopes the whole restore to that one agent — resume it if there's a live/resumable session, else start it fresh — and touches nothing else. `start`/`unpause` (no `restoreOnly`) keep the resume-every-running-agent semantics. A box created before `lastAgent` existed passes `undefined` and so falls back to resume-all. Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
Owner
Author
|
bugbot run |
… start
Bugbot: docker `reconnect` always delegated to `startBox` (`docker start`),
which errors on a paused container ("cannot start a paused container"). So
`agentbox recover` on a paused docker box left it frozen while reporting the
relay/portless recovery as success — exec, agent restore, and attach kept
failing until the user ran `unpause`.
Probe state first: paused -> `unpauseBox` (resumes the still-frozen
ctl/dockerd/vnc; the portless alias survives a pause); running/stopped ->
`startBox` (idempotent, relaunches dead daemons + re-aliases portless);
missing/destroyed -> clear error. Mirrors the cloud provider's state-routed
reconnect.
Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
Owner
Author
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 3519112. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
agentbox recover— re-establishes a box's host-side connectivity (relay registry + CloudBoxPoller, Hetzner SSH ControlMaster + port forwards, host Portless aliases, the detached agent session) without power-cycling the box. For after a host reboot, a relay restart, or a fresh CLI process on another machine — situations wherestart/unpausewould needlessly power-cycle (and can't touch a box missing from local state at all).How
Provider.reconnect(box)— the no-power-cycle sibling ofstart. Cloud:probeState; ifrunning, runsreEnsureCloudBoxdirectly (refresh preview URLs, re-open the Hetzner tunnel, re-register Portless + the relay poller, relaunch in-box daemons) and skipsbackend.start; falls back toresume/startonly when paused/stopped. Docker: idempotentstartBox(adocker starton a live container is a no-op).BoxRecord.lastAgent(claude/codex/opencode) — written on every agent launch (foreground + queued, viarecordLastAgent). Durable, unlike the in-box session pointers cleared on stop, so recover knows which agent to relaunch/attach. Only signal available for an adopted box.restoreAgentSessionslaunchFresh— startslastAgentfresh when nothing is resumable (adopted box / cleared pointer; the only path for opencode).backend.list()+ theagentbox.nametag, fresh relay/bridge tokens that reach the in-box agent when reconnect relaunches the ctl daemon (it writes/run/agentbox/relay.env). Hetzner adoption needs the box's per-host SSH key; a box created elsewhere can't be controlled and recover says so.Works across all five providers. The attach tail reuses a new exported
attachToRunningAgentfrom theattachcommand.Docs
cli.mdx(Lifecycle),docs/state.md(lastAgent),docs/host-relay.md(recover flow),docs/cloud-providers.md(reconnect-without-power-cycle).Tests / verification
recordLastAgentread-modify-write;restoreAgentSessionslaunch-fresh path.lastAgentpersistence, fresh agent relaunch.optima-b64ea2994): recover refreshed the tunnel + relay poller; after supplying the in-box relay token (the box predated the relay-env fix), a relay-backedgit pushcarried the box's work to origin.https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
Note
Medium Risk
Touches provider lifecycle, relay rehydration, and cloud adoption (new tokens/SSH constraints) across all backends; failures are mostly best-effort per box, but incorrect reconnect could leave boxes unreachable until retry.
Overview
Adds
agentbox recoverso host-side wiring (relay registry, cloud pollers, Hetzner tunnel/portless, in-box daemons) can be rebuilt without restarting a sandbox that is still running—after a host reboot, relay restart, or a new CLI on another machine. Flow:ensureRelay+rehydrateFromState, thenprovider.reconnect(box), then relaunch the last agent and optionally attach (--all,--no-attach,--provider <cloud> --adoptto rebuild local state from a live sandbox).Provider.reconnectis new on docker and cloud: cloud callsreEnsureCloudBoxwhen the sandbox is alreadyrunning(onlystart/resumeif paused/stopped); docker uses idempotentstartBoxorunpausewhen paused.BoxRecord.lastAgentis persisted viarecordLastAgenton every claude/codex/opencode launch (foreground, cloud create, queued jobs).restoreAgentSessionsgainsrestoreOnly: resume that one agent from in-box pointers, or start it fresh (including OpenCode); default behavior forstart/unpausestill resumes all resumable agents with pointers only.attachToRunningAgentis shared byattachand recover’s attach step.Reviewed by Cursor Bugbot for commit 3519112. Configure here.