Skip to content

feat(recover): add agentbox recover to reconnect a box without power-cycling#127

Merged
madarco merged 3 commits into
nightlyfrom
feat/recover-command
Jun 29, 2026
Merged

feat(recover): add agentbox recover to reconnect a box without power-cycling#127
madarco merged 3 commits into
nightlyfrom
feat/recover-command

Conversation

@madarco

@madarco madarco commented Jun 29, 2026

Copy link
Copy Markdown
Owner

What

Adds agentbox recover — re-establishes a box's host-side connectivity (relay registry + CloudBoxPoller, Hetzner SSH ControlMaster + port forwards, host Portless aliases, the detached agent session) without power-cycling the box. For after a host reboot, a relay restart, or a fresh CLI process on another machine — situations where start/unpause would needlessly power-cycle (and can't touch a box missing from local state at all).

agentbox recover [box]                      # current project's box
agentbox recover --all                       # every box in state (no attach)
agentbox recover --no-attach                 # restore only
agentbox recover --provider <cloud> --adopt [id|name]   # rebuild state for a box created elsewhere

How

  • Provider.reconnect(box) — the no-power-cycle sibling of start. Cloud: probeState; if running, runs reEnsureCloudBox directly (refresh preview URLs, re-open the Hetzner tunnel, re-register Portless + the relay poller, relaunch in-box daemons) and skips backend.start; falls back to resume/start only when paused/stopped. Docker: idempotent startBox (a docker start on a live container is a no-op).
  • BoxRecord.lastAgent (claude/codex/opencode) — written on every agent launch (foreground + queued, via recordLastAgent). Durable, unlike the in-box session pointers cleared on stop, so recover knows which agent to relaunch/attach. Only signal available for an adopted box.
  • restoreAgentSessions launchFresh — starts lastAgent fresh when nothing is resumable (adopted box / cleared pointer; the only path for opencode).
  • Adoptbackend.list() + the agentbox.name tag, fresh relay/bridge tokens that reach the in-box agent when reconnect relaunches the ctl daemon (it writes /run/agentbox/relay.env). Hetzner adoption needs the box's per-host SSH key; a box created elsewhere can't be controlled and recover says so.

Works across all five providers. The attach tail reuses a new exported attachToRunningAgent from the attach command.

Docs

cli.mdx (Lifecycle), docs/state.md (lastAgent), docs/host-relay.md (recover flow), docs/cloud-providers.md (reconnect-without-power-cycle).

Tests / verification

  • Unit: recordLastAgent read-modify-write; restoreAgentSessions launch-fresh path.
  • Full suite green (build, typecheck, lint, 649 tests).
  • Docker verified live: relay rehydrate, reconnect with container StartedAt unchanged (no power-cycle), lastAgent persistence, fresh agent relaunch.
  • Exercised live on a real Hetzner box (optima-b64ea2994): recover refreshed the tunnel + relay poller; after supplying the in-box relay token (the box predated the relay-env fix), a relay-backed git push carried the box's work to origin.

https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup


Note

Medium Risk
Touches provider lifecycle, relay rehydration, and cloud adoption (new tokens/SSH constraints) across all backends; failures are mostly best-effort per box, but incorrect reconnect could leave boxes unreachable until retry.

Overview
Adds agentbox recover so host-side wiring (relay registry, cloud pollers, Hetzner tunnel/portless, in-box daemons) can be rebuilt without restarting a sandbox that is still running—after a host reboot, relay restart, or a new CLI on another machine. Flow: ensureRelay + rehydrateFromState, then provider.reconnect(box), then relaunch the last agent and optionally attach (--all, --no-attach, --provider <cloud> --adopt to rebuild local state from a live sandbox).

Provider.reconnect is new on docker and cloud: cloud calls reEnsureCloudBox when the sandbox is already running (only start/resume if paused/stopped); docker uses idempotent startBox or unpause when paused.

BoxRecord.lastAgent is persisted via recordLastAgent on every claude/codex/opencode launch (foreground, cloud create, queued jobs). restoreAgentSessions gains restoreOnly: resume that one agent from in-box pointers, or start it fresh (including OpenCode); default behavior for start/unpause still resumes all resumable agents with pointers only.

attachToRunningAgent is shared by attach and recover’s attach step.

Reviewed by Cursor Bugbot for commit 3519112. Configure here.

A box's host-side state (the relay's in-memory registry + CloudBoxPoller,
the Hetzner SSH ControlMaster + port forwards, the host Portless aliases,
the detached agent tmux session) is separate from the box and is lost on a
host reboot / relay restart / new CLI process while the sandbox keeps
running. `start`/`unpause` only fix this by power-cycling the box and can't
touch a box missing from local state at all.

`agentbox recover [box]`:
  - ensures the host relay is up and rehydrates every box into it,
  - calls the new `Provider.reconnect(box)` — the no-power-cycle sibling of
    `start`: cloud re-runs `reEnsureCloudBox` (refresh preview URLs, re-open
    the Hetzner tunnel, re-register Portless + the relay poller, relaunch
    in-box daemons) without `backend.start`; docker re-runs the idempotent
    `startBox`,
  - relaunches the agent the box was running (resuming, or starting
    `box.lastAgent` fresh) and attaches.

Adds `BoxRecord.lastAgent` (claude/codex/opencode), written on every agent
launch (foreground + queued via `recordLastAgent`) — durable, unlike the
in-box session pointers cleared on stop, so recover knows which agent to
bring back.

`recover --provider <cloud> --adopt [ref]` rebuilds local state for a
sandbox missing from this host (from `backend.list()` + the agentbox.name
tag), minting fresh relay/bridge tokens that reach the in-box agent when
reconnect relaunches the ctl daemon (it writes /run/agentbox/relay.env).
Hetzner adoption needs the box's per-host SSH key; a box created elsewhere
can't be controlled and recover says so.

Works across all five providers. Docs updated (cli.mdx, state.md,
host-relay.md, cloud-providers.md). Unit tests for recordLastAgent and the
restoreAgentSessions launch-fresh path; docker reconnect + lastAgent +
fresh-launch verified live (StartedAt unchanged → no power-cycle).

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
agentbox-web Skipped Skipped Jun 29, 2026 9:09pm

Request Review

Comment thread apps/cli/src/agent-sessions.ts
…ession

Bugbot: `recover` passed the target agent to `restoreAgentSessions` only to
gate the fresh-launch pass, while pass 1 still resumed every resumable agent
that had an in-box pointer — so recover could resurrect an unrelated Claude/
Codex session (possibly from a stale pointer) alongside the intended agent.

Rework `restoreAgentSessions`: `restoreOnly` (was `launchFresh`) now scopes the
whole restore to that one agent — resume it if there's a live/resumable
session, else start it fresh — and touches nothing else. `start`/`unpause`
(no `restoreOnly`) keep the resume-every-running-agent semantics. A box created
before `lastAgent` existed passes `undefined` and so falls back to resume-all.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
@madarco

madarco commented Jun 29, 2026

Copy link
Copy Markdown
Owner Author

bugbot run

Comment thread packages/sandbox-docker/src/docker-provider.ts
… start

Bugbot: docker `reconnect` always delegated to `startBox` (`docker start`),
which errors on a paused container ("cannot start a paused container"). So
`agentbox recover` on a paused docker box left it frozen while reporting the
relay/portless recovery as success — exec, agent restore, and attach kept
failing until the user ran `unpause`.

Probe state first: paused -> `unpauseBox` (resumes the still-frozen
ctl/dockerd/vnc; the portless alias survives a pause); running/stopped ->
`startBox` (idempotent, relaunches dead daemons + re-aliases portless);
missing/destroyed -> clear error. Mirrors the cloud provider's state-routed
reconnect.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
@madarco

madarco commented Jun 29, 2026

Copy link
Copy Markdown
Owner Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 3519112. Configure here.

@madarco madarco merged commit 3c611d4 into nightly Jun 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant