Skip to content

Nightly#132

Merged
madarco merged 41 commits into
mainfrom
nightly
Jun 30, 2026
Merged

Nightly#132
madarco merged 41 commits into
mainfrom
nightly

Conversation

@madarco

@madarco madarco commented Jun 30, 2026

Copy link
Copy Markdown
Owner

Note

Medium Risk
Touches cloud reconnect, Hetzner firewall self-heal, SSH config alias semantics, and agent session restore—important for connectivity and git workflows but mostly additive CLI paths with tests on cp/ssh/restore.

Overview
This nightly batch extends the CLI around reconnection, file copy, git, and external-app SSH, plus Codex onboarding and small cloud attach fixes.

agentbox recover rehydrates the relay, calls provider.reconnect (no power-cycle when the sandbox is already up), restores the box’s lastAgent via restoreAgentSessions (resume or fresh start, including OpenCode), optionally attaches, supports --all and --provider … --adopt for sandboxes missing from local state. Hetzner connect failures can trigger a one-shot firewall sync through withFirewallRepair (establish paths only—not mid-session reconnect).

agentbox cp is now variadic: multiple host or box sources with the last path as the destination; upload size limits apply per source. git push --host-only (and --as / --force) lands the box branch in the host repo without hitting a remote.

agentbox shell --ssh-config (Hetzner with a persistent key) writes ~/.ssh/config using the box name as the Host alias (with legacy agentbox-cloud-* cleanup), shared with code / open via ensureCloudSshAlias. Cloud --no-attach agent launches now start detached tmux immediately (aligned with Docker). Agent launches record recordLastAgent for recover.

agentbox install codex (and hook in install) registers the Codex marketplace/plugin and enables it in ~/.codex/config.toml; dev checkouts can use a local marketplace and skill symlinks. Runtime staging adds agentbox-portless-trust for TLS Portless trust inside boxes. Docs/skills cover recover, multi-cp, host-only push, and Codex/Claude SSH links.

Reviewed by Cursor Bugbot for commit 1d72ceb. Configure here.

madarco added 30 commits June 26, 2026 11:43
On cloud providers (daytona/hetzner/vercel/e2b) the bare `agentbox
claude`/`codex`/`opencode` commands and `agentbox fork` route through
`cloudAgentCreate`, which on `--no-attach` returned before the agent's
tmux session was ever created. The cloud session is created lazily by the
attach step, so skipping attach skipped the agent entirely — the box came
up with no agent running, contradicting the documented behavior ("create
the box and start the agent session, but do not attach"). Docker was
unaffected: it creates the session before the attach check.

Fix `cloudAgentCreate` to call `cloudAgentStartDetached` in the
`attach === false` branch — the same helper the `-i` queue worker uses,
which starts a detached tmux session and verifies it stayed up (fail-loud
on immediate exit / credential rejection). The `<agent> start` subcommand
cloud branches did the same lazy no-op (printing "started lazily on
attach"); they now resolve args first, then start the detached session in
background mode. Cloud now matches docker on every path.

Verified end-to-end: docker `--no-attach` still starts the session
(regression guard), and e2b `--no-attach` now brings up a live, logged-in
claude tmux session where before it created an empty box.

Claude-Session: https://claude.ai/code/session_01PTY4KwAeZdAVvgSWxjpYfs
cloudAgentStartDetached launched the agent with raw extraArgs only, so a
background `agentbox <agent> start <box> --no-attach` (and idle-resumed
creates) started a fresh agent instead of resuming the box's recorded
claude/codex session — the interactive cloudAgentAttach path applies
agentResumeArgs when args are empty, but the detached path did not.

Apply the same resume-args resolution in cloudAgentStartDetached so the
detached path is symmetric with attach. The `-i` queue path always seeds a
prompt (non-empty extraArgs), so this no-ops there.

Found by Cursor Bugbot on PR #116.

Claude-Session: https://claude.ai/code/session_01PTY4KwAeZdAVvgSWxjpYfs
…te_*.sqlite

We stopped seeding Codex's state_*.sqlite index (commit 2eaf2b4), so Codex
now creates it at startup instead of receiving it pre-uploaded. The create
failed with a permission error because the directory wasn't owned by vscode
(the user the agent runs as). Two distinct ownership defects:

1. The agent home dirs (~/.codex, ~/.claude, ~/.local/share/opencode) were not
   reliably vscode-owned in cloud templates (E2B's base image ships a `node`
   user; the root `npm install -g @openai/codex` bake step left ~/.codex as
   node:node). This breaks even a plain `agentbox codex` start. Fixed with a
   cheap, idempotent create-time chown (ensureAgentHomeDirsOwned) — no re-bake.

2. The upload primitives only chowned the final landed path, not the parent
   directory chain they mkdir -p'd as root. Session-teleport lands a rollout at
   ~/.codex/sessions/YYYY/MM/DD/, leaving that chain root-owned so Codex can't
   write a new rollout / its sqlite index. Mirror the carry.ts parent-walk fix
   in both upload primitives (cloud-cp.ts + docker box-cp.ts), gated on the dest
   being under /home/vscode.

Chowns are name/id-derived (vscode / id -un), not hardcoded 1000, since the
vscode uid varies per provider (docker/hetzner=1000, vercel=1001, e2b=1002).

Claude-Session: https://claude.ai/code/session_0152GmbNW3e7QpXNkQFd3MB2
Bugbot: when an upload's resolved finalPath was exactly /home/vscode, the
`=== BOX_HOME` branch of the gate let the parent walk run with dirname=/home,
reassigning /home itself to the agent user. Gate strictly on
`startsWith(BOX_HOME + '/')` (a trailing segment), matching carry.ts. Applies
to both cloud-cp.ts and docker box-cp.ts; adds a regression test.

Claude-Session: https://claude.ai/code/session_0152GmbNW3e7QpXNkQFd3MB2
…SH (#118)

* feat(cli): add `agentbox shell --ssh-config` for Codex / Claude desktop SSH

Write a `~/.ssh/config` alias on demand so external apps (the Codex app,
Claude desktop) can connect to a box over plain SSH, and surface the
identity-file path + a Codex deep link.

- New `agentbox shell <box> --ssh-config` (+ `--json`): brings the box
  online, writes the alias, prints alias/host/user/identity + the
  `codex://settings/connections/ssh/add?name=<alias>` link and Claude
  desktop instructions. Gated to providers with a persistent per-box key
  (Hetzner) — Docker/Daytona/Vercel/E2B exit cleanly without writing.
- Extract the shared bring-online → buildAttach → parseSshTarget → write
  alias flow into `cloud-ssh.ts` (`resolveCloudSshTarget` /
  `ensureCloudSshAlias`); reuse it from `code` and `open`.
- Alias is now the box name (clean `ssh <box>` + Codex `name=` param);
  add `readAgentboxSshAlias` and surface `ssh alias` / `ssh identity` in
  cloud `inspect`.
- Document the flow in the agentbox-info and fork skills.

Claude-Session: https://claude.ai/code/session_011VoAz7mUaUGh6dKAvr7kAP

* fix(cli): address Cursor Bugbot findings on SSH-config

- Check SSH support (`buildAttach`) before any lifecycle action so an
  unsupported box (e.g. a stopped Docker box) errors without being
  started; add a `bringOnline` option to skip a redundant lifecycle pass.
- `agentbox code` now passes `bringOnline: false` (it already brings the
  box online + waits) — removes the duplicate resume/start.
- Migrate away legacy `agentbox-cloud-<box>` blocks on write/remove so the
  box-name rename doesn't leave stale Host entries behind.
- Warn (don't fail) when `~/.ssh/config` already has a user-authored
  `Host <box>` that could shadow agentbox's entry.
- Tests for legacy-block migration and conflict detection.

Claude-Session: https://claude.ai/code/session_011VoAz7mUaUGh6dKAvr7kAP
…erride.md

The box "system prompt" baked at /etc/claude-code/CLAUDE.md (sandbox facts:
DinD, per-box worktree, push/PR/cp via the host relay, identity in
/etc/agentbox/box.env) previously only reached Claude. Codex got none of it.

Codex loads a global personal-instructions file from CODEX_HOME, first-match
of ~/.codex/AGENTS.override.md then ~/.codex/AGENTS.md (no concat, no @import).
At create time we now regenerate ~/.codex/AGENTS.override.md = sentinel + box
facts (read fresh) + the user's own AGENTS.md / authored override folded in
beneath, so the in-box Codex agent reads the same facts. A line-1 sentinel
makes it idempotent and preserves user content (the host ~/.codex is re-synced
before each seed, restoring the source). No-op when the facts file is absent.

One shared generator (buildCodexAgentsOverrideScript) drives both paths:
- docker: seedCodexAgentsOverride() seeds the codex-config volume, called
  after seedCodexHooks() in create.ts (post host rsync).
- cloud (daytona/hetzner/vercel/e2b): ensureCodexAgentsOverride() runs the same
  script in-box via backend.exec, wired into cloud-provider.ts.

Verified: codex debug prompt-input shows the box facts in Codex's model-visible
prompt; compose + authored-override + facts-only + no-op cases all check out.

Claude-Session: https://claude.ai/code/session_01PTY4KwAeZdAVvgSWxjpYfs
Cloud backends signal script failure via a non-zero CloudExecResult.exitCode
rather than throwing, so the prior try/await/log('seeded') reported success even
when the `set -e` script aborted (perms, missing paths) — a box could boot
without box facts while create logs looked healthy. Read the exitCode and log
the failure (still best-effort, never fails create). Found by Cursor Bugbot.

Claude-Session: https://claude.ai/code/session_01PTY4KwAeZdAVvgSWxjpYfs
…ritten

The shared generator exits 0 in the no-op case (box-facts file absent) without
writing the override, so a bare exitCode===0 check logged a false "seeded" on
the cloud path. Move the success signal into the script as a stdout marker
(CODEX_OVERRIDE_WROTE_MARKER) printed only after the write; both docker and
cloud now key their "seeded" log off the marker, and cloud logs an explicit
"skipped: box-facts file absent" otherwise. Found by Cursor Bugbot.

Claude-Session: https://claude.ai/code/session_01PTY4KwAeZdAVvgSWxjpYfs
…lhost works

When the host Portless proxy runs in TLS mode, the symmetric <box>.localhost
URL is served inside the box by a self-signed CA the box doesn't trust:
- hetzner: the in-VPS mirror's own CA at /root/.portless/ca.pem
- docker: the host CA bind-mounted at /home/vscode/.portless/ca.pem

portless proxy start only trusts the CA in the Linux system store, not the box
user's NSS db, so the in-box VNC Chromium window and Playwright (via Codex)
rejected the cert with an HTTPS error. Fix by trusting the CA across every
in-box client:

- New baked helper agentbox-portless-trust: installs a CA into the system store
  (update-ca-certificates) + the box user's NSS db (certutil), idempotent and
  best-effort, prints the system CA path for NODE_EXTRA_CA_CERTS.
- Bake libnss3-tools (certutil) into the hetzner snapshot + docker base image.
- hetzner startInBoxPortless: when tls, run the helper on /root/.portless/ca.pem
  and export NODE_EXTRA_CA_CERTS via /etc/profile.d.
- docker create: when the resolved portless URL is https, run the helper on the
  bind-mounted host CA + drop the same profile.d export.

No-TLS host proxies (the --no-tls -p 1355 default) serve plain http and skip
this entirely. Requires a snapshot re-bake / docker image rebuild to pick up
libnss3-tools + the helper.

Claude-Session: https://claude.ai/code/session_0152GmbNW3e7QpXNkQFd3MB2
…ugin

Installing agentbox left the Codex plugin (the `agentbox`/`agentbox-info` skills)
a manual chore: `codex plugin marketplace add` + `codex plugin add`, then a
toggle in the Codex app. Automate all three.

New `install-codex.ts` (mirrors install-herdr.ts), gated on Codex being present
(~/.codex + `codex` on PATH), best-effort so it never aborts `agentbox install`:
- `codex plugin marketplace add madarco/agentbox` + `codex plugin add agentbox@agentbox`
- enable by default: append `[plugins."agentbox@agentbox"] enabled = true` to
  ~/.codex/config.toml when no such table exists. Codex has no `plugin enable`
  CLI — this is the same key the TUI toggle writes.

Robustness:
- Skips the network re-add when already installed (codex plugin add re-enables a
  deliberately-disabled plugin), so a user's explicit disable is respected;
  --force bypasses to re-enable.
- The config write only appends when the key is absent — never duplicates the
  table (a second table is a TOML parse error) and respects enabled = false.

Wired into the `agentbox install` wizard (runs when Codex is detected) and a
standalone `agentbox install codex`. Adds smol-toml to apps/cli (read-only parse
for the presence check). Unit tests cover the upsert/respect-disable logic; e2e
verified against codex 0.142 (fresh -> enabled, disabled re-run -> stays off,
--force -> re-enabled). Docs: cli.mdx + plugin README.

Claude-Session: https://claude.ai/code/session_01PTY4KwAeZdAVvgSWxjpYfs
Addresses the CI failure and Cursor Bugbot findings on the enable step:
- commands.test.ts asserted the exact `install` subcommand list — add `codex`.
- The managed-block approach conflicted with the `[plugins."agentbox@agentbox"]`
  table Codex itself writes (`plugin add` / TUI toggle): a second table is a TOML
  parse error, and the strip+regenerate path could (a) leave a stale duplicate
  unwritten and (b) override a disable set inside the block.

Replace it: parse config.toml read-only and
- append a plain `[plugins."agentbox@agentbox"] enabled = true` table only when
  the key is ABSENT (Codex usually writes it itself on `plugin add`);
- respect a present value (enabled defaults true; only explicit `false` is off);
- with --force, flip a disabled entry to true via a targeted in-place line edit
  that preserves the rest of the file (comments/order/formatting).
Always write when the text changed (fixes the "cleanup not written" gap).

E2E re-verified vs codex 0.142: fresh -> enabled; disabled re-run -> stays off
with comments preserved; --force -> re-enabled, file still single-table valid TOML.

Claude-Session: https://claude.ai/code/session_01PTY4KwAeZdAVvgSWxjpYfs
From a source checkout, `agentbox install codex` now points the Codex
marketplace at the local repo (`source_type = "local"`) instead of the
`madarco/agentbox` GitHub slug, re-syncs the bundle skill copy, stages the
plugin from the working tree, then symlinks the staged per-skill SKILL.md back
to the repo so skill edits go live on the next Codex restart. A published npm
install is unaffected (no `plugins/`/`.agents/` ships, so it always uses GitHub).

- New `apps/cli/src/lib/source-checkout.ts`: shared `isSourceCheckout`,
  `resolveHostSkillsDir`, and `resolveDevRepoRoot` (repo root only in a checkout
  carrying the Codex sources, else null). `install.ts` now imports these.
- `--no-dev` forces the published GitHub path even inside a checkout; a
  source-type conflict (local <-> git) is handled by remove-then-add.
- Symlink the skill *files* only — a whole-dir symlink makes Codex report the
  plugin "not installed".
- Tests for the source-checkout helpers and `marketplaceSource`; docs in
  docs/development.md.

Claude-Session: https://claude.ai/code/session_01An9tT8HqjQoGKKVWuYg4bb
…mpts

Tell the in-box agent that `agentbox.yaml` services/containers start
automatically (check with `agentbox-ctl status`) and that web services are
reachable at https://<AGENTBOX_BOX_HOST> from both box and host. Wording is
tailored per provider (local proxy for docker, cloud URL for the cloud ones).

Claude-Session: https://claude.ai/code/session_01An9tT8HqjQoGKKVWuYg4bb
- Attempt the dev skill symlink whenever a staged dir exists, not only when
  `codex plugin add` exits 0: a non-zero exit is treated as "already installed"
  and continues, leaving a staged cache we still want live-symlinked.
- Never symlink into a non-existent manifest-version path: if the manifest
  version dir is absent and the cache doesn't hold exactly one child, bail with
  "staged plugin dir not found" instead of mkdir-ing an orphan tree Codex ignores.

Claude-Session: https://claude.ai/code/session_01An9tT8HqjQoGKKVWuYg4bb
…r on failure

The Claude native installer (`curl https://claude.ai/install.sh | bash`) can
get an intermittent HTTP 403 from Cloudflare on cloud-datacenter egress IPs
(Hetzner among them) under load. The bare `curl | bash` masked it (pipeline
exit = bash's 0), so a 403 silently baked a snapshot with no `claude` — boxes
from it had no agent, so the in-box tmux session died instantly and `attach`
crash-looped on "no server running on /tmp/tmux-1000/default".

Replace the masked install with a `retry_backoff` helper: retry the native
installer 3x with 60s then 240s backoff (~5 min), keep `set -o pipefail`, and
fold `command -v claude` into the retried command so a "succeeded but absent"
result also retries. If all attempts fail, abort the bake (exit 71) rather
than ship a claude-less snapshot. Applied to hetzner / vercel / e2b bake
scripts and the docker Dockerfile (sh/dash plain-loop variant).

No npm fallback: `npm install -g @anthropic-ai/claude-code` lacks native-only
features and lands at /usr/bin/claude, mismatching the host-seeded
installMethod=native and tripping Claude Code's "missing or broken" doctor
warning.

`prepareHetzner` now special-cases exit 71 with an actionable message (the
generic one showed empty stderr because the install runs `bash -x ... 2>&1 |
tee`, merging stderr into stdout).

Known gap (deferred): the 403 can outlast the ~5-min retry window, so prepare
can still fail and need a manual re-run. The validated reliable fix
(host-proxies the native binary download, places it at ~/.local/bin/claude) is
documented in docs/hertzner_backlog.md as a follow-up.

Claude-Session: https://claude.ai/code/session_019m5WHxP4vmsoXaHUhQdY9e
fix(hetzner): retry Claude native installer with backoff + clear error
Adding a box as a remote host (Codex app, VS Code Remote-SSH, plain
`ssh <box>`) opened the session in /home/vscode instead of the project at
/workspace. SSH config has no start-directory directive, and the only
client-side option (RemoteCommand) would break scp/sftp and VS Code's
remote bootstrap on the same shared alias. So fix it server-side: make
interactive login shells cd into /workspace via the existing
/etc/profile.d/agentbox.sh shim every box installs.

Guarded to interactive shells only (scp/sftp and `ssh box <cmd>` untouched)
and only when still at $HOME (never overrides a caller-chosen dir, e.g.
agentbox's own tmux `-c /workspace`). Applied to all provider shims for
consistency: hetzner, vercel, e2b, and the canonical docker Dockerfile.box.

Claude-Session: https://claude.ai/code/session_01An9tT8HqjQoGKKVWuYg4bb
feat(box): land interactive SSH/login shells in /workspace
The host `~/.codex` is ~1.1 GB and was being synced into boxes almost whole:
~485 MB of macOS aarch64 standalone release binaries (`packages/`), a ~238 MB
plugin app-server runtime (`plugins/.plugin-appserver`), the macOS
`Codex Computer Use.app` bundle (`computer-use/`), host session archives, and
regenerable caches (`.tmp` ~213 MB, `tmp`, `cache`, `vendor_imports`, `sqlite`,
`models_cache.json`). None of it is usable in a Linux box — the in-box codex is
npm-installed and rebuilds these caches on demand.

Exclude all of it from both codex staging paths:
- `CODEX_RSYNC_EXCLUDES` (host-stage.ts) — the cloud bake path (hetzner/vercel/
  e2b/daytona `stageCodexStaticForUpload`). Dry-run: staged tarball 820 MB -> 482 KB.
- the docker `agentbox-codex-config` volume rsync (codex.ts), plus its `rm -rf`
  purge so existing shared volumes get cleaned on the next sync. Verified live:
  a fresh docker box's `~/.codex` dropped 1.5 GB -> 59 MB and `codex exec` still
  returns a real turn.

Config / auth / skills / prompts / rules / memories / plugins are still synced,
so codex keeps working — just without the host-only ballast.

Claude-Session: https://claude.ai/code/session_019m5WHxP4vmsoXaHUhQdY9e
fix(codex): drop heavy host-only artifacts from codex config staging
Cloud boxes (hetzner/daytona/vercel/e2b) have no global env primitive, so
the in-box agent — launched via a tmux login shell — only saw the relay
token through /etc/agentbox/box.env. Commit b9e4ebf made the ctl daemon
overwrite box.env without the token (correctly, to keep secrets out of a
0644 file), which severed the agent's only channel: `agentbox-ctl git push`
failed with "no relay configured".

Persist the per-box relay URL + token to a 0600 /run/agentbox/relay.env
(tmpfs, never snapshotted) written by the daemon once it validates its own
token, and have agentbox-ctl's relay clients (postRpc, RelayClient) fall
back to it when env is absent. agentbox-ctl is the only in-box relay
consumer, so a single on-demand file read fixes every path — the agent and
the host-driven `agentbox git push` on all backends — without spraying the
token into every login shell's env. The bridge token stays daemon-only.

Also stop hetzner cloud-init from copying the relay/bridge tokens into the
0644 box.env (they now travel via relay.env / the daemon process env).

Claude-Session: https://claude.ai/code/session_01SAturA5Fs2XHzzondT6DDv
Correct environment.mdx's docker-only box.env claim, and document in
host-relay.md / cloud-providers.md / the hetzner+vercel backlogs that the
cloud relay token now reaches agentbox-ctl via a 0600 /run/agentbox/relay.env
(read by resolveRelayEnv), not login-shell env — guarding the b9e4ebf
regression. Bridge token stays daemon-only.

Claude-Session: https://claude.ai/code/session_01SAturA5Fs2XHzzondT6DDv
fix(cloud): restore relay token for in-box agent via 0600 relay.env
A box's host-side state (the relay's in-memory registry + CloudBoxPoller,
the Hetzner SSH ControlMaster + port forwards, the host Portless aliases,
the detached agent tmux session) is separate from the box and is lost on a
host reboot / relay restart / new CLI process while the sandbox keeps
running. `start`/`unpause` only fix this by power-cycling the box and can't
touch a box missing from local state at all.

`agentbox recover [box]`:
  - ensures the host relay is up and rehydrates every box into it,
  - calls the new `Provider.reconnect(box)` — the no-power-cycle sibling of
    `start`: cloud re-runs `reEnsureCloudBox` (refresh preview URLs, re-open
    the Hetzner tunnel, re-register Portless + the relay poller, relaunch
    in-box daemons) without `backend.start`; docker re-runs the idempotent
    `startBox`,
  - relaunches the agent the box was running (resuming, or starting
    `box.lastAgent` fresh) and attaches.

Adds `BoxRecord.lastAgent` (claude/codex/opencode), written on every agent
launch (foreground + queued via `recordLastAgent`) — durable, unlike the
in-box session pointers cleared on stop, so recover knows which agent to
bring back.

`recover --provider <cloud> --adopt [ref]` rebuilds local state for a
sandbox missing from this host (from `backend.list()` + the agentbox.name
tag), minting fresh relay/bridge tokens that reach the in-box agent when
reconnect relaunches the ctl daemon (it writes /run/agentbox/relay.env).
Hetzner adoption needs the box's per-host SSH key; a box created elsewhere
can't be controlled and recover says so.

Works across all five providers. Docs updated (cli.mdx, state.md,
host-relay.md, cloud-providers.md). Unit tests for recordLastAgent and the
restoreAgentSessions launch-fresh path; docker reconnect + lastAgent +
fresh-launch verified live (StartedAt unchanged → no power-cycle).

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
…ession

Bugbot: `recover` passed the target agent to `restoreAgentSessions` only to
gate the fresh-launch pass, while pass 1 still resumed every resumable agent
that had an in-box pointer — so recover could resurrect an unrelated Claude/
Codex session (possibly from a stale pointer) alongside the intended agent.

Rework `restoreAgentSessions`: `restoreOnly` (was `launchFresh`) now scopes the
whole restore to that one agent — resume it if there's a live/resumable
session, else start it fresh — and touches nothing else. `start`/`unpause`
(no `restoreOnly`) keep the resume-every-running-agent semantics. A box created
before `lastAgent` existed passes `undefined` and so falls back to resume-all.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
… start

Bugbot: docker `reconnect` always delegated to `startBox` (`docker start`),
which errors on a paused container ("cannot start a paused container"). So
`agentbox recover` on a paused docker box left it frozen while reporting the
relay/portless recovery as success — exec, agent restore, and attach kept
failing until the user ran `unpause`.

Probe state first: paused -> `unpauseBox` (resumes the still-frozen
ctl/dockerd/vnc; the portless alias survives a pause); running/stopped ->
`startBox` (idempotent, relaunches dead daemons + re-aliases portless);
missing/destroyed -> clear error. Mirrors the cloud provider's state-routed
reconnect.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
feat(recover): add `agentbox recover` to reconnect a box without power-cycling
madarco added 10 commits June 29, 2026 22:28
…-up)

`launchCloudCtlDaemon` spawned `agentbox-ctl daemon` unconditionally on every
create/start/unpause/resume/recover (via `reEnsureCloudBox`). The daemon has no
singleton guard — when :8788 is already bound it catches EADDRINUSE and keeps
running — so each relaunch leaked another idle daemon (a live Hetzner box had 9,
only one serving the relay). `recover`, which targets already-running boxes, hit
this every time.

Guard the spawn with a liveness probe: a healthy ctl daemon already serving the
box relay's /healthz (port parsed from relayUrl, default 8788) means the launch
is skipped — it already has the right env (stable across a host-only reconnect),
and only a real sandbox restart kills it (probe fails -> relaunch). Probe via
node (guaranteed present; curl may not be).

Same guard for docker's `launchCtlDaemon` (`pgrep -f 'agentbox-ctl daemon$'`,
end-anchored so it can't match the `sh -c` wrapper): docker doesn't pile up
today because the daemon dies with the container, but the new docker `reconnect`
runs `startBox` against a possibly-running container and would otherwise
double-launch.

Verified live: `recover` twice on the Hetzner box left the daemon count flat and
:8788 still healthy on the original pid.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
fix(ctl): make the in-box ctl-daemon launch idempotent (stop the pile-up)
A Hetzner box's firewall locks SSH to the host's egress IP at create time and is
never re-synced. When the host IP changes (laptop moves networks), every comms op
fails with an opaque `ssh ControlMaster failed … Operation timed out` and the
user has to know to run `agentbox hetzner firewall sync`. Two fixes, both gated
to the connection-failure path so the happy path never pays the egress-detect
cost, and the firewall is re-synced ONLY when the IP actually changed:

1. Hint (read-only): wrap `tunnels.open` in `ensureTunnel` — the one choke point
   all of exec/scp/forward/poller/attach funnel through. On a real mismatch it
   turns the opaque timeout into "firewall allows X but your egress is now Y —
   run `firewall sync`/`recover`". Safe on a checkpoint drop (box merely stopped,
   IP unchanged → no hint).

2. Auto-sync, scoped to connection ESTABLISHMENT only. New optional
   `repairReachability` on CloudBackend/Provider (Hetzner-only): re-syncs the
   firewall to the current egress, but only when it changed (else changed:false).
   A `withFirewallRepair` CLI helper retries the attempt once iff something
   changed, wired at the two establish sites — `recover` (provider.reconnect) and
   the INITIAL attach connect (`_cloud-attach` buildAttach). Deliberately NOT the
   mid-session reconnect closure: a checkpoint stops the box and drops the PTY,
   which must not be mistaken for an IP change. `--no-firewall-sync` opts out on
   recover (shared/untrusted egress).

A short-TTL egress cache avoids probe storms across retries / `recover --all`.
`0.0.0.0/0` (explicit dynamic-IP opt-in) is never hinted or synced.

Verified live on a Hetzner box: locking the firewall to a bogus IP makes `shell`
fail with the hint (no auto-repair), `recover` auto-syncs back + reconnects, and
`--no-firewall-sync` leaves it locked. Unit tests cover firewallNeedsSync + the
egress TTL cache.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
…establishes

1. Stale egress cache could mask a real IP change: cut the cache TTL from 60s to
   5s. It only exists to dedup a burst of failure-path probes (poller backoff,
   `recover --all`), not to remember the IP over time — a long TTL would hide the
   very IP change we're detecting.

2. The firewall self-heal wrapped only the final buildAttach, but the resume
   probe and the detached pre-start connect first — a firewall block there aborted
   the attach (or silently dropped the resumed session) before repair ran. Move
   the repair to a single up-front warm-up (`exec true`, Hetzner-only) that opens
   the tunnel + self-heals BEFORE any later establish touch, which then reuse the
   live master. Verified live: a locked firewall is now auto-synced on
   `claude attach` before it connects.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
…re path

Bugbot (round 2): even a 5s-TTL cache could read a just-changed egress IP as
"unchanged" in the firewall comparison and skip the heal — the exact mismatch
this exists to catch. The cache only dedup'd failure-path probes, but the cloud
poller already de-dupes its recover calls and `recover --all` is sequential, so
a fresh `detectEgressIp` in `firewallEgressStatus` won't storm. Remove the cache
entirely; correctness over a marginal probe dedup.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
feat(hetzner): self-heal the per-box firewall on a host egress-IP change
… --host-only)

Add a `--host-only` flag to `agentbox git push <box>` and in-box
`agentbox-ctl git push` that makes the box's branch available in the host's
*local* repo without pushing to any remote — nothing is published online.

The destination branch defaults to the box's current branch name; `--as
<branch>` overrides it and `--force` allows a non-fast-forward overwrite.
`--host-only` is incompatible with `--remote` (exit 64).

Because nothing leaves the host, the relay skips the push-confirm gate /
host-initiated-token path entirely. Docker copies the box branch ref within the
shared bind-mounted `.git/` via a self-fetch (`handleGitSaveToHost`); cloud
reuses the push flow's git-bundle pull-back, stopping before the remote push
(`runGitRpc` short-circuit), so all four cloud providers are covered.

The in-box agent system prompt (custom-system-CLAUDE.md, all providers) and the
docs (host-relay, features, web cli/sync-and-git) document the new mode.

Claude-Session: https://claude.ai/code/session_01TmyXca2hNF9TtK6q9MAh1L
Making `--force` a known option on `push` (for `--host-only`) meant Commander
consumed it for every push, but only the host-only path forwarded it — a normal
`agentbox git push <box> --force` silently dropped `--force` (`params.force` is
only honored on the host-only land path). Re-append `--force` to the forwarded
git args on a remote push, in both the host CLI and in-box ctl, so the relay
appends it to `git push <remote> <branch>`. The host CLI's predicted params
hash stays in lockstep with ctl's normalized args tail.

Add a pure unit test for ctl's buildParams covering remote --force re-forwarding
vs host-only params.

Verified live (docker): non-ff push rejected without --force (exit 1, remote
unchanged), forced update succeeds with --force; host-only land still works.

Claude-Session: https://claude.ai/code/session_01TmyXca2hNF9TtK6q9MAh1L
`agentbox cp` and the in-box `agentbox-ctl cp toHost|fromHost` took a single
source per call; copying several files meant several invocations (and, from
inside a box, several host approval prompts). Both now accept multiple sources
in one call — list files/dirs before the destination, which must then be a
directory. Wildcards are handled by the invoking shell (host shell for host-side
sources, box shell for in-box sources), so no glob expansion lives in the code
and there's no box->host shell-injection surface.

- Host CLI: `cp <paths...>` (variadic, arity-split). One arg keeps the
  download-to-cwd back-compat; a single source keeps full docker-cp rename
  semantics. >=2 sources require a directory dest. All sources must be on the
  side opposite the dest, and box sources must name one box. Size guard runs
  per source.
- Providers: `uploadPath`/`downloadPath` take `string[]`. Docker groups sources
  by parent dir (one tar per group), hoists mkdir/parent-chain-chown, chowns
  each landed entry. Cloud loops the single-source primitive serially.
- Wire: `cp.*` RPC carries `{sources[], dest}` with backward-tolerant
  normalization of the legacy `{boxPath, hostPath}` shape; shared in cp-rpc.ts.
- Relay: the cloud `runCpRpc` now re-shells `agentbox cp` like the docker path
  instead of calling the cloud primitives directly — so excludes and the size
  guard are honored on every provider (the cloud path silently dropped them
  before, while the consent prompt advertised them). Consent prompt lists all
  sources.
- Docs/skills/system prompts updated; new unit tests for parseArgs, the cp-rpc
  wire helpers, and cloud multi-source orchestration.

Verified live (docker): multi-source + wildcard upload, multi-source download,
dest-not-a-dir / box-to-box errors, single-source rename + download-to-cwd
back-compat, and the in-box ctl variadic parsing.

Claude-Session: https://claude.ai/code/session_01XvuW3YwvHzvCrmXMJyC33W
feat(cp): copy multiple files/dirs in one call
@vercel

vercel Bot commented Jun 30, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
agentbox-web Skipped Skipped Jun 30, 2026 11:33am

Request Review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1d72ceb. Configure here.

const resume = await agentResumeArgs(provider, box, args.binary);
if (resume) extraArgs = resume;
}
const command = buildCloudAttachInnerCommand(args.binary, extraArgs);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hetzner firewall skipped detached start

Medium Severity

cloudAgentAttach warms the Hetzner SSH tunnel with withFirewallRepair before exec/buildAttach, but cloudAgentStartDetached—now used for cloud --no-attach, background creates, and queue workers—does not. After a host egress IP change, detached agent startup can fail while an interactive attach on the same box succeeds.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1d72ceb. Configure here.

Comment thread apps/cli/src/cloud-ssh.ts
}
const alias = agentboxAliasFor(box.name);
return { alias, host: target.host, user: target.user, identityFile: target.identityFile };
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SSH config lacks firewall heal

Medium Severity

New resolveCloudSshTarget (used by agentbox shell --ssh-config and shared alias helpers) brings the box online and calls buildAttach without the Hetzner withFirewallRepair pass added to cloudAgentAttach. A stale firewall after an egress IP change can make --ssh-config fail even though attach self-heals.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1d72ceb. Configure here.

@madarco madarco merged commit f444457 into main Jun 30, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant