Skip to content

fix(ctl): make the in-box ctl-daemon launch idempotent (stop the pile-up)#128

Merged
madarco merged 1 commit into
nightlyfrom
fix/ctl-daemon-idempotent
Jun 29, 2026
Merged

fix(ctl): make the in-box ctl-daemon launch idempotent (stop the pile-up)#128
madarco merged 1 commit into
nightlyfrom
fix/ctl-daemon-idempotent

Conversation

@madarco

@madarco madarco commented Jun 29, 2026

Copy link
Copy Markdown
Owner

Problem

A live cloud box accumulated many agentbox-ctl daemon processes (a Hetzner box had 9), only the oldest actually serving the box-relay port :8788; the rest were orphaned idle node processes.

launchCloudCtlDaemon spawns the daemon unconditionally on every create / start / unpause / resume / recover (via reEnsureCloudBox). The daemon has no singleton guard — when :8788 is already bound it catches EADDRINUSE and keeps running — so each relaunch leaks another idle daemon. agentbox recover, which targets already-running boxes, hit this every single time. (It is not agentbox attach — attach never launches a daemon.)

Fix

  • Cloud (ctl-launch.ts): guard the spawn with a liveness probe — if a healthy daemon already serves the box relay's /healthz (port parsed from relayUrl, default 8788), skip the launch. It already has the right env (stable across a host-only reconnect); only a real sandbox restart kills it (probe fails → relaunch). Probe via node (guaranteed present; curl may not be).
  • Docker parity (ctl.ts): same idea with pgrep -f 'agentbox-ctl daemon$' (end-anchored so it matches the daemon, not the sh -c wrapper). Docker doesn't pile up today because the daemon dies with the container, but the new docker reconnect runs startBox on a possibly-running container and would otherwise double-launch.
  • Corrected the stale "Idempotent — the launch* helpers no-op" comment in reEnsureCloudBox to match reality.

Verification

  • Unit: assert the cloud script contains the /healthz skip-guard before the spawn, and the probe port is parsed from relayUrl.
  • Live (Hetzner box): ran recover --no-attach twice — daemon count stayed flat, :8788 still healthy on the original pid.
  • pnpm typecheck && lint && test green.

Note: this prevents new pile-up; existing orphans in a long-lived box clear on its next real restart. Out of scope: recover --adopt against a box with a foreign daemon (different token) needs force-replace — tracked as a follow-up.

https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup


Note

Medium Risk
Changes in-box daemon startup on every start/resume/recover/reconnect path; mis-probes could skip needed relaunches, though fresh sandbox boots still relaunch when health checks fail.

Overview
Stops duplicate agentbox-ctl daemon processes when reEnsureCloudBox runs on an already-live box (notably recover / reconnect), where each unconditional spawn could leave orphaned Node processes after EADDRINUSE on the box-relay port.

Cloud (ctl-launch.ts): before nohup agentbox-ctl daemon, the launch script probes http://127.0.0.1:<port>/healthz via node (port from relayUrl, default 8788). A 200 skips the spawn; only a dead/missing daemon proceeds to mkdir, box.env, and launch.

Docker (ctl.ts): same idea with pgrep -f 'agentbox-ctl daemon$' so reconnect / start on a running container does not double-launch.

Comments in reEnsureCloudBox now state that wake paths include reconnect() and that ctl/dockerd/VNC launches are idempotent when already healthy.

Reviewed by Cursor Bugbot for commit 56fe054. Configure here.

…-up)

`launchCloudCtlDaemon` spawned `agentbox-ctl daemon` unconditionally on every
create/start/unpause/resume/recover (via `reEnsureCloudBox`). The daemon has no
singleton guard — when :8788 is already bound it catches EADDRINUSE and keeps
running — so each relaunch leaked another idle daemon (a live Hetzner box had 9,
only one serving the relay). `recover`, which targets already-running boxes, hit
this every time.

Guard the spawn with a liveness probe: a healthy ctl daemon already serving the
box relay's /healthz (port parsed from relayUrl, default 8788) means the launch
is skipped — it already has the right env (stable across a host-only reconnect),
and only a real sandbox restart kills it (probe fails -> relaunch). Probe via
node (guaranteed present; curl may not be).

Same guard for docker's `launchCtlDaemon` (`pgrep -f 'agentbox-ctl daemon$'`,
end-anchored so it can't match the `sh -c` wrapper): docker doesn't pile up
today because the daemon dies with the container, but the new docker `reconnect`
runs `startBox` against a possibly-running container and would otherwise
double-launch.

Verified live: `recover` twice on the Hetzner box left the daemon count flat and
:8788 still healthy on the original pid.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
agentbox-web Skipped Skipped Jun 29, 2026 9:29pm

Request Review

@madarco madarco merged commit 59f6848 into nightly Jun 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant