fix(ctl): make the in-box ctl-daemon launch idempotent (stop the pile-up) by madarco · Pull Request #128 · madarco/agentbox

madarco · 2026-06-29T21:28:57Z

Problem

A live cloud box accumulated many agentbox-ctl daemon processes (a Hetzner box had 9), only the oldest actually serving the box-relay port :8788; the rest were orphaned idle node processes.

launchCloudCtlDaemon spawns the daemon unconditionally on every create / start / unpause / resume / recover (via reEnsureCloudBox). The daemon has no singleton guard — when :8788 is already bound it catches EADDRINUSE and keeps running — so each relaunch leaks another idle daemon. agentbox recover, which targets already-running boxes, hit this every single time. (It is not agentbox attach — attach never launches a daemon.)

Fix

Cloud (ctl-launch.ts): guard the spawn with a liveness probe — if a healthy daemon already serves the box relay's /healthz (port parsed from relayUrl, default 8788), skip the launch. It already has the right env (stable across a host-only reconnect); only a real sandbox restart kills it (probe fails → relaunch). Probe via node (guaranteed present; curl may not be).
Docker parity (ctl.ts): same idea with pgrep -f 'agentbox-ctl daemon$' (end-anchored so it matches the daemon, not the sh -c wrapper). Docker doesn't pile up today because the daemon dies with the container, but the new docker reconnect runs startBox on a possibly-running container and would otherwise double-launch.
Corrected the stale "Idempotent — the launch* helpers no-op" comment in reEnsureCloudBox to match reality.

Verification

Unit: assert the cloud script contains the /healthz skip-guard before the spawn, and the probe port is parsed from relayUrl.
Live (Hetzner box): ran recover --no-attach twice — daemon count stayed flat, :8788 still healthy on the original pid.
pnpm typecheck && lint && test green.

Note: this prevents new pile-up; existing orphans in a long-lived box clear on its next real restart. Out of scope: recover --adopt against a box with a foreign daemon (different token) needs force-replace — tracked as a follow-up.

https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup

Note

Medium Risk
Changes in-box daemon startup on every start/resume/recover/reconnect path; mis-probes could skip needed relaunches, though fresh sandbox boots still relaunch when health checks fail.

Overview
Stops duplicate agentbox-ctl daemon processes when reEnsureCloudBox runs on an already-live box (notably recover / reconnect), where each unconditional spawn could leave orphaned Node processes after EADDRINUSE on the box-relay port.

Cloud (ctl-launch.ts): before nohup agentbox-ctl daemon, the launch script probes http://127.0.0.1:<port>/healthz via node (port from relayUrl, default 8788). A 200 skips the spawn; only a dead/missing daemon proceeds to mkdir, box.env, and launch.

Docker (ctl.ts): same idea with pgrep -f 'agentbox-ctl daemon$' so reconnect / start on a running container does not double-launch.

Comments in reEnsureCloudBox now state that wake paths include reconnect() and that ctl/dockerd/VNC launches are idempotent when already healthy.

^{Reviewed by Cursor Bugbot for commit 56fe054. Configure here.}

…-up) `launchCloudCtlDaemon` spawned `agentbox-ctl daemon` unconditionally on every create/start/unpause/resume/recover (via `reEnsureCloudBox`). The daemon has no singleton guard — when :8788 is already bound it catches EADDRINUSE and keeps running — so each relaunch leaked another idle daemon (a live Hetzner box had 9, only one serving the relay). `recover`, which targets already-running boxes, hit this every time. Guard the spawn with a liveness probe: a healthy ctl daemon already serving the box relay's /healthz (port parsed from relayUrl, default 8788) means the launch is skipped — it already has the right env (stable across a host-only reconnect), and only a real sandbox restart kills it (probe fails -> relaunch). Probe via node (guaranteed present; curl may not be). Same guard for docker's `launchCtlDaemon` (`pgrep -f 'agentbox-ctl daemon$'`, end-anchored so it can't match the `sh -c` wrapper): docker doesn't pile up today because the daemon dies with the container, but the new docker `reconnect` runs `startBox` against a possibly-running container and would otherwise double-launch. Verified live: `recover` twice on the Hetzner box left the daemon count flat and :8788 still healthy on the original pid. Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup

vercel · 2026-06-29T21:29:03Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Actions	Updated (UTC)
agentbox-web	Skipped		Jun 29, 2026 9:29pm

madarco merged commit 59f6848 into nightly Jun 29, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ctl): make the in-box ctl-daemon launch idempotent (stop the pile-up)#128

fix(ctl): make the in-box ctl-daemon launch idempotent (stop the pile-up)#128
madarco merged 1 commit into
nightlyfrom
fix/ctl-daemon-idempotent

madarco commented Jun 29, 2026 •

edited by cursor Bot

Loading

Uh oh!

vercel Bot commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

madarco commented Jun 29, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Verification

Uh oh!

vercel Bot commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

madarco commented Jun 29, 2026 •

edited by cursor Bot

Loading