fix(ctl): make the in-box ctl-daemon launch idempotent (stop the pile-up)#128
Merged
Conversation
…-up) `launchCloudCtlDaemon` spawned `agentbox-ctl daemon` unconditionally on every create/start/unpause/resume/recover (via `reEnsureCloudBox`). The daemon has no singleton guard — when :8788 is already bound it catches EADDRINUSE and keeps running — so each relaunch leaked another idle daemon (a live Hetzner box had 9, only one serving the relay). `recover`, which targets already-running boxes, hit this every time. Guard the spawn with a liveness probe: a healthy ctl daemon already serving the box relay's /healthz (port parsed from relayUrl, default 8788) means the launch is skipped — it already has the right env (stable across a host-only reconnect), and only a real sandbox restart kills it (probe fails -> relaunch). Probe via node (guaranteed present; curl may not be). Same guard for docker's `launchCtlDaemon` (`pgrep -f 'agentbox-ctl daemon$'`, end-anchored so it can't match the `sh -c` wrapper): docker doesn't pile up today because the daemon dies with the container, but the new docker `reconnect` runs `startBox` against a possibly-running container and would otherwise double-launch. Verified live: `recover` twice on the Hetzner box left the daemon count flat and :8788 still healthy on the original pid. Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A live cloud box accumulated many
agentbox-ctl daemonprocesses (a Hetzner box had 9), only the oldest actually serving the box-relay port:8788; the rest were orphaned idle node processes.launchCloudCtlDaemonspawns the daemon unconditionally on every create / start / unpause / resume / recover (viareEnsureCloudBox). The daemon has no singleton guard — when:8788is already bound it catchesEADDRINUSEand keeps running — so each relaunch leaks another idle daemon.agentbox recover, which targets already-running boxes, hit this every single time. (It is notagentbox attach— attach never launches a daemon.)Fix
ctl-launch.ts): guard the spawn with a liveness probe — if a healthy daemon already serves the box relay's/healthz(port parsed fromrelayUrl, default 8788), skip the launch. It already has the right env (stable across a host-only reconnect); only a real sandbox restart kills it (probe fails → relaunch). Probe vianode(guaranteed present;curlmay not be).ctl.ts): same idea withpgrep -f 'agentbox-ctl daemon$'(end-anchored so it matches the daemon, not thesh -cwrapper). Docker doesn't pile up today because the daemon dies with the container, but the new dockerreconnectrunsstartBoxon a possibly-running container and would otherwise double-launch.reEnsureCloudBoxto match reality.Verification
/healthzskip-guard before the spawn, and the probe port is parsed fromrelayUrl.recover --no-attachtwice — daemon count stayed flat,:8788still healthy on the original pid.pnpm typecheck && lint && testgreen.Note: this prevents new pile-up; existing orphans in a long-lived box clear on its next real restart. Out of scope:
recover --adoptagainst a box with a foreign daemon (different token) needs force-replace — tracked as a follow-up.https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
Note
Medium Risk
Changes in-box daemon startup on every start/resume/recover/reconnect path; mis-probes could skip needed relaunches, though fresh sandbox boots still relaunch when health checks fail.
Overview
Stops duplicate
agentbox-ctl daemonprocesses whenreEnsureCloudBoxruns on an already-live box (notablyrecover/reconnect), where each unconditional spawn could leave orphaned Node processes afterEADDRINUSEon the box-relay port.Cloud (
ctl-launch.ts): beforenohup agentbox-ctl daemon, the launch script probeshttp://127.0.0.1:<port>/healthzvianode(port fromrelayUrl, default 8788). A 200 skips the spawn; only a dead/missing daemon proceeds to mkdir,box.env, and launch.Docker (
ctl.ts): same idea withpgrep -f 'agentbox-ctl daemon$'soreconnect/ start on a running container does not double-launch.Comments in
reEnsureCloudBoxnow state that wake paths includereconnect()and that ctl/dockerd/VNC launches are idempotent when already healthy.Reviewed by Cursor Bugbot for commit 56fe054. Configure here.