feat(hetzner): self-heal the per-box firewall on a host egress-IP change#129
Merged
Conversation
A Hetzner box's firewall locks SSH to the host's egress IP at create time and is never re-synced. When the host IP changes (laptop moves networks), every comms op fails with an opaque `ssh ControlMaster failed … Operation timed out` and the user has to know to run `agentbox hetzner firewall sync`. Two fixes, both gated to the connection-failure path so the happy path never pays the egress-detect cost, and the firewall is re-synced ONLY when the IP actually changed: 1. Hint (read-only): wrap `tunnels.open` in `ensureTunnel` — the one choke point all of exec/scp/forward/poller/attach funnel through. On a real mismatch it turns the opaque timeout into "firewall allows X but your egress is now Y — run `firewall sync`/`recover`". Safe on a checkpoint drop (box merely stopped, IP unchanged → no hint). 2. Auto-sync, scoped to connection ESTABLISHMENT only. New optional `repairReachability` on CloudBackend/Provider (Hetzner-only): re-syncs the firewall to the current egress, but only when it changed (else changed:false). A `withFirewallRepair` CLI helper retries the attempt once iff something changed, wired at the two establish sites — `recover` (provider.reconnect) and the INITIAL attach connect (`_cloud-attach` buildAttach). Deliberately NOT the mid-session reconnect closure: a checkpoint stops the box and drops the PTY, which must not be mistaken for an IP change. `--no-firewall-sync` opts out on recover (shared/untrusted egress). A short-TTL egress cache avoids probe storms across retries / `recover --all`. `0.0.0.0/0` (explicit dynamic-IP opt-in) is never hinted or synced. Verified live on a Hetzner box: locking the firewall to a bogus IP makes `shell` fail with the hint (no auto-repair), `recover` auto-syncs back + reconnects, and `--no-firewall-sync` leaves it locked. Unit tests cover firewallNeedsSync + the egress TTL cache. Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
…establishes 1. Stale egress cache could mask a real IP change: cut the cache TTL from 60s to 5s. It only exists to dedup a burst of failure-path probes (poller backoff, `recover --all`), not to remember the IP over time — a long TTL would hide the very IP change we're detecting. 2. The firewall self-heal wrapped only the final buildAttach, but the resume probe and the detached pre-start connect first — a firewall block there aborted the attach (or silently dropped the resumed session) before repair ran. Move the repair to a single up-front warm-up (`exec true`, Hetzner-only) that opens the tunnel + self-heals BEFORE any later establish touch, which then reuse the live master. Verified live: a locked firewall is now auto-synced on `claude attach` before it connects. Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
Owner
Author
|
bugbot run |
…re path Bugbot (round 2): even a 5s-TTL cache could read a just-changed egress IP as "unchanged" in the firewall comparison and skip the heal — the exact mismatch this exists to catch. The cache only dedup'd failure-path probes, but the cloud poller already de-dupes its recover calls and `recover --all` is sequential, so a fresh `detectEgressIp` in `firewallEgressStatus` won't storm. Remove the cache entirely; correctness over a marginal probe dedup. Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
Owner
Author
|
bugbot run |
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit ec54498. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A Hetzner box's per-box Cloud Firewall locks SSH to the host's egress IP at create time and is never re-synced. If the host IP changes (laptop moves networks), the firewall silently drops SSH from the new IP and every comms op —
exec,attach, port-forward, the cloud poller,recover— fails with an opaquessh ControlMaster failed … connect to host <ip> port 22: Operation timed out. The fix exists (agentbox hetzner firewall sync), but nothing hints at it andrecoverdoesn't auto-heal.Fixes
Both gated to the connection-failure path (the happy path never runs the egress-detect curls), and the firewall is re-synced only when the IP actually changed.
Hint (read-only) — wrap
tunnels.openinensureTunnel, the single choke point all of exec/scp/forward/poller/attach funnel through. On a real mismatch it replaces the opaque timeout with "firewall allows X but your egress is now Y — runfirewall sync/recover". Safe on a checkpoint drop (box merely stopped, IP unchanged → no hint).Auto-sync, scoped to connection ESTABLISHMENT only — new optional
repairReachabilityonCloudBackend/Provider(Hetzner-only) re-syncs the firewall to the current egress, returning{changed:false}when it didn't change. AwithFirewallRepairCLI helper retries the attempt once iff something changed, wired at exactly two establish sites:recover→provider.reconnect_cloud-attachbuildAttach…and deliberately not the mid-session reconnect closure — a
checkpoint --set-defaultstops the box and drops the PTY, which must never be mistaken for an IP change.--no-firewall-syncopts out onrecover(shared/untrusted egress).A short-TTL egress cache avoids probe storms across retries /
recover --all. A0.0.0.0/0firewall (explicit dynamic-IP opt-in) is never hinted or synced.Verification
1.2.3.4/32→agentbox shell … -- echofails with the hint and does NOT auto-repair;agentbox recover …auto-syncs back to the real egress + reconnects ("firewall updated: SSH now allowed from … (was 1.2.3.4/32)");--no-firewall-syncleaves it locked (WARN).firewallNeedsSync(match→false, change→true,0.0.0.0/0→false, absent→true) and the egress TTL cache (probe once within window, re-probe after).pnpm typecheck && lint && testgreen.https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
Note
Medium Risk
Changes SSH reachability and firewall rules on connection failures; scoped to Hetzner establish paths with one retry and opt-out, but mis-sync could briefly widen or mis-target firewall rules if egress detection is wrong.
Overview
When a laptop’s public IP changes, Hetzner boxes keep SSH locked to the old egress CIDR, so attach/exec/recover fail with opaque SSH timeouts.
This PR adds connection-establishment-only self-heal: optional
repairReachabilityonCloudBackend/Provider(Hetzner implements it), plus CLI helperwithFirewallRepairthat retries once after a successful firewall sync.Auto-sync runs on
agentbox recover(provider.reconnect) and on initial Hetzner cloud attach (up-frontexec truebefore resume probe /buildAttach). Mid-session attach reconnect deliberately does not sync—the box stopping is not an IP change.recovergains--no-firewall-syncto opt out.On tunnel open failure, Hetzner
ensureTunnelcompares firewall SSH source vs fresh egress and appends a hint to runhetzner firewall syncorrecover. PurefirewallNeedsSync(skips0.0.0.0/0) is unit-tested.Reviewed by Cursor Bugbot for commit ec54498. Configure here.