Skip to content

feat(hetzner): self-heal the per-box firewall on a host egress-IP change#129

Merged
madarco merged 3 commits into
nightlyfrom
fix/hetzner-firewall-egress-resync
Jun 29, 2026
Merged

feat(hetzner): self-heal the per-box firewall on a host egress-IP change#129
madarco merged 3 commits into
nightlyfrom
fix/hetzner-firewall-egress-resync

Conversation

@madarco

@madarco madarco commented Jun 29, 2026

Copy link
Copy Markdown
Owner

Problem

A Hetzner box's per-box Cloud Firewall locks SSH to the host's egress IP at create time and is never re-synced. If the host IP changes (laptop moves networks), the firewall silently drops SSH from the new IP and every comms op — exec, attach, port-forward, the cloud poller, recover — fails with an opaque ssh ControlMaster failed … connect to host <ip> port 22: Operation timed out. The fix exists (agentbox hetzner firewall sync), but nothing hints at it and recover doesn't auto-heal.

Fixes

Both gated to the connection-failure path (the happy path never runs the egress-detect curls), and the firewall is re-synced only when the IP actually changed.

  1. Hint (read-only) — wrap tunnels.open in ensureTunnel, the single choke point all of exec/scp/forward/poller/attach funnel through. On a real mismatch it replaces the opaque timeout with "firewall allows X but your egress is now Y — run firewall sync / recover". Safe on a checkpoint drop (box merely stopped, IP unchanged → no hint).

  2. Auto-sync, scoped to connection ESTABLISHMENT only — new optional repairReachability on CloudBackend/Provider (Hetzner-only) re-syncs the firewall to the current egress, returning {changed:false} when it didn't change. A withFirewallRepair CLI helper retries the attempt once iff something changed, wired at exactly two establish sites:

    • recoverprovider.reconnect
    • the initial attach connect → _cloud-attach buildAttach

    …and deliberately not the mid-session reconnect closure — a checkpoint --set-default stops the box and drops the PTY, which must never be mistaken for an IP change. --no-firewall-sync opts out on recover (shared/untrusted egress).

A short-TTL egress cache avoids probe storms across retries / recover --all. A 0.0.0.0/0 firewall (explicit dynamic-IP opt-in) is never hinted or synced.

Verification

  • Live (Hetzner box): locked the firewall to a bogus 1.2.3.4/32
    • agentbox shell … -- echo fails with the hint and does NOT auto-repair;
    • agentbox recover … auto-syncs back to the real egress + reconnects ("firewall updated: SSH now allowed from … (was 1.2.3.4/32)");
    • --no-firewall-sync leaves it locked (WARN).
  • Unit: firewallNeedsSync (match→false, change→true, 0.0.0.0/0→false, absent→true) and the egress TTL cache (probe once within window, re-probe after).
  • pnpm typecheck && lint && test green.

https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup


Note

Medium Risk
Changes SSH reachability and firewall rules on connection failures; scoped to Hetzner establish paths with one retry and opt-out, but mis-sync could briefly widen or mis-target firewall rules if egress detection is wrong.

Overview
When a laptop’s public IP changes, Hetzner boxes keep SSH locked to the old egress CIDR, so attach/exec/recover fail with opaque SSH timeouts.

This PR adds connection-establishment-only self-heal: optional repairReachability on CloudBackend/Provider (Hetzner implements it), plus CLI helper withFirewallRepair that retries once after a successful firewall sync.

Auto-sync runs on agentbox recover (provider.reconnect) and on initial Hetzner cloud attach (up-front exec true before resume probe / buildAttach). Mid-session attach reconnect deliberately does not sync—the box stopping is not an IP change. recover gains --no-firewall-sync to opt out.

On tunnel open failure, Hetzner ensureTunnel compares firewall SSH source vs fresh egress and appends a hint to run hetzner firewall sync or recover. Pure firewallNeedsSync (skips 0.0.0.0/0) is unit-tested.

Reviewed by Cursor Bugbot for commit ec54498. Configure here.

A Hetzner box's firewall locks SSH to the host's egress IP at create time and is
never re-synced. When the host IP changes (laptop moves networks), every comms op
fails with an opaque `ssh ControlMaster failed … Operation timed out` and the
user has to know to run `agentbox hetzner firewall sync`. Two fixes, both gated
to the connection-failure path so the happy path never pays the egress-detect
cost, and the firewall is re-synced ONLY when the IP actually changed:

1. Hint (read-only): wrap `tunnels.open` in `ensureTunnel` — the one choke point
   all of exec/scp/forward/poller/attach funnel through. On a real mismatch it
   turns the opaque timeout into "firewall allows X but your egress is now Y —
   run `firewall sync`/`recover`". Safe on a checkpoint drop (box merely stopped,
   IP unchanged → no hint).

2. Auto-sync, scoped to connection ESTABLISHMENT only. New optional
   `repairReachability` on CloudBackend/Provider (Hetzner-only): re-syncs the
   firewall to the current egress, but only when it changed (else changed:false).
   A `withFirewallRepair` CLI helper retries the attempt once iff something
   changed, wired at the two establish sites — `recover` (provider.reconnect) and
   the INITIAL attach connect (`_cloud-attach` buildAttach). Deliberately NOT the
   mid-session reconnect closure: a checkpoint stops the box and drops the PTY,
   which must not be mistaken for an IP change. `--no-firewall-sync` opts out on
   recover (shared/untrusted egress).

A short-TTL egress cache avoids probe storms across retries / `recover --all`.
`0.0.0.0/0` (explicit dynamic-IP opt-in) is never hinted or synced.

Verified live on a Hetzner box: locking the firewall to a bogus IP makes `shell`
fail with the hint (no auto-repair), `recover` auto-syncs back + reconnects, and
`--no-firewall-sync` leaves it locked. Unit tests cover firewallNeedsSync + the
egress TTL cache.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
agentbox-web Skipped Skipped Jun 29, 2026 10:30pm

Request Review

Comment thread packages/sandbox-hetzner/src/egress-ip.ts Outdated
Comment thread apps/cli/src/commands/_cloud-attach.ts
…establishes

1. Stale egress cache could mask a real IP change: cut the cache TTL from 60s to
   5s. It only exists to dedup a burst of failure-path probes (poller backoff,
   `recover --all`), not to remember the IP over time — a long TTL would hide the
   very IP change we're detecting.

2. The firewall self-heal wrapped only the final buildAttach, but the resume
   probe and the detached pre-start connect first — a firewall block there aborted
   the attach (or silently dropped the resumed session) before repair ran. Move
   the repair to a single up-front warm-up (`exec true`, Hetzner-only) that opens
   the tunnel + self-heals BEFORE any later establish touch, which then reuse the
   live master. Verified live: a locked firewall is now auto-synced on
   `claude attach` before it connects.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
@madarco

madarco commented Jun 29, 2026

Copy link
Copy Markdown
Owner Author

bugbot run

Comment thread packages/sandbox-hetzner/src/backend.ts Outdated
…re path

Bugbot (round 2): even a 5s-TTL cache could read a just-changed egress IP as
"unchanged" in the firewall comparison and skip the heal — the exact mismatch
this exists to catch. The cache only dedup'd failure-path probes, but the cloud
poller already de-dupes its recover calls and `recover --all` is sequential, so
a fresh `detectEgressIp` in `firewallEgressStatus` won't storm. Remove the cache
entirely; correctness over a marginal probe dedup.

Claude-Session: https://claude.ai/code/session_01Ja5HgEjwyER5BhhFCpPUup
@madarco

madarco commented Jun 29, 2026

Copy link
Copy Markdown
Owner Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit ec54498. Configure here.

@madarco madarco merged commit 6c13700 into nightly Jun 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant