Release v1.0.11: gateway self-heals upstream after backend pod rollout by ZhiXiao-Lin · Pull Request #6 · A3S-Lab/Gateway

ZhiXiao-Lin · 2026-06-22T08:54:17Z

What

Three fixes, bumping a3s-gateway to v1.0.11:

fix(proxy): self-heal upstream after backend pod rollout (the headline fix)
- hyper's idle connection pool is keyed by hostname, not resolved IP, with a 90s idle timeout > 30s passive-health recovery_time. After a backend Deployment rolls (new pod IP), pooled sockets to the dead old pod IP linger and get reused → SendRequest fails → backend marked unhealthy → the half-open probe reuses another stale socket → permanent 503 "No healthy backends" until the gateway is restarted.
- Fix: pool_idle_timeout 90s→5s + TCP keepalive 90s→15s + passive-health recovery_time 30s→10s. The 5s < 10s invariant guarantees the half-open probe opens a FRESH connection that re-resolves DNS to the new pod IP → self-heals within ~10s instead of needing a manual restart.
fix(k8s): collapse redundant router key when the Ingress name equals the backend Service name (already live in prod; committed for release parity).
fix(streaming): SSE idle read_timeout (already on this branch).

Why

Every app deploy (pod roll) was taking the gateway down with {"error":"No healthy backends"} and only a manual gateway restart recovered it. This makes the gateway self-heal.

Verification

cargo check clean. Deploy plan: roll main gateway → verify → roll edge → verify; rollback to 1.0.10 ready.

reqwest's RequestBuilder::timeout caps the WHOLE request including the streamed body, so SSE/chunked responses were hard-killed after the hardcoded 300s regardless of activity — every SSE stream died at 5 min. Move to a client-level read_timeout (idle, reset on every byte): a healthy stream with periodic keep-alive frames (~10s) never trips it and can run indefinitely; only a genuinely silent upstream is reaped. Bump to 1.0.10.

…service name image-app-publish names the Ingress and Service identically, so the ns-ingress-svc key doubled (default-arche-arche). Already live in the deployed gateway — committing so the released build matches production.

Root cause: hyper's idle connection pool is keyed by hostname, not resolved IP, with a 90s idle timeout > 30s passive-health recovery_time. After a backend Deployment rolls (new pod IP), pooled sockets to the dead old pod IP linger and get reused -> SendRequest fails -> backend marked unhealthy -> the half-open probe reuses another stale socket -> permanent 503 'No healthy backends' until the gateway is restarted. Fix: pool_idle_timeout 90s->5s + TCP keepalive 90s->15s (evict stale sockets before the recovery probe) and passive-health recovery_time 30s->10s. The 5s<10s invariant guarantees the half-open probe opens a FRESH connection that re-resolves DNS to the new pod IP, so the gateway self-heals within ~10s of a rollout instead of needing a manual restart.

ZhiXiao-Lin · 2026-06-22T08:56:50Z

Superseded: SSE fix is already on main via #5 (caused the conflict). Reopening a clean PR off current main with just the de-dup + connection-pool self-heal.

RoyLin added 3 commits June 10, 2026 16:01

ZhiXiao-Lin closed this Jun 22, 2026

ZhiXiao-Lin deleted the fix/sse-idle-timeout branch June 22, 2026 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v1.0.11: gateway self-heals upstream after backend pod rollout#6

Release v1.0.11: gateway self-heals upstream after backend pod rollout#6
ZhiXiao-Lin wants to merge 3 commits into
mainfrom
fix/sse-idle-timeout

ZhiXiao-Lin commented Jun 22, 2026

Uh oh!

ZhiXiao-Lin commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ZhiXiao-Lin commented Jun 22, 2026

What

Why

Verification

Uh oh!

ZhiXiao-Lin commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant