fix(loadgen): h1 read-EOF backoff + h2 frame-size/flow-control + h2 reconnect (full-matrix resilience)#63
Merged
Conversation
A Connection:close server (or one closing mid-response under churn-close) surfaces as a read EOF on the status line or a header line, not just on the write. The write-error path already reconnects + backs off; the two read paths returned the bare error, so a close-after-one-response server spun read-EOFs with no pacing and never re-established a usable conn. drogon collapsed to 0 successful requests from exactly this. Mirror the write path: reconnect for the next request, recordConnectError + backoff only when the server is genuinely down, otherwise reset the backoff.
POST bodies larger than 16384 B were sent as a single oversized DATA frame (FRAME_SIZE_ERROR against a 16384-default server) and bodies larger than the 65535 initial send window overran flow control (FLOW_CONTROL_ERROR / hang to the 5-min deadline). This is the post-64k-h2 failure (64 KiB body = 65536 B, one past the window). Capture the server's SETTINGS_MAX_FRAME_SIZE and SETTINGS_INITIAL_WINDOW_SIZE at handshake; split the body at the server's frame size and pace it against the connection (RFC 7540 §6.9.2: starts 65535) and per-stream send windows, replenished from WINDOW_UPDATE in readLoop. The writer goroutine is sequential, so the active stream's window is tracked by curStreamID/curStreamWindow. Drop the now-dead h2WriteReq.maxFrame field. Regression test posts a 200000-B body through a strict h2c server advertising 16384/65535 (matching real bench targets, not x/net's lenient 1 MiB defaults); the pre-fix single-frame send fails it with GOAWAY code=6.
c30c81e to
a9fb908
Compare
The h2 client dialed its connections once in New() and never recovered one the server tore down. A server that GOAWAYs or closes connections periodically — hypercorn does, so the fastapi-h2 column hit it on every cell — left the slot permanently dead: readLoop marks the conn closed on GOAWAY and returns, then every DoRequest returns the bare closed-conn error with no pacing. fastapi-h2 logged ~1.1 billion errors / 0 successful requests per 35 s cell from exactly this hot loop (the h2 analog of the h1 churn-close bug). Wrap each connection in an h2ConnSlot (atomic.Pointer to the live conn + single-flight redial mutex + connectBackoff). DoRequest re-dials a dead slot, paced by the slot backoff, swapping the fresh conn in atomically so sibling workers pick it up lock-free. Verified against a live hypercorn h2c server: GET 0 -> 47.6k req (0.4% error), POST-65536-body 0 -> 16k req — both previously zero. Regression test closes a live conn out from under the client and asserts the next requests recover against the still-running server.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three resilience fixes so the full benchmark matrix survives servers it previously DNF'd on. Each is the loadgen side of a column that produced zero requests in the last full run.
1. h1client: backoff-paced reconnect on read EOF
The write-error path already reconnected + backed off; the read-status and read-header EOF paths returned the bare error. A
Connection: closeserver — or one closing mid-response under churn-close — surfaces as a read EOF, so a close-after-one-response server spun read-EOFs with no pacing and never recovered. drogon collapsed to 0 successful requests (churn-close cell) from exactly this.2. h2client: honor server MAX_FRAME_SIZE + send-window flow control
POST bodies > 16384 B were sent as a single oversized DATA frame (FRAME_SIZE_ERROR) and bodies > the 65535 initial send window overran flow control — the post-64k-h2 failure (5 h2 columns: aspnet-h2, axum-h2, elysia-h2, hono-h2, hyper-h2). Now: capture the server's SETTINGS_MAX_FRAME_SIZE + SETTINGS_INITIAL_WINDOW_SIZE at handshake, split at the frame size, and pace against the connection + per-stream send windows, replenished from WINDOW_UPDATE. Regression test posts 200000 B through a strict 16384/65535 h2c server; pre-fix single-frame send fails it with GOAWAY code=6.
3. h2client: re-dial connections the server closes/GOAWAYs mid-cell
The h2 client dialed once in New() and never recovered a torn-down conn. A server that GOAWAYs/closes connections periodically — hypercorn does, so fastapi-h2 hit it every cell: ~1.1 billion errors / 0 requests per 35 s cell — left the slot dead and every DoRequest hot-looped the closed-conn error (the h2 analog of #1). Each conn is now an h2ConnSlot (atomic.Pointer + single-flight redial + backoff); DoRequest re-dials a dead slot and swaps the fresh conn in lock-free.
Verification