fix(bench): dial backoff + never-connected fail-fast, ConnectErrors class, warmup stats, calibrated warmup->measure handoff#61
Merged
Conversation
…4.5 constant) The Version constant was never bumped past 1.4.5, so Results from v1.4.6+ builds misreport their producer. Resolve the module version via runtime/debug.ReadBuildInfo — correct both when loadgen is the main module of a tagged build and when consumed as a dependency — falling back to a corrected 1.4.7 constant for (devel)/test builds.
…r class counter
Groundwork for the v3.8 dial-storm fixes: a jittered exponential backoff
(10ms doubling to a 1s cap, abortable via ctx or a close channel), a
fail-fast tracker that turns a never-connected 5s failure streak into a
fatal ErrNeverConnected error shaped like the h1client pre-dial failure
("loadgen: dial: ...: connection refused"), and a process-global
connect-error counter so dial/handshake failures are countable as their
own class alongside the untyped errors total.
…isibility - Benchmarker now aborts the whole run (warmup or measured phase) the moment a worker surfaces an ErrNeverConnected-wrapped error, instead of burning the configured Duration against a dead target; Run returns the fatal dial error so callers can classify the cell as dnf. - Result.ConnectErrors and TimeseriesPoint.ConnectErrors expose the dial/handshake-failure error class (additive JSON; existing consumers keep parsing Errors unchanged). Producers wired in follow-up commits. - Result.Warmup (WarmupStats) snapshots warmup requests/errors/connect errors before the pre-run counter reset, and the CLI logs a warmup summary plus an explicit warning when zero warmup requests succeeded — a 100%-failing warmup was previously invisible.
…-fast v3.8 post-mortem: when the SUT died, the ws/sse drivers redialled the dead port in a hot loop (~386k dials/sec, 34.7M errors per 90s cell) and a cell that never had a server burned its full duration before being misclassified. - Connect failures now pace per-worker via the shared jittered exponential backoff (10ms doubling to 1s) instead of redialling at full speed. - Dials and the upgrade/GET handshake honor ctx cancellation and Config.DialTimeout (both previously ignored mid-dial), and Close() now aborts in-flight dials and backoff sleeps via a client close context. - If NO stream was ever established and connect attempts have failed for failFastWindow (5s), the driver returns an ErrNeverConnected-wrapped error shaped like the h1client pre-dial failure; Benchmarker.Run aborts so the harness can classify dnf. A client that had live streams keeps retrying with backoff indefinitely. - Each dial failure is recorded in the connect-error class (Result.ConnectErrors / timeseries / warmup stats).
The reconnect-and-retry path failed fast on a refused dial and returned, letting the worker loop redial immediately — v3.8's crash cell logged 33.1M dial errors at ~370k/s. Failed reconnects now sleep the shared jittered exponential backoff (10ms doubling to 1s, ctx-abortable) before surfacing the error, reset on the next successful reconnect, and count into the connect-error class.
…elines The warmup->measure boundary used to Store(0) the six mix counters while no workers were running. The calibrated saturation handoff (next commit) keeps workers in flight across that boundary, where a reset would race concurrent Adds and lose increments. markMeasureStart captures baselines instead; stats() subtracts them, so reported counts still cover the measured window only.
…ation mode v3.9 repro (chain-api-*/celeris-iouring-h1-async, 90s cells): every saturation cell's errors landed exclusively in the first measured second (get-json 22, get-json-1c 53, post-4k 271) with connect_errors=0, while the 20s warmup ran clean at ~508k RPS and the measured window settled at ~595k. Mechanism: warmup drove only 75% of the workers (so the closed loop never explored the knee and 25% of the keep-alive conns sat idle, free for the server to expire), then the boundary stopped every worker and restarted the full set at t=0 — a +33% concurrency step plus a phase-aligned restart burst that the server absorbed by resetting established conns. Steady state never repeats any of that, hence the one-second burst. The handoff is now calibrated: - Saturation mode (Rate==0): warmupHot starts the FULL worker set and leaves it running across the boundary. Knee discovery happens during warmup (errors land in Result.Warmup — honest calibration); the measured window adopts the same goroutines mid-flight, so it opens at the converged steady rate with no concurrency step, no cold conns, no herd. Short warmups simply continue the same ramp from below inside the window — converging to the true max without overshoot. - Boundary bookkeeping is swap/baseline based since workers stay hot: the latency recorder is held behind an atomic.Pointer and a fresh one is swapped in at the boundary (one pointer load per recorded request; negligible next to the existing windowMu acquisition), measured errors are reported as errors-errorsBase instead of resetting the counter, and warmup request totals are finalised at end-of-run so per-shard locals unflushed at the swap are not lost. - Rated mode keeps its stop/start phases (the measured window swaps in the rate scheduler), but its warmup now also runs the full worker set so no conn enters the paced window cold. Measured-window errors on a healthy server are now structurally zero — no samples are trimmed or edited, the burst simply happens where the calibration phase reports it.
…ion test - TestSaturationHandoffContinuity: goroutine-identity fingerprinting proves the measured window adopts the warmup workers (no second worker set, full set active from warmup start) and that warmup+measured request accounting is exact across the recorder swap. - TestRatedHandoffKeepsStopStart: rated mode still drains warmup workers and spawns a fresh ratedWorker set — scheduler semantics untouched. - TestSaturationHandoffZeroMeasuredErrors: loopback raw-TCP server with a zero-burst 8k RPS pacer, cold-start shedding, and idle-conn expiry (the three SUT behaviours behind the v3.9 burst). Asserts measured window has zero errors in every timeseries bucket while warmup reports the shedding, and that measured RPS converges to the server limit without understating the warmup-calibrated rate.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Two failure modes from the v3.8 bench post-mortem, plus a t=0 burst confirmed in the v3.9 repro:
Result.Errors, so the harness could not tell "server slow" from "server gone".Versionwas never bumped past 1.4.5, so every Result from v1.4.6+ builds misreports its producer.Changes (per commit)
ae78c64fix(version): resolve the self-reported version viaruntime/debug.ReadBuildInfo(correct both as a tagged main module and as a dependency), falling back to a corrected constant for(devel)/test builds.4a6c18afeat(dial): shared jittered exponential backoff (10ms doubling to 1s, abortable via ctx/close), a fail-fast tracker that turns a never-connected 5s failure streak into a fatalErrNeverConnected, and a connect-error class counter.e26bc79feat(bench):Benchmarker.Runaborts the run the moment a worker surfacesErrNeverConnected(no more burning the full Duration against a dead target; callers can classify dnf).Result.ConnectErrors+TimeseriesPoint.ConnectErrorsexpose the dial/handshake error class (additive JSON).Result.Warmupsnapshots warmup requests/errors/connect errors, and the CLI logs a warmup summary plus a warning when zero warmup requests succeeded.7bcd310fix(ws,sse): backoff-paced, ctx-abortable dials honoringConfig.DialTimeout;Close()aborts in-flight dials and backoff sleeps; never-connected fail-fast after 5s; every dial failure counted in the connect-error class. Clients that had live streams keep retrying with backoff indefinitely.c902b92fix(h1client): failed reconnects sleep the shared backoff instead of returning to an immediate redial loop; backoff resets on successful reconnect; failures count into the connect-error class.057f4bffeat(mix): per-protocol counters scoped to the measured window via baselines instead of a racyStore(0)at the boundary (prerequisite for keeping workers in flight across it).48bd709fix(bench): calibrated warmup->measure handoff. In saturation mode warmup starts the FULL worker set and the measured window adopts the same goroutines mid-flight — knee discovery happens during warmup (errors land inResult.Warmup), no concurrency step, no restart herd, no t=0 burst. Boundary bookkeeping is swap/baseline based (atomic latency-recorder swap, error baselines). Rated mode keeps its stop/start phases but warms up the full worker set so no conn enters the paced window cold.b194242test(bench): handoff continuity unit tests (goroutine-identity fingerprinting, exact warmup+measured accounting) plus a shedding-server integration test reproducing the v3.9 burst conditions and asserting zero measured-window errors in every timeseries bucket.Validation
-race(go test -race -count=1 ./...) and the live-celeris integration matrix (-tags integration -run TestIntegrationH2CMatrix) pass locally on the CI Go version (1.26.4) with golangci-lint v2.12 clean.