Skip to content

fix(bench): dial backoff + never-connected fail-fast, ConnectErrors class, warmup stats, calibrated warmup->measure handoff#61

Merged
FumingPower3925 merged 8 commits into
mainfrom
fix/dial-backoff-error-classes
Jun 16, 2026
Merged

fix(bench): dial backoff + never-connected fail-fast, ConnectErrors class, warmup stats, calibrated warmup->measure handoff#61
FumingPower3925 merged 8 commits into
mainfrom
fix/dial-backoff-error-classes

Conversation

@FumingPower3925

Copy link
Copy Markdown
Contributor

Motivation

Two failure modes from the v3.8 bench post-mortem, plus a t=0 burst confirmed in the v3.9 repro:

  • Dial hot-loops against a dead SUT. When the SUT died mid-cell, the ws/sse drivers redialled the dead port at full speed (~386k dials/sec, 34.7M errors in a single 90s cell) and the h1client reconnect path did the same (33.1M dial errors at ~370k/s in the crash cell). A cell whose server never came up burned its entire configured duration before being misclassified. Dial failures were also indistinguishable from request errors in Result.Errors, so the harness could not tell "server slow" from "server gone".
  • t=0 measured-window error bursts. Every saturation cell in the v3.9 repro (chain-api-*/celeris-iouring-h1-async, 90s cells) landed its errors exclusively in the first measured second while warmup ran clean. Cause: warmup drove only 75% of the workers, then the warmup->measure boundary stopped every worker and restarted the full set at t=0 — a +33% concurrency step plus a phase-aligned restart burst that the server absorbed by resetting established conns.
  • Stale version constant. Version was never bumped past 1.4.5, so every Result from v1.4.6+ builds misreports its producer.

Changes (per commit)

  • ae78c64 fix(version): resolve the self-reported version via runtime/debug.ReadBuildInfo (correct both as a tagged main module and as a dependency), falling back to a corrected constant for (devel)/test builds.
  • 4a6c18a feat(dial): shared jittered exponential backoff (10ms doubling to 1s, abortable via ctx/close), a fail-fast tracker that turns a never-connected 5s failure streak into a fatal ErrNeverConnected, and a connect-error class counter.
  • e26bc79 feat(bench): Benchmarker.Run aborts the run the moment a worker surfaces ErrNeverConnected (no more burning the full Duration against a dead target; callers can classify dnf). Result.ConnectErrors + TimeseriesPoint.ConnectErrors expose the dial/handshake error class (additive JSON). Result.Warmup snapshots warmup requests/errors/connect errors, and the CLI logs a warmup summary plus a warning when zero warmup requests succeeded.
  • 7bcd310 fix(ws,sse): backoff-paced, ctx-abortable dials honoring Config.DialTimeout; Close() aborts in-flight dials and backoff sleeps; never-connected fail-fast after 5s; every dial failure counted in the connect-error class. Clients that had live streams keep retrying with backoff indefinitely.
  • c902b92 fix(h1client): failed reconnects sleep the shared backoff instead of returning to an immediate redial loop; backoff resets on successful reconnect; failures count into the connect-error class.
  • 057f4bf feat(mix): per-protocol counters scoped to the measured window via baselines instead of a racy Store(0) at the boundary (prerequisite for keeping workers in flight across it).
  • 48bd709 fix(bench): calibrated warmup->measure handoff. In saturation mode warmup starts the FULL worker set and the measured window adopts the same goroutines mid-flight — knee discovery happens during warmup (errors land in Result.Warmup), no concurrency step, no restart herd, no t=0 burst. Boundary bookkeeping is swap/baseline based (atomic latency-recorder swap, error baselines). Rated mode keeps its stop/start phases but warms up the full worker set so no conn enters the paced window cold.
  • b194242 test(bench): handoff continuity unit tests (goroutine-identity fingerprinting, exact warmup+measured accounting) plus a shedding-server integration test reproducing the v3.9 burst conditions and asserting zero measured-window errors in every timeseries bucket.

Validation

  • Full unit suite with -race (go test -race -count=1 ./...) and the live-celeris integration matrix (-tags integration -run TestIntegrationH2CMatrix) pass locally on the CI Go version (1.26.4) with golangci-lint v2.12 clean.
  • Cluster repro of the v3.9 conditions: saturation cells now report zero measured-window errors (shedding lands in warmup stats), and dead-SUT cells fail fast as dnf instead of logging tens of millions of dial errors.

…4.5 constant)

The Version constant was never bumped past 1.4.5, so Results from v1.4.6+
builds misreport their producer. Resolve the module version via
runtime/debug.ReadBuildInfo — correct both when loadgen is the main module
of a tagged build and when consumed as a dependency — falling back to a
corrected 1.4.7 constant for (devel)/test builds.
…r class counter

Groundwork for the v3.8 dial-storm fixes: a jittered exponential backoff
(10ms doubling to a 1s cap, abortable via ctx or a close channel), a
fail-fast tracker that turns a never-connected 5s failure streak into a
fatal ErrNeverConnected error shaped like the h1client pre-dial failure
("loadgen: dial: ...: connection refused"), and a process-global
connect-error counter so dial/handshake failures are countable as their
own class alongside the untyped errors total.
…isibility

- Benchmarker now aborts the whole run (warmup or measured phase) the
  moment a worker surfaces an ErrNeverConnected-wrapped error, instead of
  burning the configured Duration against a dead target; Run returns the
  fatal dial error so callers can classify the cell as dnf.
- Result.ConnectErrors and TimeseriesPoint.ConnectErrors expose the
  dial/handshake-failure error class (additive JSON; existing consumers
  keep parsing Errors unchanged). Producers wired in follow-up commits.
- Result.Warmup (WarmupStats) snapshots warmup requests/errors/connect
  errors before the pre-run counter reset, and the CLI logs a warmup
  summary plus an explicit warning when zero warmup requests succeeded —
  a 100%-failing warmup was previously invisible.
…-fast

v3.8 post-mortem: when the SUT died, the ws/sse drivers redialled the dead
port in a hot loop (~386k dials/sec, 34.7M errors per 90s cell) and a cell
that never had a server burned its full duration before being
misclassified.

- Connect failures now pace per-worker via the shared jittered exponential
  backoff (10ms doubling to 1s) instead of redialling at full speed.
- Dials and the upgrade/GET handshake honor ctx cancellation and
  Config.DialTimeout (both previously ignored mid-dial), and Close() now
  aborts in-flight dials and backoff sleeps via a client close context.
- If NO stream was ever established and connect attempts have failed for
  failFastWindow (5s), the driver returns an ErrNeverConnected-wrapped
  error shaped like the h1client pre-dial failure; Benchmarker.Run aborts
  so the harness can classify dnf. A client that had live streams keeps
  retrying with backoff indefinitely.
- Each dial failure is recorded in the connect-error class
  (Result.ConnectErrors / timeseries / warmup stats).
The reconnect-and-retry path failed fast on a refused dial and returned,
letting the worker loop redial immediately — v3.8's crash cell logged
33.1M dial errors at ~370k/s. Failed reconnects now sleep the shared
jittered exponential backoff (10ms doubling to 1s, ctx-abortable) before
surfacing the error, reset on the next successful reconnect, and count
into the connect-error class.
…elines

The warmup->measure boundary used to Store(0) the six mix counters while
no workers were running. The calibrated saturation handoff (next commit)
keeps workers in flight across that boundary, where a reset would race
concurrent Adds and lose increments. markMeasureStart captures baselines
instead; stats() subtracts them, so reported counts still cover the
measured window only.
…ation mode

v3.9 repro (chain-api-*/celeris-iouring-h1-async, 90s cells): every
saturation cell's errors landed exclusively in the first measured second
(get-json 22, get-json-1c 53, post-4k 271) with connect_errors=0, while
the 20s warmup ran clean at ~508k RPS and the measured window settled at
~595k. Mechanism: warmup drove only 75% of the workers (so the closed
loop never explored the knee and 25% of the keep-alive conns sat idle,
free for the server to expire), then the boundary stopped every worker
and restarted the full set at t=0 — a +33% concurrency step plus a
phase-aligned restart burst that the server absorbed by resetting
established conns. Steady state never repeats any of that, hence the
one-second burst.

The handoff is now calibrated:

- Saturation mode (Rate==0): warmupHot starts the FULL worker set and
  leaves it running across the boundary. Knee discovery happens during
  warmup (errors land in Result.Warmup — honest calibration); the
  measured window adopts the same goroutines mid-flight, so it opens at
  the converged steady rate with no concurrency step, no cold conns, no
  herd. Short warmups simply continue the same ramp from below inside
  the window — converging to the true max without overshoot.
- Boundary bookkeeping is swap/baseline based since workers stay hot:
  the latency recorder is held behind an atomic.Pointer and a fresh one
  is swapped in at the boundary (one pointer load per recorded request;
  negligible next to the existing windowMu acquisition), measured errors
  are reported as errors-errorsBase instead of resetting the counter,
  and warmup request totals are finalised at end-of-run so per-shard
  locals unflushed at the swap are not lost.
- Rated mode keeps its stop/start phases (the measured window swaps in
  the rate scheduler), but its warmup now also runs the full worker set
  so no conn enters the paced window cold.

Measured-window errors on a healthy server are now structurally zero —
no samples are trimmed or edited, the burst simply happens where the
calibration phase reports it.
…ion test

- TestSaturationHandoffContinuity: goroutine-identity fingerprinting
  proves the measured window adopts the warmup workers (no second worker
  set, full set active from warmup start) and that warmup+measured
  request accounting is exact across the recorder swap.
- TestRatedHandoffKeepsStopStart: rated mode still drains warmup workers
  and spawns a fresh ratedWorker set — scheduler semantics untouched.
- TestSaturationHandoffZeroMeasuredErrors: loopback raw-TCP server with
  a zero-burst 8k RPS pacer, cold-start shedding, and idle-conn expiry
  (the three SUT behaviours behind the v3.9 burst). Asserts measured
  window has zero errors in every timeseries bucket while warmup reports
  the shedding, and that measured RPS converges to the server limit
  without understating the warmup-calibrated rate.
@FumingPower3925 FumingPower3925 merged commit 9e845ac into main Jun 16, 2026
3 checks passed
@FumingPower3925 FumingPower3925 deleted the fix/dial-backoff-error-classes branch June 21, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant