bench(v1.5.4): suite redesign — prune saturation, fair h2c, deeper drivers, WS/SSE competitors#199
Open
FumingPower3925 wants to merge 6 commits into
Open
bench(v1.5.4): suite redesign — prune saturation, fair h2c, deeper drivers, WS/SSE competitors#199FumingPower3925 wants to merge 6 commits into
FumingPower3925 wants to merge 6 commits into
Conversation
Remove the rows whose RPS converges at the 20G fabric line rate (the harness already flags them network_bound): get-json-8k/16k/64k + post-8k/16k/64k (H1) and get-json-64k-h2 + post-64k-h2. These burned ~288 cells/run (~5h) without differentiating fast adapters. Keep post-1m as the SINGLE documented wire-bound datapoint for the saturation/methodology discussion (not a ranking row). Drop the now-unused post8k/16k/64k payload generators; update the registry/category/body-size test guards. Net static rows: 12 H1 -> 6, 4 H2 -> 2. Funds the driver + WS/SSE depth.
…(W3 part 1)
Define the 6 new driver-depth scenarios (writes / transaction / range /
pipeline / multiget) + their routes/bodies, and add the unlogged bench_writes
PG table + FixtureRedisWriteKey to services. Verified: scenarios build/vet +
the registry/category tests pass with 10 driver scenarios.
NEXT (W3 part 2): implement the 6 handlers across the 9 driver adapters
(servers/*/driver_handlers.go) — celeris via its native driver/{postgres,
redis,memcached} (Pipeline/BeginTx/GetMulti all confirmed present), Go
competitors via idiomatic pgx/go-redis/gomemcache; standardize the memcached
env var; add conformance routes. Until then these rows 404 (feature branch
only; not runnable yet).
…pters (W3 part 2)
Add pg-write / pg-update-tx (BEGIN-UPDATE-COMMIT) / pg-read-range (N-row) /
redis-set / redis-pipeline (batched GETs) / mc-multiget to every driver
adapter: celeris via its native driver/{postgres,redis,memcached} (Pipeline /
BeginTx / GetMulti), the 8 Go competitors via idiomatic pgx/go-redis/
gomemcache, each matching the adapter's framework idiom. Routes are
/cache-pipeline and /mc-multiget (not /cache/pipeline, /mc/multi) to avoid
colliding with the /cache/:key and /mc/:key param routes.
Also standardize the memcached env var: fasthttp/echo/iris read
PROBATORIUM_MC_ADDR while the other 6 read PROBATORIUM_MEMCACHED_ADDR — a
latent bug where validate.yml (which only set MC_ADDR) 503'd the 6
MEMCACHED_ADDR adapters on every memcached cell. All read MEMCACHED_ADDR now;
validate.yml fixed; run_bench_cell.yml dual-set collapsed.
All 9 adapter modules build; scenarios test green (10 driver scenarios).
Widen the WS/SSE grid beyond celeris+gorilla. Each adapter serves the
fixed wire contract (GET /ws ?mode=ws-echo|ws-large-echo|ws-hub, GET
/events text/event-stream, 1ms publish, "payload"/"hello"):
- axum (Rust): axum::extract::ws + response::sse; a single broadcast
tick drives both hub fan-out and SSE. serve_h1 gains .with_upgrades()
(mandatory — without it hyper writes 101 then drops the socket); h2c
serve path untouched.
- hono (Bun): Bun.serve native websocket + SSE ReadableStream on the h1
branch; one 1ms ticker, drop-on-backpressure fan-out.
- starlette (Python): WebSocketRoute + StreamingResponse; the two 1ms
asyncio tickers start per worker via a lifespan; WS rides
uvicorn[standard]'s bundled websockets (no new dep).
Flip Capabilities{WS,SSE} on the three h1 columns. featureSetFor already
projects them and streaming gates on fs.HTTP1, so the -h2 siblings stay
out of the streaming grid. All three were build- and runtime-smoke-
verified locally (101 upgrade, echo round-trip, hub broadcast, SSE
frame, 256 KiB large-echo).
… (W2)
Audit + fix the h2c "fair fight". loadgen adopts the server's advertised
SETTINGS_INITIAL_WINDOW_SIZE as its per-stream upload window, so a small
advertised window throttles POST throughput independent of server speed.
Measured the advertised SETTINGS of every h2c column with a wire probe:
celeris ................. 1 MiB window / 1 MiB frame / 100 streams
gin/echo/chi/iris/hertz/stdhttp (Go net/http) ... 1 MiB (already fair)
axum-h2 / hyper-h2 (Rust hyper) ................. 1 MiB (already fair)
aspnet-h2 (Kestrel) ............................. 768 KiB -> 1 MiB
hono-h2 / elysia-h2 (Bun node:http2) ............ 64 KiB -> 1 MiB
fastapi-h2 / starlette-h2 (hypercorn/h2) ........ 64 KiB (see caveat)
So the original "h2c is mostly an artifact" premise is refuted by
measurement: the entire Go field and both Rust columns already advertise
celeris's 1 MiB window. Only Kestrel and the Bun columns lagged; their
adapters now explicitly advertise the 1 MiB / 100-stream profile:
- servers/aspnet/Program.cs: Http2 InitialStream/ConnectionWindowSize
= 1 MiB, MaxStreamsPerConnection = 100.
- servers/{hono,elysia}/src/h2c.ts: node:http2 settings.initialWindowSize
= 1 MiB + maxConcurrentStreams 100, plus session.setLocalWindowSize to
lift the connection window off its 64 KiB default.
DISCLOSED CAVEAT: fastapi-h2/starlette-h2 ride hypercorn, which exposes no
per-stream initial-window knob; they keep the h2-library default and can't
be equalized without a fragile monkeypatch. They are the slowest columns
regardless, so the window is not their binding constraint. Methodology
recorded as a block comment above the h2c columns in servers/servers.go.
All three patched columns re-probed from their rebuilt artifacts: each now
advertises INITIAL_WINDOW=1048576.
The v1.5.4 redesign reshaped the grid (W1 pruned saturated static rows,
W3 deepened drivers 4->10, W4 added WS/SSE to three columns). A live
`cmd/runner -dry-run -cells '*/*'` now resolves 1111 capability-gated
cells/pass (52 adapters x 44 scenarios), so the stale pins move:
FastRealizedCells: 1257 -> 1111 (fast = 35s/10s window = 19.1h < 24h)
FullRealizedCells: 820 -> 1111 (same realized "*/*" grid as Fast;
the 820/1257 split was pre-existing drift)
HeadlineWeekly's per-cell window shortened 60s/15s -> 40s/12s: the longer
window no longer fits the grown grid in 24h (1111 x 92s = 28.4h), the
shorter one does (1111 x 69s + ~0.7h rated = ~22.0h < 24h). All budget
invariants (TestWeeklyConfigFitsBudget / TestFastFitsWithin24h) pass with
the true count instead of a stale pin. Comment arithmetic updated to match.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v1.5.4 benchmark suite redesign (Part A)
Sharpens the 50-adapter × scenario grid for a thesis-defense audience: cut compute waste, deepen where signal lives, and make every comparison defensible. Grounded throughout in the live v1.5.3 results data.
W1 — Prune NIC-saturated / redundant cells (
f968b09)Removed the static rows whose RPS is fabric-bound on the 20 Gbps link (RPS converges → not a ranking signal):
get-json-8k/16k/64k,post-8k/16k/64k, and the two large-body h2 rows (get-json-64k-h2,post-64k-h2). Static H1 12→6, H2 4→2. Keptpost-1mas the single documented wire-bound datapoint for the methodology section, and the full concurrency sweep (the io_uring crossover evidence).W3 — Driver depth 4 → 10 (
67751c4,682d853)Turned the single-GET driver story into a multi-dimensional one:
driver-pg-write(INSERT),driver-pg-update-tx(BEGIN/UPDATE/COMMIT),driver-pg-read-range(N-row SELECT),driver-redis-set,driver-redis-pipeline(the strongest native-vs-go-redis differentiator),driver-mc-multiget. Implemented across all 9 driver adapters (celeris native.Async()+ idiomatic pgx/go-redis/gomemcache). Adds an unloggedbench_writesfixture table; also fixes a latent env-var split (PROBATORIUM_MC_ADDR→PROBATORIUM_MEMCACHED_ADDR) that was 503-ing memcached on half the adapters.W4 — WS/SSE competitors (
1137fdf)Native WebSocket + SSE for axum (Rust), hono (Bun), starlette (Python) matching the fixed wire contract (
/ws ?mode=ws-echo|ws-large-echo|ws-hub,/eventstext/event-stream, 1 ms publish). Each was build- and runtime-smoke-verified locally (101 upgrade, echo round-trip, hub broadcast, SSE frame, large-echo). FlippedCapabilities{WS,SSE}on the three h1 columns.W2 — h2c fair-fight, by measurement (
3f9dacc)Audited the advertised
SETTINGS_INITIAL_WINDOW_SIZEof every h2c column with a wire probe (loadgen adopts the server's advertised window as its per-stream upload cap, so a small window throttles POSTs independent of server speed). Result refutes the blanket-artifact premise: the entire Go field (incl. hertz) and both Rust columns already advertise celeris's 1 MiB window. Only Kestrel (768 KiB) and the Bun node:http2 columns (64 KiB) lagged — now equalized to 1 MiB. Disclosed caveat: the two hypercorn Python columns expose no per-stream window knob (they're the slowest columns regardless). Methodology recorded as a code comment above the h2c columns.W5 — Budget reconcile (
c9e8715)Live
cmd/runner -dry-runresolves 1111 capability-gated cells/pass (was pinned 1257/820 across two drifted constants). Updated both; retuned the headline window (60s/15s→40s/12s) so the grown grid still fits 24 h. All budget invariants pass with the true count.Verification
go build ./...+go test ./...green. Rust/Bun/Python/C# adapters built and the patched h2c columns re-probed from rebuilt artifacts (all advertise 1 MiB). The macro gate is the next cluster bench run, which produces the v1.5.4 numbers.