perf(v1.5.4): peak-RSS, driver + middleware allocation, with honest premise checks#386
Open
FumingPower3925 wants to merge 11 commits into
Open
perf(v1.5.4): peak-RSS, driver + middleware allocation, with honest premise checks#386FumingPower3925 wants to merge 11 commits into
FumingPower3925 wants to merge 11 commits into
Conversation
The documentation website (https://goceleris.dev/docs) is now the single source of truth for long-form guides, tutorials, and examples. The package-level doc.go files had grown into a large parallel copy of that material (~4k lines), a duplicated maintenance surface that drifts stale. Trim every package doc.go down to tight, accurate, IDE-facing godoc — package summary, key exported symbols, at most a minimal example — plus a "# Documentation" pointer to the matching page on the site. Symbol-level godoc on declarations (what IDEs surface on hover/autocomplete) is left untouched. - 50 doc.go trimmed, 4 already-minimal left as-is - net -3110 lines (942 insertions, 4052 deletions) - also corrects stale facts found against the real API while trimming: wrong Config timeout/logger defaults, deprecated basicauth SHA-256, MB vs MiB body limits, broken/unexported godoc symbol links, and removed/renamed symbols (probe.Probe, resource.Workload*Concurrency, session.NewMemoryStore, ratelimit.ErrorHandler, ...). Comment-only; no functional changes. gofmt, go build ./... and go vet ./... all clean.
…llector() godoc - jwt's ErrTokenMissing/ErrTokenInvalid/ErrJWTExpired/ErrJWTMalformed now chain to celeris.ErrUnauthorized via an unexported cause, so errors.Is(err, celeris.ErrUnauthorized) matches every jwt rejection — consistent with keyauth/basicauth and with the documented contract on ErrUnauthorized. HTTPError.Error() text is unchanged (regression test added). - Server.Collector() godoc: the collector is created eagerly in New() and is non-nil before Start (nil only when Config.DisableMetrics is set), matching New()'s own godoc. Surfaced by a docs usability audit. gofmt/build/vet clean; jwt, keyauth, basicauth, and root suites pass.
…loon (B1.1) celeris does no GC tuning today, so under the default GOGC=100 the live heap roughly doubles during a 1024-connection ramp burst — that transient is the dominant peak-RSS contributor (steady-state RSS sits far lower). Add an opt-in Resources.MemoryLimitBytes knob: when >0 the server applies runtime/debug.SetMemoryLimit at start (before the engine allocates its per-worker tables), so the GC collects before the heap balloons. Steady RSS is well under the ceiling, so steady throughput is untouched — it only trades a few extra GC cycles during the ramp for a lower high-water mark. Default 0 = unset: celeris never touches the process GC implicitly, since SetMemoryLimit is PROCESS-GLOBAL and celeris is a library — embedders keep full control. resource.DeriveMemoryLimit(workers) returns a generous max(256 MiB, workers*32 MiB) ceiling for operators that own the process (e.g. a dedicated server binary) and want the reduction. Local-verified (build/vet/race/tests). The peak-RSS A/B (peak down, steady throughput within +/-1%) is a cluster gate, and the default flip from off waits on that proof.
…2/B1.3) B1.2: the per-worker epoll conn table eagerly allocated all 65536 pointer slots (512 KiB) even for a worker holding a handful of conns. It now starts at 4096 and grows on demand (growConns, power-of-two) toward the unchanged 65536 hard cap; the epoll_wait events array is clamped to 2048 locally (resource.MaxEvents stays 8192 — it is NUMA-divided + test-pinned; the array only needs the max simultaneously-ready fds, and the grid maxes at 1024c). Race-safety: growConns' full-slot copy + header swap run under driverMu, and hijackConn (which can write l.conns[fd]=nil on a dispatch goroutine in async mode) takes driverMu for its l.conns header read + slot write — so the grow never reads a slot a hijack is concurrently niling, and RegisterConn (the other driverMu reader) is serialized against the swap. The linux -race CI is the definitive gate for this engine concurrency. B1.3: releaseConnState retained the full backing-array capacity of asyncInBuf/asyncOutBuf/writeBuf, so a connState that handled one multi-MB burst pinned that oversized array in connStatePool forever (the epoll-h1-async peak-RSS outlier). trimPooledBuf now drops buffers whose cap exceeds 64 KiB on release; small buffers keep their capacity so the zero-alloc steady state is unchanged. cs.buf is excluded (fixed at resource.BufferSize). Verified: GOOS=linux go build/vet clean. Peak-RSS A/B (peak down, steady throughput within +/-1%) is the cluster gate.
Investigation correction: the per-call []string{...} arg slice this item
targeted does NOT escape to the heap — escape analysis already stack-allocates
it (exec consumes the args synchronously, copying bytes before returning), so
there was no heap alloc to remove on the GET/SET path. Shipped the fixed-arity
encoder anyway (exec2/exec3 + WriteCommand2/3) as a STRUCTURAL zero-heap
guarantee rather than an escape-analysis accident, plus a small measured win:
-1.66% geomean ns/op from dropping the slice construction + variadic spread
(count=8). Wire bytes are byte-identical (verified vs WriteCommand for normal,
empty, and CRLF-containing args). Set/SetBytes use the 3-arg fast path only
when expiration<=0; the expiry path keeps the variadic exec.
The real remaining client-side alloc on GET is copyValueDetached (the reply
deep-copy), left as a scoped follow-up. Builds/tests/race/vet green; the
engine-integrated sync exec2/exec3 paths cross-compile to linux/{amd64,arm64}.
Config.WriteBehind (default OFF) dispatches the post-handler session store write asynchronously so it overlaps the next request instead of blocking the response flush. Bounded (no unbounded goroutine fan-out), snapshot-isolated (the session is marshaled before handoff so the pooled/recycled map can't race the writer — proven under -race), and shutdown-flushed (Close blocks until all in-flight writes are durable — no lost write on graceful stop). With the flag OFF, behavior is byte-for-byte identical to today. Durability tradeoff (disclosed, opt-in only, documented on the field): when ON, a response is acked before its store write is durable, so an abrupt process death between the two loses that single update. Tests cover off-unchanged, deferred-off-critical-path, flush-no-lost-write (200 reqs), snapshot-no-race (24 goroutines under -race), post-Close synchronous enqueue, and idempotent Close. Package green plain and under -race.
…e fast paths (B4) The fullstack middleware chain emits ~18 user headers; Blob then prepends content-type + content-length, reaching ~20 — past respHdrBuf's old 16 slots, forcing one heap alloc + GC pressure on the dominant header-heavy response. Bump respHdrBuf to 24 and the fast-path scratch array to 22, both now derived from a single respHdrBufCap const so they can never drift and index-panic. Adds no-alloc fast paths to ratelimit (limit-not-hit, the common case) and timeout (timer-not-fired). benchstat confirms the removed alloc/op on the header-heavy path; per-Context grows 128 B (one-time, pooled). Secondary item per the perf priorities. Builds/tests/race/vet green.
CI golangci-lint flagged two issues in the Part B changes: - driver/redis/commands.go: the variadic doRead became dead after Get/GetBytes moved to the fixed-arity doRead2 (B3.1); removed it and folded its aliasing doc into doRead2/releaseDoRead. - middleware/session/writebehind_test.go: errcheck on a deferred closer.Close(); wrapped it `_ = closer.Close()`. Verified: golangci-lint (incl. GOOS=linux ./engine/epoll/...) → 0 issues; go build/test green on both packages.
abbbe41 to
5b0ad3c
Compare
B1.1 added the soft-heap-ceiling knob to resource.Resources plus the apply in server.go, but toResourceConfig never mapped it from the root celeris.Config — so no caller could actually set it (the knob was dead, SetMemoryLimit never fired). Wire rc.Resources.MemoryLimitBytes = c.MemoryLimitBytes and surface celeris.DeriveMemoryLimit(workers) so embedders can size a ceiling. Test pins the mapping (and that unset → 0, i.e. celeris never touches the GC implicitly).
CELERIS_ADAPTIVE_UP_THRESHOLD / CELERIS_ADAPTIVE_HIGHWATER override the conns-per-worker promotion thresholds so a cluster repro can force the epoll→io_uring promotion to fire at a controlled (low) connection count — needed to deterministically trigger the dual-engine oversubscription that collapses driver throughput. Unset = production defaults (24/48). Not a production tuning knob.
CELERIS_ADAPTIVE_EVAL_INTERVAL lets a cluster repro evaluate sub-second so the promotion can snap WHILE conns are still establishing (loadgen opens all conns before the default 1s tick, so the switch always fired post-establishment -> io_uring idle -> no repro).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v1.5.4 celeris performance (Part B)
Targets the perf priorities for the thesis numbers: cut peak RSS, driver perf, low-concurrency, plus a secondary middleware pass. Every hot-path change is locally verified (build /
-race/benchstat); the peak-RSS + throughput A/B is the cluster gate (acceptance: peak ↓, steady throughput within ±1%). The linux-only engine concurrency is gated by the-raceCI on this PR.B1 — Peak RSS
877283b— opt-inResources.MemoryLimitBytessoft heap ceiling (debug.SetMemoryLimit). The connection-ramp heap balloon (GOGC=100 → ~2× during the 1024-conn ramp) is the dominant peak contributor; the ceiling forces GC before it balloons. Default off (process-global, library-safe);DeriveMemoryLimit(workers)=max(256 MiB, workers·32 MiB).6b6952a— the per-worker epoll conn table eagerly allocated all 65536 slots (512 KiB/worker); now starts at 4096 and grows on demand to the unchanged 65536 hard cap.epoll_waitevents array clamped to 2048. Race-safety:growConns' full-slot copy + header swap run underdriverMu, andhijackConn(which nilsl.conns[fd]on a dispatch goroutine) now takesdriverMufor its slot read/write — so the grow never races a concurrent hijack. (This race was caught in review after the initial implementation and fixed here.)6b6952a—releaseConnStateretained oversized async backing arrays in the pool (the epoll-h1-async peak outlier);trimPooledBufdrops buffers >64 KiB on release, steady-state zero-alloc path unchanged.B3 — Driver perf
d52058a— honest correction: the[]stringarg-slice alloc this targeted doesn't exist (escape analysis already stack-allocates it). Shipped the fixed-arity zero-heap encoder anyway as a structural guarantee + a −1.66% ns/op win; byte-identical wire. Documented the real remaining alloc (copyValueDetached) as a follow-up.a17b204— opt-insession.Config.WriteBehindtakes the store write off the critical path. Bounded, snapshot-isolated (proven under-race), shutdown-flushed (no lost write). Default off = identical behavior. Durability tradeoff disclosed on the field.B2 — Low-concurrency 1c: investigated, no change
The hypothesis (mirror iouring's sticky
h1Onlyshortcut) didn't hold for epoll: iouring's shortcut elides an atomiccs.protocol.Load()that only exists in iouring; epoll'scs.protocolis a plain field, so there is no equivalent per-request cost to elide. Verified against both engines (and independently re-checked in review). No code invented.B4 — Middleware (secondary)
c71535arespHdrBuf16→24 (fullstack emits ~20 response headers → was one overflow alloc short); the scratch array and buffer now derive from onerespHdrBufCapconst so they can't drift and index-panic. No-alloc fast paths for ratelimit (limit-not-hit) and timeout (timer-not-fired).Verification
go build ./...+GOOS=linux go build ./...+go vet(both arches) clean;gofmtclean. Tests +-racepass on every touched cross-platform package (driver/redis, middleware/session, ratelimit, timeout, root). Engine package is linux-only → its unit tests +-racerun in CI here. Two work items had their premises rejected with evidence (B3.1's alloc, B2's overhead) rather than forced — flagged for honesty.