Skip to content

perf(v1.5.4): peak-RSS, driver + middleware allocation, with honest premise checks#386

Open
FumingPower3925 wants to merge 11 commits into
mainfrom
feat/v1.5.4-perf
Open

perf(v1.5.4): peak-RSS, driver + middleware allocation, with honest premise checks#386
FumingPower3925 wants to merge 11 commits into
mainfrom
feat/v1.5.4-perf

Conversation

@FumingPower3925

Copy link
Copy Markdown
Contributor

v1.5.4 celeris performance (Part B)

Targets the perf priorities for the thesis numbers: cut peak RSS, driver perf, low-concurrency, plus a secondary middleware pass. Every hot-path change is locally verified (build / -race / benchstat); the peak-RSS + throughput A/B is the cluster gate (acceptance: peak ↓, steady throughput within ±1%). The linux-only engine concurrency is gated by the -race CI on this PR.

⚠️ Stacked on #385 (godoc trim). The first two commits (aa84a0f, a44c4e7) belong to #385 — merge #385 first and they drop from this diff automatically (same SHAs).

B1 — Peak RSS

  • B1.1 877283b — opt-in Resources.MemoryLimitBytes soft heap ceiling (debug.SetMemoryLimit). The connection-ramp heap balloon (GOGC=100 → ~2× during the 1024-conn ramp) is the dominant peak contributor; the ceiling forces GC before it balloons. Default off (process-global, library-safe); DeriveMemoryLimit(workers) = max(256 MiB, workers·32 MiB).
  • B1.2 6b6952a — the per-worker epoll conn table eagerly allocated all 65536 slots (512 KiB/worker); now starts at 4096 and grows on demand to the unchanged 65536 hard cap. epoll_wait events array clamped to 2048. Race-safety: growConns' full-slot copy + header swap run under driverMu, and hijackConn (which nils l.conns[fd] on a dispatch goroutine) now takes driverMu for its slot read/write — so the grow never races a concurrent hijack. (This race was caught in review after the initial implementation and fixed here.)
  • B1.3 6b6952areleaseConnState retained oversized async backing arrays in the pool (the epoll-h1-async peak outlier); trimPooledBuf drops buffers >64 KiB on release, steady-state zero-alloc path unchanged.

B3 — Driver perf

  • B3.1 d52058ahonest correction: the []string arg-slice alloc this targeted doesn't exist (escape analysis already stack-allocates it). Shipped the fixed-arity zero-heap encoder anyway as a structural guarantee + a −1.66% ns/op win; byte-identical wire. Documented the real remaining alloc (copyValueDetached) as a follow-up.
  • B3.2 a17b204 — opt-in session.Config.WriteBehind takes the store write off the critical path. Bounded, snapshot-isolated (proven under -race), shutdown-flushed (no lost write). Default off = identical behavior. Durability tradeoff disclosed on the field.

B2 — Low-concurrency 1c: investigated, no change

The hypothesis (mirror iouring's sticky h1Only shortcut) didn't hold for epoll: iouring's shortcut elides an atomic cs.protocol.Load() that only exists in iouring; epoll's cs.protocol is a plain field, so there is no equivalent per-request cost to elide. Verified against both engines (and independently re-checked in review). No code invented.

B4 — Middleware (secondary) c71535a

respHdrBuf 16→24 (fullstack emits ~20 response headers → was one overflow alloc short); the scratch array and buffer now derive from one respHdrBufCap const so they can't drift and index-panic. No-alloc fast paths for ratelimit (limit-not-hit) and timeout (timer-not-fired).

Verification

go build ./... + GOOS=linux go build ./... + go vet (both arches) clean; gofmt clean. Tests + -race pass on every touched cross-platform package (driver/redis, middleware/session, ratelimit, timeout, root). Engine package is linux-only → its unit tests + -race run in CI here. Two work items had their premises rejected with evidence (B3.1's alloc, B2's overhead) rather than forced — flagged for honesty.

The documentation website (https://goceleris.dev/docs) is now the single
source of truth for long-form guides, tutorials, and examples. The
package-level doc.go files had grown into a large parallel copy of that
material (~4k lines), a duplicated maintenance surface that drifts stale.

Trim every package doc.go down to tight, accurate, IDE-facing godoc —
package summary, key exported symbols, at most a minimal example — plus a
"# Documentation" pointer to the matching page on the site. Symbol-level
godoc on declarations (what IDEs surface on hover/autocomplete) is left
untouched.

- 50 doc.go trimmed, 4 already-minimal left as-is
- net -3110 lines (942 insertions, 4052 deletions)
- also corrects stale facts found against the real API while trimming:
  wrong Config timeout/logger defaults, deprecated basicauth SHA-256,
  MB vs MiB body limits, broken/unexported godoc symbol links, and
  removed/renamed symbols (probe.Probe, resource.Workload*Concurrency,
  session.NewMemoryStore, ratelimit.ErrorHandler, ...).

Comment-only; no functional changes. gofmt, go build ./... and
go vet ./... all clean.
…llector() godoc

- jwt's ErrTokenMissing/ErrTokenInvalid/ErrJWTExpired/ErrJWTMalformed now chain
  to celeris.ErrUnauthorized via an unexported cause, so
  errors.Is(err, celeris.ErrUnauthorized) matches every jwt rejection —
  consistent with keyauth/basicauth and with the documented contract on
  ErrUnauthorized. HTTPError.Error() text is unchanged (regression test added).
- Server.Collector() godoc: the collector is created eagerly in New() and is
  non-nil before Start (nil only when Config.DisableMetrics is set), matching
  New()'s own godoc.

Surfaced by a docs usability audit. gofmt/build/vet clean; jwt, keyauth,
basicauth, and root suites pass.
…loon (B1.1)

celeris does no GC tuning today, so under the default GOGC=100 the live
heap roughly doubles during a 1024-connection ramp burst — that transient
is the dominant peak-RSS contributor (steady-state RSS sits far lower).

Add an opt-in Resources.MemoryLimitBytes knob: when >0 the server applies
runtime/debug.SetMemoryLimit at start (before the engine allocates its
per-worker tables), so the GC collects before the heap balloons. Steady
RSS is well under the ceiling, so steady throughput is untouched — it only
trades a few extra GC cycles during the ramp for a lower high-water mark.

Default 0 = unset: celeris never touches the process GC implicitly, since
SetMemoryLimit is PROCESS-GLOBAL and celeris is a library — embedders keep
full control. resource.DeriveMemoryLimit(workers) returns a generous
max(256 MiB, workers*32 MiB) ceiling for operators that own the process
(e.g. a dedicated server binary) and want the reduction.

Local-verified (build/vet/race/tests). The peak-RSS A/B (peak down, steady
throughput within +/-1%) is a cluster gate, and the default flip from off
waits on that proof.
…2/B1.3)

B1.2: the per-worker epoll conn table eagerly allocated all 65536 pointer
slots (512 KiB) even for a worker holding a handful of conns. It now starts
at 4096 and grows on demand (growConns, power-of-two) toward the unchanged
65536 hard cap; the epoll_wait events array is clamped to 2048 locally
(resource.MaxEvents stays 8192 — it is NUMA-divided + test-pinned; the array
only needs the max simultaneously-ready fds, and the grid maxes at 1024c).
  Race-safety: growConns' full-slot copy + header swap run under driverMu,
  and hijackConn (which can write l.conns[fd]=nil on a dispatch goroutine in
  async mode) takes driverMu for its l.conns header read + slot write — so the
  grow never reads a slot a hijack is concurrently niling, and RegisterConn
  (the other driverMu reader) is serialized against the swap. The linux -race
  CI is the definitive gate for this engine concurrency.

B1.3: releaseConnState retained the full backing-array capacity of
asyncInBuf/asyncOutBuf/writeBuf, so a connState that handled one multi-MB
burst pinned that oversized array in connStatePool forever (the
epoll-h1-async peak-RSS outlier). trimPooledBuf now drops buffers whose cap
exceeds 64 KiB on release; small buffers keep their capacity so the zero-alloc
steady state is unchanged. cs.buf is excluded (fixed at resource.BufferSize).

Verified: GOOS=linux go build/vet clean. Peak-RSS A/B (peak down, steady
throughput within +/-1%) is the cluster gate.
Investigation correction: the per-call []string{...} arg slice this item
targeted does NOT escape to the heap — escape analysis already stack-allocates
it (exec consumes the args synchronously, copying bytes before returning), so
there was no heap alloc to remove on the GET/SET path. Shipped the fixed-arity
encoder anyway (exec2/exec3 + WriteCommand2/3) as a STRUCTURAL zero-heap
guarantee rather than an escape-analysis accident, plus a small measured win:
-1.66% geomean ns/op from dropping the slice construction + variadic spread
(count=8). Wire bytes are byte-identical (verified vs WriteCommand for normal,
empty, and CRLF-containing args). Set/SetBytes use the 3-arg fast path only
when expiration<=0; the expiry path keeps the variadic exec.

The real remaining client-side alloc on GET is copyValueDetached (the reply
deep-copy), left as a scoped follow-up. Builds/tests/race/vet green; the
engine-integrated sync exec2/exec3 paths cross-compile to linux/{amd64,arm64}.
Config.WriteBehind (default OFF) dispatches the post-handler session store
write asynchronously so it overlaps the next request instead of blocking the
response flush. Bounded (no unbounded goroutine fan-out), snapshot-isolated
(the session is marshaled before handoff so the pooled/recycled map can't race
the writer — proven under -race), and shutdown-flushed (Close blocks until all
in-flight writes are durable — no lost write on graceful stop). With the flag
OFF, behavior is byte-for-byte identical to today.

Durability tradeoff (disclosed, opt-in only, documented on the field): when ON,
a response is acked before its store write is durable, so an abrupt process
death between the two loses that single update. Tests cover off-unchanged,
deferred-off-critical-path, flush-no-lost-write (200 reqs), snapshot-no-race
(24 goroutines under -race), post-Close synchronous enqueue, and idempotent
Close. Package green plain and under -race.
…e fast paths (B4)

The fullstack middleware chain emits ~18 user headers; Blob then prepends
content-type + content-length, reaching ~20 — past respHdrBuf's old 16 slots,
forcing one heap alloc + GC pressure on the dominant header-heavy response.
Bump respHdrBuf to 24 and the fast-path scratch array to 22, both now derived
from a single respHdrBufCap const so they can never drift and index-panic.
Adds no-alloc fast paths to ratelimit (limit-not-hit, the common case) and
timeout (timer-not-fired). benchstat confirms the removed alloc/op on the
header-heavy path; per-Context grows 128 B (one-time, pooled). Secondary item
per the perf priorities. Builds/tests/race/vet green.
@FumingPower3925 FumingPower3925 self-assigned this Jun 21, 2026
CI golangci-lint flagged two issues in the Part B changes:
- driver/redis/commands.go: the variadic doRead became dead after Get/GetBytes
  moved to the fixed-arity doRead2 (B3.1); removed it and folded its aliasing
  doc into doRead2/releaseDoRead.
- middleware/session/writebehind_test.go: errcheck on a deferred closer.Close();
  wrapped it `_ = closer.Close()`.

Verified: golangci-lint (incl. GOOS=linux ./engine/epoll/...) → 0 issues;
go build/test green on both packages.
B1.1 added the soft-heap-ceiling knob to resource.Resources plus the apply in
server.go, but toResourceConfig never mapped it from the root celeris.Config —
so no caller could actually set it (the knob was dead, SetMemoryLimit never
fired). Wire rc.Resources.MemoryLimitBytes = c.MemoryLimitBytes and surface
celeris.DeriveMemoryLimit(workers) so embedders can size a ceiling. Test pins
the mapping (and that unset → 0, i.e. celeris never touches the GC implicitly).
CELERIS_ADAPTIVE_UP_THRESHOLD / CELERIS_ADAPTIVE_HIGHWATER override the
conns-per-worker promotion thresholds so a cluster repro can force the
epoll→io_uring promotion to fire at a controlled (low) connection count —
needed to deterministically trigger the dual-engine oversubscription that
collapses driver throughput. Unset = production defaults (24/48). Not a
production tuning knob.
CELERIS_ADAPTIVE_EVAL_INTERVAL lets a cluster repro evaluate sub-second so the
promotion can snap WHILE conns are still establishing (loadgen opens all conns
before the default 1s tick, so the switch always fired post-establishment ->
io_uring idle -> no repro).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant