Summary
On the arm64 cluster node (msr1), running celeris under sustained HTTP load reliably makes the whole host go unreachable (hard network/host death, requires a power-cycle) within ~10-30 min. The same workload on amd64 runs for hours, and non-celeris servers on the same arm64 node survive the identical load. So this is a celeris-on-arm64 interaction, not node hardware.
Controlled experiments (probatorium cluster, msr1 = arm64 bench node)
| Load |
Duration before host death |
| Full bench grid, celeris columns |
crashed 3× — during epoll-h1-sync (run #1, ~34min) and during iouring cells (run #3 / short run) |
| Non-celeris (actix-web): get-json + churn-close + post-4k + get-json-64k |
survived 34 min, failed=0 |
| celeris-only (epoll-h1-sync): same 4 scenarios + ws/sse |
crashed at ~22 min |
| Same celeris workload on amd64 (msa2-server) |
runs for hours, no host issue |
| Pure CPU burn (12 cores, 2 min) |
peak 59°C, stable |
The celeris-only crash happened during post-4k / get-json-64k (3rd-4th scenario), before any ws/sse scenario — so it is plain sustained HTTP load, not a streaming edge.
What it is NOT
- Not node hardware: msr1 survives 34 min of non-celeris load and a full CPU burn.
- Not thermal: temp at crash ~60°C; 59°C under a full 12-core burn.
- Not overload: load ~6 of 12 cores at crash time.
- Not amd64: identical celeris workload is stable there.
No trace
msr1 has no pstore/ramoops, so the previous-boot kernel log is empty after the hard reset — no panic/OOM/oops captured. Both epoll and iouring engines have been implicated across runs.
Candidates to investigate
- arm64 NEON SIMD assembly paths (H1 parser SIMD) — a misaligned/UB access that corrupts state or faults badly.
- A syscall / socket / NIC-driver pattern celeris drives that destabilizes the arm64 kernel (the 20G NIC under celeris's specific accept/close/send cadence).
- To capture a trace: enable netconsole or a serial console on the arm64 node, or wire pstore-ram, then reproduce.
Repro
celeris-epoll-h1-sync under sustained get-json/post-4k/get-json-64k at 256 conns on the arm64 node; host dies ~20-30 min. Env: kernel 7.0.0-14-generic, arm64.
Pre-existing (present in v1.4.12); independent of #310 (which fixes #309 + #311). Flagging as an arm64 reliability issue to triage before relying on arm64 under sustained load.
Summary
On the arm64 cluster node (msr1), running celeris under sustained HTTP load reliably makes the whole host go unreachable (hard network/host death, requires a power-cycle) within ~10-30 min. The same workload on amd64 runs for hours, and non-celeris servers on the same arm64 node survive the identical load. So this is a celeris-on-arm64 interaction, not node hardware.
Controlled experiments (probatorium cluster, msr1 = arm64 bench node)
The celeris-only crash happened during
post-4k/get-json-64k(3rd-4th scenario), before any ws/sse scenario — so it is plain sustained HTTP load, not a streaming edge.What it is NOT
No trace
msr1 has no pstore/ramoops, so the previous-boot kernel log is empty after the hard reset — no panic/OOM/oops captured. Both epoll and iouring engines have been implicated across runs.
Candidates to investigate
Repro
celeris-epoll-h1-syncunder sustained get-json/post-4k/get-json-64k at 256 conns on the arm64 node; host dies ~20-30 min. Env: kernel 7.0.0-14-generic, arm64.Pre-existing (present in v1.4.12); independent of #310 (which fixes #309 + #311). Flagging as an arm64 reliability issue to triage before relying on arm64 under sustained load.