arm64: celeris takes the host down under sustained load (~20-30min) — not reproduced on amd64

## Summary
On the arm64 cluster node (msr1), running celeris under sustained HTTP load reliably makes the **whole host go unreachable** (hard network/host death, requires a power-cycle) within ~10-30 min. The same workload on amd64 runs for hours, and **non-celeris** servers on the same arm64 node survive the identical load. So this is a celeris-on-arm64 interaction, not node hardware.

## Controlled experiments (probatorium cluster, msr1 = arm64 bench node)
| Load | Duration before host death |
|---|---|
| Full bench grid, **celeris** columns | crashed 3× — during epoll-h1-sync (run #1, ~34min) and during iouring cells (run #3 / short run) |
| **Non-celeris** (actix-web): get-json + churn-close + post-4k + get-json-64k | **survived 34 min**, failed=0 |
| **celeris-only** (epoll-h1-sync): same 4 scenarios + ws/sse | **crashed at ~22 min** |
| Same celeris workload on **amd64** (msa2-server) | runs for hours, no host issue |
| Pure CPU burn (12 cores, 2 min) | peak 59°C, stable |

The celeris-only crash happened during `post-4k` / `get-json-64k` (3rd-4th scenario), **before** any ws/sse scenario — so it is plain sustained HTTP load, not a streaming edge.

## What it is NOT
- **Not node hardware**: msr1 survives 34 min of non-celeris load and a full CPU burn.
- **Not thermal**: temp at crash ~60°C; 59°C under a full 12-core burn.
- **Not overload**: load ~6 of 12 cores at crash time.
- **Not amd64**: identical celeris workload is stable there.

## No trace
msr1 has no pstore/ramoops, so the previous-boot kernel log is empty after the hard reset — no panic/OOM/oops captured. Both epoll and iouring engines have been implicated across runs.

## Candidates to investigate
- arm64 **NEON SIMD** assembly paths (H1 parser SIMD) — a misaligned/UB access that corrupts state or faults badly.
- A syscall / socket / NIC-driver pattern celeris drives that destabilizes the arm64 kernel (the 20G NIC under celeris's specific accept/close/send cadence).
- To capture a trace: enable netconsole or a serial console on the arm64 node, or wire pstore-ram, then reproduce.

## Repro
`celeris-epoll-h1-sync` under sustained get-json/post-4k/get-json-64k at 256 conns on the arm64 node; host dies ~20-30 min. Env: kernel 7.0.0-14-generic, arm64.

Pre-existing (present in v1.4.12); independent of #310 (which fixes #309 + #311). Flagging as an arm64 reliability issue to triage before relying on arm64 under sustained load.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm64: celeris takes the host down under sustained load (~20-30min) — not reproduced on amd64 #312

Summary

Controlled experiments (probatorium cluster, msr1 = arm64 bench node)

What it is NOT

No trace

Candidates to investigate

Repro

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Load	Duration before host death
Full bench grid, celeris columns	crashed 3× — during epoll-h1-sync (run #1, ~34min) and during iouring cells (run #3 / short run)
Non-celeris (actix-web): get-json + churn-close + post-4k + get-json-64k	survived 34 min, failed=0
celeris-only (epoll-h1-sync): same 4 scenarios + ws/sse	crashed at ~22 min
Same celeris workload on amd64 (msa2-server)	runs for hours, no host issue
Pure CPU burn (12 cores, 2 min)	peak 59°C, stable

arm64: celeris takes the host down under sustained load (~20-30min) — not reproduced on amd64 #312

Description

Summary

Controlled experiments (probatorium cluster, msr1 = arm64 bench node)

What it is NOT

No trace

Candidates to investigate

Repro

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions