Skip to content

arm64: celeris takes the host down under sustained load (~20-30min) — not reproduced on amd64 #312

@FumingPower3925

Description

@FumingPower3925

Summary

On the arm64 cluster node (msr1), running celeris under sustained HTTP load reliably makes the whole host go unreachable (hard network/host death, requires a power-cycle) within ~10-30 min. The same workload on amd64 runs for hours, and non-celeris servers on the same arm64 node survive the identical load. So this is a celeris-on-arm64 interaction, not node hardware.

Controlled experiments (probatorium cluster, msr1 = arm64 bench node)

Load Duration before host death
Full bench grid, celeris columns crashed 3× — during epoll-h1-sync (run #1, ~34min) and during iouring cells (run #3 / short run)
Non-celeris (actix-web): get-json + churn-close + post-4k + get-json-64k survived 34 min, failed=0
celeris-only (epoll-h1-sync): same 4 scenarios + ws/sse crashed at ~22 min
Same celeris workload on amd64 (msa2-server) runs for hours, no host issue
Pure CPU burn (12 cores, 2 min) peak 59°C, stable

The celeris-only crash happened during post-4k / get-json-64k (3rd-4th scenario), before any ws/sse scenario — so it is plain sustained HTTP load, not a streaming edge.

What it is NOT

  • Not node hardware: msr1 survives 34 min of non-celeris load and a full CPU burn.
  • Not thermal: temp at crash ~60°C; 59°C under a full 12-core burn.
  • Not overload: load ~6 of 12 cores at crash time.
  • Not amd64: identical celeris workload is stable there.

No trace

msr1 has no pstore/ramoops, so the previous-boot kernel log is empty after the hard reset — no panic/OOM/oops captured. Both epoll and iouring engines have been implicated across runs.

Candidates to investigate

  • arm64 NEON SIMD assembly paths (H1 parser SIMD) — a misaligned/UB access that corrupts state or faults badly.
  • A syscall / socket / NIC-driver pattern celeris drives that destabilizes the arm64 kernel (the 20G NIC under celeris's specific accept/close/send cadence).
  • To capture a trace: enable netconsole or a serial console on the arm64 node, or wire pstore-ram, then reproduce.

Repro

celeris-epoll-h1-sync under sustained get-json/post-4k/get-json-64k at 256 conns on the arm64 node; host dies ~20-30 min. Env: kernel 7.0.0-14-generic, arm64.

Pre-existing (present in v1.4.12); independent of #310 (which fixes #309 + #311). Flagging as an arm64 reliability issue to triage before relying on arm64 under sustained load.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions