Add cephadm + radosgw E2E coverage for quincy → tentacle by taodd · Pull Request #106 · taodd/cephtrace

taodd · 2026-05-21T14:21:55Z

Summary

New end-to-end test that exercises both osdtrace and radostrace against a cephadm-deployed cluster across the full quincy / reef / squid / tentacle line (tests/functional-test-cephadm-rgw.sh, reusable helpers in tests/lib/cephadm-setup.sh, LXD-VM dev wrapper in tests/local-cephadm-test.sh).
New build-ubuntu-cephadm matrix job in .github/workflows/pr-build.yaml: ubuntu-22.04 × ubuntu-24.04 × {quincy, reef, squid, tentacle}, fail-fast: false, 35-minute timeout per cell.
Added quincy 17.2.8 and 17.2.9 embedded DWARF JSONs (centos-stream) for both osdtrace and radostrace. 17.2.8 matches the quay.io/ceph/ceph:v17.2.8 image used by the cephadm test; 17.2.9 covers RPM-deployed production clusters at the latest quincy point release (no v17.2.9 container was ever published).
Tool tweak: osdtrace / radostrace now skip the deleted-binary / deleted-library guard when -j (JSON export) is set, so the export path can parse the on-disk (possibly newly-upgraded) binary regardless of what's still mmap'd by the live process.

Test plan

Local validation via tests/local-cephadm-test.sh on Ubuntu 24.04 host: all four releases PASS
- squid: osdtrace 1243 rows, radostrace 2410 rows
- reef: osdtrace 901, radostrace 1941
- tentacle: osdtrace 1631, radostrace 3076
- quincy: osdtrace 824, radostrace 1782
CI matrix (build-ubuntu-cephadm) green across all 8 cells (2 OS × 4 releases)
Existing CI jobs (build-ubuntu, build-ubuntu-arm64, build-centos, build-rocky, clang-tidy) still pass

Adds an end-to-end test that exercises both osdtrace and radostrace against a cephadm-deployed cluster, validating the embedded DWARF data across the full quincy/reef/squid/tentacle line. Locally this runs inside a disposable LXD VM; in CI it runs directly on the GHA runner. Pieces ------ * tests/lib/cephadm-setup.sh -- reusable helpers: image-tag map per release, distro/apt or quay.io-fallback cephadm install, LVM-wrapped loopback OSDs (cephadm's inventory filters raw /dev/loopX), single-host bootstrap, healthy-cluster wait, RGW deploy, RGW user creation. * tests/functional-test-cephadm-rgw.sh -- per-release E2E: 1 MON+MGR + 3 OSDs + 1 radosgw, S3 PUT/GET workload, parallel osdtrace + radostrace trace, row-count + range + latency assertions. * tests/local-cephadm-test.sh -- developer wrapper that runs the test in a fresh ubuntu:24.04 LXD VM (one per release). Keeps the VM on failure for post-mortem; forwards KEEP_CLUSTER for in-VM debugging. CI integration -------------- * New build-ubuntu-cephadm matrix job in .github/workflows/pr-build.yaml: ubuntu-22.04 × ubuntu-24.04 × {quincy, reef, squid, tentacle}, with fail-fast disabled and a 35-minute timeout per cell. Embedded DWARF data ------------------- * files/centos-stream/{osdtrace,radostrace}/{osd,rados}-2:17.2.8-0.el9_dwarf.json * files/centos-stream/{osdtrace,radostrace}/{osd,rados}-2:17.2.9-0.el9_dwarf.json 17.2.8 covers the quay.io/ceph/ceph:v17.2.8 image used by the cephadm test (no v17.2.9 container was ever published). 17.2.9 covers RPM-deployed production clusters at the latest quincy point release. Tool tweak ---------- * osdtrace / radostrace: skip the deleted-binary / deleted-library guard when -j (JSON export) is set. The export path intentionally wants to parse the *on-disk* (possibly newly-upgraded) binary so the recorded DWARF metadata matches the package version stamped into the JSON, rather than whatever stale image happens to still be mmap'd by the running process.

The flag is only present in cephadm releases newer than what Ubuntu 22.04 ships in its apt archive, causing 'unrecognized arguments' across all four release cells of the build-ubuntu-cephadm matrix on 22.04. It was only useful for inspecting partial bootstraps during local debugging; CI doesn't need it, and the cephadm default behaviour (auto-cleanup on failure) is the right one there.

The functional test's cleanup() trap unlinks the osdtrace / radostrace output logs unconditionally on exit, so by the time a failing CI job returns to the workflow runner there is nothing left on disk besides the tail -50 dump the trap echoed into stdout. When a rare event trips a verifier check (e.g. an underflowed op_lat value beyond the TRACE_MAX_LATENCY_US bound), the offending row is almost never within those 50 lines. Two changes: * tests/functional-test-cephadm-rgw.sh: capture \$? at the top of cleanup() and skip the rm -f when the script is exiting non-zero (or when KEEP_CLUSTER=1, the existing local-debug switch). * .github/workflows/pr-build.yaml: in build-ubuntu-cephadm, stage the surviving /tmp/*trace-cephadm-*.log files into the workspace on failure and upload them as a per-cell artifact (retained 14 days, named trace-logs-<os>-<release>). Lets us pull the full trace from the run page with `gh run download` for later analysis.

Rewrite _osdtrace_rows and _verify_osdtrace_output_impl in tests/lib/verify-trace-output.sh to capture every field per op type and bound each sub-latency to the row's total op_lat. Parser ------ * Three typed schemas keyed off $5 (op_r / subop_w / op_w), each emitted as a discriminator-prefixed pipe-separated line. Field positions mirror the three printf-format strings in src/osdtrace.cc:print_op_*. * Strict row acceptance: each op type matched by exact NF *and* by literal field-name landmarks (`$24 == "op_lat"` for op_r, `$22 == "peers" && $39 == "op_lat"` for op_w, ...). This drops truncated/mid-printf rows that the old loose `$NF + 0` parser silently misread. In particular it filters out the rare row that was killed by `timeout`'s SIGKILL between the `peers` and `bluestore_lat` tokens, leaving the underflowed peer-latency token (`(-1, 18446743169577026)]`) as the final field; the old parser mistook that for the genuine op_lat and reported a spurious >100 s latency. Per-row invariants ------------------ * Existing: osd_id within [0, max_osd_id]; total op_lat bounded by TRACE_MAX_LATENCY_US; pg seed within pg_num for test_pool rows. * New: every named sub-latency (throttle/recv/dispatch/queue/osd/ bluestore/prepare/aio_wait/seq_wait/kv_commit/peer*) must be <= the row's total op_lat. Underflowed BPF timestamps would otherwise present as huge µs values that quietly poison downstream analysis without triggering the (much looser) absolute upper bound. * op_w peer slots with id == -1 are sentinel-padded; both their id and latency checks are skipped (the peer_lat in that slot is uninitialised garbage from the BPF prog). Aggregates ---------- * Per-op-type row counts (op_r / subop_w / op_w) reported alongside the existing total-rows summary, both globally and for test_pool. Makes "we lost reads" / "we lost replica writes" regressions easy to spot from the test summary alone. Existing call sites in functional-test-microceph.sh, functional-test-embedded-dwarf.sh and the RGW wrapper continue to use the same `verify_osdtrace_output` signature; the stricter checks now apply transparently.

SIGKILL of the writer at the trace-window deadline can leave the log file's tail in a byte-truncated state: libc was mid-flush and the final stdio write() syscall was interrupted between two of the chunks the buffer was split across. The byte-truncated row is always the last record the writer produced; everything before it landed via previously-completed atomic write()s. The osdtrace parser's strict NF + landmark checks catch most of these truncations on their own, but byte-level truncation can occasionally leave a row whose NF happens to match an op type and whose tokens coincidentally pass each landmark check. The radostrace parser's NF >= 10 predicate is much looser and routinely admits truncated tails (e.g. an object name cut to "rbd" instead of "rbd_data.<hex>.<seq>"), which then trips the verifier's `^rbd_` object-prefix check and fails the entire test for what is structurally a non-bug. Buffer the latest emit in `prev` and rely on awk's END-without-flush to drop it. Cost: one good row per trace, against thousands captured.

…racking A client retry of an op the OSD is still processing arrives via a fresh MOSDOp with the same `(client_id, tid)` as the original. Ceph handles the duplicate via `PrimaryLogPG::already_complete()` inside `do_op`, so it never goes through the full transaction path. The problem is on the BPF side: `uprobe_enqueue_op` unconditionally overwrites the in-flight original's record via `bpf_map_update_elem(..., BPF_ANY)`. The local `value` struct is memset-zeroed first, so the overwrite wipes: * dequeue_stamp (back to 0) * peer0/peer1 slots (back to the -1 sentinel) * every per-stage latency (osd_lat, bluestore_lat, prepare, aio_wait, seq_wait, kv_commit) When the original op later completes, its `uprobe_log_op_stats` looks up the key, finds the retry's freshly-clobbered record, sets `reply_stamp = bpf_ktime_get_boot_ns()`, and emits a ringbuf event. Userspace then computes `queue_lat = 0 - retry_enqueue_stamp` which unsigned-underflows to `~UINT64_MAX − T_enq_retry` -- the failure mode reported in cephtrace#107. Detect the case at the top of `uprobe_enqueue_op`: if an entry for the key already exists *and* its `enqueue_stamp` is recent (within 5 s), the new event is a client retry of an in-flight op -- return early without overwriting. Older entries (> 5 s) are orphans left by some completion uprobe that didn't fire for an earlier op; fall through to overwrite them so the map doesn't slowly leak. bpf_printk diagnostics on both branches make it easy to tell, when debugging, whether a given run is hitting retries, orphans, or neither.

The readiness loop accepted any cluster status string of HEALTH_OK or HEALTH_WARN, but a freshly bootstrapped MicroCeph reports HEALTH_WARN immediately because of TOO_FEW_OSDS / mon_warn_on_insecure_global_id_reclaim warnings -- with zero OSDs up. microceph_setup_single_node therefore returned "ready (0s)" before any OSD had finished booting; the calling test then ran `pgrep ceph-osd`, got a real PID (the osd daemon process was launching), proceeded to create a pool, and the still-booting OSD either crashed or exited because the cluster had no quorum-of-OSDs to peer with. By the time osdtrace was started a few seconds later the PID was gone and the tool reported "Process ID NNNN does not exist". Tighten the readiness check: parse `ceph status --format=json` and require quorum >= 1 AND num_up_osds >= osd_count AND num_in_osds >= osd_count, in addition to the health string being OK or WARN. All three callers (functional-test-microceph.sh, functional-test-embedded-dwarf.sh) already request 3 OSDs, so this rejects partial-bring-up scenarios that previously slipped through. Timeout default unchanged (120 s); on timeout, dump the last 20 lines of `ceph status` for post-mortem.

taodd added 2 commits May 22, 2026 09:48

taodd force-pushed the cephadm-rgw-ci-tests branch from feda816 to aaa6900 Compare May 22, 2026 00:48

taodd added 3 commits May 22, 2026 16:34

taodd mentioned this pull request May 22, 2026

osdtrace: queue_lat underflow with value 18446743547040292 #107

Open

taodd added 2 commits May 22, 2026 23:22

taodd changed the title ~~tests: add cephadm + radosgw E2E coverage for quincy → tentacle~~ Add cephadm + radosgw E2E coverage for quincy → tentacle May 23, 2026

taodd merged commit ba31351 into main May 23, 2026
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cephadm + radosgw E2E coverage for quincy → tentacle#106

Add cephadm + radosgw E2E coverage for quincy → tentacle#106
taodd merged 7 commits into
mainfrom
cephadm-rgw-ci-tests

taodd commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

taodd commented May 21, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant