Add cephadm + radosgw E2E coverage for quincy → tentacle#106
Merged
Conversation
Adds an end-to-end test that exercises both osdtrace and radostrace
against a cephadm-deployed cluster, validating the embedded DWARF data
across the full quincy/reef/squid/tentacle line. Locally this runs
inside a disposable LXD VM; in CI it runs directly on the GHA runner.
Pieces
------
* tests/lib/cephadm-setup.sh -- reusable helpers: image-tag map per
release, distro/apt or quay.io-fallback cephadm install, LVM-wrapped
loopback OSDs (cephadm's inventory filters raw /dev/loopX), single-host
bootstrap, healthy-cluster wait, RGW deploy, RGW user creation.
* tests/functional-test-cephadm-rgw.sh -- per-release E2E: 1 MON+MGR +
3 OSDs + 1 radosgw, S3 PUT/GET workload, parallel osdtrace + radostrace
trace, row-count + range + latency assertions.
* tests/local-cephadm-test.sh -- developer wrapper that runs the test
in a fresh ubuntu:24.04 LXD VM (one per release). Keeps the VM on
failure for post-mortem; forwards KEEP_CLUSTER for in-VM debugging.
CI integration
--------------
* New build-ubuntu-cephadm matrix job in .github/workflows/pr-build.yaml:
ubuntu-22.04 × ubuntu-24.04 × {quincy, reef, squid, tentacle}, with
fail-fast disabled and a 35-minute timeout per cell.
Embedded DWARF data
-------------------
* files/centos-stream/{osdtrace,radostrace}/{osd,rados}-2:17.2.8-0.el9_dwarf.json
* files/centos-stream/{osdtrace,radostrace}/{osd,rados}-2:17.2.9-0.el9_dwarf.json
17.2.8 covers the quay.io/ceph/ceph:v17.2.8 image used by the cephadm
test (no v17.2.9 container was ever published). 17.2.9 covers
RPM-deployed production clusters at the latest quincy point release.
Tool tweak
----------
* osdtrace / radostrace: skip the deleted-binary / deleted-library guard
when -j (JSON export) is set. The export path intentionally wants to
parse the *on-disk* (possibly newly-upgraded) binary so the recorded
DWARF metadata matches the package version stamped into the JSON,
rather than whatever stale image happens to still be mmap'd by the
running process.
The flag is only present in cephadm releases newer than what Ubuntu 22.04 ships in its apt archive, causing 'unrecognized arguments' across all four release cells of the build-ubuntu-cephadm matrix on 22.04. It was only useful for inspecting partial bootstraps during local debugging; CI doesn't need it, and the cephadm default behaviour (auto-cleanup on failure) is the right one there.
feda816 to
aaa6900
Compare
The functional test's cleanup() trap unlinks the osdtrace / radostrace output logs unconditionally on exit, so by the time a failing CI job returns to the workflow runner there is nothing left on disk besides the tail -50 dump the trap echoed into stdout. When a rare event trips a verifier check (e.g. an underflowed op_lat value beyond the TRACE_MAX_LATENCY_US bound), the offending row is almost never within those 50 lines. Two changes: * tests/functional-test-cephadm-rgw.sh: capture \$? at the top of cleanup() and skip the rm -f when the script is exiting non-zero (or when KEEP_CLUSTER=1, the existing local-debug switch). * .github/workflows/pr-build.yaml: in build-ubuntu-cephadm, stage the surviving /tmp/*trace-cephadm-*.log files into the workspace on failure and upload them as a per-cell artifact (retained 14 days, named trace-logs-<os>-<release>). Lets us pull the full trace from the run page with `gh run download` for later analysis.
Rewrite _osdtrace_rows and _verify_osdtrace_output_impl in tests/lib/verify-trace-output.sh to capture every field per op type and bound each sub-latency to the row's total op_lat. Parser ------ * Three typed schemas keyed off $5 (op_r / subop_w / op_w), each emitted as a discriminator-prefixed pipe-separated line. Field positions mirror the three printf-format strings in src/osdtrace.cc:print_op_*. * Strict row acceptance: each op type matched by exact NF *and* by literal field-name landmarks (`$24 == "op_lat"` for op_r, `$22 == "peers" && $39 == "op_lat"` for op_w, ...). This drops truncated/mid-printf rows that the old loose `$NF + 0` parser silently misread. In particular it filters out the rare row that was killed by `timeout`'s SIGKILL between the `peers` and `bluestore_lat` tokens, leaving the underflowed peer-latency token (`(-1, 18446743169577026)]`) as the final field; the old parser mistook that for the genuine op_lat and reported a spurious >100 s latency. Per-row invariants ------------------ * Existing: osd_id within [0, max_osd_id]; total op_lat bounded by TRACE_MAX_LATENCY_US; pg seed within pg_num for test_pool rows. * New: every named sub-latency (throttle/recv/dispatch/queue/osd/ bluestore/prepare/aio_wait/seq_wait/kv_commit/peer*) must be <= the row's total op_lat. Underflowed BPF timestamps would otherwise present as huge µs values that quietly poison downstream analysis without triggering the (much looser) absolute upper bound. * op_w peer slots with id == -1 are sentinel-padded; both their id and latency checks are skipped (the peer_lat in that slot is uninitialised garbage from the BPF prog). Aggregates ---------- * Per-op-type row counts (op_r / subop_w / op_w) reported alongside the existing total-rows summary, both globally and for test_pool. Makes "we lost reads" / "we lost replica writes" regressions easy to spot from the test summary alone. Existing call sites in functional-test-microceph.sh, functional-test-embedded-dwarf.sh and the RGW wrapper continue to use the same `verify_osdtrace_output` signature; the stricter checks now apply transparently.
SIGKILL of the writer at the trace-window deadline can leave the log file's tail in a byte-truncated state: libc was mid-flush and the final stdio write() syscall was interrupted between two of the chunks the buffer was split across. The byte-truncated row is always the last record the writer produced; everything before it landed via previously-completed atomic write()s. The osdtrace parser's strict NF + landmark checks catch most of these truncations on their own, but byte-level truncation can occasionally leave a row whose NF happens to match an op type and whose tokens coincidentally pass each landmark check. The radostrace parser's NF >= 10 predicate is much looser and routinely admits truncated tails (e.g. an object name cut to "rbd" instead of "rbd_data.<hex>.<seq>"), which then trips the verifier's `^rbd_` object-prefix check and fails the entire test for what is structurally a non-bug. Buffer the latest emit in `prev` and rely on awk's END-without-flush to drop it. Cost: one good row per trace, against thousands captured.
…racking
A client retry of an op the OSD is still processing arrives via a
fresh MOSDOp with the same `(client_id, tid)` as the original. Ceph
handles the duplicate via `PrimaryLogPG::already_complete()` inside
`do_op`, so it never goes through the full transaction path. The
problem is on the BPF side: `uprobe_enqueue_op` unconditionally
overwrites the in-flight original's record via
`bpf_map_update_elem(..., BPF_ANY)`. The local `value` struct is
memset-zeroed first, so the overwrite wipes:
* dequeue_stamp (back to 0)
* peer0/peer1 slots (back to the -1 sentinel)
* every per-stage latency
(osd_lat, bluestore_lat, prepare, aio_wait, seq_wait, kv_commit)
When the original op later completes, its `uprobe_log_op_stats` looks
up the key, finds the retry's freshly-clobbered record, sets
`reply_stamp = bpf_ktime_get_boot_ns()`, and emits a ringbuf event.
Userspace then computes `queue_lat = 0 - retry_enqueue_stamp` which
unsigned-underflows to `~UINT64_MAX − T_enq_retry` -- the failure
mode reported in cephtrace#107.
Detect the case at the top of `uprobe_enqueue_op`: if an entry for
the key already exists *and* its `enqueue_stamp` is recent (within
5 s), the new event is a client retry of an in-flight op -- return
early without overwriting. Older entries (> 5 s) are orphans left
by some completion uprobe that didn't fire for an earlier op; fall
through to overwrite them so the map doesn't slowly leak.
bpf_printk diagnostics on both branches make it easy to tell, when
debugging, whether a given run is hitting retries, orphans, or
neither.
The readiness loop accepted any cluster status string of HEALTH_OK or HEALTH_WARN, but a freshly bootstrapped MicroCeph reports HEALTH_WARN immediately because of TOO_FEW_OSDS / mon_warn_on_insecure_global_id_reclaim warnings -- with zero OSDs up. microceph_setup_single_node therefore returned "ready (0s)" before any OSD had finished booting; the calling test then ran `pgrep ceph-osd`, got a real PID (the osd daemon process was launching), proceeded to create a pool, and the still-booting OSD either crashed or exited because the cluster had no quorum-of-OSDs to peer with. By the time osdtrace was started a few seconds later the PID was gone and the tool reported "Process ID NNNN does not exist". Tighten the readiness check: parse `ceph status --format=json` and require quorum >= 1 AND num_up_osds >= osd_count AND num_in_osds >= osd_count, in addition to the health string being OK or WARN. All three callers (functional-test-microceph.sh, functional-test-embedded-dwarf.sh) already request 3 OSDs, so this rejects partial-bring-up scenarios that previously slipped through. Timeout default unchanged (120 s); on timeout, dump the last 20 lines of `ceph status` for post-mortem.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
osdtraceandradostraceagainst a cephadm-deployed cluster across the full quincy / reef / squid / tentacle line (tests/functional-test-cephadm-rgw.sh, reusable helpers intests/lib/cephadm-setup.sh, LXD-VM dev wrapper intests/local-cephadm-test.sh).build-ubuntu-cephadmmatrix job in.github/workflows/pr-build.yaml:ubuntu-22.04×ubuntu-24.04×{quincy, reef, squid, tentacle},fail-fast: false, 35-minute timeout per cell.osdtraceandradostrace. 17.2.8 matches thequay.io/ceph/ceph:v17.2.8image used by the cephadm test; 17.2.9 covers RPM-deployed production clusters at the latest quincy point release (no v17.2.9 container was ever published).osdtrace/radostracenow skip the deleted-binary / deleted-library guard when-j(JSON export) is set, so the export path can parse the on-disk (possibly newly-upgraded) binary regardless of what's still mmap'd by the live process.Test plan
tests/local-cephadm-test.shon Ubuntu 24.04 host: all four releases PASSbuild-ubuntu-cephadm) green across all 8 cells (2 OS × 4 releases)build-ubuntu,build-ubuntu-arm64,build-centos,build-rocky,clang-tidy) still pass