Skip to content

Add cephadm + radosgw E2E coverage for quincy → tentacle#106

Merged
taodd merged 7 commits into
mainfrom
cephadm-rgw-ci-tests
May 23, 2026
Merged

Add cephadm + radosgw E2E coverage for quincy → tentacle#106
taodd merged 7 commits into
mainfrom
cephadm-rgw-ci-tests

Conversation

@taodd
Copy link
Copy Markdown
Owner

@taodd taodd commented May 21, 2026

Summary

  • New end-to-end test that exercises both osdtrace and radostrace against a cephadm-deployed cluster across the full quincy / reef / squid / tentacle line (tests/functional-test-cephadm-rgw.sh, reusable helpers in tests/lib/cephadm-setup.sh, LXD-VM dev wrapper in tests/local-cephadm-test.sh).
  • New build-ubuntu-cephadm matrix job in .github/workflows/pr-build.yaml: ubuntu-22.04 × ubuntu-24.04 × {quincy, reef, squid, tentacle}, fail-fast: false, 35-minute timeout per cell.
  • Added quincy 17.2.8 and 17.2.9 embedded DWARF JSONs (centos-stream) for both osdtrace and radostrace. 17.2.8 matches the quay.io/ceph/ceph:v17.2.8 image used by the cephadm test; 17.2.9 covers RPM-deployed production clusters at the latest quincy point release (no v17.2.9 container was ever published).
  • Tool tweak: osdtrace / radostrace now skip the deleted-binary / deleted-library guard when -j (JSON export) is set, so the export path can parse the on-disk (possibly newly-upgraded) binary regardless of what's still mmap'd by the live process.

Test plan

  • Local validation via tests/local-cephadm-test.sh on Ubuntu 24.04 host: all four releases PASS
    • squid: osdtrace 1243 rows, radostrace 2410 rows
    • reef: osdtrace 901, radostrace 1941
    • tentacle: osdtrace 1631, radostrace 3076
    • quincy: osdtrace 824, radostrace 1782
  • CI matrix (build-ubuntu-cephadm) green across all 8 cells (2 OS × 4 releases)
  • Existing CI jobs (build-ubuntu, build-ubuntu-arm64, build-centos, build-rocky, clang-tidy) still pass

taodd added 2 commits May 22, 2026 09:48
Adds an end-to-end test that exercises both osdtrace and radostrace
against a cephadm-deployed cluster, validating the embedded DWARF data
across the full quincy/reef/squid/tentacle line.  Locally this runs
inside a disposable LXD VM; in CI it runs directly on the GHA runner.

Pieces
------
* tests/lib/cephadm-setup.sh -- reusable helpers: image-tag map per
  release, distro/apt or quay.io-fallback cephadm install, LVM-wrapped
  loopback OSDs (cephadm's inventory filters raw /dev/loopX), single-host
  bootstrap, healthy-cluster wait, RGW deploy, RGW user creation.
* tests/functional-test-cephadm-rgw.sh -- per-release E2E: 1 MON+MGR +
  3 OSDs + 1 radosgw, S3 PUT/GET workload, parallel osdtrace + radostrace
  trace, row-count + range + latency assertions.
* tests/local-cephadm-test.sh -- developer wrapper that runs the test
  in a fresh ubuntu:24.04 LXD VM (one per release).  Keeps the VM on
  failure for post-mortem; forwards KEEP_CLUSTER for in-VM debugging.

CI integration
--------------
* New build-ubuntu-cephadm matrix job in .github/workflows/pr-build.yaml:
  ubuntu-22.04 × ubuntu-24.04 × {quincy, reef, squid, tentacle}, with
  fail-fast disabled and a 35-minute timeout per cell.

Embedded DWARF data
-------------------
* files/centos-stream/{osdtrace,radostrace}/{osd,rados}-2:17.2.8-0.el9_dwarf.json
* files/centos-stream/{osdtrace,radostrace}/{osd,rados}-2:17.2.9-0.el9_dwarf.json
  17.2.8 covers the quay.io/ceph/ceph:v17.2.8 image used by the cephadm
  test (no v17.2.9 container was ever published).  17.2.9 covers
  RPM-deployed production clusters at the latest quincy point release.

Tool tweak
----------
* osdtrace / radostrace: skip the deleted-binary / deleted-library guard
  when -j (JSON export) is set.  The export path intentionally wants to
  parse the *on-disk* (possibly newly-upgraded) binary so the recorded
  DWARF metadata matches the package version stamped into the JSON,
  rather than whatever stale image happens to still be mmap'd by the
  running process.
The flag is only present in cephadm releases newer than what Ubuntu
22.04 ships in its apt archive, causing 'unrecognized arguments' across
all four release cells of the build-ubuntu-cephadm matrix on 22.04.
It was only useful for inspecting partial bootstraps during local
debugging; CI doesn't need it, and the cephadm default behaviour
(auto-cleanup on failure) is the right one there.
@taodd taodd force-pushed the cephadm-rgw-ci-tests branch from feda816 to aaa6900 Compare May 22, 2026 00:48
taodd added 3 commits May 22, 2026 16:34
The functional test's cleanup() trap unlinks the osdtrace / radostrace
output logs unconditionally on exit, so by the time a failing CI job
returns to the workflow runner there is nothing left on disk besides
the tail -50 dump the trap echoed into stdout.  When a rare event
trips a verifier check (e.g. an underflowed op_lat value beyond the
TRACE_MAX_LATENCY_US bound), the offending row is almost never within
those 50 lines.

Two changes:

* tests/functional-test-cephadm-rgw.sh: capture \$? at the top of
  cleanup() and skip the rm -f when the script is exiting non-zero (or
  when KEEP_CLUSTER=1, the existing local-debug switch).
* .github/workflows/pr-build.yaml: in build-ubuntu-cephadm, stage the
  surviving /tmp/*trace-cephadm-*.log files into the workspace on
  failure and upload them as a per-cell artifact (retained 14 days,
  named trace-logs-<os>-<release>).  Lets us pull the full trace from
  the run page with `gh run download` for later analysis.
Rewrite _osdtrace_rows and _verify_osdtrace_output_impl in
tests/lib/verify-trace-output.sh to capture every field per op type
and bound each sub-latency to the row's total op_lat.

Parser
------
* Three typed schemas keyed off $5 (op_r / subop_w / op_w), each
  emitted as a discriminator-prefixed pipe-separated line.  Field
  positions mirror the three printf-format strings in
  src/osdtrace.cc:print_op_*.
* Strict row acceptance: each op type matched by exact NF *and* by
  literal field-name landmarks (`$24 == "op_lat"` for op_r,
  `$22 == "peers" && $39 == "op_lat"` for op_w, ...).  This drops
  truncated/mid-printf rows that the old loose `$NF + 0` parser
  silently misread.  In particular it filters out the rare row that
  was killed by `timeout`'s SIGKILL between the `peers` and
  `bluestore_lat` tokens, leaving the underflowed peer-latency token
  (`(-1, 18446743169577026)]`) as the final field; the old parser
  mistook that for the genuine op_lat and reported a spurious
  >100 s latency.

Per-row invariants
------------------
* Existing: osd_id within [0, max_osd_id]; total op_lat bounded by
  TRACE_MAX_LATENCY_US; pg seed within pg_num for test_pool rows.
* New: every named sub-latency (throttle/recv/dispatch/queue/osd/
  bluestore/prepare/aio_wait/seq_wait/kv_commit/peer*) must be
  <= the row's total op_lat.  Underflowed BPF timestamps would
  otherwise present as huge µs values that quietly poison downstream
  analysis without triggering the (much looser) absolute upper bound.
* op_w peer slots with id == -1 are sentinel-padded; both their id
  and latency checks are skipped (the peer_lat in that slot is
  uninitialised garbage from the BPF prog).

Aggregates
----------
* Per-op-type row counts (op_r / subop_w / op_w) reported alongside
  the existing total-rows summary, both globally and for test_pool.
  Makes "we lost reads" / "we lost replica writes" regressions easy
  to spot from the test summary alone.

Existing call sites in functional-test-microceph.sh,
functional-test-embedded-dwarf.sh and the RGW wrapper continue to use
the same `verify_osdtrace_output` signature; the stricter checks now
apply transparently.
SIGKILL of the writer at the trace-window deadline can leave the log
file's tail in a byte-truncated state: libc was mid-flush and the
final stdio write() syscall was interrupted between two of the chunks
the buffer was split across.  The byte-truncated row is always the
last record the writer produced; everything before it landed via
previously-completed atomic write()s.

The osdtrace parser's strict NF + landmark checks catch most of these
truncations on their own, but byte-level truncation can occasionally
leave a row whose NF happens to match an op type and whose tokens
coincidentally pass each landmark check.  The radostrace parser's
NF >= 10 predicate is much looser and routinely admits truncated
tails (e.g. an object name cut to "rbd" instead of "rbd_data.<hex>.<seq>"),
which then trips the verifier's `^rbd_` object-prefix check and fails
the entire test for what is structurally a non-bug.

Buffer the latest emit in `prev` and rely on awk's END-without-flush
to drop it.  Cost: one good row per trace, against thousands captured.
taodd added 2 commits May 22, 2026 23:22
…racking

A client retry of an op the OSD is still processing arrives via a
fresh MOSDOp with the same `(client_id, tid)` as the original.  Ceph
handles the duplicate via `PrimaryLogPG::already_complete()` inside
`do_op`, so it never goes through the full transaction path.  The
problem is on the BPF side: `uprobe_enqueue_op` unconditionally
overwrites the in-flight original's record via
`bpf_map_update_elem(..., BPF_ANY)`.  The local `value` struct is
memset-zeroed first, so the overwrite wipes:

  * dequeue_stamp      (back to 0)
  * peer0/peer1 slots  (back to the -1 sentinel)
  * every per-stage latency
    (osd_lat, bluestore_lat, prepare, aio_wait, seq_wait, kv_commit)

When the original op later completes, its `uprobe_log_op_stats` looks
up the key, finds the retry's freshly-clobbered record, sets
`reply_stamp = bpf_ktime_get_boot_ns()`, and emits a ringbuf event.
Userspace then computes `queue_lat = 0 - retry_enqueue_stamp` which
unsigned-underflows to `~UINT64_MAX − T_enq_retry` -- the failure
mode reported in cephtrace#107.

Detect the case at the top of `uprobe_enqueue_op`: if an entry for
the key already exists *and* its `enqueue_stamp` is recent (within
5 s), the new event is a client retry of an in-flight op -- return
early without overwriting.  Older entries (> 5 s) are orphans left
by some completion uprobe that didn't fire for an earlier op; fall
through to overwrite them so the map doesn't slowly leak.

bpf_printk diagnostics on both branches make it easy to tell, when
debugging, whether a given run is hitting retries, orphans, or
neither.
The readiness loop accepted any cluster status string of HEALTH_OK or
HEALTH_WARN, but a freshly bootstrapped MicroCeph reports HEALTH_WARN
immediately because of TOO_FEW_OSDS / mon_warn_on_insecure_global_id_reclaim
warnings -- with zero OSDs up.  microceph_setup_single_node therefore
returned "ready (0s)" before any OSD had finished booting; the calling
test then ran `pgrep ceph-osd`, got a real PID (the osd daemon process
was launching), proceeded to create a pool, and the still-booting OSD
either crashed or exited because the cluster had no quorum-of-OSDs to
peer with.  By the time osdtrace was started a few seconds later the
PID was gone and the tool reported "Process ID NNNN does not exist".

Tighten the readiness check: parse `ceph status --format=json` and
require quorum >= 1 AND num_up_osds >= osd_count AND num_in_osds >=
osd_count, in addition to the health string being OK or WARN.  All
three callers (functional-test-microceph.sh, functional-test-embedded-dwarf.sh)
already request 3 OSDs, so this rejects partial-bring-up scenarios
that previously slipped through.  Timeout default unchanged (120 s);
on timeout, dump the last 20 lines of `ceph status` for post-mortem.
@taodd taodd changed the title tests: add cephadm + radosgw E2E coverage for quincy → tentacle Add cephadm + radosgw E2E coverage for quincy → tentacle May 23, 2026
@taodd taodd merged commit ba31351 into main May 23, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant