Skip to content

Add shared-ring realtime dispatch routing support#4712

Open
cketcham2333 wants to merge 9 commits into
NVIDIA:mainfrom
cketcham2333:realtime_dispatch_from_kernel
Open

Add shared-ring realtime dispatch routing support#4712
cketcham2333 wants to merge 9 commits into
NVIDIA:mainfrom
cketcham2333:realtime_dispatch_from_kernel

Conversation

@cketcham2333

Copy link
Copy Markdown
Collaborator

Adds the realtime dispatcher features needed for mixed HOST_LOOP + DEVICE_LOOP QEC workloads that share one RX/TX ring:

  • Add shared_ring_mode so peer dispatchers skip, rather than drop, slots owned by another dispatch path.
  • Add routing_key sub-routing for GRAPH_LAUNCH workers that share the same function_id.
  • Add a device-side rx_flags polling fence to avoid stale reads from host-pinned mapped memory.
  • Export the shared-ring mode setter and regular dispatch-kernel launch helper for downstream runtime consumers.

Enable HOST_LOOP (CPU thread) and DEVICE_LOOP (persistent GPU kernel)
dispatchers to cooperate on a single shared ring buffer. Today both
dispatchers DROP slots whose function_id is not in their own table
(clearing rx_flags), which would race with the cooperator's processing.
With shared_ring_mode=1, unknown function_ids are SKIPPED instead --
the local cursor advances but rx_flags stays set so the matching
dispatcher can claim the slot.
This is a prerequisite for QEC trio dispatch (enqueue_syndromes via
HOST_LOOP-launched per-round CUDA graph; get_corrections and
reset_decoder via DEVICE_LOOP __device__ functions) sharing one ring.
Changes:
* cudaq_realtime.h: add shared_ring_mode field on
  cudaq_dispatcher_config_t; declare new C API
  cudaq_dispatch_kernel_set_shared_ring_mode().
* host_dispatcher.cu: parse_slot_with_function_table distinguishes
  drop (bad magic, clear rx_flags) from skip (unknown fid under shared
  mode, leave rx_flags set).
* dispatch_kernel.cu: add __constant__ g_dispatch_shared_ring_mode and
  cudaMemcpyToSymbol setter. Mirror the skip-vs-drop logic in all
  three kernel paths (cooperative dispatch_kernel_device_call_only,
  regular dispatch_kernel_device_call_only, dispatch_kernel_with_graph).
* Opportunistic ring scan: when rx_value == 0 at the local cursor
  under shared_ring_mode, scan forward for any non-zero rx_flag and
  jump cursor to the match. Without this, dispatchers livelock at
  slots the peer just cleared. Added to both host and device paths.
* cudaq_realtime_api.cpp: comment documenting that the consumer (not
  the .so) is responsible for calling
  cudaq_dispatch_kernel_set_shared_ring_mode() before
  cudaq_dispatcher_start(). Rationale: dispatch_kernel.cu lives in
  the hidden-visibility static lib cudaq-realtime-dispatch.a which is
  --exclude-libs=ALL'd from libcudaq-realtime.so; the .so cannot
  reach the symbol, but consumers that link the .a directly can.
Tests:
* test_shared_ring_dispatchers.cu (new): brings up HOST_LOOP +
  DEVICE_LOOP on one pinned-mapped ring buffer with a 2-entry shared
  function table (GRAPH_LAUNCH at fid_A handled by HOST_LOOP,
  DEVICE_CALL at fid_B handled by DEVICE_LOOP). Interleaves 4 RPCs
  across the table; asserts each completes with the correct
  dispatcher's transformation and that each dispatcher's slot-count
  stats are exactly 2.
* test_dispatch_kernel, test_host_dispatcher: continue to pass
  unchanged (shared_ring_mode defaults to 0, preserving today's
  drop-on-unknown behavior).

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>


Local cherry-pick of upstream commit 67404ce to restore the noise
model from the execution context, which is needed by CUDA-QX MSM
generation. Required to bring our locally-built SDK in line with the
public 'NVIDIA/cuda-quantum' commit that the cudaqx .cudaq_version
pin points at.

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Extends cudaq_function_entry_t and cudaq_host_dispatch_worker_t with a
uint64_t routing_key, and teaches the host monitor loop to use
(function_id, routing_key) when acquiring an idle GRAPH_LAUNCH worker.
The runtime routing key is sourced from the request payload's first
8 bytes (arg0) when arg_len >= 8.
This unblocks the QEC realtime decoder suite, where N decoders each
capture their own graph but all share the same `enqueue_syndromes`
function_id; routing_key disambiguates by decoder_id (arg0 per
proposals/decoder_server_runtime.md).
Backward compatible: workloads that don't sub-route leave the
routing_key field zero on both worker registration and (implicitly)
in the wire payload, recovering the historical function_id-only match.
Device-path (dispatch_kernel.cu) is intentionally untouched -- a
follow-on MR will mirror this for kernel-driven dispatch when a
workload needs it.
Spec reference:
  proposals/cudaq_realtime_host_api.bs#host-path-graph-routing-key

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
…eads

The persistent dispatch kernel polls volatile rx_flags in host-mapped
pinned memory; `volatile` suppresses compiler caching but does not
invalidate the GPU's L2, so a producer-published flag could remain
invisible on the device for many polling iterations.  Observed as a
~7%-rate ACK timeout in the cuda-qx 1000-shot surface_code-1 stress
test (a published get_corrections RPC sat unprocessed for the full 1s
producer timeout while the kernel was hot-looping at ~150 kHz on a
stale-zero read).  Fix by issuing __threadfence_system() before each
of the three rx_flags polling reads in this TU.  30/30 stress runs
post-fix, 12/12 ctest passes, no ABI change.

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
cuda-qx's libcudaq-qec-realtime-decoding.so resolves
cudaq_launch_dispatch_kernel_regular and cudaq_dispatch_kernel_set_shared_-
ring_mode at runtime via dlsym(RTLD_DEFAULT, ...) so it can pick them up
from whatever exe absorbed libcudaq-realtime-dispatch.a, without any
explicit setter or constructor-shim plumbing on the consumer side.  The
archive is built CXX_VISIBILITY_PRESET=hidden + -Wl,--exclude-libs=ALL,
which strips every symbol from the binary's dynamic table when absorbed --
including the two dlsym targets -- so the lookup currently fails.  cuda-qx
works around this today by compiling a per-test-exe TU that takes the
addresses by direct reference and hands them to the .so via a setter.
Mark just those two entry points with default visibility (new
CUDAQ_REALTIME_DISPATCH_API macro) so they survive the archive's hidden-
visibility build flags and reach the binary's --export-dynamic table; the
remaining HSB/internal symbols stay hidden (no third-party-leak
regression).  Verified with readelf -s on the rebuilt archive: the two
tagged symbols are now GLOBAL DEFAULT while neighbours like
cudaq_dispatch_kernel_cooperative_query_occupancy and
cudaq_launch_dispatch_kernel_cooperative remain GLOBAL HIDDEN as before.
test_dispatch_kernel / test_host_dispatcher / test_shared_ring_dispatchers
all pass against the rebuilt archive (the third one exercises both
modified symbols end-to-end).

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
…rom_kernel

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
…rom_kernel

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Move the shared HOST_LOOP + DEVICE_LOOP ring interleaving coverage into the
existing test_host_dispatcher.cu CUDA test instead of carrying a new standalone
.cu file. The test still validates shared_ring_mode skip behavior across the
host and device dispatchers, but now reuses the existing host dispatcher test
target and CMake wiring.

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
@github-actions

Copy link
Copy Markdown

CI Summary (push) — ✅ passed

Run #27295382846 · ✅ 6 · ⏩ 7 · ❌ 0 · ⛔ 0

Top-level jobs (13)
Job Result
binaries ⏩ skipped
build_and_test ✅ success
config_devdeps ✅ success
config_source_build ⏩ skipped
config_wheeldeps ✅ success
devdeps ✅ success
docker_image ⏩ skipped
gen_code_coverage ⏩ skipped
metadata ✅ success
python_metapackages ⏩ skipped
python_wheels ⏩ skipped
source_build ⏩ skipped
wheeldeps ✅ success
⏩ Skipped jobs (7) — intentionally skipped on PR builds; run on merge_group / workflow_dispatch
Job
binaries
config_source_build
docker_image
gen_code_coverage
python_metapackages
python_wheels
source_build
All sub-jobs (42) — every matrix leg, with links
Job Status Link
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python) ✅ success view
Build and test (amd64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, llvm, openmpi) / Dev environment (Python) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Python) ✅ success view
CI Summary ❔ in_progress view
Configure build (devdeps) ✅ success view
Configure build (source_build) ⏩ skipped view
Configure build (wheeldeps) ✅ success view
Create CUDA Quantum installer ⏩ skipped view
Create Docker images ⏩ skipped view
Create Python metapackages ⏩ skipped view
Create Python wheels ⏩ skipped view
Gen code coverage ⏩ skipped view
Load dependencies (amd64, gcc12) / Caching ✅ success view
Load dependencies (amd64, gcc12) / Finalize ✅ success view
Load dependencies (amd64, gcc12) / Metadata ✅ success view
Load dependencies (amd64, llvm) / Caching ✅ success view
Load dependencies (amd64, llvm) / Finalize ✅ success view
Load dependencies (amd64, llvm) / Metadata ✅ success view
Load dependencies (arm64, gcc12) / Caching ✅ success view
Load dependencies (arm64, gcc12) / Finalize ✅ success view
Load dependencies (arm64, gcc12) / Metadata ✅ success view
Load dependencies (arm64, llvm) / Caching ✅ success view
Load dependencies (arm64, llvm) / Finalize ✅ success view
Load dependencies (arm64, llvm) / Metadata ✅ success view
Load source build cache ⏩ skipped view
Load wheel dependencies (amd64, 12.6) / Caching ✅ success view
Load wheel dependencies (amd64, 12.6) / Finalize ✅ success view
Load wheel dependencies (amd64, 12.6) / Metadata ✅ success view
Load wheel dependencies (amd64, 13.0) / Caching ✅ success view
Load wheel dependencies (amd64, 13.0) / Finalize ✅ success view
Load wheel dependencies (amd64, 13.0) / Metadata ✅ success view
Load wheel dependencies (arm64, 12.6) / Caching ✅ success view
Load wheel dependencies (arm64, 12.6) / Finalize ✅ success view
Load wheel dependencies (arm64, 12.6) / Metadata ✅ success view
Load wheel dependencies (arm64, 13.0) / Caching ✅ success view
Load wheel dependencies (arm64, 13.0) / Finalize ✅ success view
Load wheel dependencies (arm64, 13.0) / Metadata ✅ success view
Prepare cache clean-up ❔ in_progress view
Retrieve PR info ✅ success view
✅ Required checks (6/6) — declared in .github/required-checks.yml for push
Required check Status Link
Build and test (amd64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, llvm, openmpi) / Dev environment (Python) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug) ✅ success view
Build and test (arm64, llvm, openmpi) / Dev environment (Python) ✅ success view
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug) ✅ success view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python) ✅ success view

@cketcham2333 cketcham2333 marked this pull request as ready for review June 10, 2026 18:41
/// function_id and routing_key match. Set to 0 when sub-routing isn't
/// needed (the historical function_id-only match).
/// See proposals/cudaq_realtime_host_api.bs#host-path-graph-routing-key.
uint64_t routing_key;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: do we need to initialize this routing_key in cudaq_host_dispatcher_start_thread similar to other fields?

Comment on lines +18 to +19
// dlsym(RTLD_DEFAULT,
// ...) at runtime. libcudaq-realtime-dispatch.a is built with hidden

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: line break (clang-format fallout, perhaps)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants