Add shared-ring realtime dispatch routing support#4712
Open
cketcham2333 wants to merge 9 commits into
Open
Conversation
Enable HOST_LOOP (CPU thread) and DEVICE_LOOP (persistent GPU kernel) dispatchers to cooperate on a single shared ring buffer. Today both dispatchers DROP slots whose function_id is not in their own table (clearing rx_flags), which would race with the cooperator's processing. With shared_ring_mode=1, unknown function_ids are SKIPPED instead -- the local cursor advances but rx_flags stays set so the matching dispatcher can claim the slot. This is a prerequisite for QEC trio dispatch (enqueue_syndromes via HOST_LOOP-launched per-round CUDA graph; get_corrections and reset_decoder via DEVICE_LOOP __device__ functions) sharing one ring. Changes: * cudaq_realtime.h: add shared_ring_mode field on cudaq_dispatcher_config_t; declare new C API cudaq_dispatch_kernel_set_shared_ring_mode(). * host_dispatcher.cu: parse_slot_with_function_table distinguishes drop (bad magic, clear rx_flags) from skip (unknown fid under shared mode, leave rx_flags set). * dispatch_kernel.cu: add __constant__ g_dispatch_shared_ring_mode and cudaMemcpyToSymbol setter. Mirror the skip-vs-drop logic in all three kernel paths (cooperative dispatch_kernel_device_call_only, regular dispatch_kernel_device_call_only, dispatch_kernel_with_graph). * Opportunistic ring scan: when rx_value == 0 at the local cursor under shared_ring_mode, scan forward for any non-zero rx_flag and jump cursor to the match. Without this, dispatchers livelock at slots the peer just cleared. Added to both host and device paths. * cudaq_realtime_api.cpp: comment documenting that the consumer (not the .so) is responsible for calling cudaq_dispatch_kernel_set_shared_ring_mode() before cudaq_dispatcher_start(). Rationale: dispatch_kernel.cu lives in the hidden-visibility static lib cudaq-realtime-dispatch.a which is --exclude-libs=ALL'd from libcudaq-realtime.so; the .so cannot reach the symbol, but consumers that link the .a directly can. Tests: * test_shared_ring_dispatchers.cu (new): brings up HOST_LOOP + DEVICE_LOOP on one pinned-mapped ring buffer with a 2-entry shared function table (GRAPH_LAUNCH at fid_A handled by HOST_LOOP, DEVICE_CALL at fid_B handled by DEVICE_LOOP). Interleaves 4 RPCs across the table; asserts each completes with the correct dispatcher's transformation and that each dispatcher's slot-count stats are exactly 2. * test_dispatch_kernel, test_host_dispatcher: continue to pass unchanged (shared_ring_mode defaults to 0, preserving today's drop-on-unknown behavior). Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Local cherry-pick of upstream commit 67404ce to restore the noise model from the execution context, which is needed by CUDA-QX MSM generation. Required to bring our locally-built SDK in line with the public 'NVIDIA/cuda-quantum' commit that the cudaqx .cudaq_version pin points at. Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Extends cudaq_function_entry_t and cudaq_host_dispatch_worker_t with a uint64_t routing_key, and teaches the host monitor loop to use (function_id, routing_key) when acquiring an idle GRAPH_LAUNCH worker. The runtime routing key is sourced from the request payload's first 8 bytes (arg0) when arg_len >= 8. This unblocks the QEC realtime decoder suite, where N decoders each capture their own graph but all share the same `enqueue_syndromes` function_id; routing_key disambiguates by decoder_id (arg0 per proposals/decoder_server_runtime.md). Backward compatible: workloads that don't sub-route leave the routing_key field zero on both worker registration and (implicitly) in the wire payload, recovering the historical function_id-only match. Device-path (dispatch_kernel.cu) is intentionally untouched -- a follow-on MR will mirror this for kernel-driven dispatch when a workload needs it. Spec reference: proposals/cudaq_realtime_host_api.bs#host-path-graph-routing-key Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
…eads The persistent dispatch kernel polls volatile rx_flags in host-mapped pinned memory; `volatile` suppresses compiler caching but does not invalidate the GPU's L2, so a producer-published flag could remain invisible on the device for many polling iterations. Observed as a ~7%-rate ACK timeout in the cuda-qx 1000-shot surface_code-1 stress test (a published get_corrections RPC sat unprocessed for the full 1s producer timeout while the kernel was hot-looping at ~150 kHz on a stale-zero read). Fix by issuing __threadfence_system() before each of the three rx_flags polling reads in this TU. 30/30 stress runs post-fix, 12/12 ctest passes, no ABI change. Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
cuda-qx's libcudaq-qec-realtime-decoding.so resolves cudaq_launch_dispatch_kernel_regular and cudaq_dispatch_kernel_set_shared_- ring_mode at runtime via dlsym(RTLD_DEFAULT, ...) so it can pick them up from whatever exe absorbed libcudaq-realtime-dispatch.a, without any explicit setter or constructor-shim plumbing on the consumer side. The archive is built CXX_VISIBILITY_PRESET=hidden + -Wl,--exclude-libs=ALL, which strips every symbol from the binary's dynamic table when absorbed -- including the two dlsym targets -- so the lookup currently fails. cuda-qx works around this today by compiling a per-test-exe TU that takes the addresses by direct reference and hands them to the .so via a setter. Mark just those two entry points with default visibility (new CUDAQ_REALTIME_DISPATCH_API macro) so they survive the archive's hidden- visibility build flags and reach the binary's --export-dynamic table; the remaining HSB/internal symbols stay hidden (no third-party-leak regression). Verified with readelf -s on the rebuilt archive: the two tagged symbols are now GLOBAL DEFAULT while neighbours like cudaq_dispatch_kernel_cooperative_query_occupancy and cudaq_launch_dispatch_kernel_cooperative remain GLOBAL HIDDEN as before. test_dispatch_kernel / test_host_dispatcher / test_shared_ring_dispatchers all pass against the rebuilt archive (the third one exercises both modified symbols end-to-end). Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
…rom_kernel Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
…rom_kernel Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Move the shared HOST_LOOP + DEVICE_LOOP ring interleaving coverage into the existing test_host_dispatcher.cu CUDA test instead of carrying a new standalone .cu file. The test still validates shared_ring_mode skip behavior across the host and device dispatchers, but now reuses the existing host dispatcher test target and CMake wiring. Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>
CI Summary (
|
| Job | Result |
|---|---|
binaries |
⏩ skipped |
build_and_test |
✅ success |
config_devdeps |
✅ success |
config_source_build |
⏩ skipped |
config_wheeldeps |
✅ success |
devdeps |
✅ success |
docker_image |
⏩ skipped |
gen_code_coverage |
⏩ skipped |
metadata |
✅ success |
python_metapackages |
⏩ skipped |
python_wheels |
⏩ skipped |
source_build |
⏩ skipped |
wheeldeps |
✅ success |
⏩ Skipped jobs (7) — intentionally skipped on PR builds; run on merge_group / workflow_dispatch
| Job |
|---|
binaries |
config_source_build |
docker_image |
gen_code_coverage |
python_metapackages |
python_wheels |
source_build |
All sub-jobs (42) — every matrix leg, with links
| Job | Status | Link |
|---|---|---|
| Build and test (amd64, gcc12, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (amd64, gcc12, openmpi) / Dev environment (Python) | ✅ success | view |
| Build and test (amd64, llvm, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (amd64, llvm, openmpi) / Dev environment (Python) | ✅ success | view |
| Build and test (arm64, llvm, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (arm64, llvm, openmpi) / Dev environment (Python) | ✅ success | view |
| CI Summary | ❔ in_progress | view |
| Configure build (devdeps) | ✅ success | view |
| Configure build (source_build) | ⏩ skipped | view |
| Configure build (wheeldeps) | ✅ success | view |
| Create CUDA Quantum installer | ⏩ skipped | view |
| Create Docker images | ⏩ skipped | view |
| Create Python metapackages | ⏩ skipped | view |
| Create Python wheels | ⏩ skipped | view |
| Gen code coverage | ⏩ skipped | view |
| Load dependencies (amd64, gcc12) / Caching | ✅ success | view |
| Load dependencies (amd64, gcc12) / Finalize | ✅ success | view |
| Load dependencies (amd64, gcc12) / Metadata | ✅ success | view |
| Load dependencies (amd64, llvm) / Caching | ✅ success | view |
| Load dependencies (amd64, llvm) / Finalize | ✅ success | view |
| Load dependencies (amd64, llvm) / Metadata | ✅ success | view |
| Load dependencies (arm64, gcc12) / Caching | ✅ success | view |
| Load dependencies (arm64, gcc12) / Finalize | ✅ success | view |
| Load dependencies (arm64, gcc12) / Metadata | ✅ success | view |
| Load dependencies (arm64, llvm) / Caching | ✅ success | view |
| Load dependencies (arm64, llvm) / Finalize | ✅ success | view |
| Load dependencies (arm64, llvm) / Metadata | ✅ success | view |
| Load source build cache | ⏩ skipped | view |
| Load wheel dependencies (amd64, 12.6) / Caching | ✅ success | view |
| Load wheel dependencies (amd64, 12.6) / Finalize | ✅ success | view |
| Load wheel dependencies (amd64, 12.6) / Metadata | ✅ success | view |
| Load wheel dependencies (amd64, 13.0) / Caching | ✅ success | view |
| Load wheel dependencies (amd64, 13.0) / Finalize | ✅ success | view |
| Load wheel dependencies (amd64, 13.0) / Metadata | ✅ success | view |
| Load wheel dependencies (arm64, 12.6) / Caching | ✅ success | view |
| Load wheel dependencies (arm64, 12.6) / Finalize | ✅ success | view |
| Load wheel dependencies (arm64, 12.6) / Metadata | ✅ success | view |
| Load wheel dependencies (arm64, 13.0) / Caching | ✅ success | view |
| Load wheel dependencies (arm64, 13.0) / Finalize | ✅ success | view |
| Load wheel dependencies (arm64, 13.0) / Metadata | ✅ success | view |
| Prepare cache clean-up | ❔ in_progress | view |
| Retrieve PR info | ✅ success | view |
✅ Required checks (6/6) — declared in .github/required-checks.yml for push
| Required check | Status | Link |
|---|---|---|
| Build and test (amd64, llvm, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (amd64, llvm, openmpi) / Dev environment (Python) | ✅ success | view |
| Build and test (arm64, llvm, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (arm64, llvm, openmpi) / Dev environment (Python) | ✅ success | view |
| Build and test (amd64, gcc12, openmpi) / Dev environment (Debug) | ✅ success | view |
| Build and test (amd64, gcc12, openmpi) / Dev environment (Python) | ✅ success | view |
1tnguyen
reviewed
Jun 10, 2026
| /// function_id and routing_key match. Set to 0 when sub-routing isn't | ||
| /// needed (the historical function_id-only match). | ||
| /// See proposals/cudaq_realtime_host_api.bs#host-path-graph-routing-key. | ||
| uint64_t routing_key; |
Collaborator
There was a problem hiding this comment.
Q: do we need to initialize this routing_key in cudaq_host_dispatcher_start_thread similar to other fields?
1tnguyen
reviewed
Jun 10, 2026
Comment on lines
+18
to
+19
| // dlsym(RTLD_DEFAULT, | ||
| // ...) at runtime. libcudaq-realtime-dispatch.a is built with hidden |
Collaborator
There was a problem hiding this comment.
nits: line break (clang-format fallout, perhaps)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the realtime dispatcher features needed for mixed HOST_LOOP + DEVICE_LOOP QEC workloads that share one RX/TX ring:
shared_ring_modeso peer dispatchers skip, rather than drop, slots owned by another dispatch path.routing_keysub-routing forGRAPH_LAUNCHworkers that share the samefunction_id.