Add shared-ring realtime dispatch routing support by cketcham2333 · Pull Request #4712 · NVIDIA/cuda-quantum

cketcham2333 · 2026-06-10T17:42:57Z

Adds the realtime dispatcher features needed for mixed HOST_LOOP + DEVICE_LOOP QEC workloads that share one RX/TX ring:

Add shared_ring_mode so peer dispatchers skip, rather than drop, slots owned by another dispatch path.
Add routing_key sub-routing for GRAPH_LAUNCH workers that share the same function_id.
Add a device-side rx_flags polling fence to avoid stale reads from host-pinned mapped memory.
Export the shared-ring mode setter and regular dispatch-kernel launch helper for downstream runtime consumers.

Enable HOST_LOOP (CPU thread) and DEVICE_LOOP (persistent GPU kernel) dispatchers to cooperate on a single shared ring buffer. Today both dispatchers DROP slots whose function_id is not in their own table (clearing rx_flags), which would race with the cooperator's processing. With shared_ring_mode=1, unknown function_ids are SKIPPED instead -- the local cursor advances but rx_flags stays set so the matching dispatcher can claim the slot. This is a prerequisite for QEC trio dispatch (enqueue_syndromes via HOST_LOOP-launched per-round CUDA graph; get_corrections and reset_decoder via DEVICE_LOOP __device__ functions) sharing one ring. Changes: * cudaq_realtime.h: add shared_ring_mode field on cudaq_dispatcher_config_t; declare new C API cudaq_dispatch_kernel_set_shared_ring_mode(). * host_dispatcher.cu: parse_slot_with_function_table distinguishes drop (bad magic, clear rx_flags) from skip (unknown fid under shared mode, leave rx_flags set). * dispatch_kernel.cu: add __constant__ g_dispatch_shared_ring_mode and cudaMemcpyToSymbol setter. Mirror the skip-vs-drop logic in all three kernel paths (cooperative dispatch_kernel_device_call_only, regular dispatch_kernel_device_call_only, dispatch_kernel_with_graph). * Opportunistic ring scan: when rx_value == 0 at the local cursor under shared_ring_mode, scan forward for any non-zero rx_flag and jump cursor to the match. Without this, dispatchers livelock at slots the peer just cleared. Added to both host and device paths. * cudaq_realtime_api.cpp: comment documenting that the consumer (not the .so) is responsible for calling cudaq_dispatch_kernel_set_shared_ring_mode() before cudaq_dispatcher_start(). Rationale: dispatch_kernel.cu lives in the hidden-visibility static lib cudaq-realtime-dispatch.a which is --exclude-libs=ALL'd from libcudaq-realtime.so; the .so cannot reach the symbol, but consumers that link the .a directly can. Tests: * test_shared_ring_dispatchers.cu (new): brings up HOST_LOOP + DEVICE_LOOP on one pinned-mapped ring buffer with a 2-entry shared function table (GRAPH_LAUNCH at fid_A handled by HOST_LOOP, DEVICE_CALL at fid_B handled by DEVICE_LOOP). Interleaves 4 RPCs across the table; asserts each completes with the correct dispatcher's transformation and that each dispatcher's slot-count stats are exactly 2. * test_dispatch_kernel, test_host_dispatcher: continue to pass unchanged (shared_ring_mode defaults to 0, preserving today's drop-on-unknown behavior). Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

Local cherry-pick of upstream commit 67404ce to restore the noise model from the execution context, which is needed by CUDA-QX MSM generation. Required to bring our locally-built SDK in line with the public 'NVIDIA/cuda-quantum' commit that the cudaqx .cudaq_version pin points at. Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

Extends cudaq_function_entry_t and cudaq_host_dispatch_worker_t with a uint64_t routing_key, and teaches the host monitor loop to use (function_id, routing_key) when acquiring an idle GRAPH_LAUNCH worker. The runtime routing key is sourced from the request payload's first 8 bytes (arg0) when arg_len >= 8. This unblocks the QEC realtime decoder suite, where N decoders each capture their own graph but all share the same `enqueue_syndromes` function_id; routing_key disambiguates by decoder_id (arg0 per proposals/decoder_server_runtime.md). Backward compatible: workloads that don't sub-route leave the routing_key field zero on both worker registration and (implicitly) in the wire payload, recovering the historical function_id-only match. Device-path (dispatch_kernel.cu) is intentionally untouched -- a follow-on MR will mirror this for kernel-driven dispatch when a workload needs it. Spec reference: proposals/cudaq_realtime_host_api.bs#host-path-graph-routing-key Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

…eads The persistent dispatch kernel polls volatile rx_flags in host-mapped pinned memory; `volatile` suppresses compiler caching but does not invalidate the GPU's L2, so a producer-published flag could remain invisible on the device for many polling iterations. Observed as a ~7%-rate ACK timeout in the cuda-qx 1000-shot surface_code-1 stress test (a published get_corrections RPC sat unprocessed for the full 1s producer timeout while the kernel was hot-looping at ~150 kHz on a stale-zero read). Fix by issuing __threadfence_system() before each of the three rx_flags polling reads in this TU. 30/30 stress runs post-fix, 12/12 ctest passes, no ABI change. Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

cuda-qx's libcudaq-qec-realtime-decoding.so resolves cudaq_launch_dispatch_kernel_regular and cudaq_dispatch_kernel_set_shared_- ring_mode at runtime via dlsym(RTLD_DEFAULT, ...) so it can pick them up from whatever exe absorbed libcudaq-realtime-dispatch.a, without any explicit setter or constructor-shim plumbing on the consumer side. The archive is built CXX_VISIBILITY_PRESET=hidden + -Wl,--exclude-libs=ALL, which strips every symbol from the binary's dynamic table when absorbed -- including the two dlsym targets -- so the lookup currently fails. cuda-qx works around this today by compiling a per-test-exe TU that takes the addresses by direct reference and hands them to the .so via a setter. Mark just those two entry points with default visibility (new CUDAQ_REALTIME_DISPATCH_API macro) so they survive the archive's hidden- visibility build flags and reach the binary's --export-dynamic table; the remaining HSB/internal symbols stay hidden (no third-party-leak regression). Verified with readelf -s on the rebuilt archive: the two tagged symbols are now GLOBAL DEFAULT while neighbours like cudaq_dispatch_kernel_cooperative_query_occupancy and cudaq_launch_dispatch_kernel_cooperative remain GLOBAL HIDDEN as before. test_dispatch_kernel / test_host_dispatcher / test_shared_ring_dispatchers all pass against the rebuilt archive (the third one exercises both modified symbols end-to-end). Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

…rom_kernel Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

Move the shared HOST_LOOP + DEVICE_LOOP ring interleaving coverage into the existing test_host_dispatcher.cu CUDA test instead of carrying a new standalone .cu file. The test still validates shared_ring_mode skip behavior across the host and device dispatchers, but now reuses the existing host dispatcher test target and CMake wiring. Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

github-actions · 2026-06-10T18:36:43Z

CI Summary (`push`) — ✅ passed

Run #27295382846 · ✅ 6 · ⏩ 7 · ❌ 0 · ⛔ 0

Top-level jobs (13)

Job	Result
`binaries`	⏩ skipped
`build_and_test`	✅ success
`config_devdeps`	✅ success
`config_source_build`	⏩ skipped
`config_wheeldeps`	✅ success
`devdeps`	✅ success
`docker_image`	⏩ skipped
`gen_code_coverage`	⏩ skipped
`metadata`	✅ success
`python_metapackages`	⏩ skipped
`python_wheels`	⏩ skipped
`source_build`	⏩ skipped
`wheeldeps`	✅ success

⏩ Skipped jobs (7) — intentionally skipped on PR builds; run on merge_group / workflow_dispatch

Job
`binaries`
`config_source_build`
`docker_image`
`gen_code_coverage`
`python_metapackages`
`python_wheels`
`source_build`

All sub-jobs (42) — every matrix leg, with links

Job	Status	Link
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python)	✅ success	view
Build and test (amd64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, llvm, openmpi) / Dev environment (Python)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Python)	✅ success	view
CI Summary	❔ in_progress	view
Configure build (devdeps)	✅ success	view
Configure build (source_build)	⏩ skipped	view
Configure build (wheeldeps)	✅ success	view
Create CUDA Quantum installer	⏩ skipped	view
Create Docker images	⏩ skipped	view
Create Python metapackages	⏩ skipped	view
Create Python wheels	⏩ skipped	view
Gen code coverage	⏩ skipped	view
Load dependencies (amd64, gcc12) / Caching	✅ success	view
Load dependencies (amd64, gcc12) / Finalize	✅ success	view
Load dependencies (amd64, gcc12) / Metadata	✅ success	view
Load dependencies (amd64, llvm) / Caching	✅ success	view
Load dependencies (amd64, llvm) / Finalize	✅ success	view
Load dependencies (amd64, llvm) / Metadata	✅ success	view
Load dependencies (arm64, gcc12) / Caching	✅ success	view
Load dependencies (arm64, gcc12) / Finalize	✅ success	view
Load dependencies (arm64, gcc12) / Metadata	✅ success	view
Load dependencies (arm64, llvm) / Caching	✅ success	view
Load dependencies (arm64, llvm) / Finalize	✅ success	view
Load dependencies (arm64, llvm) / Metadata	✅ success	view
Load source build cache	⏩ skipped	view
Load wheel dependencies (amd64, 12.6) / Caching	✅ success	view
Load wheel dependencies (amd64, 12.6) / Finalize	✅ success	view
Load wheel dependencies (amd64, 12.6) / Metadata	✅ success	view
Load wheel dependencies (amd64, 13.0) / Caching	✅ success	view
Load wheel dependencies (amd64, 13.0) / Finalize	✅ success	view
Load wheel dependencies (amd64, 13.0) / Metadata	✅ success	view
Load wheel dependencies (arm64, 12.6) / Caching	✅ success	view
Load wheel dependencies (arm64, 12.6) / Finalize	✅ success	view
Load wheel dependencies (arm64, 12.6) / Metadata	✅ success	view
Load wheel dependencies (arm64, 13.0) / Caching	✅ success	view
Load wheel dependencies (arm64, 13.0) / Finalize	✅ success	view
Load wheel dependencies (arm64, 13.0) / Metadata	✅ success	view
Prepare cache clean-up	❔ in_progress	view
Retrieve PR info	✅ success	view

✅ Required checks (6/6) — declared in .github/required-checks.yml for push

Required check	Status	Link
Build and test (amd64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, llvm, openmpi) / Dev environment (Python)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (arm64, llvm, openmpi) / Dev environment (Python)	✅ success	view
Build and test (amd64, gcc12, openmpi) / Dev environment (Debug)	✅ success	view
Build and test (amd64, gcc12, openmpi) / Dev environment (Python)	✅ success	view

1tnguyen · 2026-06-10T23:04:21Z

+  /// function_id and routing_key match.  Set to 0 when sub-routing isn't
+  /// needed (the historical function_id-only match).
+  /// See proposals/cudaq_realtime_host_api.bs#host-path-graph-routing-key.
+  uint64_t routing_key;


Q: do we need to initialize this routing_key in cudaq_host_dispatcher_start_thread similar to other fields?

1tnguyen · 2026-06-10T23:11:01Z

+// dlsym(RTLD_DEFAULT,
+// ...) at runtime.  libcudaq-realtime-dispatch.a is built with hidden


nits: line break (clang-format fallout, perhaps)

cketcham2333 added 9 commits May 17, 2026 22:00

Merge remote-tracking branch 'upstream/main' into realtime_dispatch_f…

f74ff54

…rom_kernel Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

Merge remote-tracking branch 'upstream/main' into realtime_dispatch_f…

d31272d

…rom_kernel Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

clang format

aacd480

Signed-off-by: Chuck Ketcham <cketcham@nvidia.com>

cketcham2333 marked this pull request as ready for review June 10, 2026 18:41

cketcham2333 requested review from 1tnguyen, bettinaheim and bmhowe23 June 10, 2026 18:42

1tnguyen reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add shared-ring realtime dispatch routing support#4712

Add shared-ring realtime dispatch routing support#4712
cketcham2333 wants to merge 9 commits into
NVIDIA:mainfrom
cketcham2333:realtime_dispatch_from_kernel

cketcham2333 commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

Uh oh!

1tnguyen Jun 10, 2026

Uh oh!

1tnguyen Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// dlsym(RTLD_DEFAULT,
		// ...) at runtime. libcudaq-realtime-dispatch.a is built with hidden

Conversation

cketcham2333 commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026

CI Summary (push) — ✅ passed

Uh oh!

1tnguyen Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

1tnguyen Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CI Summary (`push`) — ✅ passed