Open
Conversation
Build system:
- CMakeLists.txt: add Windows platform detection, MSVC flags, DLL
export target, conditional linking (no ibverbs/mlx5 on Windows),
Ninja generator support; add unit-test and demo targets
- build.bat: Windows one-click CMake configure + Ninja build script
- src/misc/gen_nccl_h.py: generate nccl.h on Windows (no sed)
- src/nccl.def: DLL export map for MSVC linker
Platform stubs and OS abstraction:
- src/os/windows.cc / windows_stubs.cc: implement POSIX-like APIs
(pthread, semaphore, mmap, clock_gettime, sysconf, popen, etc.)
- src/include/dlfcn_win.h: dlopen/dlsym shim via LoadLibrary
- src/include/nccl_win_pch.h: Windows precompiled-header guard
- src/include/compiler/msvc.h: MSVC attribute / intrinsic compat
- src/os/linux.cc: guard Linux-only includes
MSVC / CCCL compatibility throughout the codebase:
- Replace Linux-only headers (unistd.h, sys/mman.h, etc.) with
Windows equivalents or guards in: alloc.h, bitops.h, collectives.h,
core.h, debug.h, gdrwrap.h, graph.h, ipcsocket.h, nccl_common.h,
ras.h, socket.h, os.h, nccl.h.in, and all plugin/* wrappers
- Fix __attribute__ → __declspec and packed-struct attributes in
gin_proxy_device_host_common.h, gin_gdaki*.h, nccl_device/utility.h
- Guard ibverbs / mlx5dv / RAS / GDR symbols with NCCL_OS_LINUX
- Guard nvtx3 C++ API with Windows-safe include order (nvtxImpl.h)
- Fix CCCL/Thrust headers for MSVC in device/prims_ll*.h,
reduce_kernel.h, device/generate.py, device/symmetric/generate.py
Runtime portability:
- src/allocator.cc, init.cc, proxy.cc, mem_manager.cc, bootstrap.cc,
transport.cc, dev_runtime.cc, debug.cc, graph/topo.cc,
graph/tuning.cc, graph/xml.{cc,h}: guard Linux-only ioctls,
fork/exec, NUMA, proc-filesystem reads, and ibverbs calls
- src/graph/topo.cc: tolerate missing "arch" attribute in topology XML
(Windows topology generator does not emit it); default to x86_64
- src/include/device.h: set ncclCollUnroll=4 for SM_120 (workstation
Blackwell has only 53 KB shared memory; unroll=8 overflows)
- src/misc/shmutils.cc (calloc NULL-check), param.cc, socket.cc,
utils.cc, misc/*wrap.cc: Windows-safe implementations
- src/transport/net.cc, net_socket.cc: guard Linux-only socket opts
- src/rma/rma_proxy.cc, scheduler/allgatherv_sched.cc,
gin/gin_host_proxy.{cc,h}, ras/ras.cc: guard Linux-only paths
Tests and samples:
- tests/unit/nccl_tests.cu: Windows-compatible test harness
- tests/demo/nccl_demo.cpp, samples/nccl_demo.cu: portable demo apps
Co-Authored-By: Claude Code <noreply@anthropic.com>
Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>
On Windows WDDM, GPU SM stores go to GPU L2 and are not automatically flushed to DRAM; CPU reads from mapped memory see only DRAM. Standard NCCL SHM transport assumes UVA coherence that doesn't hold on WDDM. Four changes to restore correctness: 1. Zero-copy devFifo (src/transport/shm.cc): allocate connFifo/devFifo with cudaHostAlloc(Mapped|Portable) so the GPU writes go through the PCIe BAR directly to host DRAM, bypassing GPU L2. The CPU proxy reads the fifo size/step without requiring explicit flushes. 2. ceRecvMemDev VRAM shadow (src/transport/shm.cc): for the CE-based recv proxy, allocate a VRAM mirror of the step counter via cudaMemcpyAsync (H2D CE engine). CE writes bypass GPU SM L2 and land directly in VRAM DRAM, making them visible to GPU ld.cv reads. 3. ld.cv.global for RoleWaitRecv (src/device/prims_simple.h): use the cache-invalidating load (bypasses L1+L2) so the recv-side thread sees the CE-written step value in VRAM DRAM immediately. 4. atomicExch_system for connFifo/step writes from GPU (src/device/prims_simple.h): GPU RolePostSend/RolePostRecv must write through L2 to DRAM for the CPU proxy; atomicExch_system issues a system-scope atomic that bypasses GPU L2. 5. LL protocol disabled for SHM (src/transport/shm.cc): force buffs[NCCL_PROTO_LL] = buffs[NCCL_PROTO_LL128] = nullptr so the channel allocator always selects SIMPLE protocol for SHM. The LL busy-poll loop has the same DRAM-visibility issue and fixing it would require additional CE shadow buffers; SIMPLE is sufficient for PCIe bandwidth. 6. Two-scalar loads in loadWorkBatchToShmem (src/device/common.h): MSVC/CUDA on WDDM generates ld.v2.u64 (16-byte vector loads) whose .y lane returns 0 for every access. Use two sequential volatile scalar loads (ld.volatile.global u64 × 2) to work around the bug. 7. shmutils.cc NULL check: handle calloc returning NULL (defensive). Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>
P2P sendrecv operations (gather, scatter, all_to_all) hit CUDA error 719 (unspecified launch failure) on WDDM when the data payload is below the LL threshold (nChannels × 16384 bytes, typically 32 KB with 2 channels). Root cause: addP2pToPlan selects LL protocol for small transfers because LL has lower latency on Linux. LL relies on 64-bit flag+data atoms being coherent across process boundaries via NVLink or PCIe P2P. On WDDM, cross-process VRAM mappings (IPC handles) do not provide the required cache coherence for LL's busy-poll pattern; the receiving rank sees a stale flag value and the GPU kernel hangs, triggering TDR. Fix: guard the protoLL initialization in addP2pToPlan with NCCL_OS_WINDOWS and force both LL and LL128 off, so all P2P sendrecv always uses SIMPLE protocol on WDDM. SIMPLE is correct and performant for PCIe P2P. Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>
2f9bcd0 to
863db42
Compare
Author
|
@nv-udeodhar |
Collaborator
|
@peterkisfaludi Thank you for the contribution! Just fyi - we have a lot of windows changes already on the v2.30 branch (https://github.com/NVIDIA/nccl/tree/v2.30). We have minimum tests passing there. I'l let @nv-udeodhar review and comment if we can easily leverage some of your work! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Tested on 8xA40 GPUs with PCIe connection
Test results:
Note: the SHM and P2P fixes are workarounds for WDDM-specific driver/hardware behavior (not upstream NCCL bugs). These are minimal OS-specific guards needed for WDDM coherence.
Related Issues
#1995
mentions
NCCL Windows Platform Support: Brings multi-GPU communication to Windows environments.
Changes & Impact
N/A
Performance Impact
Not benchmarked yet, only focused on accuracy, not perf