Windows port of NCCL by peterkisfaludi · Pull Request #2068 · NVIDIA/nccl

peterkisfaludi · 2026-03-24T22:30:43Z

Description

Enable compiling NCCL on Windows OS over PCIe
Enable Point to Point (send, recv) and Collective operations (AllReduce, Broadcast, Reduce, AllGather, ReduceScatter, AlltoAll, Gather, Scatter)
Add unit tests for p2p and collective operations

Tested on 8xA40 GPUs with PCIe connection
Test results:

  NCCL unit tests — 8 GPU(s) detected
                                                                                                                                                                 
  PASS [AllReduce]                                                                                                                                             
  PASS [AllGather]                                                                                                                                               
  PASS [Broadcast]                                                                                                                                             
  PASS [Reduce]                                                                                                                                                  
  PASS [ReduceScatter]
  PASS [Scatter]                                                                                                                                                 
  PASS [Gather]                                                                                                                                                
  PASS [AlltoAll]
  PASS [SendRecv]                                                                                                                                                
  PASS [MultiOpStream]

Note: the SHM and P2P fixes are workarounds for WDDM-specific driver/hardware behavior (not upstream NCCL bugs). These are minimal OS-specific guards needed for WDDM coherence.

Related Issues

#1995
mentions
NCCL Windows Platform Support: Brings multi-GPU communication to Windows environments.

Changes & Impact

N/A

Performance Impact

Not benchmarked yet, only focused on accuracy, not perf

Build system: - CMakeLists.txt: add Windows platform detection, MSVC flags, DLL export target, conditional linking (no ibverbs/mlx5 on Windows), Ninja generator support; add unit-test and demo targets - build.bat: Windows one-click CMake configure + Ninja build script - src/misc/gen_nccl_h.py: generate nccl.h on Windows (no sed) - src/nccl.def: DLL export map for MSVC linker Platform stubs and OS abstraction: - src/os/windows.cc / windows_stubs.cc: implement POSIX-like APIs (pthread, semaphore, mmap, clock_gettime, sysconf, popen, etc.) - src/include/dlfcn_win.h: dlopen/dlsym shim via LoadLibrary - src/include/nccl_win_pch.h: Windows precompiled-header guard - src/include/compiler/msvc.h: MSVC attribute / intrinsic compat - src/os/linux.cc: guard Linux-only includes MSVC / CCCL compatibility throughout the codebase: - Replace Linux-only headers (unistd.h, sys/mman.h, etc.) with Windows equivalents or guards in: alloc.h, bitops.h, collectives.h, core.h, debug.h, gdrwrap.h, graph.h, ipcsocket.h, nccl_common.h, ras.h, socket.h, os.h, nccl.h.in, and all plugin/* wrappers - Fix __attribute__ → __declspec and packed-struct attributes in gin_proxy_device_host_common.h, gin_gdaki*.h, nccl_device/utility.h - Guard ibverbs / mlx5dv / RAS / GDR symbols with NCCL_OS_LINUX - Guard nvtx3 C++ API with Windows-safe include order (nvtxImpl.h) - Fix CCCL/Thrust headers for MSVC in device/prims_ll*.h, reduce_kernel.h, device/generate.py, device/symmetric/generate.py Runtime portability: - src/allocator.cc, init.cc, proxy.cc, mem_manager.cc, bootstrap.cc, transport.cc, dev_runtime.cc, debug.cc, graph/topo.cc, graph/tuning.cc, graph/xml.{cc,h}: guard Linux-only ioctls, fork/exec, NUMA, proc-filesystem reads, and ibverbs calls - src/graph/topo.cc: tolerate missing "arch" attribute in topology XML (Windows topology generator does not emit it); default to x86_64 - src/include/device.h: set ncclCollUnroll=4 for SM_120 (workstation Blackwell has only 53 KB shared memory; unroll=8 overflows) - src/misc/shmutils.cc (calloc NULL-check), param.cc, socket.cc, utils.cc, misc/*wrap.cc: Windows-safe implementations - src/transport/net.cc, net_socket.cc: guard Linux-only socket opts - src/rma/rma_proxy.cc, scheduler/allgatherv_sched.cc, gin/gin_host_proxy.{cc,h}, ras/ras.cc: guard Linux-only paths Tests and samples: - tests/unit/nccl_tests.cu: Windows-compatible test harness - tests/demo/nccl_demo.cpp, samples/nccl_demo.cu: portable demo apps Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>

On Windows WDDM, GPU SM stores go to GPU L2 and are not automatically flushed to DRAM; CPU reads from mapped memory see only DRAM. Standard NCCL SHM transport assumes UVA coherence that doesn't hold on WDDM. Four changes to restore correctness: 1. Zero-copy devFifo (src/transport/shm.cc): allocate connFifo/devFifo with cudaHostAlloc(Mapped|Portable) so the GPU writes go through the PCIe BAR directly to host DRAM, bypassing GPU L2. The CPU proxy reads the fifo size/step without requiring explicit flushes. 2. ceRecvMemDev VRAM shadow (src/transport/shm.cc): for the CE-based recv proxy, allocate a VRAM mirror of the step counter via cudaMemcpyAsync (H2D CE engine). CE writes bypass GPU SM L2 and land directly in VRAM DRAM, making them visible to GPU ld.cv reads. 3. ld.cv.global for RoleWaitRecv (src/device/prims_simple.h): use the cache-invalidating load (bypasses L1+L2) so the recv-side thread sees the CE-written step value in VRAM DRAM immediately. 4. atomicExch_system for connFifo/step writes from GPU (src/device/prims_simple.h): GPU RolePostSend/RolePostRecv must write through L2 to DRAM for the CPU proxy; atomicExch_system issues a system-scope atomic that bypasses GPU L2. 5. LL protocol disabled for SHM (src/transport/shm.cc): force buffs[NCCL_PROTO_LL] = buffs[NCCL_PROTO_LL128] = nullptr so the channel allocator always selects SIMPLE protocol for SHM. The LL busy-poll loop has the same DRAM-visibility issue and fixing it would require additional CE shadow buffers; SIMPLE is sufficient for PCIe bandwidth. 6. Two-scalar loads in loadWorkBatchToShmem (src/device/common.h): MSVC/CUDA on WDDM generates ld.v2.u64 (16-byte vector loads) whose .y lane returns 0 for every access. Use two sequential volatile scalar loads (ld.volatile.global u64 × 2) to work around the bug. 7. shmutils.cc NULL check: handle calloc returning NULL (defensive). Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>

P2P sendrecv operations (gather, scatter, all_to_all) hit CUDA error 719 (unspecified launch failure) on WDDM when the data payload is below the LL threshold (nChannels × 16384 bytes, typically 32 KB with 2 channels). Root cause: addP2pToPlan selects LL protocol for small transfers because LL has lower latency on Linux. LL relies on 64-bit flag+data atoms being coherent across process boundaries via NVLink or PCIe P2P. On WDDM, cross-process VRAM mappings (IPC handles) do not provide the required cache coherence for LL's busy-poll pattern; the receiving rank sees a stale flag value and the GPU kernel hangs, triggering TDR. Fix: guard the protoLL initialization in addP2pToPlan with NCCL_OS_WINDOWS and force both LL and LL128 off, so all P2P sendrecv always uses SIMPLE protocol on WDDM. SIMPLE is correct and performant for PCIe P2P. Co-Authored-By: Claude Code <noreply@anthropic.com> Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>

peterkisfaludi · 2026-04-01T00:07:16Z

@nv-udeodhar
Are these changes in line with what you had in mind?
If there is a gap, I can modify this MR to align with design goals
thank you!

xiaofanl-nvidia · 2026-04-01T03:16:18Z

@peterkisfaludi Thank you for the contribution! Just fyi - we have a lot of windows changes already on the v2.30 branch (https://github.com/NVIDIA/nccl/tree/v2.30). We have minimum tests passing there. I'l let @nv-udeodhar review and comment if we can easily leverage some of your work!

Peter Kisfaludi and others added 3 commits March 24, 2026 15:30

peterkisfaludi force-pushed the windows-port branch from 2f9bcd0 to 863db42 Compare March 24, 2026 22:31

xiaofanl-nvidia requested a review from nv-udeodhar April 1, 2026 03:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows port of NCCL#2068

Windows port of NCCL#2068
peterkisfaludi wants to merge 3 commits intoNVIDIA:masterfrom
peterkisfaludi:windows-port

peterkisfaludi commented Mar 24, 2026 •

edited

Loading

Uh oh!

peterkisfaludi commented Apr 1, 2026

Uh oh!

xiaofanl-nvidia commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

peterkisfaludi commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Changes & Impact

Performance Impact

Uh oh!

peterkisfaludi commented Apr 1, 2026

Uh oh!

xiaofanl-nvidia commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peterkisfaludi commented Mar 24, 2026 •

edited

Loading