Skip to content

Windows port of NCCL#2068

Open
peterkisfaludi wants to merge 3 commits intoNVIDIA:masterfrom
peterkisfaludi:windows-port
Open

Windows port of NCCL#2068
peterkisfaludi wants to merge 3 commits intoNVIDIA:masterfrom
peterkisfaludi:windows-port

Conversation

@peterkisfaludi
Copy link
Copy Markdown

@peterkisfaludi peterkisfaludi commented Mar 24, 2026

Description

  • Enable compiling NCCL on Windows OS over PCIe
  • Enable Point to Point (send, recv) and Collective operations (AllReduce, Broadcast, Reduce, AllGather, ReduceScatter, AlltoAll, Gather, Scatter)
  • Add unit tests for p2p and collective operations

Tested on 8xA40 GPUs with PCIe connection
Test results:

  NCCL unit tests — 8 GPU(s) detected
                                                                                                                                                                 
  PASS [AllReduce]                                                                                                                                             
  PASS [AllGather]                                                                                                                                               
  PASS [Broadcast]                                                                                                                                             
  PASS [Reduce]                                                                                                                                                  
  PASS [ReduceScatter]
  PASS [Scatter]                                                                                                                                                 
  PASS [Gather]                                                                                                                                                
  PASS [AlltoAll]
  PASS [SendRecv]                                                                                                                                                
  PASS [MultiOpStream]

Note: the SHM and P2P fixes are workarounds for WDDM-specific driver/hardware behavior (not upstream NCCL bugs). These are minimal OS-specific guards needed for WDDM coherence.

Related Issues

#1995
mentions
NCCL Windows Platform Support: Brings multi-GPU communication to Windows environments.

Changes & Impact

N/A

Performance Impact

Not benchmarked yet, only focused on accuracy, not perf

Peter Kisfaludi and others added 3 commits March 24, 2026 15:30
Build system:
- CMakeLists.txt: add Windows platform detection, MSVC flags, DLL
  export target, conditional linking (no ibverbs/mlx5 on Windows),
  Ninja generator support; add unit-test and demo targets
- build.bat: Windows one-click CMake configure + Ninja build script
- src/misc/gen_nccl_h.py: generate nccl.h on Windows (no sed)
- src/nccl.def: DLL export map for MSVC linker

Platform stubs and OS abstraction:
- src/os/windows.cc / windows_stubs.cc: implement POSIX-like APIs
  (pthread, semaphore, mmap, clock_gettime, sysconf, popen, etc.)
- src/include/dlfcn_win.h: dlopen/dlsym shim via LoadLibrary
- src/include/nccl_win_pch.h: Windows precompiled-header guard
- src/include/compiler/msvc.h: MSVC attribute / intrinsic compat
- src/os/linux.cc: guard Linux-only includes

MSVC / CCCL compatibility throughout the codebase:
- Replace Linux-only headers (unistd.h, sys/mman.h, etc.) with
  Windows equivalents or guards in: alloc.h, bitops.h, collectives.h,
  core.h, debug.h, gdrwrap.h, graph.h, ipcsocket.h, nccl_common.h,
  ras.h, socket.h, os.h, nccl.h.in, and all plugin/* wrappers
- Fix __attribute__ → __declspec and packed-struct attributes in
  gin_proxy_device_host_common.h, gin_gdaki*.h, nccl_device/utility.h
- Guard ibverbs / mlx5dv / RAS / GDR symbols with NCCL_OS_LINUX
- Guard nvtx3 C++ API with Windows-safe include order (nvtxImpl.h)
- Fix CCCL/Thrust headers for MSVC in device/prims_ll*.h,
  reduce_kernel.h, device/generate.py, device/symmetric/generate.py

Runtime portability:
- src/allocator.cc, init.cc, proxy.cc, mem_manager.cc, bootstrap.cc,
  transport.cc, dev_runtime.cc, debug.cc, graph/topo.cc,
  graph/tuning.cc, graph/xml.{cc,h}: guard Linux-only ioctls,
  fork/exec, NUMA, proc-filesystem reads, and ibverbs calls
- src/graph/topo.cc: tolerate missing "arch" attribute in topology XML
  (Windows topology generator does not emit it); default to x86_64
- src/include/device.h: set ncclCollUnroll=4 for SM_120 (workstation
  Blackwell has only 53 KB shared memory; unroll=8 overflows)
- src/misc/shmutils.cc (calloc NULL-check), param.cc, socket.cc,
  utils.cc, misc/*wrap.cc: Windows-safe implementations
- src/transport/net.cc, net_socket.cc: guard Linux-only socket opts
- src/rma/rma_proxy.cc, scheduler/allgatherv_sched.cc,
  gin/gin_host_proxy.{cc,h}, ras/ras.cc: guard Linux-only paths

Tests and samples:
- tests/unit/nccl_tests.cu: Windows-compatible test harness
- tests/demo/nccl_demo.cpp, samples/nccl_demo.cu: portable demo apps

Co-Authored-By: Claude Code <noreply@anthropic.com>
Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>
On Windows WDDM, GPU SM stores go to GPU L2 and are not automatically
flushed to DRAM; CPU reads from mapped memory see only DRAM.  Standard
NCCL SHM transport assumes UVA coherence that doesn't hold on WDDM.

Four changes to restore correctness:

1. Zero-copy devFifo (src/transport/shm.cc): allocate connFifo/devFifo
   with cudaHostAlloc(Mapped|Portable) so the GPU writes go through
   the PCIe BAR directly to host DRAM, bypassing GPU L2.  The CPU proxy
   reads the fifo size/step without requiring explicit flushes.

2. ceRecvMemDev VRAM shadow (src/transport/shm.cc): for the CE-based
   recv proxy, allocate a VRAM mirror of the step counter via
   cudaMemcpyAsync (H2D CE engine).  CE writes bypass GPU SM L2 and
   land directly in VRAM DRAM, making them visible to GPU ld.cv reads.

3. ld.cv.global for RoleWaitRecv (src/device/prims_simple.h): use the
   cache-invalidating load (bypasses L1+L2) so the recv-side thread
   sees the CE-written step value in VRAM DRAM immediately.

4. atomicExch_system for connFifo/step writes from GPU
   (src/device/prims_simple.h): GPU RolePostSend/RolePostRecv must
   write through L2 to DRAM for the CPU proxy; atomicExch_system
   issues a system-scope atomic that bypasses GPU L2.

5. LL protocol disabled for SHM (src/transport/shm.cc): force
   buffs[NCCL_PROTO_LL] = buffs[NCCL_PROTO_LL128] = nullptr so the
   channel allocator always selects SIMPLE protocol for SHM.  The
   LL busy-poll loop has the same DRAM-visibility issue and fixing it
   would require additional CE shadow buffers; SIMPLE is sufficient
   for PCIe bandwidth.

6. Two-scalar loads in loadWorkBatchToShmem (src/device/common.h):
   MSVC/CUDA on WDDM generates ld.v2.u64 (16-byte vector loads) whose
   .y lane returns 0 for every access.  Use two sequential volatile
   scalar loads (ld.volatile.global u64 × 2) to work around the bug.

7. shmutils.cc NULL check: handle calloc returning NULL (defensive).

Co-Authored-By: Claude Code <noreply@anthropic.com>
Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>
P2P sendrecv operations (gather, scatter, all_to_all) hit CUDA error 719
(unspecified launch failure) on WDDM when the data payload is below the
LL threshold (nChannels × 16384 bytes, typically 32 KB with 2 channels).

Root cause: addP2pToPlan selects LL protocol for small transfers because
LL has lower latency on Linux.  LL relies on 64-bit flag+data atoms
being coherent across process boundaries via NVLink or PCIe P2P.  On
WDDM, cross-process VRAM mappings (IPC handles) do not provide the
required cache coherence for LL's busy-poll pattern; the receiving rank
sees a stale flag value and the GPU kernel hangs, triggering TDR.

Fix: guard the protoLL initialization in addP2pToPlan with
NCCL_OS_WINDOWS and force both LL and LL128 off, so all P2P sendrecv
always uses SIMPLE protocol on WDDM.  SIMPLE is correct and performant
for PCIe P2P.

Co-Authored-By: Claude Code <noreply@anthropic.com>
Signed-off-by: Peter Kisfaludi <pkisfaludi@nvidia.com>
@peterkisfaludi
Copy link
Copy Markdown
Author

@nv-udeodhar
Are these changes in line with what you had in mind?
If there is a gap, I can modify this MR to align with design goals
thank you!

@xiaofanl-nvidia
Copy link
Copy Markdown
Collaborator

@peterkisfaludi Thank you for the contribution! Just fyi - we have a lot of windows changes already on the v2.30 branch (https://github.com/NVIDIA/nccl/tree/v2.30). We have minimum tests passing there. I'l let @nv-udeodhar review and comment if we can easily leverage some of your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants