Skip to content

[ROCm] Fix duplicate hipMemMap call in map_block() causing SIGSEGV#3265

Open
srinivamd wants to merge 1 commit into
ROCm:release/2.12from
srinivamd:fix-rocm-25272-double-hipMemMap
Open

[ROCm] Fix duplicate hipMemMap call in map_block() causing SIGSEGV#3265
srinivamd wants to merge 1 commit into
ROCm:release/2.12from
srinivamd:fix-rocm-25272-double-hipMemMap

Conversation

@srinivamd
Copy link
Copy Markdown

@srinivamd srinivamd commented Jun 1, 2026

Summary

Fixes SIGSEGV (exit -11) in IntraNodeComm::rendezvous() during test_intra_node_comm_all_reduce_custom_group_name_{True,False} on MI300X/MI308X gfx942 (ROCM-25272).

Root Cause

The ROCm code path in map_block() (CUDASymmetricMemoryUtils.cpp) calls hipMemMap() twice with identical arguments — a copy-paste bug. The CUDA path correctly calls cuMemMap_() only once.

Before (buggy):

#elif defined(USE_ROCM)
  C10_CUDA_CHECK(hipMemAddressReserve(ptr, size, 0ULL, 0, 0ULL));
  C10_CUDA_CHECK(hipMemMap(*ptr, size, 0, ...handle..., 0ULL));
  C10_CUDA_CHECK(hipMemMap(*ptr, size, 0, ...handle..., 0ULL));  // DUPLICATE

After (fixed):

#elif defined(USE_ROCM)
  C10_CUDA_CHECK(hipMemAddressReserve(ptr, size, 0ULL, 0, 0ULL));
  C10_CUDA_CHECK(hipMemMap(*ptr, size, 0, ...handle..., 0ULL));

Mapping over an already-mapped VA range causes undefined behavior in the HIP VMM driver, leading to SIGSEGV when IntraNodeComm::rendezvous() is called during the first allreduce.

Why This Bug Was Not Caught Earlier

  • Upstream pytorch/pytorch (main and release/2.12): test_intra_node_comm_all_reduce has @skipIfRocm — test never runs on ROCm, bug is dormant.
  • ROCm/pytorch release/2.12: @skipIfRocm was replaced with @runOnRocmArch(MI300_ARCH) to enable IntraNodeComm on MI300X, exposing this crash.
  • Same bug exists on upstream main but no impact since test is skipped.

Crash Stack (from logs)

#0  <unknown> + 0x38535e5 in libtorch_hip.so
#1  c10d::intra_node_comm::IntraNodeComm::rendezvous() + 0x18aa
#2  c10d::ProcessGroupNCCL::initIntraNodeComm() + 0x196
#3  c10d::ProcessGroupNCCL::allreduce() + 0x8a9

Both ranks crash at the exact same instruction within ~2 seconds of init_process_group completing. Hard SIGSEGV, not a hang.

Test Plan

  • Run test_intra_node_comm_all_reduce_custom_group_name_True on MI300X with ENABLE_INTRA_NODE_COMM=1
  • Run test_intra_node_comm_all_reduce_custom_group_name_False on MI300X with ENABLE_INTRA_NODE_COMM=1
  • Verify no regression in distributed.test_c10d_nccl suite

References

…ROCM-25272)

The ROCm code path in map_block() calls hipMemMap() twice with
identical arguments — a copy-paste bug from the CUDA path which
only calls cuMemMap_() once. Mapping over an already-mapped VA range
causes undefined behavior in the HIP VMM driver, leading to SIGSEGV
in IntraNodeComm::rendezvous() during the first allreduce call.

This bug is dormant on upstream pytorch/pytorch because
test_intra_node_comm_all_reduce has @skipIfRocm. On ROCm/pytorch
release/2.12, the skip was replaced with @runOnRocmArch(MI300_ARCH),
exposing the crash on MI300X/gfx942.

Fixes: ROCM-25272
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Jun 1, 2026

Jenkins build for cce3bf49a8e0a92ce7e76e25e2b51ce55571264c commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@srinivamd srinivamd requested review from jeffdaily and pragupta June 1, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant