[ROCm] Fix duplicate hipMemMap call in map_block() causing SIGSEGV by srinivamd · Pull Request #3265 · ROCm/pytorch

srinivamd · 2026-06-01T08:25:46Z

Summary

Fixes SIGSEGV (exit -11) in IntraNodeComm::rendezvous() during test_intra_node_comm_all_reduce_custom_group_name_{True,False} on MI300X/MI308X gfx942 (ROCM-25272).

Root Cause

The ROCm code path in map_block() (CUDASymmetricMemoryUtils.cpp) calls hipMemMap() twice with identical arguments — a copy-paste bug. The CUDA path correctly calls cuMemMap_() only once.

Before (buggy):

#elif defined(USE_ROCM)
  C10_CUDA_CHECK(hipMemAddressReserve(ptr, size, 0ULL, 0, 0ULL));
  C10_CUDA_CHECK(hipMemMap(*ptr, size, 0, ...handle..., 0ULL));
  C10_CUDA_CHECK(hipMemMap(*ptr, size, 0, ...handle..., 0ULL));  // DUPLICATE

After (fixed):

#elif defined(USE_ROCM)
  C10_CUDA_CHECK(hipMemAddressReserve(ptr, size, 0ULL, 0, 0ULL));
  C10_CUDA_CHECK(hipMemMap(*ptr, size, 0, ...handle..., 0ULL));

Mapping over an already-mapped VA range causes undefined behavior in the HIP VMM driver, leading to SIGSEGV when IntraNodeComm::rendezvous() is called during the first allreduce.

Why This Bug Was Not Caught Earlier

Upstream pytorch/pytorch (main and release/2.12): test_intra_node_comm_all_reduce has @skipIfRocm — test never runs on ROCm, bug is dormant.
ROCm/pytorch release/2.12: @skipIfRocm was replaced with @runOnRocmArch(MI300_ARCH) to enable IntraNodeComm on MI300X, exposing this crash.
Same bug exists on upstream main but no impact since test is skipped.

Crash Stack (from logs)

#0  <unknown> + 0x38535e5 in libtorch_hip.so
#1  c10d::intra_node_comm::IntraNodeComm::rendezvous() + 0x18aa
#2  c10d::ProcessGroupNCCL::initIntraNodeComm() + 0x196
#3  c10d::ProcessGroupNCCL::allreduce() + 0x8a9

Both ranks crash at the exact same instruction within ~2 seconds of init_process_group completing. Hard SIGSEGV, not a hang.

Test Plan

Run test_intra_node_comm_all_reduce_custom_group_name_True on MI300X with ENABLE_INTRA_NODE_COMM=1
Run test_intra_node_comm_all_reduce_custom_group_name_False on MI300X with ENABLE_INTRA_NODE_COMM=1
Verify no regression in distributed.test_c10d_nccl suite

References

Jira: ROCM-25272
Upstream issue (test disabled on ROCm since Dec 2023): pytorch/pytorch#115859
groupName fix (already on release/2.12): pytorch/pytorch#180809
Test parametrization (already on release/2.12): pytorch/pytorch#181331

…ROCM-25272) The ROCm code path in map_block() calls hipMemMap() twice with identical arguments — a copy-paste bug from the CUDA path which only calls cuMemMap_() once. Mapping over an already-mapped VA range causes undefined behavior in the HIP VMM driver, leading to SIGSEGV in IntraNodeComm::rendezvous() during the first allreduce call. This bug is dormant on upstream pytorch/pytorch because test_intra_node_comm_all_reduce has @skipIfRocm. On ROCm/pytorch release/2.12, the skip was replaced with @runOnRocmArch(MI300_ARCH), exposing the crash on MI300X/gfx942. Fixes: ROCM-25272

rocm-repo-management-api · 2026-06-01T08:36:52Z

Jenkins build for cce3bf49a8e0a92ce7e76e25e2b51ce55571264c commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

srinivamd requested review from jeffdaily and pragupta June 1, 2026 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Fix duplicate hipMemMap call in map_block() causing SIGSEGV#3265

[ROCm] Fix duplicate hipMemMap call in map_block() causing SIGSEGV#3265
srinivamd wants to merge 1 commit into
ROCm:release/2.12from
srinivamd:fix-rocm-25272-double-hipMemMap

srinivamd commented Jun 1, 2026 •

edited by atlassian Bot

Loading

Uh oh!

rocm-repo-management-api Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

srinivamd commented Jun 1, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Why This Bug Was Not Caught Earlier

Crash Stack (from logs)

Test Plan

References

Uh oh!

rocm-repo-management-api Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

srinivamd commented Jun 1, 2026 •

edited by atlassian Bot

Loading

rocm-repo-management-api Bot commented Jun 1, 2026 •

edited

Loading