[ROCm] Fix duplicate hipMemMap call in map_block() causing SIGSEGV#3265
Open
srinivamd wants to merge 1 commit into
Open
[ROCm] Fix duplicate hipMemMap call in map_block() causing SIGSEGV#3265srinivamd wants to merge 1 commit into
srinivamd wants to merge 1 commit into
Conversation
…ROCM-25272) The ROCm code path in map_block() calls hipMemMap() twice with identical arguments — a copy-paste bug from the CUDA path which only calls cuMemMap_() once. Mapping over an already-mapped VA range causes undefined behavior in the HIP VMM driver, leading to SIGSEGV in IntraNodeComm::rendezvous() during the first allreduce call. This bug is dormant on upstream pytorch/pytorch because test_intra_node_comm_all_reduce has @skipIfRocm. On ROCm/pytorch release/2.12, the skip was replaced with @runOnRocmArch(MI300_ARCH), exposing the crash on MI300X/gfx942. Fixes: ROCM-25272
|
Jenkins build for cce3bf49a8e0a92ce7e76e25e2b51ce55571264c commit finished as FAILURE |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes SIGSEGV (exit -11) in
IntraNodeComm::rendezvous()duringtest_intra_node_comm_all_reduce_custom_group_name_{True,False}on MI300X/MI308X gfx942 (ROCM-25272).Root Cause
The ROCm code path in
map_block()(CUDASymmetricMemoryUtils.cpp) callshipMemMap()twice with identical arguments — a copy-paste bug. The CUDA path correctly callscuMemMap_()only once.Before (buggy):
After (fixed):
Mapping over an already-mapped VA range causes undefined behavior in the HIP VMM driver, leading to SIGSEGV when
IntraNodeComm::rendezvous()is called during the firstallreduce.Why This Bug Was Not Caught Earlier
mainandrelease/2.12):test_intra_node_comm_all_reducehas@skipIfRocm— test never runs on ROCm, bug is dormant.release/2.12:@skipIfRocmwas replaced with@runOnRocmArch(MI300_ARCH)to enable IntraNodeComm on MI300X, exposing this crash.mainbut no impact since test is skipped.Crash Stack (from logs)
Both ranks crash at the exact same instruction within ~2 seconds of
init_process_groupcompleting. Hard SIGSEGV, not a hang.Test Plan
test_intra_node_comm_all_reduce_custom_group_name_Trueon MI300X withENABLE_INTRA_NODE_COMM=1test_intra_node_comm_all_reduce_custom_group_name_Falseon MI300X withENABLE_INTRA_NODE_COMM=1distributed.test_c10d_ncclsuiteReferences