Skip to content

Fix /dev/shm/nccl-K error.#2059

Open
lunixbochs wants to merge 1 commit intoNVIDIA:masterfrom
lunixbochs:shm-tag-confusion
Open

Fix /dev/shm/nccl-K error.#2059
lunixbochs wants to merge 1 commit intoNVIDIA:masterfrom
lunixbochs:shm-tag-confusion

Conversation

@lunixbochs
Copy link
Copy Markdown

Description

With memoryless NUMA domains (e.g. if you don't populate all of the memory channels on an EPYC CPU), NCCL IPC has a type confusion that results in this error:

[rank1]: Error while attaching to shared memory segment /dev/shm/nccl-Ќ (size 0), error: No such file or directory (2)

This happens with memoryless NUMA domains, which would cause the received cuMem SHM descriptor union to be interpreted as a /dev/shm file path. This PR checks the union tag and turns it into a clear warning.

Related Issues

I have a more detailed writeup here: pytorch/pytorch#152302 (comment)

Changes & Impact

N/A

Performance Impact

N/A

This would happen with memoryless NUMA domains, which would cause the
received cuMem SHM descriptor to be interpreted as a file path.
@sjeaugey
Copy link
Copy Markdown
Member

Thanks for the report. I have the feeling that we could have fixed the issue though, rather than failing.
@xiaofanl-nvidia I think we should investigate further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants