Fix dev_runtime memory leaks by PatriosTheGreat · Pull Request #2087 · NVIDIA/nccl

PatriosTheGreat · 2026-04-02T09:00:38Z

Description

Fix CPU memory leak in dev_runtime finalize method.

Without this change we observe heap checker failures on the following code launched on H100 machine:

constexpr int kNumLocalDevices = 2;
ncclComm_t communicators[kNumLocalDevices];
ncclCommInitAll(communicators, kNumLocalDevices,
                /*devlist=*/nullptr);

for (int i = 0; i < kNumLocalDevices; ++i) {
  ncclCommDestroy(communicators[i]);
}

Heap checker shows the following stack trace where unreleased memory is allocated:

ncclDevrInitOnce()
ncclTeamLsa
initTransportsRank()
ncclCommInitRankFunc()
ncclAsyncJobMain()

Considering that lsaRankList is allocated before bigSize is initialized, I assume it should also be deallocated even in bigSize is zero.

Related Issues

No issues opened IIUC

Changes & Impact

No impact

Performance Impact

No impact

PatriosTheGreat · 2026-04-02T13:11:04Z

I found another leak with the following sample:

constexpr int kNumLocalDevices = 2;
constexpr int kRootDevice = 0;
constexpr int kDataSize = 8;

std::vector<double> host_data(kDataSize);
for (int i = 0; i < kDataSize; i++) host_data[i] = 1;

devices_data_.resize(kNumLocalDevices);

for (int device = 0; device < kNumLocalDevices; device++) {
  cudaSetDevice(device);
  cudaMalloc(&devices_data_[device], kDataSize * sizeof(double));
  cudaMemcpy(devices_data_[device], host_data.data(),  kDataSize * sizeof(double), cudaMemcpyHostToDevice);
 }

ncclCommInitAll(communicators_.data(), kNumLocalDevices,
                /*devlist=*/nullptr);
streams_.resize(kNumLocalDevices);
for (int device = 0; device < kNumLocalDevices; device++) {
  cudaSetDevice(device);
  cudaStreamCreate(&streams_[device]);
}

ncclGroupStart();
for (int device = 0; device < kNumLocalDevices; device++) {
  cudaSetDevice(device);
  ncclReduce(devices_data_[device], devices_data_[device], kDataSize,
             ncclDouble, ncclSum, kRootDevice, communicators_[device],
             streams_[device]);
}
ncclGroupEnd();

for (int device = 0; device < kNumLocalDevices; device++) {
  cudaSetDevice(device);
  cudaStreamSynchronize(streams_[device]);
}

for (int device = 0; device < kNumLocalDevices; device++) {
  cudaSetDevice(device);
  cudaStreamDestroy(streams_[device]);
}
for (int device = 0; device < kNumLocalDevices; device++) {
  ncclCommDestroy(communicators_[device]);
}
for (int device = 0; device < kNumLocalDevices; device++) {
  cudaFree(devices_data_[device]);
}

Apparently ncclDevrInitOnce is getting called twice in this example for each device:

1: ncclCommInitRankFunc->initTransportsRank->ncclTeamLsa->ncclDevrInitOnce

2: groupLaunch->ncclPrepareTasksAndCollPreconnect->ncclPrepareTasks->ncclMakeSymmetricTaskList->ncclDevrInitOnce

But the second object finalize function is never called which leads to memory leak for allocated lsaRankList memory

I guess this happens because in init.cc:1482 bigSize is zeroed

Not sure yet how to fix this

Signed-off-by: Levon Ter-Grigorian <patrios@google.com>

xiaofanl-nvidia · 2026-04-03T05:04:23Z

++ @kiskra-nvidia @kgioioso to take a look. Feel free to mirror the PR if it looks good.

PatriosTheGreat force-pushed the master branch from 867e208 to dac487c Compare April 2, 2026 09:03

Fix dev_runtime memory leak

755ba53

Signed-off-by: Levon Ter-Grigorian <patrios@google.com>

PatriosTheGreat force-pushed the master branch from dac487c to 755ba53 Compare April 2, 2026 16:04

PatriosTheGreat changed the title ~~Fix dev_runtime memory leak~~ Fix dev_runtime memory leaks Apr 2, 2026

xiaofanl-nvidia requested review from kgioioso and kiskra-nvidia April 3, 2026 05:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix dev_runtime memory leaks#2087

Fix dev_runtime memory leaks#2087
PatriosTheGreat wants to merge 1 commit intoNVIDIA:masterfrom
PatriosTheGreat:master

PatriosTheGreat commented Apr 2, 2026

Uh oh!

PatriosTheGreat commented Apr 2, 2026 •

edited

Loading

Uh oh!

xiaofanl-nvidia commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PatriosTheGreat commented Apr 2, 2026

Description

Related Issues

Changes & Impact

Performance Impact

Uh oh!

PatriosTheGreat commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaofanl-nvidia commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PatriosTheGreat commented Apr 2, 2026 •

edited

Loading