Skip to content

Fix dev_runtime memory leaks#2087

Open
PatriosTheGreat wants to merge 1 commit intoNVIDIA:masterfrom
PatriosTheGreat:master
Open

Fix dev_runtime memory leaks#2087
PatriosTheGreat wants to merge 1 commit intoNVIDIA:masterfrom
PatriosTheGreat:master

Conversation

@PatriosTheGreat
Copy link
Copy Markdown

Description

Fix CPU memory leak in dev_runtime finalize method.

Without this change we observe heap checker failures on the following code launched on H100 machine:

constexpr int kNumLocalDevices = 2;
ncclComm_t communicators[kNumLocalDevices];
ncclCommInitAll(communicators, kNumLocalDevices,
                /*devlist=*/nullptr);

for (int i = 0; i < kNumLocalDevices; ++i) {
  ncclCommDestroy(communicators[i]);
}

Heap checker shows the following stack trace where unreleased memory is allocated:

ncclDevrInitOnce()
ncclTeamLsa
initTransportsRank()
ncclCommInitRankFunc()
ncclAsyncJobMain()

Considering that lsaRankList is allocated before bigSize is initialized, I assume it should also be deallocated even in bigSize is zero.

Related Issues

No issues opened IIUC

Changes & Impact

No impact

Performance Impact

No impact

@PatriosTheGreat
Copy link
Copy Markdown
Author

PatriosTheGreat commented Apr 2, 2026

I found another leak with the following sample:

constexpr int kNumLocalDevices = 2;
constexpr int kRootDevice = 0;
constexpr int kDataSize = 8;

std::vector<double> host_data(kDataSize);
for (int i = 0; i < kDataSize; i++) host_data[i] = 1;

devices_data_.resize(kNumLocalDevices);

for (int device = 0; device < kNumLocalDevices; device++) {
  cudaSetDevice(device);
  cudaMalloc(&devices_data_[device], kDataSize * sizeof(double));
  cudaMemcpy(devices_data_[device], host_data.data(),  kDataSize * sizeof(double), cudaMemcpyHostToDevice);
 }

ncclCommInitAll(communicators_.data(), kNumLocalDevices,
                /*devlist=*/nullptr);
streams_.resize(kNumLocalDevices);
for (int device = 0; device < kNumLocalDevices; device++) {
  cudaSetDevice(device);
  cudaStreamCreate(&streams_[device]);
}

ncclGroupStart();
for (int device = 0; device < kNumLocalDevices; device++) {
  cudaSetDevice(device);
  ncclReduce(devices_data_[device], devices_data_[device], kDataSize,
             ncclDouble, ncclSum, kRootDevice, communicators_[device],
             streams_[device]);
}
ncclGroupEnd();

for (int device = 0; device < kNumLocalDevices; device++) {
  cudaSetDevice(device);
  cudaStreamSynchronize(streams_[device]);
}

for (int device = 0; device < kNumLocalDevices; device++) {
  cudaSetDevice(device);
  cudaStreamDestroy(streams_[device]);
}
for (int device = 0; device < kNumLocalDevices; device++) {
  ncclCommDestroy(communicators_[device]);
}
for (int device = 0; device < kNumLocalDevices; device++) {
  cudaFree(devices_data_[device]);
}

Apparently ncclDevrInitOnce is getting called twice in this example for each device:

1: ncclCommInitRankFunc->initTransportsRank->ncclTeamLsa->ncclDevrInitOnce

2: groupLaunch->ncclPrepareTasksAndCollPreconnect->ncclPrepareTasks->ncclMakeSymmetricTaskList->ncclDevrInitOnce

But the second object finalize function is never called which leads to memory leak for allocated lsaRankList memory

I guess this happens because in init.cc:1482 bigSize is zeroed

Not sure yet how to fix this

Signed-off-by: Levon Ter-Grigorian <patrios@google.com>
@PatriosTheGreat PatriosTheGreat changed the title Fix dev_runtime memory leak Fix dev_runtime memory leaks Apr 2, 2026
@xiaofanl-nvidia
Copy link
Copy Markdown
Collaborator

++ @kiskra-nvidia @kgioioso to take a look. Feel free to mirror the PR if it looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants