Unknown memory type error for alltoallv with torch.empty tensors

# Summary

In Pytorch, we can replicate an alltoallv with some non-zero and some zero-sized buffers (similar to the use case described in #174) by passing torch.empty(0) tensors to the list of send and receive buffers. This type of collective breaks on Intel GPU on Aurora with the oneCCL backend with `CCL_ALLTOALLV=topo` and `CCL_ALLTOALLV_MONOLITHIC_KERNEL=0`. It works however with `CCL_ALLTOALLV_MONOLITHIC_KERNEL=1`, although it is VERY slow, and it also works with other alltoallv algorithms (like direct and naive). The error with the topology-aware algorithm is shown below.

```
|CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type
```

# Version and environment

Sunspot system at ALCF.

- oneCCL release: 2021.17
- MPI version: aurora_test branch @ [3c70a61](https://github.com/pmodels/mpich/compare/6037a7a..3c70a61)
- Compiler type and version: Intel compiler 2024.3.2
- PyTorch: 2.10.0
- OS name and version:
- GPU driver information:
- Hardware configuration: Aurora/Sunspot @ ALCF configuration

# Reproducer

The reproducer is found here (includes a correctness check too):
https://github.com/argonne-lcf/nekRS-ML/blob/alcf4/3rd_party/dist-gnn/all2all_bench.py

And run instructions for Aurora are found here:
https://github.com/argonne-lcf/nekRS-ML/blob/alcf4/3rd_party/dist-gnn/run_all2all_bench.sh

# Logs

Will produce this.

# Expected behavior

The expected behavior is for the alltoallv collective to work with this setup with the `topo` algorithm and `CCL_ALLTOALLV_MONOLITHIC_KERNEL=0`.

# Observed behavior

The observed behavior is the following error

```
|CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type
```

# Existing workarounds

Using other alltoallv algorithms and setting `CCL_ALLTOALLV_MONOLITHIC_KERNEL=1`. The latter is especially undesirable due to the performance penalty. 

# Affected projects

This bug affects scaling of a graph neural network on Aurora, which a critical workload for modeling mesh-based PDE based systems. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unknown memory type error for alltoallv with torch.empty tensors #190

Summary

Version and environment

Reproducer

Logs

Expected behavior

Observed behavior

Existing workarounds

Affected projects

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unknown memory type error for alltoallv with torch.empty tensors #190

Description

Summary

Version and environment

Reproducer

Logs

Expected behavior

Observed behavior

Existing workarounds

Affected projects

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions