Summary
In Pytorch, we can replicate an alltoallv with some non-zero and some zero-sized buffers (similar to the use case described in #174) by passing torch.empty(0) tensors to the list of send and receive buffers. This type of collective breaks on Intel GPU on Aurora with the oneCCL backend with CCL_ALLTOALLV=topo and CCL_ALLTOALLV_MONOLITHIC_KERNEL=0. It works however with CCL_ALLTOALLV_MONOLITHIC_KERNEL=1, although it is VERY slow, and it also works with other alltoallv algorithms (like direct and naive). The error with the topology-aware algorithm is shown below.
|CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type
terminate called after throwing an instance of 'ccl::v1::exception'
what(): oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type
Version and environment
Sunspot system at ALCF.
- oneCCL release: 2021.17
- MPI version: aurora_test branch @ 3c70a61
- Compiler type and version: Intel compiler 2024.3.2
- PyTorch: 2.10.0
- OS name and version:
- GPU driver information:
- Hardware configuration: Aurora/Sunspot @ ALCF configuration
Reproducer
The reproducer is found here (includes a correctness check too):
https://github.com/argonne-lcf/nekRS-ML/blob/alcf4/3rd_party/dist-gnn/all2all_bench.py
And run instructions for Aurora are found here:
https://github.com/argonne-lcf/nekRS-ML/blob/alcf4/3rd_party/dist-gnn/run_all2all_bench.sh
Logs
Will produce this.
Expected behavior
The expected behavior is for the alltoallv collective to work with this setup with the topo algorithm and CCL_ALLTOALLV_MONOLITHIC_KERNEL=0.
Observed behavior
The observed behavior is the following error
|CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type
terminate called after throwing an instance of 'ccl::v1::exception'
what(): oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type
Existing workarounds
Using other alltoallv algorithms and setting CCL_ALLTOALLV_MONOLITHIC_KERNEL=1. The latter is especially undesirable due to the performance penalty.
Affected projects
This bug affects scaling of a graph neural network on Aurora, which a critical workload for modeling mesh-based PDE based systems.
Summary
In Pytorch, we can replicate an alltoallv with some non-zero and some zero-sized buffers (similar to the use case described in #174) by passing torch.empty(0) tensors to the list of send and receive buffers. This type of collective breaks on Intel GPU on Aurora with the oneCCL backend with
CCL_ALLTOALLV=topoandCCL_ALLTOALLV_MONOLITHIC_KERNEL=0. It works however withCCL_ALLTOALLV_MONOLITHIC_KERNEL=1, although it is VERY slow, and it also works with other alltoallv algorithms (like direct and naive). The error with the topology-aware algorithm is shown below.Version and environment
Sunspot system at ALCF.
Reproducer
The reproducer is found here (includes a correctness check too):
https://github.com/argonne-lcf/nekRS-ML/blob/alcf4/3rd_party/dist-gnn/all2all_bench.py
And run instructions for Aurora are found here:
https://github.com/argonne-lcf/nekRS-ML/blob/alcf4/3rd_party/dist-gnn/run_all2all_bench.sh
Logs
Will produce this.
Expected behavior
The expected behavior is for the alltoallv collective to work with this setup with the
topoalgorithm andCCL_ALLTOALLV_MONOLITHIC_KERNEL=0.Observed behavior
The observed behavior is the following error
Existing workarounds
Using other alltoallv algorithms and setting
CCL_ALLTOALLV_MONOLITHIC_KERNEL=1. The latter is especially undesirable due to the performance penalty.Affected projects
This bug affects scaling of a graph neural network on Aurora, which a critical workload for modeling mesh-based PDE based systems.