Skip to content

Unknown memory type error for alltoallv with torch.empty tensors #190

@rickybalin

Description

@rickybalin

Summary

In Pytorch, we can replicate an alltoallv with some non-zero and some zero-sized buffers (similar to the use case described in #174) by passing torch.empty(0) tensors to the list of send and receive buffers. This type of collective breaks on Intel GPU on Aurora with the oneCCL backend with CCL_ALLTOALLV=topo and CCL_ALLTOALLV_MONOLITHIC_KERNEL=0. It works however with CCL_ALLTOALLV_MONOLITHIC_KERNEL=1, although it is VERY slow, and it also works with other alltoallv algorithms (like direct and naive). The error with the topology-aware algorithm is shown below.

|CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type

Version and environment

Sunspot system at ALCF.

  • oneCCL release: 2021.17
  • MPI version: aurora_test branch @ 3c70a61
  • Compiler type and version: Intel compiler 2024.3.2
  • PyTorch: 2.10.0
  • OS name and version:
  • GPU driver information:
  • Hardware configuration: Aurora/Sunspot @ ALCF configuration

Reproducer

The reproducer is found here (includes a correctness check too):
https://github.com/argonne-lcf/nekRS-ML/blob/alcf4/3rd_party/dist-gnn/all2all_bench.py

And run instructions for Aurora are found here:
https://github.com/argonne-lcf/nekRS-ML/blob/alcf4/3rd_party/dist-gnn/run_all2all_bench.sh

Logs

Will produce this.

Expected behavior

The expected behavior is for the alltoallv collective to work with this setup with the topo algorithm and CCL_ALLTOALLV_MONOLITHIC_KERNEL=0.

Observed behavior

The observed behavior is the following error

|CCL_ERROR| worker.cpp:338 ccl_worker_func: worker 0 caught internal exception: oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type
terminate called after throwing an instance of 'ccl::v1::exception'
  what():  oneCCL: ze_handle_manager.cpp:226 get_ptr: EXCEPTION: unknown memory type

Existing workarounds

Using other alltoallv algorithms and setting CCL_ALLTOALLV_MONOLITHIC_KERNEL=1. The latter is especially undesirable due to the performance penalty.

Affected projects

This bug affects scaling of a graph neural network on Aurora, which a critical workload for modeling mesh-based PDE based systems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions