Skip to content

possible bug in python/fedml/core/distributed/communication/trpc/utils.py #2002

Description

@bene-ges

Hi,

I was trying to launch federate/cross_silo/cuda_rpc_fedavg_mnist_lr_example, mapping all processes (1 server and 2 clients) to a single gpu.

it ended with error

File "/home/myhome/.local/lib/python3.10/site-packages/torch/distributed/rpc/backend_registry.py", line 235, in _validate_device_maps
    raise ValueError(
ValueError: Node worker0 has target devices with invalid indices in its device map for worker2
device map = {device(type='cuda', index=0): device(type='cuda', index=2)}
device count = 1

I suspect there is a bug in python/fedml/core/distributed/communication/trpc/utils.py

# Generate Device Map for Cuda RPC
def set_device_map(options, worker_idx, device_list):
    local_device = device_list[worker_idx]
    for index, remote_device in enumerate(device_list):
        logging.warn(f"Setting device map for client {index} as {remote_device}")
        if index != worker_idx:
            options.set_device_map(WORKER_NAME.format(index), {local_device: remote_device})

here device_list is a dict {0:0, 1:0, 2:0}, but enumerate iterates over its keys and then assigns the key (0,1,2) as local_device.

I tried to correct this as

    for index, remote_device in enumerate(device_list):
        logging.warn(f"Setting device map for client {index} as {device_list[remote_device]}")
        if index != worker_idx:
            options.set_device_map(WORKER_NAME.format(index), {local_device: device_list[remote_device]})

and the example worked ok.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions