Skip to content

torch.distributed.batch_isend_irecv is not recorded properly in ET #134

@shengfukevin

Description

@shengfukevin

We are developing comm_repay and finding a problem with torch.distributed.batch_isend_irecv , which is used in one of our testing trace.
The p2p comm sequence of real training between rank0 and rank8 is:
rank 0: batch -> send -> batch -> send -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> recv
rank 8: batch -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> send -> recv -> batch -> send
The p2p comm sequence in Execution Trace for replay between rank0 and rank8 is:
rank0: send-> send -> recv-> send -> recv-> send -> recv -> recv
rank8: recv-> send -> recv-> send -> recv-> send -> recv -> send

The issue can be reproduced with the collected ET for https://github.com/pytorch/pytorch/blob/main/test/distributed/test_c10d_nccl.py#L3846

batched-send-recv-0.json attached two files are simpler version of that unit test (with only one batch_isend_irecv call). You can find nccl::coalesced node, which marks the end of the coalescing buffer. I think the trace missed the node to mark the start of the coalescing buffer. After that is added, all send/recv nodes between the start of coalescing and the end of the coalescing should be treated as one coalesced group to replay.
batched-send-recv-1.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions