Skip to content

NCCL:Broadcast collectives are missing from the converted trace but present in the trace_link #161

@alexseceks

Description

@alexseceks

Describe the Bug

After running a ResNet50 or TinyLlama2 workload on 4 ranks I see that in the Kineto trace at least one nccl:broadcast collective is observed. In the trace_link file the same collective is observed, but in the converted trace the collective is no longer present. Is this a normal behavior or is it an issue on the Chakra Converter side?

I looked in the converter implementation, but I did not observe any pointers that this should be done - dismiss broadcast collectives. Is there something I missed?

Steps to Reproduce

Using the Chakra version from 6 Sept, after the merge of commit #140.

Expected Behavior

See the nccl:broadcast collective in the converted trace.

Screenshots

This is the trace_link file, the broadcast collective is present.
Screenshot 2024-10-16 at 14 15 15
This is the converted trace, in json format, no broadcast collective can be found - search result is at the bottom of the picture.
Screenshot 2024-10-16 at 14 17 39

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions