Skip to content

removing secondary reordering#314

Open
AtlantaPepsi wants to merge 1 commit into
ROCm:candidatefrom
AtlantaPepsi:hotfix
Open

removing secondary reordering#314
AtlantaPepsi wants to merge 1 commit into
ROCm:candidatefrom
AtlantaPepsi:hotfix

Conversation

@AtlantaPepsi
Copy link
Copy Markdown
Contributor

Motivation

Quick fix for missed grouping change

Technical Details

Grouping should be based directly on StrideGenerate output, previous vestige of a second layer of reordering is in poda2a and caused wrong grouping results.

Test Plan

Example:
With 3 nodes, each with 4 devices, NUM_GROUPS=4, GROUP_STRIDE=4 should result in
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] => [0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11] => [0, 4, 8], [1, 5, 9], [2, 6, 10], [3, 7, 11]

Test Result

A2A group 0: R0:G0, R1:G0, R2:G0
A2A group 1: R0:G1, R1:G1, R2:G1
A2A group 2: R0:G2, R1:G2, R2:G2
A2A group 3: R0:G3, R1:G3, R2:G3

Submission Checklist

Copilot AI review requested due to automatic review settings May 26, 2026 04:18
@AtlantaPepsi AtlantaPepsi requested a review from a team as a code owner May 26, 2026 04:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes incorrect device grouping in the PodAllToAll preset by removing an unintended second reordering step and basing group construction directly on the Utils::StrideGenerate() permutation.

Changes:

  • Builds the devices list in the same order as the StrideGenerate output (instead of reindexing into a separately permuted layout).
  • Ensures group membership corresponds to contiguous chunks of the strided device order (fixing wrong grouping results in poda2a).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants