Skip to content

speed up nvte_multi_padding / nvte_multi_unpadding#592

Open
matthiasdiener wants to merge 9 commits into
devfrom
mdiener/speedup-pad-unpad
Open

speed up nvte_multi_padding / nvte_multi_unpadding#592
matthiasdiener wants to merge 9 commits into
devfrom
mdiener/speedup-pad-unpad

Conversation

@matthiasdiener
Copy link
Copy Markdown
Contributor

@matthiasdiener matthiasdiener commented May 20, 2026

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes https://github.com/ROCm/frameworks-internal/issues/16530

See https://github.com/ROCm/frameworks-internal/issues/16530#issuecomment-4502138388 for performance.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@matthiasdiener matthiasdiener self-assigned this May 20, 2026
@matthiasdiener matthiasdiener added the ci-level 1 CI test level 1 label May 20, 2026
@matthiasdiener matthiasdiener marked this pull request as ready for review May 20, 2026 20:01
Comment thread transformer_engine/common/util/padding.cu Outdated
Comment thread transformer_engine/common/util/padding.cu
Comment thread transformer_engine/common/util/padding.cu Outdated
@matthiasdiener matthiasdiener requested a review from aris134 June 1, 2026 19:59
Copy link
Copy Markdown
Contributor

@aris134 aris134 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

// falls back to element-wise for partial/unaligned cases.
// Note: NT loads were also benchmarked but hurt performance.
template <uint32_t nvec, typename Type>
__device__ __forceinline__ void nt_store_to_elts(const Vec<Type, nvec>& v,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the case where we hit non-aligned vectors? Isn't FP8/MXFP8 always padded to a multiple of 16 by default? Ideally we would template out the NT vs elementwise stores


#ifdef __HIP_PLATFORM_AMD__
// Process subtiles with vectorized loads/stores
#pragma unroll
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try #pragma unroll 2 here? If we have the registers available that might help performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-level 1 CI test level 1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants