enable blockwise FP8 quantization on rocm by asdfvg123 · Pull Request #609 · ROCm/TransformerEngine

asdfvg123 · 2026-06-03T17:53:36Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Enable blockwise FP8 quantization on rocm

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

remove HIP guard in quantization.py
guard kernels using TMA in quantization.
add branch to handle rocm for different threads per wave

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

matthiasdiener · 2026-06-03T18:28:00Z

 # TODO replace with call to fp8.py when recipe added.
-recipe_available = not IS_HIP_EXTENSION and (get_device_compute_capability() >= (9, 0) and float(torch.version.cuda) >= 12.8)
+if IS_HIP_EXTENSION:
+    recipe_available = get_device_compute_capability() >= (9, 0)


Wouldn't this be always True on ROCm TE?

matthiasdiener · 2026-06-03T18:29:11Z

@@ -1 +1 @@
 /*************************************************************************


Needs AMD copyright

matthiasdiener · 2026-06-03T18:30:03Z

+#ifndef __HIP_PLATFORM_AMD__
 #include <cudaTypedefs.h>
+#endif
 #include <cuda_bf16.h>
 #include <cuda_runtime.h>

 #include <cfloat>
+#ifndef __HIP_PLATFORM_AMD__
 #include <cuda/barrier>
+#endif

 #include "common/common.h"
 #include "common/recipe/recipe_common.cuh"
 #include "common/util/cuda_runtime.h"
+#ifndef __HIP_PLATFORM_AMD__
 #include "common/util/ptx.cuh"
+#endif


These #includes should be already disabled via hipify, so probably no need for the #ifndefs here.

matthiasdiener · 2026-06-03T18:43:26Z

+  static constexpr float max = 448.0f;
+  static constexpr float max_inverse = 1.0 / max;


Is this change necessary? fp8e4m3 max depends on the device type on AMD.

alextmagro · 2026-06-03T18:48:39Z

Could you give a description of what you want to achieve with this PR? My understanding is that block fp8 quantization relies on some upstream kernels that will need to be adapted for AMD.

If you're just trying to enable the interface, I would argue that we should do this last, after we have a working quantization and GEMM path (and enabled and passing C++/Python tests).

asdfvg123 · 2026-06-03T18:57:07Z

@alextmagro
This PR is to enable only the quantization in the AMD gpus, not the GEMM. There are two kernels in the upstream which uses TMA for the quantization and does not uses TMA for the quantization. I guarded the kernels which uses TMA and used the non-TMA kernels to quantize for AMD.

I tested with
tests/pytorch/test_float8blockwisetensor.py
and it passes [175 passed / 32 xpassed / 5 warnings]

alextmagro · 2026-06-03T19:10:43Z

@alextmagro This PR is to enable only the quantization in the AMD gpus, not the GEMM. There are two kernels in the upstream which uses TMA for the quantization and does not uses TMA for the quantization. I guarded the kernels which uses TMA and used the non-TMA kernels to quantize for AMD.

I tested with tests/pytorch/test_float8blockwisetensor.py and it passes [175 passed / 32 xpassed / 5 warnings]

OK, in that case we need to add the cpp blockwise tests to the CMake file, and the pytorch test file to ci/pytorch.sh.

enable blockwise FP8 quantization on rocm

8335488

asdfvg123 requested review from alextmagro, wangye805 and wenchenvincent June 3, 2026 17:53

asdfvg123 requested a review from ipanfilo as a code owner June 3, 2026 17:53

matthiasdiener reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable blockwise FP8 quantization on rocm#609

enable blockwise FP8 quantization on rocm#609
asdfvg123 wants to merge 1 commit into
devfrom
yeonsoo/blockwise_fp8

asdfvg123 commented Jun 3, 2026

Uh oh!

matthiasdiener Jun 3, 2026

Uh oh!

matthiasdiener Jun 3, 2026

Uh oh!

matthiasdiener Jun 3, 2026

Uh oh!

matthiasdiener Jun 3, 2026

Uh oh!

alextmagro commented Jun 3, 2026 •

edited

Loading

Uh oh!

asdfvg123 commented Jun 3, 2026

Uh oh!

alextmagro commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -1 +1 @@
		/*************************************************************************

		static constexpr float max = 448.0f;
		static constexpr float max_inverse = 1.0 / max;

Conversation

asdfvg123 commented Jun 3, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

matthiasdiener Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

alextmagro commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asdfvg123 commented Jun 3, 2026

Uh oh!

alextmagro commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alextmagro commented Jun 3, 2026 •

edited

Loading