Cute Dsl kernel for Wgrad for Fused MOE Layer by vthumbe1503 · Pull Request #2869 · NVIDIA/TransformerEngine

vthumbe1503 · 2026-04-13T04:07:04Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-04-13T04:12:02Z

Greptile Summary

This PR routes the weight-gradient (wgrad) GEMM for the fused MXFP8 MOE backward pass through a new CuTe DSL kernel (grouped_gemm_wgrad_wrapper_sm100) on SM100+ hardware, with an automatic cuBLAS fallback when the kernel is unavailable. The core additions are _cudnn_compute_wgrad (handles dense / discrete per-expert modes) and a refactored _compute_grad_params that selects between the CuTe kernel and the legacy general_grouped_gemm_for_grouped_tensor path.

P1 — unhandled ImportError on wgrad kernel import: grouped_gemm_wgrad_kernel() contains no try/except, so if grouped_gemm_wgrad_wrapper_sm100 is absent despite the >= 1.23.0 version check, the backward pass crashes instead of falling back to cuBLAS (see inline comment).

Confidence Score: 4/5

Safe to merge after addressing the missing ImportError guard in grouped_gemm_wgrad_kernel().

One P1 finding: the wgrad kernel method has no try/except, so a symbol-not-found condition would crash the backward pass with no cuBLAS fallback. The fix is a two-line try/except wrapper. All other findings are P2 style concerns.

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py — grouped_gemm_wgrad_kernel() classmethod needs ImportError handling.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/ops/_common.py	Adds `_nvidia_cudnn_frontend_supports_wgrad()` version-gate for the new wgrad kernel; its body is identical to the existing `_scaled_clamped_qgeglu` check (same `>= 1.23.0` threshold), which is potentially fragile if the wgrad symbol was added at a different version.
transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py	Introduces `_cudnn_compute_wgrad` and refactors `_compute_grad_params` to dispatch wgrad to the CuTe DSL kernel or fall back to cuBLAS. `grouped_gemm_wgrad_kernel()` lacks try/except around the import, so a symbol-not-found ImportError propagates to the backward pass with no fallback.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[fuser_backward] --> B[_compute_grad_params FC2 wgrad]
    A --> C[_compute_grad_params FC1 wgrad]
    B --> D{cudnn_wgrad_kernel_fn not None?}
    C --> D2{cudnn_wgrad_kernel_fn not None?}
    D -->|Yes| E[functools.partial _cudnn_compute_wgrad]
    D -->|No| F[functools.partial cuBLAS fallback]
    D2 -->|Yes| E2[functools.partial _cudnn_compute_wgrad]
    D2 -->|No| F2[functools.partial cuBLAS fallback]
    E --> G{delay_wgrad?}
    G -->|Yes| H[wgrad_store.put deferred]
    G -->|No| I[_cudnn_compute_wgrad]
    I --> J{single_grouped_weight?}
    J -->|Yes| K[output_mode=dense]
    J -->|No| L[output_mode=discrete]
    subgraph version_gate[grouped_gemm_wgrad_kernel]
        M{supports_wgrad?} -->|False| N[return None]
        M -->|True| O[import grouped_gemm_wgrad_wrapper_sm100]
    end

_{Reviews (4): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…/TransformerEngine into users/vthumbe/wgrad_cute_dsl

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

timmoon10

LGTM

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 · 2026-04-13T20:07:17Z

/te-ci pytorch

for more information, see https://pre-commit.ci

vthumbe1503 and others added 3 commits April 13, 2026 02:46

integrate cudnn wgrad kernel

4512258

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

have only cute dsl for wgrad

18fc3af

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5d1a077

for more information, see https://pre-commit.ci

vthumbe1503 changed the title ~~Users/vthumbe/wgrad cute dsl~~ Cute Dsl for Wgrad for Fused MOE Layer Apr 13, 2026

vthumbe1503 changed the title ~~Cute Dsl for Wgrad for Fused MOE Layer~~ Cute Dsl kernel for Wgrad for Fused MOE Layer Apr 13, 2026

greptile-apps bot reviewed Apr 13, 2026

View reviewed changes

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py Show resolved Hide resolved

transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py Outdated Show resolved Hide resolved

vthumbe1503 added 4 commits April 13, 2026 16:22

revert the change for cudnn

9598847

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'users/vthumbe/wgrad_cute_dsl' of github.com:vthumbe1503…

f6ae933

…/TransformerEngine into users/vthumbe/wgrad_cute_dsl

remove dtype

6af1227

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

fix comment:

99f9853

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

timmoon10 previously approved these changes Apr 13, 2026

View reviewed changes

go to cublas if needed

b829352

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 dismissed timmoon10’s stale review via b829352 April 13, 2026 20:06

Merge branch 'main' into users/vthumbe/wgrad_cute_dsl

90c4d47

[pre-commit.ci] auto fixes from pre-commit.com hooks

0eb7eae

for more information, see https://pre-commit.ci

timmoon10 approved these changes Apr 13, 2026

View reviewed changes

vthumbe1503 merged commit 72328b3 into NVIDIA:main Apr 13, 2026
21 of 24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cute Dsl kernel for Wgrad for Fused MOE Layer#2869

Cute Dsl kernel for Wgrad for Fused MOE Layer#2869
vthumbe1503 merged 10 commits intoNVIDIA:mainfrom
vthumbe1503:users/vthumbe/wgrad_cute_dsl

vthumbe1503 commented Apr 13, 2026

Uh oh!

greptile-apps bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Uh oh!

vthumbe1503 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vthumbe1503 commented Apr 13, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Apr 13, 2026 •

edited

Loading