Skip to content

🐛 [Bug] Torch-TRT does not translate softmax quantizer generated by modelopt fp8 mha quantization #4200

@nvyihengz

Description

@nvyihengz

Bug Description

The modelopt inserts QDQ pairs for BMM1/softmax/BMM2 to follow the TRT's fp8 MHA pattern however it seems Torch-trt is omitting the softmax quantizer and it caused no fused mha picked up by TRT

To Reproduce

Steps to reproduce the behavior:

  1. Launch the nvidia pytorch container: nvcr.io/nvidia/pytorch:26.03-py3
  2. Install transformers: pip install transformers
  3. Install modelopt nightly: pip install --upgrade "git+https://github.com/NVIDIA/Model-Optimizer.git@main"
  4. Run the attached scripts

vit_fp8_mha_qdq_inspect.log
vit_fp8_mha_qdq_inspect.py

Expected behavior

The softmax quantizers get converted to TRT qdq along with other bmm quantizers

Environment

See steps to reproduce

Additional context

@narendasan filed per discussion

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions