Feature/gemmbf16xfp32 by shaochangxu · Pull Request #47 · Tencent/hpc-ops

shaochangxu · 2026-05-28T11:23:04Z

This mr addresses the performance bottleneck of BF16xFP32 matrix multiplication (GEMM) in LLM inference and proposes a high-precision FP32 GEMM operator based on double BF16 residual decomposition. In traditional approaches, naive FP32 GEMM relies on CUDA Cores, achieving high accuracy but low throughput, while reduced-precision schemes such as BF16 or TF32 can leverage Tensor Core acceleration but suffer from severe accuracy loss. The proposed method decomposes the FP32 weight matrix W into a high‑part BF16 tensor W_high and a low‑residual BF16 tensor W_low, sets the scaling factor scale = 1/256, and equivalently transforms the original GEMM into a linear combination of two BF16 GEMMs.

chaseshao added 2 commits May 28, 2026 19:00

support gemmbf16xfp32

3d5ee22

support gemmbf16xfp32

885c3c1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/gemmbf16xfp32#47

Feature/gemmbf16xfp32#47
shaochangxu wants to merge 2 commits into
Tencent:mainfrom
shaochangxu:feature/gemmbf16xfp32

shaochangxu commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shaochangxu commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant