Skip to content

Feature/gemmbf16xfp32#47

Open
shaochangxu wants to merge 2 commits into
Tencent:mainfrom
shaochangxu:feature/gemmbf16xfp32
Open

Feature/gemmbf16xfp32#47
shaochangxu wants to merge 2 commits into
Tencent:mainfrom
shaochangxu:feature/gemmbf16xfp32

Conversation

@shaochangxu
Copy link
Copy Markdown
Contributor

This mr addresses the performance bottleneck of BF16xFP32 matrix multiplication (GEMM) in LLM inference and proposes a high-precision FP32 GEMM operator based on double BF16 residual decomposition. In traditional approaches, naive FP32 GEMM relies on CUDA Cores, achieving high accuracy but low throughput, while reduced-precision schemes such as BF16 or TF32 can leverage Tensor Core acceleration but suffer from severe accuracy loss. The proposed method decomposes the FP32 weight matrix W into a high‑part BF16 tensor W_high and a low‑residual BF16 tensor W_low, sets the scaling factor scale = 1/256, and equivalently transforms the original GEMM into a linear combination of two BF16 GEMMs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant