Add experimental ROCmFP4 quantization support#24184
Conversation
(cherry picked from commit 22c4f745008abcd26d4c3235aee90288448171a2)
|
Hi @charlie12345, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
This is not even close to following contribution guidelines. |
|
Thanks for the guidance. I split this into a smaller CPU/reference-format PR that avoids backend changes: #24185.\n\nI am closing this larger backend-inclusive PR in favor of the smaller first step. If the CPU/reference tensor format is acceptable, I can submit ROCm/HIP and Vulkan acceleration as focused follow-up PRs. |
PR title
Add experimental ROCmFP4 quantization support
Summary
This PR adds two experimental GGUF tensor types for AMD-oriented 4-bit floating-point quantization:
Q4_0_ROCMFP4: dual-scale ROCmFP4 layout, 18 bytes per 32-value block, 4.50 bpw.Q4_0_ROCMFP4_FAST: single-scale ROCmFP4 layout, 17 bytes per 32-value block, 4.25 bpw.The implementation is additive. Existing GGUF tensor types, file types, quantization modes, and backend dispatch paths are unchanged unless a tensor is explicitly quantized as
Q4_0_ROCMFP4orQ4_0_ROCMFP4_FAST.The goal is to make ROCmFP4 usable from the normal llama.cpp workflow while preserving the existing llama.cpp stack:
llama-quantizecan create ROCmFP4 GGUFs from normal source models.--imatrixworks through the existing importance-matrix quantization path.Why
AMD Strix Halo class systems benefit from compact low-bit tensor formats that still map cleanly onto the existing HIP and Vulkan backend structure. ROCmFP4 provides a compact FP4-style representation with finite E4M3 scales and a small signed codebook, while keeping the implementation inside llama.cpp's existing quantization and backend abstractions.
This is not presented as a native rocWMMA FP4 tensor-core path. The current speed and memory benefits come from the compact GGUF layout plus dedicated HIP/Vulkan decode and matmul paths. Future backend work can build on the same tensor types without changing the file format.
User-facing usage
Quantize:
Quantize with an importance matrix:
Run as usual with the HIP or Vulkan backend enabled:
Advanced mixed recipes should use the existing
--tensor-typeand--tensor-type-fileoptions rather than adding more public file-type presets.Implementation notes
ggml/rocmfp4/.Q4_0_ROCMFP4andQ4_0_ROCMFP4_FAST.Compatibility and risk containment
This PR avoids changing existing quant behavior:
One review point: the new ggml/llama type IDs are currently assigned as
100and101. This preserves compatibility with existing experimental ROCmFP4 GGUF files, but they can be renumbered to the next contiguous upstream values if maintainers prefer that before merge.Local validation
Validated locally on the upstream-ready branch:
The local machine does not have
nvcc, so CUDA-enabled compilation should be covered by CI or by a maintainer with CUDA hardware. The HIP backend shares source files with the CUDA backend, so this is the main compatibility check I would want before merging.Suggested reviewer focus
100/101or be renumbered before merge.ggml-cudasources.ggml/rocmfp4/.