Skip to content

Add experimental ROCmFP4 quantization support#24184

Closed
charlie12345 wants to merge 3 commits into
ggml-org:masterfrom
charlie12345:rocmfp4-upstream-ready
Closed

Add experimental ROCmFP4 quantization support#24184
charlie12345 wants to merge 3 commits into
ggml-org:masterfrom
charlie12345:rocmfp4-upstream-ready

Conversation

@charlie12345
Copy link
Copy Markdown

PR title

Add experimental ROCmFP4 quantization support

Summary

This PR adds two experimental GGUF tensor types for AMD-oriented 4-bit floating-point quantization:

  • Q4_0_ROCMFP4: dual-scale ROCmFP4 layout, 18 bytes per 32-value block, 4.50 bpw.
  • Q4_0_ROCMFP4_FAST: single-scale ROCmFP4 layout, 17 bytes per 32-value block, 4.25 bpw.

The implementation is additive. Existing GGUF tensor types, file types, quantization modes, and backend dispatch paths are unchanged unless a tensor is explicitly quantized as Q4_0_ROCMFP4 or Q4_0_ROCMFP4_FAST.

The goal is to make ROCmFP4 usable from the normal llama.cpp workflow while preserving the existing llama.cpp stack:

  • llama-quantize can create ROCmFP4 GGUFs from normal source models.
  • --imatrix works through the existing importance-matrix quantization path.
  • CPU reference quantization/dequantization is included.
  • ROCm/HIP backend support covers conversion, copy, get-rows, dequantization, MMVQ/MMQ, and FlashAttention vector paths.
  • Vulkan backend support covers conversion, copy, set-rows, get-rows, dequantization, matvec/MMQ, and scalar FlashAttention decode paths.

Why

AMD Strix Halo class systems benefit from compact low-bit tensor formats that still map cleanly onto the existing HIP and Vulkan backend structure. ROCmFP4 provides a compact FP4-style representation with finite E4M3 scales and a small signed codebook, while keeping the implementation inside llama.cpp's existing quantization and backend abstractions.

This is not presented as a native rocWMMA FP4 tensor-core path. The current speed and memory benefits come from the compact GGUF layout plus dedicated HIP/Vulkan decode and matmul paths. Future backend work can build on the same tensor types without changing the file format.

User-facing usage

Quantize:

./llama-quantize model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4
./llama-quantize model-f16.gguf model-rocmfp4-fast.gguf Q4_0_ROCMFP4_FAST

Quantize with an importance matrix:

./llama-quantize --imatrix imatrix.dat model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4
./llama-quantize --imatrix imatrix.dat model-f16.gguf model-rocmfp4-fast.gguf Q4_0_ROCMFP4_FAST

Run as usual with the HIP or Vulkan backend enabled:

./llama-cli -m model-rocmfp4.gguf -p "Hello"
./llama-cli -m model-rocmfp4-fast.gguf -p "Hello"

Advanced mixed recipes should use the existing --tensor-type and --tensor-type-file options rather than adding more public file-type presets.

Implementation notes

  • The ROCmFP4 format-specific CPU/HIP helpers live under ggml/rocmfp4/.
  • The core tensor types are registered in ggml alongside existing quantized tensor types.
  • CPU reference quantization uses exhaustive finite E4M3 scale search and rejects invalid scale bytes during row validation.
  • HIP integration reuses the existing CUDA/HIP-shared backend structure and adds ROCmFP4-specific decode, conversion, copy, get-rows, matmul, and FlashAttention vector support.
  • Vulkan integration adds ROCmFP4 decode/conversion shaders and matmul/FlashAttention support while keeping existing non-ROCmFP4 shader behavior unchanged.
  • The public quantization surface intentionally exposes only the two tensor layouts: Q4_0_ROCMFP4 and Q4_0_ROCMFP4_FAST.

Compatibility and risk containment

This PR avoids changing existing quant behavior:

  • Existing quantization modes still map to their existing tensor types.
  • Existing MXFP4/NVFP4 codebook behavior is unchanged.
  • Existing backend support checks only route to ROCmFP4 paths when the tensor type is ROCmFP4.
  • ROCmFP4 row validation rejects malformed scale bytes early.
  • Extra local tuning presets were intentionally not exposed as public ftypes; mixed layouts remain possible through existing tensor override flags.

One review point: the new ggml/llama type IDs are currently assigned as 100 and 101. This preserves compatibility with existing experimental ROCmFP4 GGUF files, but they can be renumbered to the next contiguous upstream values if maintainers prefer that before merge.

Local validation

Validated locally on the upstream-ready branch:

CPU configure/build: passed
Vulkan configure/build: passed
HIP/ROCm configure/build: passed
test-quantize-fns: passed, including q4_0_rocmfp4 and q4_0_rocmfp4_fast
git diff --check origin/master..HEAD: passed
added-line secret/path scan: clean

The local machine does not have nvcc, so CUDA-enabled compilation should be covered by CI or by a maintainer with CUDA hardware. The HIP backend shares source files with the CUDA backend, so this is the main compatibility check I would want before merging.

Suggested reviewer focus

  • Confirm the new tensor/file type IDs should remain 100/101 or be renumbered before merge.
  • Confirm CUDA CI still compiles the shared ggml-cuda sources.
  • Review the HIP/Vulkan backend support checks to make sure ROCmFP4 paths are only selected for ROCmFP4 tensors.
  • Review the ROCmFP4 block format and row validation rules in ggml/rocmfp4/.

caf and others added 3 commits June 5, 2026 09:05
@github-actions github-actions Bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Jun 5, 2026

Hi @charlie12345, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple backend changes in one PR: When adding support for a new model or feature, focus on CPU support only in the initial PR. Add support for other backends like CUDA in follow-up PRs. If you have a good reason to modify multiple backends in one PR, please explain it.

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@charlie12345 charlie12345 marked this pull request as ready for review June 5, 2026 13:35
@charlie12345 charlie12345 requested review from a team, IMbackK and ggerganov as code owners June 5, 2026 13:35
@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Jun 5, 2026

This is not even close to following contribution guidelines.

@charlie12345
Copy link
Copy Markdown
Author

Thanks for the guidance. I split this into a smaller CPU/reference-format PR that avoids backend changes: #24185.\n\nI am closing this larger backend-inclusive PR in favor of the smaller first step. If the CPU/reference tensor format is acceptable, I can submit ROCm/HIP and Vulkan acceleration as focused follow-up PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants