Add ROCmFP4 CPU quantization support#24185
Conversation
|
You're still not following the PR description template. In general, the threshold for adding quantization types is very high, because it leads to a lot of work in all other backends to make them work and run well, and can't be removed without breaking support of existing models. You didn't post any performance/quality benchmarks that would even allow comparing what you came up with with existing types. IMO step one of proposing a quantization scheme would be an issue or discussion thread where you clearly explain why it is a good addition with benchmarks comparing with existing quants, and present this kind of prototype code to validate your claims. |
|
Thanks, that is fair. I ran a fresh limited comparison so the rationale is concrete. Current PR scope: CPU/reference format only. The speed numbers below come from the prototype backend branch and are intended to justify why the tensor format may be worth reviewing before backend PRs. Hardware: Framework AMD Strix Halo class system, 128 GB unified memory Models:
ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline. Speed command shape: llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3
Decode speed:
Limited quality check: llama-perplexity -f wiki.test.raw -c 2048 -b 512 -ub 512 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 --chunks 32
This is a limited 32-chunk WikiText-2 pass, not a full benchmark, but the result is effectively tied within uncertainty. I am not claiming universal AMD speedup or full quality parity from this data. I also do not have a same-model Q4 baseline locally yet, so I am not claiming this is smaller/better than Q4_K_M. The narrower claim is: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass. |
|
I added two more same-hardware comparisons to the local benchmark notes. These are secondary evidence; the Qwen3.6 27B result above is still the cleanest quality+speed datapoint. All runs use the same Strix Halo class system and the same speed command shape: llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL
Supported claim for this model: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim. Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL
Supported claim for this model: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster. I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim. |
Overview
This PR adds CPU/reference support for two experimental ROCmFP4 GGUF tensor types:
Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw.Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw.The PR is intentionally limited to the CPU/reference portion:
llama-quantizeintegration.Existing quantization modes and existing MXFP4/NVFP4 behavior are unchanged. No non-CPU backend files are modified in this PR. ROCm/HIP and Vulkan acceleration are intended as follow-up PRs only if the tensor formats and reference behavior are considered acceptable.
User-facing usage:
With an importance matrix:
Additional information
This is a smaller split from #24184 in response to the automated PR checker. The accelerated prototype branch is separate; this PR only adds the portable/reference tensor format and quantizer behavior.
Prototype benchmark rationale, measured on a Framework AMD Strix Halo class system with 128 GB unified memory:
Models compared:
Qwen3.6-27B-MTP-BF16-to-ROCmFP4-STRIX_LEAN.gguf, 14,817,252,512 bytes / 13.80 GiB.Qwen3.6-27B-UD-Q5_K_XL.gguf, 20,350,682,240 bytes / 18.95 GiB.ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline.
Speed command shape:
Decode speed from this prototype backend branch:
Limited quality check:
This is a limited 32-chunk WikiText-2 pass, not a full quality benchmark, but the result is effectively tied within uncertainty.
Narrow claim supported by this data: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass. I am not claiming universal AMD speedup or full quality parity from this limited data, and I do not have a same-model Q4 baseline locally yet.
Additional prototype datapoints, same hardware and same speed command shape:
Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL:
Supported claim: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim.
Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL:
Supported claim: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster.
I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim.
Local validation for this CPU/reference PR:
New IDs use the next contiguous upstream values:
GGML_TYPE_Q4_0_ROCMFP4 = 42GGML_TYPE_Q4_0_ROCMFP4_FAST = 43LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4 = 41LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4_FAST = 42Follow-up plan if the CPU/reference format is considered acceptable:
Requirements