Skip to content

Add ROCmFP4 CPU quantization support#24185

Open
charlie12345 wants to merge 1 commit into
ggml-org:masterfrom
charlie12345:rocmfp4-cpu-only
Open

Add ROCmFP4 CPU quantization support#24185
charlie12345 wants to merge 1 commit into
ggml-org:masterfrom
charlie12345:rocmfp4-cpu-only

Conversation

@charlie12345
Copy link
Copy Markdown

@charlie12345 charlie12345 commented Jun 5, 2026

Overview

This PR adds CPU/reference support for two experimental ROCmFP4 GGUF tensor types:

  • Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw.
  • Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw.

The PR is intentionally limited to the CPU/reference portion:

  • GGUF tensor type registration.
  • CPU reference quantization/dequantization.
  • row validation for ROCmFP4 scale bytes.
  • llama-quantize integration.
  • imatrix support through the existing quantization path.
  • concise format documentation.

Existing quantization modes and existing MXFP4/NVFP4 behavior are unchanged. No non-CPU backend files are modified in this PR. ROCm/HIP and Vulkan acceleration are intended as follow-up PRs only if the tensor formats and reference behavior are considered acceptable.

User-facing usage:

./llama-quantize model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4
./llama-quantize model-f16.gguf model-rocmfp4-fast.gguf Q4_0_ROCMFP4_FAST

With an importance matrix:

./llama-quantize --imatrix imatrix.dat model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4

Additional information

This is a smaller split from #24184 in response to the automated PR checker. The accelerated prototype branch is separate; this PR only adds the portable/reference tensor format and quantizer behavior.

Prototype benchmark rationale, measured on a Framework AMD Strix Halo class system with 128 GB unified memory:

Models compared:

  • ROCmFP4: Qwen3.6-27B-MTP-BF16-to-ROCmFP4-STRIX_LEAN.gguf, 14,817,252,512 bytes / 13.80 GiB.
  • Baseline: Qwen3.6-27B-UD-Q5_K_XL.gguf, 20,350,682,240 bytes / 18.95 GiB.

ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline.

Speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3
Model Backend pp512 tok/s tg128 tok/s
ROCmFP4 STRIX_LEAN Vulkan 255.19 +/- 2.88 14.16 +/- 0.09
UD-Q5_K_XL Vulkan 333.41 +/- 0.64 10.77 +/- 0.01
ROCmFP4 STRIX_LEAN ROCm 362.72 +/- 1.89 13.83 +/- 0.01
UD-Q5_K_XL ROCm 335.14 +/- 5.25 10.23 +/- 0.02

Decode speed from this prototype backend branch:

  • Vulkan: 14.16 vs 10.77 tok/s, about 31.5% faster.
  • ROCm: 13.83 vs 10.23 tok/s, about 35.2% faster.

Limited quality check:

llama-perplexity -f wiki.test.raw -c 2048 -b 512 -ub 512 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 --chunks 32
Model Backend Chunks PPL
ROCmFP4 STRIX_LEAN Vulkan 32 6.6538 +/- 0.09455
UD-Q5_K_XL Vulkan 32 6.6554 +/- 0.09661

This is a limited 32-chunk WikiText-2 pass, not a full quality benchmark, but the result is effectively tied within uncertainty.

Narrow claim supported by this data: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass. I am not claiming universal AMD speedup or full quality parity from this limited data, and I do not have a same-model Q4 baseline locally yet.

Additional prototype datapoints, same hardware and same speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL:

Metric ROCmFP4 STRIX_LEAN UD-Q5_K_XL
Size 17.74 GiB 25.29 GiB
Vulkan pp512 1092.13 +/- 11.48 tok/s 1027.47 +/- 14.50 tok/s
Vulkan tg128 76.95 +/- 0.09 tok/s 57.31 +/- 0.04 tok/s
ROCm pp512 1128.36 +/- 20.65 tok/s 941.70 +/- 14.89 tok/s
ROCm tg128 67.53 +/- 0.16 tok/s 48.09 +/- 0.08 tok/s
Limited WikiText-2 PPL, 32 chunks 6.0609 +/- 0.08084 5.8748 +/- 0.07794

Supported claim: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim.

Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL:

Metric ROCmFP4 STRIX_LEAN UD-Q4_K_XL
Size 12.65 GiB 15.84 GiB
Vulkan pp512 1175.12 +/- 15.11 tok/s 1228.07 +/- 15.89 tok/s
Vulkan tg128 62.68 +/- 0.03 tok/s 52.73 +/- 0.13 tok/s
ROCm pp512 1305.38 +/- 28.52 tok/s 1191.64 +/- 17.98 tok/s
ROCm tg128 61.51 +/- 0.14 tok/s 45.63 +/- 0.06 tok/s

Supported claim: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster.

I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim.

Local validation for this CPU/reference PR:

cmake configure, CPU-only: passed
llama-quantize build: passed
test-quantize-fns build: passed
test-quantize-perf build: passed
test-quantize-fns: passed, including q4_0_rocmfp4 and q4_0_rocmfp4_fast
git diff --check: passed
added-line secret/path scan: clean

New IDs use the next contiguous upstream values:

  • GGML_TYPE_Q4_0_ROCMFP4 = 42
  • GGML_TYPE_Q4_0_ROCMFP4_FAST = 43
  • LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4 = 41
  • LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4_FAST = 42

Follow-up plan if the CPU/reference format is considered acceptable:

  1. ROCm/HIP backend support.
  2. Vulkan backend support.
  3. Backend tests and benchmark notes for AMD Strix Halo class systems.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - AI assistance was used to help split the change into a CPU/reference-only PR, run local validation/benchmark commands, and draft the benchmark summary. The submitter reviewed the resulting code and PR text.

@charlie12345 charlie12345 requested a review from ggerganov as a code owner June 5, 2026 13:42
@github-actions github-actions Bot added examples ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026
@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented Jun 5, 2026

You're still not following the PR description template.

In general, the threshold for adding quantization types is very high, because it leads to a lot of work in all other backends to make them work and run well, and can't be removed without breaking support of existing models.

You didn't post any performance/quality benchmarks that would even allow comparing what you came up with with existing types. IMO step one of proposing a quantization scheme would be an issue or discussion thread where you clearly explain why it is a good addition with benchmarks comparing with existing quants, and present this kind of prototype code to validate your claims.

@charlie12345
Copy link
Copy Markdown
Author

Thanks, that is fair. I ran a fresh limited comparison so the rationale is concrete.

Current PR scope: CPU/reference format only. The speed numbers below come from the prototype backend branch and are intended to justify why the tensor format may be worth reviewing before backend PRs.

Hardware: Framework AMD Strix Halo class system, 128 GB unified memory

Models:

  • ROCmFP4: Qwen3.6-27B-MTP-BF16-to-ROCmFP4-STRIX_LEAN.gguf, 14,817,252,512 bytes / 13.80 GiB
  • Baseline: Qwen3.6-27B-UD-Q5_K_XL.gguf, 20,350,682,240 bytes / 18.95 GiB

ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline.

Speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3
Model Backend pp512 tok/s tg128 tok/s
ROCmFP4 STRIX_LEAN Vulkan 255.19 +/- 2.88 14.16 +/- 0.09
UD-Q5_K_XL Vulkan 333.41 +/- 0.64 10.77 +/- 0.01
ROCmFP4 STRIX_LEAN ROCm 362.72 +/- 1.89 13.83 +/- 0.01
UD-Q5_K_XL ROCm 335.14 +/- 5.25 10.23 +/- 0.02

Decode speed:

  • Vulkan: 14.16 vs 10.77 tok/s, about 31.5% faster.
  • ROCm: 13.83 vs 10.23 tok/s, about 35.2% faster.

Limited quality check:

llama-perplexity -f wiki.test.raw -c 2048 -b 512 -ub 512 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 --chunks 32
Model Backend Chunks PPL
ROCmFP4 STRIX_LEAN Vulkan 32 6.6538 +/- 0.09455
UD-Q5_K_XL Vulkan 32 6.6554 +/- 0.09661

This is a limited 32-chunk WikiText-2 pass, not a full benchmark, but the result is effectively tied within uncertainty.

I am not claiming universal AMD speedup or full quality parity from this data. I also do not have a same-model Q4 baseline locally yet, so I am not claiming this is smaller/better than Q4_K_M. The narrower claim is: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass.

@charlie12345
Copy link
Copy Markdown
Author

I added two more same-hardware comparisons to the local benchmark notes. These are secondary evidence; the Qwen3.6 27B result above is still the cleanest quality+speed datapoint.

All runs use the same Strix Halo class system and the same speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL

Metric ROCmFP4 STRIX_LEAN UD-Q5_K_XL
Size 17.74 GiB 25.29 GiB
Vulkan pp512 1092.13 +/- 11.48 tok/s 1027.47 +/- 14.50 tok/s
Vulkan tg128 76.95 +/- 0.09 tok/s 57.31 +/- 0.04 tok/s
ROCm pp512 1128.36 +/- 20.65 tok/s 941.70 +/- 14.89 tok/s
ROCm tg128 67.53 +/- 0.16 tok/s 48.09 +/- 0.08 tok/s
Limited WikiText-2 PPL, 32 chunks 6.0609 +/- 0.08084 5.8748 +/- 0.07794

Supported claim for this model: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim.

Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL

Metric ROCmFP4 STRIX_LEAN UD-Q4_K_XL
Size 12.65 GiB 15.84 GiB
Vulkan pp512 1175.12 +/- 15.11 tok/s 1228.07 +/- 15.89 tok/s
Vulkan tg128 62.68 +/- 0.03 tok/s 52.73 +/- 0.13 tok/s
ROCm pp512 1305.38 +/- 28.52 tok/s 1191.64 +/- 17.98 tok/s
ROCm tg128 61.51 +/- 0.14 tok/s 45.63 +/- 0.06 tok/s

Supported claim for this model: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster.

I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants