Add ROCmFP4 CPU quantization support by charlie12345 · Pull Request #24185 · ggml-org/llama.cpp

charlie12345 · 2026-06-05T13:42:53Z

Overview

This PR adds CPU/reference support for two experimental ROCmFP4 GGUF tensor types:

Q4_0_ROCMFP4: dual-scale layout, 18 bytes per 32-value block, 4.50 bpw.
Q4_0_ROCMFP4_FAST: single-scale layout, 17 bytes per 32-value block, 4.25 bpw.

The PR is intentionally limited to the CPU/reference portion:

GGUF tensor type registration.
CPU reference quantization/dequantization.
row validation for ROCmFP4 scale bytes.
llama-quantize integration.
imatrix support through the existing quantization path.
concise format documentation.

Existing quantization modes and existing MXFP4/NVFP4 behavior are unchanged. No non-CPU backend files are modified in this PR. ROCm/HIP and Vulkan acceleration are intended as follow-up PRs only if the tensor formats and reference behavior are considered acceptable.

User-facing usage:

./llama-quantize model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4
./llama-quantize model-f16.gguf model-rocmfp4-fast.gguf Q4_0_ROCMFP4_FAST

With an importance matrix:

./llama-quantize --imatrix imatrix.dat model-f16.gguf model-rocmfp4.gguf Q4_0_ROCMFP4

Additional information

This is a smaller split from #24184 in response to the automated PR checker. The accelerated prototype branch is separate; this PR only adds the portable/reference tensor format and quantizer behavior.

Prototype benchmark rationale, measured on a Framework AMD Strix Halo class system with 128 GB unified memory:

Models compared:

ROCmFP4: Qwen3.6-27B-MTP-BF16-to-ROCmFP4-STRIX_LEAN.gguf, 14,817,252,512 bytes / 13.80 GiB.
Baseline: Qwen3.6-27B-UD-Q5_K_XL.gguf, 20,350,682,240 bytes / 18.95 GiB.

ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline.

Speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Model	Backend	pp512 tok/s	tg128 tok/s
ROCmFP4 STRIX_LEAN	Vulkan	255.19 +/- 2.88	14.16 +/- 0.09
UD-Q5_K_XL	Vulkan	333.41 +/- 0.64	10.77 +/- 0.01
ROCmFP4 STRIX_LEAN	ROCm	362.72 +/- 1.89	13.83 +/- 0.01
UD-Q5_K_XL	ROCm	335.14 +/- 5.25	10.23 +/- 0.02

Decode speed from this prototype backend branch:

Vulkan: 14.16 vs 10.77 tok/s, about 31.5% faster.
ROCm: 13.83 vs 10.23 tok/s, about 35.2% faster.

Limited quality check:

llama-perplexity -f wiki.test.raw -c 2048 -b 512 -ub 512 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 --chunks 32

Model	Backend	Chunks	PPL
ROCmFP4 STRIX_LEAN	Vulkan	32	6.6538 +/- 0.09455
UD-Q5_K_XL	Vulkan	32	6.6554 +/- 0.09661

This is a limited 32-chunk WikiText-2 pass, not a full quality benchmark, but the result is effectively tied within uncertainty.

Narrow claim supported by this data: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass. I am not claiming universal AMD speedup or full quality parity from this limited data, and I do not have a same-model Q4 baseline locally yet.

Additional prototype datapoints, same hardware and same speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL:

Metric	ROCmFP4 STRIX_LEAN	UD-Q5_K_XL
Size	17.74 GiB	25.29 GiB
Vulkan pp512	1092.13 +/- 11.48 tok/s	1027.47 +/- 14.50 tok/s
Vulkan tg128	76.95 +/- 0.09 tok/s	57.31 +/- 0.04 tok/s
ROCm pp512	1128.36 +/- 20.65 tok/s	941.70 +/- 14.89 tok/s
ROCm tg128	67.53 +/- 0.16 tok/s	48.09 +/- 0.08 tok/s
Limited WikiText-2 PPL, 32 chunks	6.0609 +/- 0.08084	5.8748 +/- 0.07794

Supported claim: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim.

Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL:

Metric	ROCmFP4 STRIX_LEAN	UD-Q4_K_XL
Size	12.65 GiB	15.84 GiB
Vulkan pp512	1175.12 +/- 15.11 tok/s	1228.07 +/- 15.89 tok/s
Vulkan tg128	62.68 +/- 0.03 tok/s	52.73 +/- 0.13 tok/s
ROCm pp512	1305.38 +/- 28.52 tok/s	1191.64 +/- 17.98 tok/s
ROCm tg128	61.51 +/- 0.14 tok/s	45.63 +/- 0.06 tok/s

Supported claim: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster.

I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim.

Local validation for this CPU/reference PR:

cmake configure, CPU-only: passed
llama-quantize build: passed
test-quantize-fns build: passed
test-quantize-perf build: passed
test-quantize-fns: passed, including q4_0_rocmfp4 and q4_0_rocmfp4_fast
git diff --check: passed
added-line secret/path scan: clean

New IDs use the next contiguous upstream values:

GGML_TYPE_Q4_0_ROCMFP4 = 42
GGML_TYPE_Q4_0_ROCMFP4_FAST = 43
LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4 = 41
LLAMA_FTYPE_MOSTLY_Q4_0_ROCMFP4_FAST = 42

Follow-up plan if the CPU/reference format is considered acceptable:

ROCm/HIP backend support.
Vulkan backend support.
Backend tests and benchmark notes for AMD Strix Halo class systems.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES - AI assistance was used to help split the change into a CPU/reference-only PR, run local validation/benchmark commands, and draft the benchmark summary. The submitter reviewed the resulting code and PR text.

0cc4m · 2026-06-05T14:01:36Z

You're still not following the PR description template.

In general, the threshold for adding quantization types is very high, because it leads to a lot of work in all other backends to make them work and run well, and can't be removed without breaking support of existing models.

You didn't post any performance/quality benchmarks that would even allow comparing what you came up with with existing types. IMO step one of proposing a quantization scheme would be an issue or discussion thread where you clearly explain why it is a good addition with benchmarks comparing with existing quants, and present this kind of prototype code to validate your claims.

charlie12345 · 2026-06-05T15:35:04Z

Thanks, that is fair. I ran a fresh limited comparison so the rationale is concrete.

Current PR scope: CPU/reference format only. The speed numbers below come from the prototype backend branch and are intended to justify why the tensor format may be worth reviewing before backend PRs.

Hardware: Framework AMD Strix Halo class system, 128 GB unified memory

Models:

ROCmFP4: Qwen3.6-27B-MTP-BF16-to-ROCmFP4-STRIX_LEAN.gguf, 14,817,252,512 bytes / 13.80 GiB
Baseline: Qwen3.6-27B-UD-Q5_K_XL.gguf, 20,350,682,240 bytes / 18.95 GiB

ROCmFP4 is 5,533,429,728 bytes smaller, about 27.19% smaller than this Q5 baseline.

Speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Model	Backend	pp512 tok/s	tg128 tok/s
ROCmFP4 STRIX_LEAN	Vulkan	255.19 +/- 2.88	14.16 +/- 0.09
UD-Q5_K_XL	Vulkan	333.41 +/- 0.64	10.77 +/- 0.01
ROCmFP4 STRIX_LEAN	ROCm	362.72 +/- 1.89	13.83 +/- 0.01
UD-Q5_K_XL	ROCm	335.14 +/- 5.25	10.23 +/- 0.02

Decode speed:

Vulkan: 14.16 vs 10.77 tok/s, about 31.5% faster.
ROCm: 13.83 vs 10.23 tok/s, about 35.2% faster.

Limited quality check:

llama-perplexity -f wiki.test.raw -c 2048 -b 512 -ub 512 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 --chunks 32

Model	Backend	Chunks	PPL
ROCmFP4 STRIX_LEAN	Vulkan	32	6.6538 +/- 0.09455
UD-Q5_K_XL	Vulkan	32	6.6554 +/- 0.09661

This is a limited 32-chunk WikiText-2 pass, not a full benchmark, but the result is effectively tied within uncertainty.

I am not claiming universal AMD speedup or full quality parity from this data. I also do not have a same-model Q4 baseline locally yet, so I am not claiming this is smaller/better than Q4_K_M. The narrower claim is: on this Strix Halo test, the ROCmFP4 prototype is about 27% smaller than the tested Q5_K_XL baseline, 31-35% faster at decode, and tied it on the limited PPL pass.

charlie12345 · 2026-06-05T16:02:23Z

I added two more same-hardware comparisons to the local benchmark notes. These are secondary evidence; the Qwen3.6 27B result above is still the cleanest quality+speed datapoint.

All runs use the same Strix Halo class system and the same speed command shape:

llama-bench -p 512 -n 128 -ngl 99 -fa on -ctk q4_0 -ctv q4_0 -r 3

Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL

Metric	ROCmFP4 STRIX_LEAN	UD-Q5_K_XL
Size	17.74 GiB	25.29 GiB
Vulkan pp512	1092.13 +/- 11.48 tok/s	1027.47 +/- 14.50 tok/s
Vulkan tg128	76.95 +/- 0.09 tok/s	57.31 +/- 0.04 tok/s
ROCm pp512	1128.36 +/- 20.65 tok/s	941.70 +/- 14.89 tok/s
ROCm tg128	67.53 +/- 0.16 tok/s	48.09 +/- 0.08 tok/s
Limited WikiText-2 PPL, 32 chunks	6.0609 +/- 0.08084	5.8748 +/- 0.07794

Supported claim for this model: ROCmFP4 is about 29.87% smaller and about 34.3% faster on Vulkan decode / 40.4% faster on ROCm decode. Q5 has better limited PPL here, so this is a size/speed tradeoff rather than a quality-parity claim.

Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL

Metric	ROCmFP4 STRIX_LEAN	UD-Q4_K_XL
Size	12.65 GiB	15.84 GiB
Vulkan pp512	1175.12 +/- 15.11 tok/s	1228.07 +/- 15.89 tok/s
Vulkan tg128	62.68 +/- 0.03 tok/s	52.73 +/- 0.13 tok/s
ROCm pp512	1305.38 +/- 28.52 tok/s	1191.64 +/- 17.98 tok/s
ROCm tg128	61.51 +/- 0.14 tok/s	45.63 +/- 0.06 tok/s

Supported claim for this model: ROCmFP4 is about 20.14% smaller than the tested Q4_K_XL baseline, about 18.9% faster on Vulkan decode, and about 34.8% faster on ROCm decode. Vulkan prompt processing was about 4.3% slower; ROCm prompt processing was about 9.5% faster.

I also attempted the same limited WikiText-2 PPL check for Gemma 4 26B A4B, but both instruct/tool quants produced extremely high PPL values, so I do not think that run is meaningful quality evidence for this model setup and I am not using it as a quality claim.

Add ROCmFP4 CPU quantization support

19eb89d

charlie12345 requested a review from ggerganov as a code owner June 5, 2026 13:42

github-actions Bot added examples ggml changes relating to the ggml tensor library for machine learning labels Jun 5, 2026

charlie12345 mentioned this pull request Jun 5, 2026

Add experimental ROCmFP4 quantization support #24184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ROCmFP4 CPU quantization support#24185

Add ROCmFP4 CPU quantization support#24185
charlie12345 wants to merge 1 commit into
ggml-org:masterfrom
charlie12345:rocmfp4-cpu-only

charlie12345 commented Jun 5, 2026 •

edited

Loading

Uh oh!

0cc4m commented Jun 5, 2026

Uh oh!

charlie12345 commented Jun 5, 2026

Uh oh!

charlie12345 commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

charlie12345 commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

0cc4m commented Jun 5, 2026

Uh oh!

charlie12345 commented Jun 5, 2026

Uh oh!

charlie12345 commented Jun 5, 2026

Qwen3.6 35B A3B, ROCmFP4 vs UD-Q5_K_XL

Gemma 4 26B A4B, ROCmFP4 vs UD-Q4_K_XL

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

charlie12345 commented Jun 5, 2026 •

edited

Loading