TP: round up granularity to 128 by JohannesGaessler · Pull Request #24180 · ggml-org/llama.cpp

JohannesGaessler · 2026-06-05T12:05:41Z

On master for -sm tensor the tensors are split to the minimum possible granularity. However, for performance it seems to be preferable to round the granularity up to a larger power of 2, 128 seems to be a good value. This should only make a difference when

the number of GPUs or the tensor dimensions are not a power of 2 and if
FP16/BF16/FP32 or a legacy quant are used.

Performance

GPU	Model	Microbatch size	Test	t/s master	t/s PR	Speedup
3x NVIDIA A16	llama 8B F16	1	pp512	30.66	34.04	1.11
3x NVIDIA A16	llama 8B F16	2	pp512	45.91	70.78	1.54
3x NVIDIA A16	llama 8B F16	4	pp512	86.50	134.02	1.55
3x NVIDIA A16	llama 8B F16	8	pp512	154.52	244.43	1.58
3x NVIDIA A16	llama 8B F16	16	pp512	275.71	414.48	1.50
3x NVIDIA A16	llama 8B F16	32	pp512	480.39	689.08	1.43
3x NVIDIA A16	llama 8B F16	64	pp512	731.28	1120.62	1.53
3x NVIDIA A16	llama 8B F16	128	pp512	899.91	1262.54	1.40
3x NVIDIA A16	llama 8B F16	256	pp512	1126.47	1414.11	1.26
3x NVIDIA A16	llama 8B F16	512	pp512	1070.45	1507.99	1.41
3x NVIDIA A16	llama 8B Q4_0	1	pp512	91.29	91.18	1.00
3x NVIDIA A16	llama 8B Q4_0	2	pp512	156.44	156.00	1.00
3x NVIDIA A16	llama 8B Q4_0	4	pp512	206.57	206.05	1.00
3x NVIDIA A16	llama 8B Q4_0	8	pp512	223.84	223.31	1.00
3x NVIDIA A16	llama 8B Q4_0	16	pp512	543.03	559.92	1.03
3x NVIDIA A16	llama 8B Q4_0	32	pp512	865.31	885.06	1.02
3x NVIDIA A16	llama 8B Q4_0	64	pp512	1217.23	1215.99	1.00
3x NVIDIA A16	llama 8B Q4_0	128	pp512	1409.54	1426.35	1.01
3x NVIDIA A16	llama 8B Q4_0	256	pp512	1478.06	1494.74	1.01
3x NVIDIA A16	llama 8B Q4_0	512	pp512	1585.59	1615.00	1.02
3x RTX 4090	llama 8B F16	1	pp512	137.75	140.87	1.02
3x RTX 4090	llama 8B F16	2	pp512	212.39	258.29	1.22
3x RTX 4090	llama 8B F16	4	pp512	379.53	470.82	1.24
3x RTX 4090	llama 8B F16	8	pp512	629.61	797.85	1.27
3x RTX 4090	llama 8B F16	16	pp512	1139.93	1261.59	1.11
3x RTX 4090	llama 8B F16	32	pp512	2206.53	2363.42	1.07
3x RTX 4090	llama 8B F16	64	pp512	3171.78	3622.67	1.14
3x RTX 4090	llama 8B F16	128	pp512	3778.82	4616.36	1.22
3x RTX 4090	llama 8B F16	256	pp512	5021.24	5963.07	1.19
3x RTX 4090	llama 8B F16	512	pp512	6574.54	6913.55	1.05
3x RTX 4090	llama 8B Q4_0	1	pp512	269.30	271.25	1.01
3x RTX 4090	llama 8B Q4_0	2	pp512	459.52	483.37	1.05
3x RTX 4090	llama 8B Q4_0	4	pp512	717.73	802.99	1.12
3x RTX 4090	llama 8B Q4_0	8	pp512	931.62	1104.94	1.19
3x RTX 4090	llama 8B Q4_0	16	pp512	1571.74	1674.04	1.07
3x RTX 4090	llama 8B Q4_0	32	pp512	2760.85	2882.11	1.04
3x RTX 4090	llama 8B Q4_0	64	pp512	3679.10	4101.06	1.11
3x RTX 4090	llama 8B Q4_0	128	pp512	4203.68	5111.99	1.22
3x RTX 4090	llama 8B Q4_0	256	pp512	5317.01	6235.11	1.17
3x RTX 4090	llama 8B Q4_0	512	pp512	7008.85	8615.56	1.23

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: No

CISC · 2026-06-05T12:23:53Z

https://github.com/ggml-org/llama.cpp/actions/runs/27013843018/job/79723743341#step:3:2663

JohannesGaessler · 2026-06-05T13:13:20Z

I removed the assert. I had only added it to make sure that the implementation on master does what I want but logically speaking the granularity does not have to cleanly divide the tensor dimension (if an uneven allocation is OK).

TP: round up granularity to 128

9509c1e

JohannesGaessler requested a review from CISC as a code owner June 5, 2026 12:05

CISC approved these changes Jun 5, 2026

View reviewed changes

ggerganov approved these changes Jun 5, 2026

View reviewed changes

remove assert

092a78b

CISC approved these changes Jun 5, 2026

View reviewed changes

JohannesGaessler merged commit 6effcec into ggml-org:master Jun 5, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TP: round up granularity to 128#24180

TP: round up granularity to 128#24180
JohannesGaessler merged 2 commits into
ggml-org:masterfrom
JohannesGaessler:tp-min-granularity

JohannesGaessler commented Jun 5, 2026

Uh oh!

CISC commented Jun 5, 2026

Uh oh!

JohannesGaessler commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JohannesGaessler commented Jun 5, 2026

Requirements

Uh oh!

CISC commented Jun 5, 2026

Uh oh!

JohannesGaessler commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants