Skip to content

TP: round up granularity to 128#24180

Merged
JohannesGaessler merged 2 commits into
ggml-org:masterfrom
JohannesGaessler:tp-min-granularity
Jun 5, 2026
Merged

TP: round up granularity to 128#24180
JohannesGaessler merged 2 commits into
ggml-org:masterfrom
JohannesGaessler:tp-min-granularity

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

On master for -sm tensor the tensors are split to the minimum possible granularity. However, for performance it seems to be preferable to round the granularity up to a larger power of 2, 128 seems to be a good value. This should only make a difference when

  1. the number of GPUs or the tensor dimensions are not a power of 2 and if
  2. FP16/BF16/FP32 or a legacy quant are used.
Performance
GPU Model Microbatch size Test t/s master t/s PR Speedup
3x NVIDIA A16 llama 8B F16 1 pp512 30.66 34.04 1.11
3x NVIDIA A16 llama 8B F16 2 pp512 45.91 70.78 1.54
3x NVIDIA A16 llama 8B F16 4 pp512 86.50 134.02 1.55
3x NVIDIA A16 llama 8B F16 8 pp512 154.52 244.43 1.58
3x NVIDIA A16 llama 8B F16 16 pp512 275.71 414.48 1.50
3x NVIDIA A16 llama 8B F16 32 pp512 480.39 689.08 1.43
3x NVIDIA A16 llama 8B F16 64 pp512 731.28 1120.62 1.53
3x NVIDIA A16 llama 8B F16 128 pp512 899.91 1262.54 1.40
3x NVIDIA A16 llama 8B F16 256 pp512 1126.47 1414.11 1.26
3x NVIDIA A16 llama 8B F16 512 pp512 1070.45 1507.99 1.41
3x NVIDIA A16 llama 8B Q4_0 1 pp512 91.29 91.18 1.00
3x NVIDIA A16 llama 8B Q4_0 2 pp512 156.44 156.00 1.00
3x NVIDIA A16 llama 8B Q4_0 4 pp512 206.57 206.05 1.00
3x NVIDIA A16 llama 8B Q4_0 8 pp512 223.84 223.31 1.00
3x NVIDIA A16 llama 8B Q4_0 16 pp512 543.03 559.92 1.03
3x NVIDIA A16 llama 8B Q4_0 32 pp512 865.31 885.06 1.02
3x NVIDIA A16 llama 8B Q4_0 64 pp512 1217.23 1215.99 1.00
3x NVIDIA A16 llama 8B Q4_0 128 pp512 1409.54 1426.35 1.01
3x NVIDIA A16 llama 8B Q4_0 256 pp512 1478.06 1494.74 1.01
3x NVIDIA A16 llama 8B Q4_0 512 pp512 1585.59 1615.00 1.02
3x RTX 4090 llama 8B F16 1 pp512 137.75 140.87 1.02
3x RTX 4090 llama 8B F16 2 pp512 212.39 258.29 1.22
3x RTX 4090 llama 8B F16 4 pp512 379.53 470.82 1.24
3x RTX 4090 llama 8B F16 8 pp512 629.61 797.85 1.27
3x RTX 4090 llama 8B F16 16 pp512 1139.93 1261.59 1.11
3x RTX 4090 llama 8B F16 32 pp512 2206.53 2363.42 1.07
3x RTX 4090 llama 8B F16 64 pp512 3171.78 3622.67 1.14
3x RTX 4090 llama 8B F16 128 pp512 3778.82 4616.36 1.22
3x RTX 4090 llama 8B F16 256 pp512 5021.24 5963.07 1.19
3x RTX 4090 llama 8B F16 512 pp512 6574.54 6913.55 1.05
3x RTX 4090 llama 8B Q4_0 1 pp512 269.30 271.25 1.01
3x RTX 4090 llama 8B Q4_0 2 pp512 459.52 483.37 1.05
3x RTX 4090 llama 8B Q4_0 4 pp512 717.73 802.99 1.12
3x RTX 4090 llama 8B Q4_0 8 pp512 931.62 1104.94 1.19
3x RTX 4090 llama 8B Q4_0 16 pp512 1571.74 1674.04 1.07
3x RTX 4090 llama 8B Q4_0 32 pp512 2760.85 2882.11 1.04
3x RTX 4090 llama 8B Q4_0 64 pp512 3679.10 4101.06 1.11
3x RTX 4090 llama 8B Q4_0 128 pp512 4203.68 5111.99 1.22
3x RTX 4090 llama 8B Q4_0 256 pp512 5317.01 6235.11 1.17
3x RTX 4090 llama 8B Q4_0 512 pp512 7008.85 8615.56 1.23

Requirements

@JohannesGaessler JohannesGaessler requested a review from CISC as a code owner June 5, 2026 12:05
@CISC
Copy link
Copy Markdown
Member

CISC commented Jun 5, 2026

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I removed the assert. I had only added it to make sure that the implementation on master does what I want but logically speaking the granularity does not have to cleanly divide the tensor dimension (if an uneven allocation is OK).

@JohannesGaessler JohannesGaessler merged commit 6effcec into ggml-org:master Jun 5, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants