opencl: improve get_rows, cpy, concat and q6_k flat gemv by lhez · Pull Request #24160 · ggml-org/llama.cpp

lhez · 2026-06-05T07:04:03Z

Overview

Current implementations of get_rows, cpy and concat perform poorly with Qwen3.5. In particular, they all assign one workgroup to one row. When there is only one large row or a lot of very small rows, GPU becomes underutilized. This is improved in this PR.

This PR also tweaks how threads are mapped to data to improve coalescing in Q6_K flat gemv kernel. This helps with models with Q6_K output weights.

Additional information

Details

X2-90 (before, after)

Qwen3.5 9B

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen35 9B Q4_K - Medium	5.74 GiB	9.20 B	OpenCL	99	0	0	pp512	200.75 ± 0.59
qwen35 9B Q4_K - Medium	5.74 GiB	9.20 B	OpenCL	99	0	0	tg128	10.36 ± 0.08

build: 94a220c (9496)

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen35 9B Q4_K - Medium	5.74 GiB	9.20 B	OpenCL	99	0	0	pp512	200.11 ± 2.65
qwen35 9B Q4_K - Medium	5.74 GiB	9.20 B	OpenCL	99	0	0	tg128	14.44 ± 0.08

build: 0fb3d35 (9500)

Qwen3.5 4B

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen35 4B Q4_K - Medium	2.80 GiB	4.33 B	OpenCL	99	0	0	pp512	349.43 ± 1.50
qwen35 4B Q4_K - Medium	2.80 GiB	4.33 B	OpenCL	99	0	0	tg128	17.25 ± 0.09

build: 94a220c (9496)

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen35 4B Q4_K - Medium	2.80 GiB	4.33 B	OpenCL	99	0	0	pp512	349.07 ± 7.49
qwen35 4B Q4_K - Medium	2.80 GiB	4.33 B	OpenCL	99	0	0	tg128	23.35 ± 0.26

build: 0fb3d35 (9500)

Qwen3 8B

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen3 8B Q4_K - Medium	4.68 GiB	8.19 B	OpenCL	99	0	0	pp512	202.66 ± 3.06
qwen3 8B Q4_K - Medium	4.68 GiB	8.19 B	OpenCL	99	0	0	tg128	13.20 ± 0.23

build: 94a220c (9496)

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen3 8B Q4_K - Medium	4.68 GiB	8.19 B	OpenCL	99	0	0	pp512	216.79 ± 0.39
qwen3 8B Q4_K - Medium	4.68 GiB	8.19 B	OpenCL	99	0	0	tg128	15.99 ± 0.04

build: 0fb3d35 (9500)

Qwen3 4B

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen3 4B Q4_K - Medium	2.32 GiB	4.02 B	OpenCL	99	0	0	pp512	372.21 ± 6.42
qwen3 4B Q4_K - Medium	2.32 GiB	4.02 B	OpenCL	99	0	0	tg128	22.12 ± 0.09

build: 94a220c (9496)

model	size	params	backend	ngl	fa	mmap	test	t/s
qwen3 4B Q4_K - Medium	2.32 GiB	4.02 B	OpenCL	99	0	0	pp512	376.55 ± 2.43
qwen3 4B Q4_K - Medium	2.32 GiB	4.02 B	OpenCL	99	0	0	tg128	26.03 ± 0.09

build: 0fb3d35 (9500)

llama3.2 3B

model	size	params	backend	ngl	fa	mmap	test	t/s
llama 3B Q4_K - Medium	1.87 GiB	3.21 B	OpenCL	99	0	0	pp512	491.38 ± 18.21
llama 3B Q4_K - Medium	1.87 GiB	3.21 B	OpenCL	99	0	0	tg128	25.52 ± 0.61

build: 94a220c (9496)

model	size	params	backend	ngl	fa	mmap	test	t/s
llama 3B Q4_K - Medium	1.87 GiB	3.21 B	OpenCL	99	0	0	pp512	504.74 ± 3.63
llama 3B Q4_K - Medium	1.87 GiB	3.21 B	OpenCL	99	0	0	tg128	33.12 ± 0.13

build: 0fb3d35 (9500)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, used Claude to profile Qwen3.5 models to identify the issues.

max-krasnyansky

Nice bump in tg!

lhez added 4 commits June 3, 2026 22:26

opencl: allow multiple workgroups for large rows

bfedc32

opencl: improve small cpy

6da0c8e

opencl: packed concat for small input

3774105

opencl: tweak flat q6_K gemv, increase N_DST and remap threads

0fb3d35

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Jun 5, 2026

lhez marked this pull request as ready for review June 5, 2026 14:45

lhez requested a review from a team as a code owner June 5, 2026 14:45

max-krasnyansky approved these changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opencl: improve get_rows, cpy, concat and q6_k flat gemv#24160

opencl: improve get_rows, cpy, concat and q6_k flat gemv#24160
lhez wants to merge 4 commits into
ggml-org:masterfrom
qualcomm:lh/get-rows-cpy-concat-q6_k-flat-gemv

lhez commented Jun 5, 2026 •

edited

Loading

Uh oh!

max-krasnyansky left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lhez commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

X2-90 (before, after)

Requirements

Uh oh!

max-krasnyansky left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lhez commented Jun 5, 2026 •

edited

Loading