Skip to content

opencl: improve get_rows, cpy, concat and q6_k flat gemv#24160

Open
lhez wants to merge 4 commits into
ggml-org:masterfrom
qualcomm:lh/get-rows-cpy-concat-q6_k-flat-gemv
Open

opencl: improve get_rows, cpy, concat and q6_k flat gemv#24160
lhez wants to merge 4 commits into
ggml-org:masterfrom
qualcomm:lh/get-rows-cpy-concat-q6_k-flat-gemv

Conversation

@lhez
Copy link
Copy Markdown
Contributor

@lhez lhez commented Jun 5, 2026

Overview

Current implementations of get_rows, cpy and concat perform poorly with Qwen3.5. In particular, they all assign one workgroup to one row. When there is only one large row or a lot of very small rows, GPU becomes underutilized. This is improved in this PR.

This PR also tweaks how threads are mapped to data to improve coalescing in Q6_K flat gemv kernel. This helps with models with Q6_K output weights.

Additional information

Details

X2-90 (before, after)

Qwen3.5 9B

model size params backend ngl fa mmap test t/s
qwen35 9B Q4_K - Medium 5.74 GiB 9.20 B OpenCL 99 0 0 pp512 200.75 ± 0.59
qwen35 9B Q4_K - Medium 5.74 GiB 9.20 B OpenCL 99 0 0 tg128 10.36 ± 0.08

build: 94a220c (9496)

model size params backend ngl fa mmap test t/s
qwen35 9B Q4_K - Medium 5.74 GiB 9.20 B OpenCL 99 0 0 pp512 200.11 ± 2.65
qwen35 9B Q4_K - Medium 5.74 GiB 9.20 B OpenCL 99 0 0 tg128 14.44 ± 0.08

build: 0fb3d35 (9500)

Qwen3.5 4B

model size params backend ngl fa mmap test t/s
qwen35 4B Q4_K - Medium 2.80 GiB 4.33 B OpenCL 99 0 0 pp512 349.43 ± 1.50
qwen35 4B Q4_K - Medium 2.80 GiB 4.33 B OpenCL 99 0 0 tg128 17.25 ± 0.09

build: 94a220c (9496)

model size params backend ngl fa mmap test t/s
qwen35 4B Q4_K - Medium 2.80 GiB 4.33 B OpenCL 99 0 0 pp512 349.07 ± 7.49
qwen35 4B Q4_K - Medium 2.80 GiB 4.33 B OpenCL 99 0 0 tg128 23.35 ± 0.26

build: 0fb3d35 (9500)

Qwen3 8B

model size params backend ngl fa mmap test t/s
qwen3 8B Q4_K - Medium 4.68 GiB 8.19 B OpenCL 99 0 0 pp512 202.66 ± 3.06
qwen3 8B Q4_K - Medium 4.68 GiB 8.19 B OpenCL 99 0 0 tg128 13.20 ± 0.23

build: 94a220c (9496)

model size params backend ngl fa mmap test t/s
qwen3 8B Q4_K - Medium 4.68 GiB 8.19 B OpenCL 99 0 0 pp512 216.79 ± 0.39
qwen3 8B Q4_K - Medium 4.68 GiB 8.19 B OpenCL 99 0 0 tg128 15.99 ± 0.04

build: 0fb3d35 (9500)

Qwen3 4B

model size params backend ngl fa mmap test t/s
qwen3 4B Q4_K - Medium 2.32 GiB 4.02 B OpenCL 99 0 0 pp512 372.21 ± 6.42
qwen3 4B Q4_K - Medium 2.32 GiB 4.02 B OpenCL 99 0 0 tg128 22.12 ± 0.09

build: 94a220c (9496)

model size params backend ngl fa mmap test t/s
qwen3 4B Q4_K - Medium 2.32 GiB 4.02 B OpenCL 99 0 0 pp512 376.55 ± 2.43
qwen3 4B Q4_K - Medium 2.32 GiB 4.02 B OpenCL 99 0 0 tg128 26.03 ± 0.09

build: 0fb3d35 (9500)

llama3.2 3B

model size params backend ngl fa mmap test t/s
llama 3B Q4_K - Medium 1.87 GiB 3.21 B OpenCL 99 0 0 pp512 491.38 ± 18.21
llama 3B Q4_K - Medium 1.87 GiB 3.21 B OpenCL 99 0 0 tg128 25.52 ± 0.61

build: 94a220c (9496)

model size params backend ngl fa mmap test t/s
llama 3B Q4_K - Medium 1.87 GiB 3.21 B OpenCL 99 0 0 pp512 504.74 ± 3.63
llama 3B Q4_K - Medium 1.87 GiB 3.21 B OpenCL 99 0 0 tg128 33.12 ± 0.13

build: 0fb3d35 (9500)

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, used Claude to profile Qwen3.5 models to identify the issues.

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Jun 5, 2026
@lhez lhez marked this pull request as ready for review June 5, 2026 14:45
@lhez lhez requested a review from a team as a code owner June 5, 2026 14:45
Copy link
Copy Markdown
Member

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice bump in tg!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants