Sycl --split-mode tensor by Spruill-1 · Pull Request #24152 · ggml-org/llama.cpp

Spruill-1 · 2026-06-05T00:46:56Z

Overview

Adds the comm_init/comm_free/comm_allreduce_tensor trio that the meta-backend wants via get_proc_address to enable backend-specific all-reduce, mirroring the pattern used by ggml-cuda.cu.

For N=2 (dual-GPUs) implements a degenerate ring allreduce with two size-branched paths to mirror CUDA NCCL allreduce:

Small (N< 32k): FP32 direct memcpy + per-device ADD kernel chained via depends_on(memcpy_event). 4 submissions/call.
Large (N>= 32k): BF16-compressed. Each device compresses FP32 -> BF16 locally, cross-device memcpys to the peer, then decompresses + adds into the local FP32 partial. 6 SYCL submissions/call but PCIe bytes halved. Wins overall for larger dense models.

Storage: A single persistent uint8_t buffer per device, 4 * N bytes (matches both path layouts: FP32 N floats; BF16 outbox+inbox = 2 * N uint16_t each). Single alloc+free per device keeps the SYCL pool's strict-LIFO invariant.

Initial impl handles N=2 FP32 contiguous tensors. Other cases return false, causing the meta-backend to use its generic butterfly fallback. This should be easy to extend to N>2, but I don't have the hardware.

Measured on dual Intel Arc Pro B70 (NEO 26.05.x, oneAPI 2025.3 + DPC++ nightly) with a Threadripper 2970WX, 128GB system running Ubuntu Server 26 LTS.

Llama-3.3-70B Q4_K_M, -sm tensor -fa 1 -ctk f16 -ctv f16:
pp512 = 377.08 t/s (vs 313.65 layer mode = +20.2%)
tg128 = 17.40 t/s (vs 9.74 layer mode = +78.6%)

Qwen3-Coder-Next-80B-A3B Q3_K_M (MoE):
pp512 = 216.56 t/s (vs 156.58 meta-backend butterfly = +38.3%)
tg128 = 17.60 t/s (vs 14.31 meta-backend butterfly = +23.0%)

Qwen3-4B Q4_K_M:
pp64 = 984.51 t/s, tg16 = 49.29 t/s

Requirements

Build/CMake: no changes. No new dependencies. ~200 lines added across ggml-sycl.h and ggml-sycl.cpp.

I have read and agree with the contributing guidelines
AI usage disclosure: I used github copilot to help set up an automated benchmarking setup to save time. With cleanup I may try and publish the benchmarking setup dockerfile if anyone is interested

* SYCL: tensor parallelism (--split-mode tensor) for dual-GPU Adds the comm_init/comm_free/comm_allreduce_tensor trio that the meta-backend queries via get_proc_address to enable backend-specific all-reduce, mirroring the pattern used by ggml-cuda.cu. For N=2 (the common dual-GPU case) implements a degenerate ring all-reduce with two size-branched paths: * Small (nelem < 32768): FP32 direct memcpy + per-device ADD kernel chained via depends_on(memcpy_event). 4 SYCL submissions/call. * Large (nelem >= 32768): BF16-compressed. Each device compresses FP32 -> BF16 in a local outbox, cross-device memcpys to the peer's inbox (HALF the PCIe bytes), then decompresses + adds into the local FP32 partial. 6 SYCL submissions/call but PCIe bytes halved -- wins for any tensor where PCIe dominates kernel time. Threshold and BF16 path pattern mirror the CUDA NCCL allreduce. Storage: ONE persistent uint8_t buffer per device, 4 * nelem bytes (matches both path layouts: FP32 nelem floats; BF16 outbox+inbox = 2 * nelem uint16_t each). Single alloc+free per device keeps the SYCL pool's strict-LIFO invariant trivial. Initial impl handles N=2 FP32 contiguous tensors. Other cases return false, causing the meta-backend to use its generic butterfly fallback. Per-call sync is intentionally omitted. SYCL in-order queue semantics ensure that the meta-backend's next compute on the same per-device queue waits for our final ADD, and the next allreduce's first op on the same persistent buffer waits via the same queue. Only comm_free does an explicit final wait. OneCCL is NOT used: OneCCL 2021.17 hardcodes single-device-per-process in communicator_impl.hpp:47 (condition devices.size() == 1), which is incompatible with llama.cpp's single-process multi-GPU model. Measured on dual Intel Arc Pro B70 (NEO 26.05.x, oneAPI 2025.3 + DPC++ nightly): Llama-3.3-70B Q4_K_M, -sm tensor -fa 1 -ctk f16 -ctv f16: pp512 = 377.08 t/s (vs 313.65 layer mode = +20.2%) tg128 = 17.40 t/s (vs 9.74 layer mode = +78.6%) Qwen3-Coder-Next-80B-A3B Q3_K_M (MoE): pp512 = 216.56 t/s (vs 156.58 meta-backend butterfly = +38.3%) tg128 = 17.60 t/s (vs 14.31 meta-backend butterfly = +23.0%) Qwen3-4B Q4_K_M: pp64 = 984.51 t/s, tg16 = 49.29 t/s Llama-3.3-70B in SYCL TP now comfortably beats production layer mode on both prefill and decode. Coder-Next-80B-A3B (MoE) also wins on both — the BF16 path is what unlocks the many-medium-allreduces prefill pattern. Build/CMake: no changes. No new dependencies. ~210 lines added across ggml-sycl.h and ggml-sycl.cpp. * Fix comments

ggml-gh-bot · 2026-06-05T00:51:12Z

Hi @Spruill-1, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Spruill-1 requested review from a team and ggerganov as code owners June 5, 2026 00:46

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 5, 2026

Spruill-1 changed the title ~~Sycl~~ Sycl --split-mode tensor Jun 5, 2026

Spruill-1 mentioned this pull request Jun 5, 2026

SYCL: -sm row (tensor-parallel) segfaults in ggml_backend_sycl_split_buffer_type on 3x Arc Pro B60 (b422) #23301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sycl --split-mode tensor#24152

Sycl --split-mode tensor#24152
Spruill-1 wants to merge 1 commit into
ggml-org:masterfrom
Spruill-1:master

Spruill-1 commented Jun 5, 2026

Uh oh!

ggml-gh-bot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Spruill-1 commented Jun 5, 2026

Overview

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant