Skip to content

Sycl --split-mode tensor#24152

Open
Spruill-1 wants to merge 1 commit into
ggml-org:masterfrom
Spruill-1:master
Open

Sycl --split-mode tensor#24152
Spruill-1 wants to merge 1 commit into
ggml-org:masterfrom
Spruill-1:master

Conversation

@Spruill-1
Copy link
Copy Markdown

Overview

Adds the comm_init/comm_free/comm_allreduce_tensor trio that the meta-backend wants via get_proc_address to enable backend-specific all-reduce, mirroring the pattern used by ggml-cuda.cu.

For N=2 (dual-GPUs) implements a degenerate ring allreduce with two size-branched paths to mirror CUDA NCCL allreduce:

  • Small (N< 32k): FP32 direct memcpy + per-device ADD kernel chained via depends_on(memcpy_event). 4 submissions/call.

  • Large (N>= 32k): BF16-compressed. Each device compresses FP32 -> BF16 locally, cross-device memcpys to the peer, then decompresses + adds into the local FP32 partial. 6 SYCL submissions/call but PCIe bytes halved. Wins overall for larger dense models.

Storage: A single persistent uint8_t buffer per device, 4 * N bytes (matches both path layouts: FP32 N floats; BF16 outbox+inbox = 2 * N uint16_t each). Single alloc+free per device keeps the SYCL pool's strict-LIFO invariant.

Initial impl handles N=2 FP32 contiguous tensors. Other cases return false, causing the meta-backend to use its generic butterfly fallback. This should be easy to extend to N>2, but I don't have the hardware.

Measured on dual Intel Arc Pro B70 (NEO 26.05.x, oneAPI 2025.3 + DPC++ nightly) with a Threadripper 2970WX, 128GB system running Ubuntu Server 26 LTS.

Llama-3.3-70B Q4_K_M, -sm tensor -fa 1 -ctk f16 -ctv f16:
pp512 = 377.08 t/s (vs 313.65 layer mode = +20.2%)
tg128 = 17.40 t/s (vs 9.74 layer mode = +78.6%)

Qwen3-Coder-Next-80B-A3B Q3_K_M (MoE):
pp512 = 216.56 t/s (vs 156.58 meta-backend butterfly = +38.3%)
tg128 = 17.60 t/s (vs 14.31 meta-backend butterfly = +23.0%)

Qwen3-4B Q4_K_M:
pp64 = 984.51 t/s, tg16 = 49.29 t/s

Requirements

Build/CMake: no changes. No new dependencies. ~200 lines added across ggml-sycl.h and ggml-sycl.cpp.

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: I used github copilot to help set up an automated benchmarking setup to save time. With cleanup I may try and publish the benchmarking setup dockerfile if anyone is interested

* SYCL: tensor parallelism (--split-mode tensor) for dual-GPU

Adds the comm_init/comm_free/comm_allreduce_tensor trio that the
meta-backend queries via get_proc_address to enable backend-specific
all-reduce, mirroring the pattern used by ggml-cuda.cu.

For N=2 (the common dual-GPU case) implements a degenerate ring
all-reduce with two size-branched paths:

  * Small (nelem < 32768): FP32 direct memcpy + per-device ADD kernel
    chained via depends_on(memcpy_event). 4 SYCL submissions/call.

  * Large (nelem >= 32768): BF16-compressed. Each device compresses
    FP32 -> BF16 in a local outbox, cross-device memcpys to the peer's
    inbox (HALF the PCIe bytes), then decompresses + adds into the
    local FP32 partial. 6 SYCL submissions/call but PCIe bytes halved
    -- wins for any tensor where PCIe dominates kernel time.

Threshold and BF16 path pattern mirror the CUDA NCCL allreduce.

Storage: ONE persistent uint8_t buffer per device, 4 * nelem bytes
(matches both path layouts: FP32 nelem floats; BF16 outbox+inbox =
2 * nelem uint16_t each). Single alloc+free per device keeps the
SYCL pool's strict-LIFO invariant trivial.

Initial impl handles N=2 FP32 contiguous tensors. Other cases return
false, causing the meta-backend to use its generic butterfly fallback.

Per-call sync is intentionally omitted. SYCL in-order queue semantics
ensure that the meta-backend's next compute on the same per-device
queue waits for our final ADD, and the next allreduce's first op on
the same persistent buffer waits via the same queue. Only comm_free
does an explicit final wait.

OneCCL is NOT used: OneCCL 2021.17 hardcodes single-device-per-process
in communicator_impl.hpp:47 (condition devices.size() == 1), which is
incompatible with llama.cpp's single-process multi-GPU model.

Measured on dual Intel Arc Pro B70 (NEO 26.05.x, oneAPI 2025.3 +
DPC++ nightly):

  Llama-3.3-70B Q4_K_M, -sm tensor -fa 1 -ctk f16 -ctv f16:
    pp512 = 377.08 t/s  (vs 313.65 layer mode = +20.2%)
    tg128 = 17.40 t/s   (vs   9.74 layer mode = +78.6%)

  Qwen3-Coder-Next-80B-A3B Q3_K_M (MoE):
    pp512 = 216.56 t/s  (vs 156.58 meta-backend butterfly = +38.3%)
    tg128 = 17.60 t/s   (vs  14.31 meta-backend butterfly = +23.0%)

  Qwen3-4B Q4_K_M:
    pp64  = 984.51 t/s, tg16 = 49.29 t/s

Llama-3.3-70B in SYCL TP now comfortably beats production layer mode
on both prefill and decode. Coder-Next-80B-A3B (MoE) also wins on
both — the BF16 path is what unlocks the many-medium-allreduces
prefill pattern.

Build/CMake: no changes. No new dependencies. ~210 lines added across
ggml-sycl.h and ggml-sycl.cpp.

* Fix comments
@Spruill-1 Spruill-1 requested review from a team and ggerganov as code owners June 5, 2026 00:46
@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jun 5, 2026
@Spruill-1 Spruill-1 changed the title Sycl Sycl --split-mode tensor Jun 5, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Jun 5, 2026

Hi @Spruill-1, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant