Sycl --split-mode tensor#24152
Open
Spruill-1 wants to merge 1 commit into
Open
Conversation
* SYCL: tensor parallelism (--split-mode tensor) for dual-GPU
Adds the comm_init/comm_free/comm_allreduce_tensor trio that the
meta-backend queries via get_proc_address to enable backend-specific
all-reduce, mirroring the pattern used by ggml-cuda.cu.
For N=2 (the common dual-GPU case) implements a degenerate ring
all-reduce with two size-branched paths:
* Small (nelem < 32768): FP32 direct memcpy + per-device ADD kernel
chained via depends_on(memcpy_event). 4 SYCL submissions/call.
* Large (nelem >= 32768): BF16-compressed. Each device compresses
FP32 -> BF16 in a local outbox, cross-device memcpys to the peer's
inbox (HALF the PCIe bytes), then decompresses + adds into the
local FP32 partial. 6 SYCL submissions/call but PCIe bytes halved
-- wins for any tensor where PCIe dominates kernel time.
Threshold and BF16 path pattern mirror the CUDA NCCL allreduce.
Storage: ONE persistent uint8_t buffer per device, 4 * nelem bytes
(matches both path layouts: FP32 nelem floats; BF16 outbox+inbox =
2 * nelem uint16_t each). Single alloc+free per device keeps the
SYCL pool's strict-LIFO invariant trivial.
Initial impl handles N=2 FP32 contiguous tensors. Other cases return
false, causing the meta-backend to use its generic butterfly fallback.
Per-call sync is intentionally omitted. SYCL in-order queue semantics
ensure that the meta-backend's next compute on the same per-device
queue waits for our final ADD, and the next allreduce's first op on
the same persistent buffer waits via the same queue. Only comm_free
does an explicit final wait.
OneCCL is NOT used: OneCCL 2021.17 hardcodes single-device-per-process
in communicator_impl.hpp:47 (condition devices.size() == 1), which is
incompatible with llama.cpp's single-process multi-GPU model.
Measured on dual Intel Arc Pro B70 (NEO 26.05.x, oneAPI 2025.3 +
DPC++ nightly):
Llama-3.3-70B Q4_K_M, -sm tensor -fa 1 -ctk f16 -ctv f16:
pp512 = 377.08 t/s (vs 313.65 layer mode = +20.2%)
tg128 = 17.40 t/s (vs 9.74 layer mode = +78.6%)
Qwen3-Coder-Next-80B-A3B Q3_K_M (MoE):
pp512 = 216.56 t/s (vs 156.58 meta-backend butterfly = +38.3%)
tg128 = 17.60 t/s (vs 14.31 meta-backend butterfly = +23.0%)
Qwen3-4B Q4_K_M:
pp64 = 984.51 t/s, tg16 = 49.29 t/s
Llama-3.3-70B in SYCL TP now comfortably beats production layer mode
on both prefill and decode. Coder-Next-80B-A3B (MoE) also wins on
both — the BF16 path is what unlocks the many-medium-allreduces
prefill pattern.
Build/CMake: no changes. No new dependencies. ~210 lines added across
ggml-sycl.h and ggml-sycl.cpp.
* Fix comments
|
Hi @Spruill-1, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Adds the comm_init/comm_free/comm_allreduce_tensor trio that the meta-backend wants via get_proc_address to enable backend-specific all-reduce, mirroring the pattern used by ggml-cuda.cu.
For N=2 (dual-GPUs) implements a degenerate ring allreduce with two size-branched paths to mirror CUDA NCCL allreduce:
Small (N< 32k): FP32 direct memcpy + per-device ADD kernel chained via depends_on(memcpy_event). 4 submissions/call.
Large (N>= 32k): BF16-compressed. Each device compresses FP32 -> BF16 locally, cross-device memcpys to the peer, then decompresses + adds into the local FP32 partial. 6 SYCL submissions/call but PCIe bytes halved. Wins overall for larger dense models.
Storage: A single persistent uint8_t buffer per device, 4 * N bytes (matches both path layouts: FP32 N floats; BF16 outbox+inbox = 2 * N uint16_t each). Single alloc+free per device keeps the SYCL pool's strict-LIFO invariant.
Initial impl handles N=2 FP32 contiguous tensors. Other cases return false, causing the meta-backend to use its generic butterfly fallback. This should be easy to extend to N>2, but I don't have the hardware.
Measured on dual Intel Arc Pro B70 (NEO 26.05.x, oneAPI 2025.3 + DPC++ nightly) with a Threadripper 2970WX, 128GB system running Ubuntu Server 26 LTS.
Llama-3.3-70B Q4_K_M, -sm tensor -fa 1 -ctk f16 -ctv f16:
pp512 = 377.08 t/s (vs 313.65 layer mode = +20.2%)
tg128 = 17.40 t/s (vs 9.74 layer mode = +78.6%)
Qwen3-Coder-Next-80B-A3B Q3_K_M (MoE):
pp512 = 216.56 t/s (vs 156.58 meta-backend butterfly = +38.3%)
tg128 = 17.60 t/s (vs 14.31 meta-backend butterfly = +23.0%)
Qwen3-4B Q4_K_M:
pp64 = 984.51 t/s, tg16 = 49.29 t/s
Requirements
Build/CMake: no changes. No new dependencies. ~200 lines added across ggml-sycl.h and ggml-sycl.cpp.