What is the issue?
llama-server with -sm row (tensor-parallel) on 3× Intel Arc Pro B60 24 GB segfaults during model load, inside ggml_backend_sycl_split_buffer_type, before any compute kernel runs. Pipeline-parallel (-sm layer) on the same 3 GPUs works correctly.
Reproduces deterministically with and without -fa on, and with 60+ GB of free VRAM across the GPUs (model is ~60 GB MXFP4).
What did you do?
source /opt/intel/oneapi/setvars.sh
export ONEAPI_DEVICE_SELECTOR=level_zero:0,1,2
# Tensor-parallel attempt (CRASHES)
./llama-server \
-m gpt-oss-120b-mxfp4-00001-of-00003.gguf \
--host 127.0.0.1 --port 9097 \
-ngl 999 -sm row --tensor-split 1,1,1 --main-gpu 0 \
-c 4096 --jinja
# Pipeline-parallel on same GPUs (WORKS, ~19-20 tok/s)
./llama-server -ngl 999 -sm layer ...
What happened?
0.00.440.341 I log_info: verbosity = 3
0.00.440.343 I device_info:
0.00.440.702 I - SYCL0 : Intel(R) Arc(TM) Pro B60 Graphics (20343 MiB, 20343 MiB free)
0.00.441.001 I - SYCL1 : Intel(R) Arc(TM) Pro B60 Graphics (23256 MiB, 23256 MiB free)
0.00.441.289 I - SYCL2 : Intel(R) Arc(TM) Pro B60 Graphics (20343 MiB, 20343 MiB free)
0.00.441.293 I - CPU : AMD Ryzen 9 5900XT (128778 MiB, 128778 MiB free)
0.00.442.589 I srv main: loading model
0.00.442.668 I common_init_result: fitting params to device memory ...
…then the process is killed by SIGSEGV. From dmesg:
llama-server[2367925]: segfault at 1 ip 00007b8259137540 sp 00007ffd5726d860
error 4 in libggml-sycl.so.0.12.0[137540,7b825912f000+a4a000]
addr2line resolves the faulting instruction to:
$ addr2line -Cfie libggml-sycl.so.0.12.0 0x137540
ggml_backend_sycl_split_buffer_type
error 4 = userspace read fault, faulting address 0x1 — looks like a deref of a 1 literal (uninitialized pointer return from a helper, most likely).
I reproduced this 3 times in a row from cold (process restart, no other GPU consumers, full VRAM free on all three devices). It is not memory pressure.
Tried
- With and without
-fa on — same crash, same address.
- After completely stopping the previous
llama-server and waiting for VRAM to free — same crash.
- Without
-DGGML_SYCL_DEVICE_ARCH=bmg-g21 (JIT-only) — same crash.
Version
$ ./llama-server --version
version: 422 (9a532ae4b)
built with IntelLLVM 2025.3.3 for Linux x86_64
(Master at the time of writing, includes PR #21597 — that fix did eliminate an earlier device-enumeration crash on this hardware, but the row-split crash is separate.)
Hardware
- 3× Intel Arc Pro B60 24 GB (Battlemage,
bmg-g21)
- Intel Level Zero v1.x via
intel-level-zero-gpu / intel-opencl-icd 26.05.37020.3
- AMD Ryzen 9 5900XT, 128 GB DDR4
- Ubuntu 24.04, kernel 6.17.0-23-generic,
xe driver
Build flags
cmake -B build-sycl \
-DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx \
-DGGML_SYCL=ON -DGGML_SYCL_TARGET=INTEL \
-DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release \
-DLLAMA_CURL=OFF
Model
gpt-oss-120b MXFP4, 3-shard GGUF (~60 GB on disk).
Happy to provide
- Full
--verbose log (it stops at "fitting params to device memory" before crashing, but I can attach it).
- A gdb backtrace if needed — the binary symbols are stripped by default; will rebuild
-DCMAKE_BUILD_TYPE=RelWithDebInfo if it would help triage.
- Test alternative
--tensor-split ratios or different --main-gpu values.
Why this matters
PP works but gives single-stream perf (~20 tok/s for gpt-oss-120b on 3× B60). TP would give the expected single-stream speedup, but ggml_backend_sycl_split_buffer_type crashes before any kernel launches. This blocks effective use of multi-GPU Intel Arc systems for single-user inference of larger models.
What is the issue?
llama-serverwith-sm row(tensor-parallel) on 3× Intel Arc Pro B60 24 GB segfaults during model load, insideggml_backend_sycl_split_buffer_type, before any compute kernel runs. Pipeline-parallel (-sm layer) on the same 3 GPUs works correctly.Reproduces deterministically with and without
-fa on, and with 60+ GB of free VRAM across the GPUs (model is ~60 GB MXFP4).What did you do?
What happened?
…then the process is killed by SIGSEGV. From
dmesg:addr2lineresolves the faulting instruction to:error 4= userspace read fault, faulting address0x1— looks like a deref of a1literal (uninitialized pointer return from a helper, most likely).I reproduced this 3 times in a row from cold (process restart, no other GPU consumers, full VRAM free on all three devices). It is not memory pressure.
Tried
-fa on— same crash, same address.llama-serverand waiting for VRAM to free — same crash.-DGGML_SYCL_DEVICE_ARCH=bmg-g21(JIT-only) — same crash.Version
(Master at the time of writing, includes PR #21597 — that fix did eliminate an earlier device-enumeration crash on this hardware, but the row-split crash is separate.)
Hardware
bmg-g21)intel-level-zero-gpu/intel-opencl-icd 26.05.37020.3xedriverBuild flags
Model
gpt-oss-120bMXFP4, 3-shard GGUF (~60 GB on disk).Happy to provide
--verboselog (it stops at "fitting params to device memory" before crashing, but I can attach it).-DCMAKE_BUILD_TYPE=RelWithDebInfoif it would help triage.--tensor-splitratios or different--main-gpuvalues.Why this matters
PP works but gives single-stream perf (~20 tok/s for gpt-oss-120b on 3× B60). TP would give the expected single-stream speedup, but
ggml_backend_sycl_split_buffer_typecrashes before any kernel launches. This blocks effective use of multi-GPU Intel Arc systems for single-user inference of larger models.