Skip to content

SYCL: -sm row (tensor-parallel) segfaults in ggml_backend_sycl_split_buffer_type on 3x Arc Pro B60 (b422) #23301

@jpmicrosoft

Description

@jpmicrosoft

What is the issue?

llama-server with -sm row (tensor-parallel) on 3× Intel Arc Pro B60 24 GB segfaults during model load, inside ggml_backend_sycl_split_buffer_type, before any compute kernel runs. Pipeline-parallel (-sm layer) on the same 3 GPUs works correctly.

Reproduces deterministically with and without -fa on, and with 60+ GB of free VRAM across the GPUs (model is ~60 GB MXFP4).

What did you do?

source /opt/intel/oneapi/setvars.sh
export ONEAPI_DEVICE_SELECTOR=level_zero:0,1,2

# Tensor-parallel attempt (CRASHES)
./llama-server \
  -m gpt-oss-120b-mxfp4-00001-of-00003.gguf \
  --host 127.0.0.1 --port 9097 \
  -ngl 999 -sm row --tensor-split 1,1,1 --main-gpu 0 \
  -c 4096 --jinja

# Pipeline-parallel on same GPUs (WORKS, ~19-20 tok/s)
./llama-server -ngl 999 -sm layer ...

What happened?

0.00.440.341 I log_info: verbosity = 3
0.00.440.343 I device_info:
0.00.440.702 I   - SYCL0   : Intel(R) Arc(TM) Pro B60 Graphics (20343 MiB, 20343 MiB free)
0.00.441.001 I   - SYCL1   : Intel(R) Arc(TM) Pro B60 Graphics (23256 MiB, 23256 MiB free)
0.00.441.289 I   - SYCL2   : Intel(R) Arc(TM) Pro B60 Graphics (20343 MiB, 20343 MiB free)
0.00.441.293 I   - CPU     : AMD Ryzen 9 5900XT (128778 MiB, 128778 MiB free)
0.00.442.589 I srv          main: loading model
0.00.442.668 I common_init_result: fitting params to device memory ...

…then the process is killed by SIGSEGV. From dmesg:

llama-server[2367925]: segfault at 1 ip 00007b8259137540 sp 00007ffd5726d860
  error 4 in libggml-sycl.so.0.12.0[137540,7b825912f000+a4a000]

addr2line resolves the faulting instruction to:

$ addr2line -Cfie libggml-sycl.so.0.12.0 0x137540
ggml_backend_sycl_split_buffer_type

error 4 = userspace read fault, faulting address 0x1 — looks like a deref of a 1 literal (uninitialized pointer return from a helper, most likely).

I reproduced this 3 times in a row from cold (process restart, no other GPU consumers, full VRAM free on all three devices). It is not memory pressure.

Tried

  • With and without -fa on — same crash, same address.
  • After completely stopping the previous llama-server and waiting for VRAM to free — same crash.
  • Without -DGGML_SYCL_DEVICE_ARCH=bmg-g21 (JIT-only) — same crash.

Version

$ ./llama-server --version
version: 422 (9a532ae4b)
built with IntelLLVM 2025.3.3 for Linux x86_64

(Master at the time of writing, includes PR #21597 — that fix did eliminate an earlier device-enumeration crash on this hardware, but the row-split crash is separate.)

Hardware

  • 3× Intel Arc Pro B60 24 GB (Battlemage, bmg-g21)
  • Intel Level Zero v1.x via intel-level-zero-gpu / intel-opencl-icd 26.05.37020.3
  • AMD Ryzen 9 5900XT, 128 GB DDR4
  • Ubuntu 24.04, kernel 6.17.0-23-generic, xe driver

Build flags

cmake -B build-sycl \
  -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx \
  -DGGML_SYCL=ON -DGGML_SYCL_TARGET=INTEL \
  -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=OFF

Model

gpt-oss-120b MXFP4, 3-shard GGUF (~60 GB on disk).

Happy to provide

  • Full --verbose log (it stops at "fitting params to device memory" before crashing, but I can attach it).
  • A gdb backtrace if needed — the binary symbols are stripped by default; will rebuild -DCMAKE_BUILD_TYPE=RelWithDebInfo if it would help triage.
  • Test alternative --tensor-split ratios or different --main-gpu values.

Why this matters

PP works but gives single-stream perf (~20 tok/s for gpt-oss-120b on 3× B60). TP would give the expected single-stream speedup, but ggml_backend_sycl_split_buffer_type crashes before any kernel launches. This blocks effective use of multi-GPU Intel Arc systems for single-user inference of larger models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions