SYCL: -sm row (tensor-parallel) segfaults in ggml_backend_sycl_split_buffer_type on 3x Arc Pro B60 (b422)

### What is the issue?

`llama-server` with `-sm row` (tensor-parallel) on 3× Intel Arc Pro B60 24 GB segfaults during model load, inside `ggml_backend_sycl_split_buffer_type`, before any compute kernel runs. Pipeline-parallel (`-sm layer`) on the same 3 GPUs works correctly.

Reproduces deterministically with and without `-fa on`, and with 60+ GB of free VRAM across the GPUs (model is ~60 GB MXFP4).

### What did you do?

```bash
source /opt/intel/oneapi/setvars.sh
export ONEAPI_DEVICE_SELECTOR=level_zero:0,1,2

# Tensor-parallel attempt (CRASHES)
./llama-server \
  -m gpt-oss-120b-mxfp4-00001-of-00003.gguf \
  --host 127.0.0.1 --port 9097 \
  -ngl 999 -sm row --tensor-split 1,1,1 --main-gpu 0 \
  -c 4096 --jinja

# Pipeline-parallel on same GPUs (WORKS, ~19-20 tok/s)
./llama-server -ngl 999 -sm layer ...
```

### What happened?

```
0.00.440.341 I log_info: verbosity = 3
0.00.440.343 I device_info:
0.00.440.702 I   - SYCL0   : Intel(R) Arc(TM) Pro B60 Graphics (20343 MiB, 20343 MiB free)
0.00.441.001 I   - SYCL1   : Intel(R) Arc(TM) Pro B60 Graphics (23256 MiB, 23256 MiB free)
0.00.441.289 I   - SYCL2   : Intel(R) Arc(TM) Pro B60 Graphics (20343 MiB, 20343 MiB free)
0.00.441.293 I   - CPU     : AMD Ryzen 9 5900XT (128778 MiB, 128778 MiB free)
0.00.442.589 I srv          main: loading model
0.00.442.668 I common_init_result: fitting params to device memory ...
```
…then the process is killed by SIGSEGV. From `dmesg`:

```
llama-server[2367925]: segfault at 1 ip 00007b8259137540 sp 00007ffd5726d860
  error 4 in libggml-sycl.so.0.12.0[137540,7b825912f000+a4a000]
```

`addr2line` resolves the faulting instruction to:

```
$ addr2line -Cfie libggml-sycl.so.0.12.0 0x137540
ggml_backend_sycl_split_buffer_type
```

`error 4` = userspace read fault, faulting address `0x1` — looks like a deref of a `1` literal (uninitialized pointer return from a helper, most likely).

I reproduced this 3 times in a row from cold (process restart, no other GPU consumers, full VRAM free on all three devices). It is not memory pressure.

### Tried

- With and without `-fa on` — same crash, same address.
- After completely stopping the previous `llama-server` and waiting for VRAM to free — same crash.
- Without `-DGGML_SYCL_DEVICE_ARCH=bmg-g21` (JIT-only) — same crash.

### Version

```
$ ./llama-server --version
version: 422 (9a532ae4b)
built with IntelLLVM 2025.3.3 for Linux x86_64
```

(Master at the time of writing, includes PR #21597 — that fix did eliminate an earlier device-enumeration crash on this hardware, but the row-split crash is separate.)

### Hardware

- 3× Intel Arc Pro B60 24 GB (Battlemage, `bmg-g21`)
- Intel Level Zero v1.x via `intel-level-zero-gpu` / `intel-opencl-icd 26.05.37020.3`
- AMD Ryzen 9 5900XT, 128 GB DDR4
- Ubuntu 24.04, kernel 6.17.0-23-generic, `xe` driver

### Build flags

```
cmake -B build-sycl \
  -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx \
  -DGGML_SYCL=ON -DGGML_SYCL_TARGET=INTEL \
  -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=OFF
```

### Model

`gpt-oss-120b` MXFP4, 3-shard GGUF (~60 GB on disk).

### Happy to provide

- Full `--verbose` log (it stops at "fitting params to device memory" before crashing, but I can attach it).
- A gdb backtrace if needed — the binary symbols are stripped by default; will rebuild `-DCMAKE_BUILD_TYPE=RelWithDebInfo` if it would help triage.
- Test alternative `--tensor-split` ratios or different `--main-gpu` values.

### Why this matters

PP works but gives single-stream perf (~20 tok/s for gpt-oss-120b on 3× B60). TP would give the expected single-stream speedup, but `ggml_backend_sycl_split_buffer_type` crashes before any kernel launches. This blocks effective use of multi-GPU Intel Arc systems for single-user inference of larger models.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SYCL: -sm row (tensor-parallel) segfaults in ggml_backend_sycl_split_buffer_type on 3x Arc Pro B60 (b422) #23301

What is the issue?

What did you do?

What happened?

Tried

Version

Hardware

Build flags

Model

Happy to provide

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SYCL: -sm row (tensor-parallel) segfaults in ggml_backend_sycl_split_buffer_type on 3x Arc Pro B60 (b422) #23301

Description

What is the issue?

What did you do?

What happened?

Tried

Version

Hardware

Build flags

Model

Happy to provide

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions