Eval bug: [SYCL]  empty/gibberish output on hybrid models + ggml_sycl_op_mul_mat crash (qwen3next/qwen35 arch) on Intel Arc Pro B60 — regression between *pinpointed b9128- b9159* (2026-03-23) and server-intel latest (2026-06-03)

### Name and Version

Working:  build 8477 (ec2b787eb) with IntelLLVM 2025.2.1 — image built 2026-03-23
Broken:   server-intel latest — image built 2026-06-03 (~b9479)

### Operating systems

Linux

### GGML backends

SYCL

### Hardware

2× Intel Arc Pro B60 (24GB each, 46GB combined, PCIe discrete, no XeLink)
AMD Ryzen 7 7700 (host CPU)
64GB DDR5
Level Zero driver 1.15.38308+1
Both GPUs passed via /dev/dri to Docker container

### Models

Qwen3-Coder-Next UD-Q3_K_XL (33.79 GiB, arch=qwen3next) — CRASH on latest, FIXED on b8477
https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF

Qwen3-Coder-Next Q4_K_M (48.18 GiB, arch=qwen3next) — CRASH on latest, FIXED on b8477
https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF

Qwen3.6-27B Q4_K_M (15.65 GiB, arch=qwen35) — empty/gibberish content on latest, FIXED on b8477
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

Qwen3.6-35B-A3B UD-Q4_K_M (arch=qwen35) — empty/gibberish content on latest, FIXED on b8477
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Qwen3-Coder-30B-A3B Q4_K_M (arch=qwen3moe, pure attention) — UNAFFECTED on both builds

### Problem description & steps to reproduce

Newer qwen models 3.5+ hybrid models were outputing mostly gibberish and crashing.  I've tested most of these models previously and they were working. Honestly thought I did something at first but then realized it was just an update I'm guessing.

Two related symptoms affect all hybrid architecture models (identifiable by 
ssm_d_conv / ssm_d_state / ssm_dt_rank fields in print_info) on Intel Arc Pro B60 
using server-intel latest (built 2026-06-03). Both are fixed by pinning to b8477.

SYMPTOM 1 - CRASH (qwen3next arch, e.g. Qwen3-Coder-Next):
Model loads fully into VRAM then crashes during warmup mul_mat on SSM/GDN weights.
crash at ggml_sycl_op_mul_mat line:3093.

SYMPTOM 2 - EMPTY/GIBBERISH OUTPUT (qwen35 arch, e.g. Qwen3.6-27B/35B):
Model loads and generates tokens but "content" field in API response is empty or 
contains number spam (e.g. 03129816571033482510583...). All output goes to 
"reasoning_content" as corrupted tokens. On b8477 the same models produce clean 
coherent output in "content".

IMPORTANT CONTEXT - why this is hard to diagnose:
server-intel is a rolling latest tag. Not sure when build/image broke but when I tried most recent build or 2026-06-03 get the broken build. Same 
image tag, completely different behavior depending on pull date. At one point I randomly tested Coder-Next about a week ago and although slow it worked. Now I randomly couldn't get any model 3.5 or above to work without gibberish outputs .... Coder-Next had been working in earlier sessions, then 
silently broke when the image updated i'm guessing, I just can't pinpoint when, unless this is known already.

Pure attention MoE models (qwen3moe arch, no SSM layers) are completely unaffected 
on both builds and work correctly throughout.

Steps to reproduce (using Docker):

1. Pull latest server-intel:
   docker pull ghcr.io/ggml-org/llama.cpp:server-intel

2. Run with dual Arc Pro B60:

   docker run --rm \
     --device /dev/dri:/dev/dri \
     --group-add 993 --group-add 44 \
     -e ONEAPI_DEVICE_SELECTOR=level_zero:0;level_zero:1 \
     -e ZES_ENABLE_SYSMAN=1 \
     -e UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1 \
     -e GGML_SYCL_ENABLE_LEVEL_ZERO=1 \
     -e GGML_SYCL_ENABLE_VMM=1 \
     -e GGML_SYCL_DISABLE_GRAPH=1 \
     -p 8010:8010 \
     ghcr.io/ggml-org/llama.cpp:server-intel \
     --model /models/Qwen3-Coder-Next-UD-Q3_K_XL.gguf \
     --ctx-size 8192 \
     --host 0.0.0.0 \
     --port 8010 \
     -ngl 999 \
     --split-mode layer \
     --tensor-split 1,1 \
     --flash-attn off \
     --threads 1

3. For qwen3next: crash during warmup at ggml_sycl_op_mul_mat line:3093
   For qwen35: server starts, responds, but content field is empty or number spam

4. Replace image with server-intel-b8477 — identical command — both symptoms fixed

---

Side note for Ollama SYCL users:

This same regression affects users of https://github.com/eleiton/ollama-intel-arc
— an excellent community-maintained project providing native SYCL Ollama builds 
for Intel Arc GPUs. It is not a bug in that project. Ollama 0.30.2 pins 
LLAMA_CPP_VERSION=b9479 (confirmed via 
https://raw.githubusercontent.com/ollama/ollama/v0.30.2/LLAMA_CPP_VERSION) which 
pulls the broken llama.cpp commit via FetchContent at build time.

Workaround: add this to ollama-sycl/Dockerfile after WORKDIR /ollama, then rebuild:

    RUN echo "ec2b787eb" > LLAMA_CPP_VERSION

Once this regression is fixed in llama.cpp master, Ollama SYCL builds will 
automatically pick up the fix.

### First Bad Commit

Working: server-intel-b8477 (ec2b787eb), image built 2026-03-23
Broken:  server-intel latest (~b9479), image built 2026-06-03

Regression window: somewhere between 2026-03-23 and 2026-06-03.
llama.cpp ships multiple builds per day so this window contains 
hundreds of commits. Have not bisected further.

A git bisect focusing on changes to ggml/src/ggml-sycl/ggml-sycl.cpp 
around the ggml_sycl_op_mul_mat function (line ~3093 in broken build) 
would likely narrow it down quickly.

Related: #21474 (same crash class on iGPU/Meteor Lake — that reporter 
also identified b8477 as working)  I just used this image to see if it would work, and it did! 

### Relevant log output

<details>
<summary>Logs</summary>
--- CRASH on latest (qwen3next arch) ---

load_tensors: offloaded 49/49 layers to GPU
load_tensors:        SYCL0 model buffer size = 17595.44 MiB
load_tensors:        SYCL1 model buffer size = 16685.41 MiB
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: graph nodes  = 5108
common_init_from_params: warming up the model with an empty run...
Exception caught at ggml/src/ggml-sycl/ggml-sycl.cpp, line:3093
SYCL error: CHECK_TRY_ERROR(op(ctx, src0, src1, dst, ...stream))
in function ggml_sycl_op_mul_mat
signal: aborted (core dumped)

--- EMPTY CONTENT on latest (qwen35 arch) ---

{"choices":[{"finish_reason":"length","index":0,
"message":{"role":"assistant","content":"",
"reasoning_content":"1. = 4. = 3. = 4. = 3. = 4..."}}]}

content is empty, reasoning_content contains number spam

--- BOTH FIXED on server-intel-b8477 (identical commands) ---

qwen3next: server is listening, no crash, ~10 t/s clean output
qwen35:    server is listening, clean text in content field,
           full HTML page generated successfully

```console

```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: [SYCL] empty/gibberish output on hybrid models + ggml_sycl_op_mul_mat crash (qwen3next/qwen35 arch) on Intel Arc Pro B60 — regression between pinpointed b9128- b9159 (2026-03-23) and server-intel latest (2026-06-03) #24168

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: [SYCL] empty/gibberish output on hybrid models + ggml_sycl_op_mul_mat crash (qwen3next/qwen35 arch) on Intel Arc Pro B60 — regression between *pinpointed b9128- b9159* (2026-03-23) and server-intel latest (2026-06-03) #24168

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Eval bug: [SYCL] empty/gibberish output on hybrid models + ggml_sycl_op_mul_mat crash (qwen3next/qwen35 arch) on Intel Arc Pro B60 — regression between pinpointed b9128- b9159 (2026-03-23) and server-intel latest (2026-06-03) #24168