Skip to content

Eval bug: [SYCL] empty/gibberish output on hybrid models + ggml_sycl_op_mul_mat crash (qwen3next/qwen35 arch) on Intel Arc Pro B60 — regression between *pinpointed b9128- b9159* (2026-03-23) and server-intel latest (2026-06-03) #24168

@LVLUP-Studio

Description

@LVLUP-Studio

Name and Version

Working: build 8477 (ec2b787) with IntelLLVM 2025.2.1 — image built 2026-03-23
Broken: server-intel latest — image built 2026-06-03 (~b9479)

Operating systems

Linux

GGML backends

SYCL

Hardware

2× Intel Arc Pro B60 (24GB each, 46GB combined, PCIe discrete, no XeLink)
AMD Ryzen 7 7700 (host CPU)
64GB DDR5
Level Zero driver 1.15.38308+1
Both GPUs passed via /dev/dri to Docker container

Models

Qwen3-Coder-Next UD-Q3_K_XL (33.79 GiB, arch=qwen3next) — CRASH on latest, FIXED on b8477
https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF

Qwen3-Coder-Next Q4_K_M (48.18 GiB, arch=qwen3next) — CRASH on latest, FIXED on b8477
https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF

Qwen3.6-27B Q4_K_M (15.65 GiB, arch=qwen35) — empty/gibberish content on latest, FIXED on b8477
https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

Qwen3.6-35B-A3B UD-Q4_K_M (arch=qwen35) — empty/gibberish content on latest, FIXED on b8477
https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

Qwen3-Coder-30B-A3B Q4_K_M (arch=qwen3moe, pure attention) — UNAFFECTED on both builds

Problem description & steps to reproduce

Newer qwen models 3.5+ hybrid models were outputing mostly gibberish and crashing. I've tested most of these models previously and they were working. Honestly thought I did something at first but then realized it was just an update I'm guessing.

Two related symptoms affect all hybrid architecture models (identifiable by
ssm_d_conv / ssm_d_state / ssm_dt_rank fields in print_info) on Intel Arc Pro B60
using server-intel latest (built 2026-06-03). Both are fixed by pinning to b8477.

SYMPTOM 1 - CRASH (qwen3next arch, e.g. Qwen3-Coder-Next):
Model loads fully into VRAM then crashes during warmup mul_mat on SSM/GDN weights.
crash at ggml_sycl_op_mul_mat line:3093.

SYMPTOM 2 - EMPTY/GIBBERISH OUTPUT (qwen35 arch, e.g. Qwen3.6-27B/35B):
Model loads and generates tokens but "content" field in API response is empty or
contains number spam (e.g. 03129816571033482510583...). All output goes to
"reasoning_content" as corrupted tokens. On b8477 the same models produce clean
coherent output in "content".

IMPORTANT CONTEXT - why this is hard to diagnose:
server-intel is a rolling latest tag. Not sure when build/image broke but when I tried most recent build or 2026-06-03 get the broken build. Same
image tag, completely different behavior depending on pull date. At one point I randomly tested Coder-Next about a week ago and although slow it worked. Now I randomly couldn't get any model 3.5 or above to work without gibberish outputs .... Coder-Next had been working in earlier sessions, then
silently broke when the image updated i'm guessing, I just can't pinpoint when, unless this is known already.

Pure attention MoE models (qwen3moe arch, no SSM layers) are completely unaffected
on both builds and work correctly throughout.

Steps to reproduce (using Docker):

  1. Pull latest server-intel:
    docker pull ghcr.io/ggml-org/llama.cpp:server-intel

  2. Run with dual Arc Pro B60:

    docker run --rm
    --device /dev/dri:/dev/dri
    --group-add 993 --group-add 44
    -e ONEAPI_DEVICE_SELECTOR=level_zero:0;level_zero:1
    -e ZES_ENABLE_SYSMAN=1
    -e UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1
    -e GGML_SYCL_ENABLE_LEVEL_ZERO=1
    -e GGML_SYCL_ENABLE_VMM=1
    -e GGML_SYCL_DISABLE_GRAPH=1
    -p 8010:8010
    ghcr.io/ggml-org/llama.cpp:server-intel
    --model /models/Qwen3-Coder-Next-UD-Q3_K_XL.gguf
    --ctx-size 8192
    --host 0.0.0.0
    --port 8010
    -ngl 999
    --split-mode layer
    --tensor-split 1,1
    --flash-attn off
    --threads 1

  3. For qwen3next: crash during warmup at ggml_sycl_op_mul_mat line:3093
    For qwen35: server starts, responds, but content field is empty or number spam

  4. Replace image with server-intel-b8477 — identical command — both symptoms fixed


Side note for Ollama SYCL users:

This same regression affects users of https://github.com/eleiton/ollama-intel-arc
— an excellent community-maintained project providing native SYCL Ollama builds
for Intel Arc GPUs. It is not a bug in that project. Ollama 0.30.2 pins
LLAMA_CPP_VERSION=b9479 (confirmed via
https://raw.githubusercontent.com/ollama/ollama/v0.30.2/LLAMA_CPP_VERSION) which
pulls the broken llama.cpp commit via FetchContent at build time.

Workaround: add this to ollama-sycl/Dockerfile after WORKDIR /ollama, then rebuild:

RUN echo "ec2b787eb" > LLAMA_CPP_VERSION

Once this regression is fixed in llama.cpp master, Ollama SYCL builds will
automatically pick up the fix.

First Bad Commit

Working: server-intel-b8477 (ec2b787), image built 2026-03-23
Broken: server-intel latest (~b9479), image built 2026-06-03

Regression window: somewhere between 2026-03-23 and 2026-06-03.
llama.cpp ships multiple builds per day so this window contains
hundreds of commits. Have not bisected further.

A git bisect focusing on changes to ggml/src/ggml-sycl/ggml-sycl.cpp
around the ggml_sycl_op_mul_mat function (line ~3093 in broken build)
would likely narrow it down quickly.

Related: #21474 (same crash class on iGPU/Meteor Lake — that reporter
also identified b8477 as working) I just used this image to see if it would work, and it did!

Relevant log output

Logs --- CRASH on latest (qwen3next arch) ---

load_tensors: offloaded 49/49 layers to GPU
load_tensors: SYCL0 model buffer size = 17595.44 MiB
load_tensors: SYCL1 model buffer size = 16685.41 MiB
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: graph nodes = 5108
common_init_from_params: warming up the model with an empty run...
Exception caught at ggml/src/ggml-sycl/ggml-sycl.cpp, line:3093
SYCL error: CHECK_TRY_ERROR(op(ctx, src0, src1, dst, ...stream))
in function ggml_sycl_op_mul_mat
signal: aborted (core dumped)

--- EMPTY CONTENT on latest (qwen35 arch) ---

{"choices":[{"finish_reason":"length","index":0,
"message":{"role":"assistant","content":"",
"reasoning_content":"1. = 4. = 3. = 4. = 3. = 4..."}}]}

content is empty, reasoning_content contains number spam

--- BOTH FIXED on server-intel-b8477 (identical commands) ---

qwen3next: server is listening, no crash, ~10 t/s clean output
qwen35: server is listening, clean text in content field,
full HTML page generated successfully

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions