Eval bug: llama-bench CUDA crash with Gemma 4 MXFP4_MOE and -ub 1024

### Name and Version

$ ~/llama.cpp/build/bin/llama-cli --version
version: 9197 (fcae601e4)
built with GNU 16.1.1 for Linux x86_64

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

Ryzen 5 7500F + RTX 5070

### Models

unsloth/gemma-4-26B-A4B-it-GGUF:MXFP4_MOE

### Problem description & steps to reproduce

llama-bench crashes with a CUDA error when benchmarking gemma 4 with a large ubatch and multiple thread values.

Repro command:
```
./llama-bench \
  -hf unsloth/gemma-4-26B-A4B-it-GGUF:MXFP4_MOE \
  -ngl 60 \
  -ncmoe 16 \
  -ub 1024 \
  -t 8,10 \
  -p 2048
```

This crashes with:
```
ggml/src/ggml-cuda/ggml-cuda.cu:102: CUDA error
```

The same command with -ub 768 does not crash:
```
./llama-bench \
  -hf unsloth/gemma-4-26B-A4B-it-GGUF:MXFP4_MOE \
  -ngl 60 \
  -ncmoe 16 \
  -ub 768 \
  -t 8,10 \
  -p 2048
```

I did not use fit.

### First Bad Commit

Goes all the way back to 9f5f0e689, when Gemma4 26B A4 support was added.

### Relevant log output

<details>
<summary>Logs</summary>


```console
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11771 MiB):
  Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes, VRAM: 11771 MiB
| model                          |       size |     params | backend    | ngl |  n_cpu_moe | threads | n_ubatch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ------: | -------: | --------------: | -------------------: |
/home/progamers/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:102: CUDA error
[New LWP 95134]
[New LWP 95133]
[New LWP 95132]
[New LWP 95130]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007fd97dea0a52 in ?? () from /usr/lib/libc.so.6
#0  0x00007fd97dea0a52 in ?? () from /usr/lib/libc.so.6
#1  0x00007fd97de94abc in ?? () from /usr/lib/libc.so.6
#2  0x00007fd97de94b04 in ?? () from /usr/lib/libc.so.6
#3  0x00007fd97df05c6f in wait4 () from /usr/lib/libc.so.6
#4  0x00007fd987f4f3bb in ggml_print_backtrace () from /home/progamers/llama.cpp/build/bin/libggml-base.so.0
#5  0x00007fd987f4f54e in ggml_abort () from /home/progamers/llama.cpp/build/bin/libggml-base.so.0
#6  0x00007fd984025481 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/progamers/llama.cpp/build/bin/libggml-cuda.so.0
#7  0x00007fd9840355ea in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) () from /home/progamers/llama.cpp/build/bin/libggml-cuda.so.0
#8  0x00007fd98403bb22 in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) () from /home/progamers/llama.cpp/build/bin/libggml-cuda.so.0
#9  0x00007fd98403c4b2 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/progamers/llama.cpp/build/bin/libggml-cuda.so.0
#10 0x00007fd987f6c57b in ggml_backend_sched_graph_compute_async () from /home/progamers/llama.cpp/build/bin/libggml-base.so.0
#11 0x00007fd987cc9be0 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/progamers/llama.cpp/build/bin/libllama.so.0
#12 0x00007fd987ccdc8e in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/progamers/llama.cpp/build/bin/libllama.so.0
#13 0x00007fd987cd1eb8 in llama_context::decode(llama_batch const&) () from /home/progamers/llama.cpp/build/bin/libllama.so.0
#14 0x00007fd987cd413e in llama_decode () from /home/progamers/llama.cpp/build/bin/libllama.so.0
#15 0x00005642f96b88eb in test_prompt(llama_context*, int, int, int) ()
#16 0x00005642f96b5280 in main ()
[Inferior 1 (process 95129) detached]
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: llama-bench CUDA crash with Gemma 4 MXFP4_MOE and -ub 1024 #24174

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: llama-bench CUDA crash with Gemma 4 MXFP4_MOE and -ub 1024 #24174

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions