ggml_cuda_init: found 1 CUDA devices (Total VRAM: 11771 MiB):
Device 0: NVIDIA GeForce RTX 5070, compute capability 12.0, VMM: yes, VRAM: 11771 MiB
| model | size | params | backend | ngl | n_cpu_moe | threads | n_ubatch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ------: | -------: | --------------: | -------------------: |
/home/progamers/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:102: CUDA error
[New LWP 95134]
[New LWP 95133]
[New LWP 95132]
[New LWP 95130]
This GDB supports auto-downloading debuginfo from the following URLs:
<https://debuginfod.archlinux.org>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007fd97dea0a52 in ?? () from /usr/lib/libc.so.6
#0 0x00007fd97dea0a52 in ?? () from /usr/lib/libc.so.6
#1 0x00007fd97de94abc in ?? () from /usr/lib/libc.so.6
#2 0x00007fd97de94b04 in ?? () from /usr/lib/libc.so.6
#3 0x00007fd97df05c6f in wait4 () from /usr/lib/libc.so.6
#4 0x00007fd987f4f3bb in ggml_print_backtrace () from /home/progamers/llama.cpp/build/bin/libggml-base.so.0
#5 0x00007fd987f4f54e in ggml_abort () from /home/progamers/llama.cpp/build/bin/libggml-base.so.0
#6 0x00007fd984025481 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/progamers/llama.cpp/build/bin/libggml-cuda.so.0
#7 0x00007fd9840355ea in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) () from /home/progamers/llama.cpp/build/bin/libggml-cuda.so.0
#8 0x00007fd98403bb22 in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) () from /home/progamers/llama.cpp/build/bin/libggml-cuda.so.0
#9 0x00007fd98403c4b2 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/progamers/llama.cpp/build/bin/libggml-cuda.so.0
#10 0x00007fd987f6c57b in ggml_backend_sched_graph_compute_async () from /home/progamers/llama.cpp/build/bin/libggml-base.so.0
#11 0x00007fd987cc9be0 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/progamers/llama.cpp/build/bin/libllama.so.0
#12 0x00007fd987ccdc8e in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/progamers/llama.cpp/build/bin/libllama.so.0
#13 0x00007fd987cd1eb8 in llama_context::decode(llama_batch const&) () from /home/progamers/llama.cpp/build/bin/libllama.so.0
#14 0x00007fd987cd413e in llama_decode () from /home/progamers/llama.cpp/build/bin/libllama.so.0
#15 0x00005642f96b88eb in test_prompt(llama_context*, int, int, int) ()
#16 0x00005642f96b5280 in main ()
[Inferior 1 (process 95129) detached]
Name and Version
$ ~/llama.cpp/build/bin/llama-cli --version
version: 9197 (fcae601)
built with GNU 16.1.1 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen 5 7500F + RTX 5070
Models
unsloth/gemma-4-26B-A4B-it-GGUF:MXFP4_MOE
Problem description & steps to reproduce
llama-bench crashes with a CUDA error when benchmarking gemma 4 with a large ubatch and multiple thread values.
Repro command:
This crashes with:
The same command with -ub 768 does not crash:
I did not use fit.
First Bad Commit
Goes all the way back to 9f5f0e6, when Gemma4 26B A4 support was added.
Relevant log output
Logs