llama_context: CUDA_Host output buffer size = 2.00 MiB
3.32.601.226 I llama_kv_cache: CUDA0 KV buffer size = 189.00 MiB
3.32.607.581 I llama_kv_cache: size = 189.00 MiB ( 16128 cells, 12 layers, 4/1 seqs), K (f16): 94.50 MiB, V (f16): 94.50 MiB
3.32.607.582 I llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
3.32.607.582 I llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
3.32.608.680 I llama_memory_recurrent: CUDA0 RS buffer size = 1576.50 MiB
3.32.608.684 I llama_memory_recurrent: size = 1576.50 MiB ( 4 cells, 108 layers, 4 seqs 0 rs_seq), R (f32): 40.50 MiB, S (f32): 1536.00 MiB
3.32.608.692 I sched_reserve: reserving ...
3.32.610.585 I sched_reserve: Flash Attention was auto, set to enabled
3.32.610.586 I sched_reserve: resolving fused Gated Delta Net support:
3.32.611.270 I sched_reserve: fused Gated Delta Net (autoregressive) enabled
3.32.611.944 I sched_reserve: fused Gated Delta Net (chunked) enabled
3.32.617.239 I sched_reserve: CUDA0 compute buffer size = 42.06 MiB
3.32.617.240 I sched_reserve: CUDA_Host compute buffer size = 11.26 MiB
llama_context: CUDA_Host output buffer size = 2.00 MiB
3.33.548.526 I llama_kv_cache: CUDA0 KV buffer size = 189.00 MiB
3.33.559.237 I llama_kv_cache: size = 189.00 MiB ( 16128 cells, 12 layers, 4/1 seqs), K (f16): 94.50 MiB, V (f16): 94.50 MiB
3.33.559.239 I llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
3.33.559.239 I llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
3.33.560.612 I llama_memory_recurrent: CUDA0 RS buffer size = 1576.50 MiB
3.33.560.618 I llama_memory_recurrent: size = 1576.50 MiB ( 4 cells, 108 layers, 4 seqs 0 rs_seq), R (f32): 40.50 MiB, S (f32): 1536.00 MiB
3.33.560.626 I sched_reserve: reserving ...
3.33.562.568 I sched_reserve: Flash Attention was auto, set to enabled
3.33.562.569 I sched_reserve: resolving fused Gated Delta Net support:
3.33.563.282 I sched_reserve: fused Gated Delta Net (autoregressive) enabled
3.33.563.983 I sched_reserve: fused Gated Delta Net (chunked) enabled
3.33.565.482 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6936.84 MiB on device 0: cudaMalloc failed: out of memory
3.33.565.486 E ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 7273808000
3.33.565.486 E graph_reserve: failed to allocate compute buffers
3.33.567.433 E llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
Name and Version
version: 9525 (ad1b88c)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
9950x3d and 5090
Models
Nemotron 3 ultra 550B
Problem description & steps to reproduce
When running the server a batch size of 16, the cuda compute buffer size is 42Mb. However, when running the server, with any batch size above 16 the size of the cuda compute (pp) buffer explodes to 7 ~ 7.5 Gigabytes, regardless of the exact value set. It's between 7 and 7.5 Gigabytes at -b 32, -b 1024 and even -b 2048.
First Bad Commit
No response
Relevant log output
Logs
When -b 16
When -b >16