Eval bug: Nemotron Ultra compute buffer size explodes above batch-size 16

### Name and Version

version: 9525 (ad1b88ca0)
built with GNU 13.3.0 for Linux x86_64

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

9950x3d and 5090

### Models

Nemotron 3 ultra 550B

### Problem description & steps to reproduce

When running the server a batch size of 16, the cuda compute buffer size is 42Mb. However, when running the server, with any batch size above 16 the size of the cuda compute (pp) buffer explodes to 7 ~ 7.5 Gigabytes, regardless of the exact value set. It's between 7 and 7.5 Gigabytes at -b 32, -b 1024 and even -b 2048.

### First Bad Commit

_No response_

### Relevant log output

<details>
<summary>Logs</summary>


When -b 16
```console
llama_context:  CUDA_Host  output buffer size =     2.00 MiB
3.32.601.226 I llama_kv_cache:      CUDA0 KV buffer size =   189.00 MiB
3.32.607.581 I llama_kv_cache: size =  189.00 MiB ( 16128 cells,  12 layers,  4/1 seqs), K (f16):   94.50 MiB, V (f16):   94.50 MiB
3.32.607.582 I llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
3.32.607.582 I llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
3.32.608.680 I llama_memory_recurrent:      CUDA0 RS buffer size =  1576.50 MiB
3.32.608.684 I llama_memory_recurrent: size = 1576.50 MiB (     4 cells, 108 layers,  4 seqs  0 rs_seq), R (f32):   40.50 MiB, S (f32): 1536.00 MiB
3.32.608.692 I sched_reserve: reserving ...
3.32.610.585 I sched_reserve: Flash Attention was auto, set to enabled
3.32.610.586 I sched_reserve: resolving fused Gated Delta Net support:
3.32.611.270 I sched_reserve: fused Gated Delta Net (autoregressive) enabled
3.32.611.944 I sched_reserve: fused Gated Delta Net (chunked) enabled
3.32.617.239 I sched_reserve:      CUDA0 compute buffer size =    42.06 MiB
3.32.617.240 I sched_reserve:  CUDA_Host compute buffer size =    11.26 MiB

```

When -b >16 

```console
llama_context:  CUDA_Host  output buffer size =     2.00 MiB
3.33.548.526 I llama_kv_cache:      CUDA0 KV buffer size =   189.00 MiB
3.33.559.237 I llama_kv_cache: size =  189.00 MiB ( 16128 cells,  12 layers,  4/1 seqs), K (f16):   94.50 MiB, V (f16):   94.50 MiB
3.33.559.239 I llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
3.33.559.239 I llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
3.33.560.612 I llama_memory_recurrent:      CUDA0 RS buffer size =  1576.50 MiB
3.33.560.618 I llama_memory_recurrent: size = 1576.50 MiB (     4 cells, 108 layers,  4 seqs  0 rs_seq), R (f32):   40.50 MiB, S (f32): 1536.00 MiB
3.33.560.626 I sched_reserve: reserving ...
3.33.562.568 I sched_reserve: Flash Attention was auto, set to enabled
3.33.562.569 I sched_reserve: resolving fused Gated Delta Net support:
3.33.563.282 I sched_reserve: fused Gated Delta Net (autoregressive) enabled
3.33.563.983 I sched_reserve: fused Gated Delta Net (chunked) enabled
3.33.565.482 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6936.84 MiB on device 0: cudaMalloc failed: out of memory
3.33.565.486 E ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 7273808000
3.33.565.486 E graph_reserve: failed to allocate compute buffers
3.33.567.433 E llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers


```


</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Nemotron Ultra compute buffer size explodes above batch-size 16 #24175

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: Nemotron Ultra compute buffer size explodes above batch-size 16 #24175

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions