Skip to content

Eval bug: Nemotron Ultra compute buffer size explodes above batch-size 16 #24175

@timkhronos

Description

@timkhronos

Name and Version

version: 9525 (ad1b88c)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

GGML backends

CUDA

Hardware

9950x3d and 5090

Models

Nemotron 3 ultra 550B

Problem description & steps to reproduce

When running the server a batch size of 16, the cuda compute buffer size is 42Mb. However, when running the server, with any batch size above 16 the size of the cuda compute (pp) buffer explodes to 7 ~ 7.5 Gigabytes, regardless of the exact value set. It's between 7 and 7.5 Gigabytes at -b 32, -b 1024 and even -b 2048.

First Bad Commit

No response

Relevant log output

Logs

When -b 16

llama_context:  CUDA_Host  output buffer size =     2.00 MiB
3.32.601.226 I llama_kv_cache:      CUDA0 KV buffer size =   189.00 MiB
3.32.607.581 I llama_kv_cache: size =  189.00 MiB ( 16128 cells,  12 layers,  4/1 seqs), K (f16):   94.50 MiB, V (f16):   94.50 MiB
3.32.607.582 I llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
3.32.607.582 I llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
3.32.608.680 I llama_memory_recurrent:      CUDA0 RS buffer size =  1576.50 MiB
3.32.608.684 I llama_memory_recurrent: size = 1576.50 MiB (     4 cells, 108 layers,  4 seqs  0 rs_seq), R (f32):   40.50 MiB, S (f32): 1536.00 MiB
3.32.608.692 I sched_reserve: reserving ...
3.32.610.585 I sched_reserve: Flash Attention was auto, set to enabled
3.32.610.586 I sched_reserve: resolving fused Gated Delta Net support:
3.32.611.270 I sched_reserve: fused Gated Delta Net (autoregressive) enabled
3.32.611.944 I sched_reserve: fused Gated Delta Net (chunked) enabled
3.32.617.239 I sched_reserve:      CUDA0 compute buffer size =    42.06 MiB
3.32.617.240 I sched_reserve:  CUDA_Host compute buffer size =    11.26 MiB

When -b >16

llama_context:  CUDA_Host  output buffer size =     2.00 MiB
3.33.548.526 I llama_kv_cache:      CUDA0 KV buffer size =   189.00 MiB
3.33.559.237 I llama_kv_cache: size =  189.00 MiB ( 16128 cells,  12 layers,  4/1 seqs), K (f16):   94.50 MiB, V (f16):   94.50 MiB
3.33.559.239 I llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 128
3.33.559.239 I llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
3.33.560.612 I llama_memory_recurrent:      CUDA0 RS buffer size =  1576.50 MiB
3.33.560.618 I llama_memory_recurrent: size = 1576.50 MiB (     4 cells, 108 layers,  4 seqs  0 rs_seq), R (f32):   40.50 MiB, S (f32): 1536.00 MiB
3.33.560.626 I sched_reserve: reserving ...
3.33.562.568 I sched_reserve: Flash Attention was auto, set to enabled
3.33.562.569 I sched_reserve: resolving fused Gated Delta Net support:
3.33.563.282 I sched_reserve: fused Gated Delta Net (autoregressive) enabled
3.33.563.983 I sched_reserve: fused Gated Delta Net (chunked) enabled
3.33.565.482 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 6936.84 MiB on device 0: cudaMalloc failed: out of memory
3.33.565.486 E ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 7273808000
3.33.565.486 E graph_reserve: failed to allocate compute buffers
3.33.567.433 E llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions