Name and Version
version: 9512 (0dbfa66)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
llama-vulkan/llama-cli.exe -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q5_K_XL -ngl all --fit on -lv 4 --no-mmap --fa on --cache-type-k x --cache-type-v x --spec-type draft-mtp --spec-draft-n-max 2 --spec-draft-type-k x --spec-draft-type-v x
Problem description & steps to reproduce
When going through iterations of KV cache and draft KV cache quantization, VRAM usage is not consistent across different backends.
HIP:
kv cache | draft kv | context
f16/f16 | ntp | 49408
q8_0/q8_0 | ntp | 82944
f16/f16 | f16/f16 | 13312
f16/f16 | q8_0/q8_0 | 4608
q8_0/q8_0 | f16/f16 | 22784
q8_0/q8_0 | q8_0/q8_0 | 8192
Vulkan:
kv cache | draft kv | context
f16/f16 | ntp | 39936
q8_0/q8_0 | ntp | 74496
f16/f16 | f16/f16 | 8960
f16/f16 | q8_0/q8_0 | 16640
q8_0/q8_0 | f16/f16 | 17408
q8_0/q8_0 | q8_0/q8_0 | 31488
According to #24102 (comment) the HIP numbers are correct, but the Vulkan numbers are not.
First Bad Commit
N/A
Relevant log output
N/A
Name and Version
version: 9512 (0dbfa66)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-cli
Command line
Problem description & steps to reproduce
When going through iterations of KV cache and draft KV cache quantization, VRAM usage is not consistent across different backends.
HIP:
Vulkan:
According to #24102 (comment) the HIP numbers are correct, but the Vulkan numbers are not.
First Bad Commit
N/A
Relevant log output
N/A