Skip to content

Add ctx-per-slot argument for unified KV cache#24124

Open
bartowski1182 wants to merge 4 commits into
ggml-org:masterfrom
bartowski1182:auto_parallel
Open

Add ctx-per-slot argument for unified KV cache#24124
bartowski1182 wants to merge 4 commits into
ggml-org:masterfrom
bartowski1182:auto_parallel

Conversation

@bartowski1182
Copy link
Copy Markdown
Contributor

@bartowski1182 bartowski1182 commented Jun 4, 2026

Overview

Adds two new arguments for llama-server context provisioning

ctx-per-slot

First argument, --ctx-per-slot

This specifies how much context each slot is limited to in the kvu mode, so for example if a user wants to have N slots with each having 4096 context, they can specify --ctx-per-slot 4096, it will provision 16384 total context:

./build/bin/llama-server -m Qwen3.5-4B-Q4_K_M.gguf --ctx-per-slot 4096
...
0.02.635.267 W llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
...
0.03.025.317 I srv    load_model: capping per-slot context (16384) to --ctx-per-slot (4096)
...
0.03.084.366 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
0.03.084.370 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 4096
0.03.084.370 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 4096
0.03.084.370 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 4096
...

If specified with -c, you will then limit the total context:

./build-cpu/bin/llama-server -m Qwen3.5-4B-Q4_K_M.gguf -c 8192 --ctx-per-slot 4096
...
0.02.650.929 W llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
...
0.02.921.467 I srv    load_model: capping per-slot context (8192) to --ctx-per-slot (4096)
...
0.02.973.263 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 4096
...

If you set -c to a value lower than --ctx-per-slot, it will warn and cap to -c's value:

load_model: --ctx-per-slot (4096) exceeds the per-slot pool capacity (2048) - cap has no effect, slots are limited to 2048 (raise the KV pool with -c, or unset -c to size it to n_parallel*ctx_per_slot)

ctx-pool-slots

Also adds another new flag, --ctx-pool-slots, which specifies how many slots worth of context should be allocated, this is similar to setting -c to an explicit value but doesn't require the user to manually do the math ahead of time:

./build-cpu/bin/llama-server -m Qwen3.5-4B-Q4_K_M.gguf --ctx-per-slot 4096 --ctx-pool-slots 2
...
0.02.650.929 W llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
...
0.02.921.467 I srv    load_model: capping per-slot context (8192) to --ctx-per-slot (4096)
...
0.02.973.263 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 4096
...

If you specify more pool slots than np, it will warn and clamp to np:

srv llama_server: --ctx-pool-slots (6) exceeds n_parallel (4), clamping

Since this is purely a helper argument, I wouldn't mind dropping it from the PR, I think it would be nice to have but I've been told the server arguments are already getting a bit bloated...

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, used for brainstorming and code changes/testing

@bartowski1182 bartowski1182 changed the title Add ctx-per-slot argument for unifid KV cache Add ctx-per-slot argument for unified KV cache Jun 4, 2026
@bartowski1182 bartowski1182 marked this pull request as ready for review June 4, 2026 13:56
@bartowski1182 bartowski1182 requested review from a team as code owners June 4, 2026 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant