Add ctx-per-slot argument for unified KV cache by bartowski1182 · Pull Request #24124 · ggml-org/llama.cpp

bartowski1182 · 2026-06-04T12:49:39Z

Overview

Adds two new arguments for llama-server context provisioning

ctx-per-slot

First argument, --ctx-per-slot

This specifies how much context each slot is limited to in the kvu mode, so for example if a user wants to have N slots with each having 4096 context, they can specify --ctx-per-slot 4096, it will provision 16384 total context:

./build/bin/llama-server -m Qwen3.5-4B-Q4_K_M.gguf --ctx-per-slot 4096
...
0.02.635.267 W llama_context: n_ctx_seq (16384) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
...
0.03.025.317 I srv    load_model: capping per-slot context (16384) to --ctx-per-slot (4096)
...
0.03.084.366 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
0.03.084.370 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 4096
0.03.084.370 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 4096
0.03.084.370 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 4096
...

If specified with -c, you will then limit the total context:

./build-cpu/bin/llama-server -m Qwen3.5-4B-Q4_K_M.gguf -c 8192 --ctx-per-slot 4096
...
0.02.650.929 W llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
...
0.02.921.467 I srv    load_model: capping per-slot context (8192) to --ctx-per-slot (4096)
...
0.02.973.263 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 4096
...

If you set -c to a value lower than --ctx-per-slot, it will warn and cap to -c's value:

load_model: --ctx-per-slot (4096) exceeds the per-slot pool capacity (2048) - cap has no effect, slots are limited to 2048 (raise the KV pool with -c, or unset -c to size it to n_parallel*ctx_per_slot)

ctx-pool-slots

Also adds another new flag, --ctx-pool-slots, which specifies how many slots worth of context should be allocated, this is similar to setting -c to an explicit value but doesn't require the user to manually do the math ahead of time:

./build-cpu/bin/llama-server -m Qwen3.5-4B-Q4_K_M.gguf --ctx-per-slot 4096 --ctx-pool-slots 2
...
0.02.650.929 W llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
...
0.02.921.467 I srv    load_model: capping per-slot context (8192) to --ctx-per-slot (4096)
...
0.02.973.263 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  1 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  2 | task -1 | new slot, n_ctx = 4096
0.02.973.267 I slot   load_model: id  3 | task -1 | new slot, n_ctx = 4096
...

If you specify more pool slots than np, it will warn and clamp to np:

srv llama_server: --ctx-pool-slots (6) exceeds n_parallel (4), clamping

Since this is purely a helper argument, I wouldn't mind dropping it from the PR, I think it would be nice to have but I've been told the server arguments are already getting a bit bloated...

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, used for brainstorming and code changes/testing

bartowski1182 added 2 commits June 4, 2026 02:06

Add ctx-per-slot argument for unifid KV cache

9be1d52

Merge branch 'master' into auto_parallel

0c5d5d7

github-actions Bot added examples server labels Jun 4, 2026

Swap out ctx fractions for ctx pool slots

b2d7dc6

bartowski1182 changed the title ~~Add ctx-per-slot argument for unifid KV cache~~ Add ctx-per-slot argument for unified KV cache Jun 4, 2026

Formatting cleanup

6386441

bartowski1182 marked this pull request as ready for review June 4, 2026 13:56

bartowski1182 requested review from a team as code owners June 4, 2026 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ctx-per-slot argument for unified KV cache#24124

Add ctx-per-slot argument for unified KV cache#24124
bartowski1182 wants to merge 4 commits into
ggml-org:masterfrom
bartowski1182:auto_parallel

bartowski1182 commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bartowski1182 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

ctx-per-slot

ctx-pool-slots

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bartowski1182 commented Jun 4, 2026 •

edited

Loading