Skip to content

perf(chat): KV reuse, flash-attn, modern samplers, history window, token coalescing#73

Merged
cryptopoly merged 6 commits into
stagingfrom
feature/chatt-llm-improvements
Jun 4, 2026
Merged

perf(chat): KV reuse, flash-attn, modern samplers, history window, token coalescing#73
cryptopoly merged 6 commits into
stagingfrom
feature/chatt-llm-improvements

Conversation

@cryptopoly
Copy link
Copy Markdown
Owner

Chat-LLM performance & quality improvements

Outcome of a review of the chat text-generation path (MLX + llama.cpp + orchestration). Both engines now reuse KV across turns, modern samplers are reachable, long chats are bounded, and the SSE stream is far cheaper.

Commits

Commit Change
154d6ac llama.cpp cross-turn KV reuse (--cache-reuse 256 + cache_prompt) + flash-attn now emitted (was stored, never used) + DRY/XTC/top-n-sigma sampler keys; MLX repeat_penalty fix (was silently dropped) + /v1 sampler parity
aadb024 Token-budgeted history window + stop replaying prior <think> reasoning every turn
4cbb300 MLX persistent per-session prompt cache — single-slot port of mlx-lm's server reuse; trim divergent tail, prefill only the suffix. Live-validated 5.6× less prompt processing on turn 2, no corruption
f21a233 DRY + XTC samplers exposed end-to-end (request model + mapping + SamplerPanel UI)
739db71 Token coalescing (standard stream path) — batch token frames, flush on size/time/event/end. Live ~4–5× fewer SSE frames, text coherent
8011203 Token coalescing on the agent/tool path + drop the 4-char final-answer dribble

Intentionally not done

/v1 history windowing — silently dropping an OpenAI-API client's messages violates API expectations; the client owns its context, so windowing belongs in our UI (covered by the history window).

Validation

  • E2E suite: 8/8 phases pass (phase 1 chat = MLX + GGUF + cache + spec-dec all green).
  • Python: green except one pre-existing failure (DFlashCompatibilityTests — orphaned dflash-mlx pin, FU-057; proven to fail identically on the base commit).
  • tsc clean · vitest 455/455 · i18n 100% all locales.
  • Live smokes: MLX prompt-cache reuse (turn-2 promptTokens 90→16); coalescing (9 frames vs ~40 for a 40-token reply).

Safety

Every perf path has a hard fallback to prior behaviour: MLX cache resets on any model change / no-prefix / partial-trim / exception; coalescing disabled when logprobs requested; flash-attn honours the user's fused_attention flag; samplers are forward-only (old binaries ignore unknown keys).

🤖 Generated with Claude Code

…plers

Tier 1+2 of the chat-LLM perf/quality review.

Performance (llama.cpp):
- --cache-reuse 256 + cache_prompt:true on the chat payload so a growing
  conversation reuses the slot KV and re-prefills only the new suffix
  instead of the whole history (turn-2+ TTFT drops sharply; was O(n^2)).
- Emit --flash-attn on when the user's fused_attention flag is set. It was
  plumbed into load_model + stored on LoadedModelInfo but never turned into
  a flag; threaded fused_attention into _build_command. Large Metal
  decode/KV-memory win; required for quantized KV cache types.

Quality (samplers):
- llama.cpp: add DRY (dry_multiplier/base/allowed_length), XTC
  (xtc_probability/threshold), top_n_sigma to _LLAMA_SAMPLER_KEYS
  (forward-only; old binaries ignore unknown fields).
- MLX: wire XTC into the sampler + add repeat_penalty via a new
  logits_processors builder. repeat_penalty was shown in the UI but
  silently dropped because mlx-lm applies it through logits_processors,
  not make_sampler.
- /v1 parity: forward min_p / repeat_penalty / mirostat(_tau/_eta) which
  the OpenAI-compat path dropped; added the request-model fields.

Tests: new sampler/logits/parity cases; touched-area suites green. The one
failing test (dflash runtime-bundle) is pre-existing/unrelated (orphaned
dflash-mlx pin, FU-057).
…oning

Tier 3 of the chat-LLM review.

- Stop re-feeding prior <think> reasoning into history every turn. The live
  chat path now passes preserve_reasoning=False; upstream Qwen3 /
  DeepSeek-R1 templates strip prior reasoning, and replaying it inflated
  the prompt each turn. _build_history_with_reasoning keeps the capability
  for callers that still want it (sessions side already passed False).
- Token-budgeted sliding window (optional token_budget arg). Keeps system
  messages + the newest turns that fit, drops the oldest — bounding prompt
  growth so a long chat can't silently truncate on llama.cpp or overflow
  context on MLX. Budget reserves room for system prompt + current prompt +
  max_tokens + template overhead, floors at 512 so the latest turn is
  always kept. Conservative ~3 chars/token estimate (no tokenizer at this
  layer) errs toward under-filling to avoid overflow.

Tests: +7 windowing/budget cases; generate-path + services suites green.
The native MLX chat path rebuilt a fresh cache every turn and re-prefilled
the whole conversation. This keeps one persistent mlx-lm prompt cache on the
worker and reuses the longest matching token prefix across turns: trim the
divergent tail, prefill only the new suffix, re-commit keyed by
prompt+generated tokens. A single-slot port of mlx-lm server's
LRUPromptCache.fetch_nearest_cache.

- New backend_service/mlx_worker_prompt_cache.py: acquire / commit /
  invalidate. Gated to the native strategy (compression caches keep their
  path); guarded by can_trim_prompt_cache (SSM/Mamba/rotating-full reset,
  mlx-lm #980); resets on model change / no common prefix / partial trim /
  any exception -> fresh full prefill (identical output, no speedup).
- WorkerState gains _persist_cache / _persist_tokens /
  _persist_cache_model_ref; invalidated on every load / unload / profile
  change. generate_standard + stream_generate collect generated token ids
  (GenerationResponse.token) so the persisted token list always equals the
  cache's positional contents (exact next-turn trim).

Live-validated (mlx-community/Qwen2.5-0.5B-Instruct-4bit, same session):
turn 1 promptTokens=34, turn 2 promptTokens=16 (vs ~90 without reuse) with a
coherent, context-aware turn-2 answer -- ~5.6x less prompt processing, no
corruption.

Tests: +12 reuse-logic cases (fake cache, trim accounting, all reset paths);
MLX worker suite green (the one fail, dflash runtime-bundle, is the
pre-existing orphaned-pin issue).
Tier 2 added DRY/XTC/top-n-sigma at the engine layer but nothing populated
them from a request, so they were unreachable. Complete the chain:

- GenerateRequest gains xtcProbability / xtcThreshold / dryMultiplier /
  dryBase / dryAllowedLength; _build_sampler_overrides maps them to the
  engine-side snake_case keys (llama-server forwards all via
  _LLAMA_SAMPLER_KEYS; mlx-lm applies XTC via make_sampler, ignores DRY).
- SamplerPanel gains xtc_probability / xtc_threshold / dry_multiplier rows;
  SamplerOverrides type + samplerOverrides storage/serialize carry the new
  fields.

XTC adds creative variety (both engines); DRY kills verbatim repetition
loops better than repeat_penalty (llama.cpp). Both default off.

Tests: backend mapping + frontend projection/round-trip; tsc + vitest green.
…ites)

The chat SSE stream emitted one json.dumps + frame per token. Batch visible
token text in the standard (non-tool) stream path and flush on a 24-char /
50ms window, before any non-token event (reasoning / reasoningDone / panic /
thermal / error), and at stream end. Disabled when per-token logprobs are
requested (they must stay 1:1 aligned); the agent/tool path is unchanged.
full_text accumulation + the per-token runaway / stall / loop guards are
untouched, so persisted output + abort behaviour are identical; the frontend
reassembles the same text from larger frames.

Live-validated (mlx-community/Qwen2.5-0.5B-Instruct-4bit): a ~40-token reply
streamed as 9 token frames (avg ~20 chars) instead of ~40, text fully
coherent + correctly ordered (phase -> tokens -> done). ~4-5x fewer frames.
…dribble

The token coalescer landed on the standard stream path but the agent/tool
path still emitted one frame per token, and the agent loop fake-streamed the
already-computed final answer 4 chars at a time. Now the agent token
forwarding uses the same coalescer (flush before toolCallStart /
toolCallResult), and the agent loop emits the final answer in 48-char chunks
instead of 4 -- less fake latency + fewer yields. Tool-call detection +
execution flow is unchanged.

Tests: test_agent + test_backend_service green; identical coalescer mechanics
to the live-validated standard path.
@cryptopoly cryptopoly merged commit 766a610 into staging Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant