perf(chat): KV reuse, flash-attn, modern samplers, history window, token coalescing by cryptopoly · Pull Request #73 · cryptopoly/ChaosEngineAI

cryptopoly · 2026-06-04T11:22:48Z

Chat-LLM performance & quality improvements

Outcome of a review of the chat text-generation path (MLX + llama.cpp + orchestration). Both engines now reuse KV across turns, modern samplers are reachable, long chats are bounded, and the SSE stream is far cheaper.

Commits

Commit	Change
`154d6ac`	llama.cpp cross-turn KV reuse (`--cache-reuse 256` + `cache_prompt`) + flash-attn now emitted (was stored, never used) + DRY/XTC/top-n-sigma sampler keys; MLX `repeat_penalty` fix (was silently dropped) + `/v1` sampler parity
`aadb024`	Token-budgeted history window + stop replaying prior `<think>` reasoning every turn
`4cbb300`	MLX persistent per-session prompt cache — single-slot port of mlx-lm's server reuse; trim divergent tail, prefill only the suffix. Live-validated 5.6× less prompt processing on turn 2, no corruption
`f21a233`	DRY + XTC samplers exposed end-to-end (request model + mapping + SamplerPanel UI)
`739db71`	Token coalescing (standard stream path) — batch token frames, flush on size/time/event/end. Live ~4–5× fewer SSE frames, text coherent
`8011203`	Token coalescing on the agent/tool path + drop the 4-char final-answer dribble

Intentionally not done

/v1 history windowing — silently dropping an OpenAI-API client's messages violates API expectations; the client owns its context, so windowing belongs in our UI (covered by the history window).

Validation

E2E suite: 8/8 phases pass (phase 1 chat = MLX + GGUF + cache + spec-dec all green).
Python: green except one pre-existing failure (DFlashCompatibilityTests — orphaned dflash-mlx pin, FU-057; proven to fail identically on the base commit).
tsc clean · vitest 455/455 · i18n 100% all locales.
Live smokes: MLX prompt-cache reuse (turn-2 promptTokens 90→16); coalescing (9 frames vs ~40 for a 40-token reply).

Safety

Every perf path has a hard fallback to prior behaviour: MLX cache resets on any model change / no-prefix / partial-trim / exception; coalescing disabled when logprobs requested; flash-attn honours the user's fused_attention flag; samplers are forward-only (old binaries ignore unknown keys).

🤖 Generated with Claude Code

…plers Tier 1+2 of the chat-LLM perf/quality review. Performance (llama.cpp): - --cache-reuse 256 + cache_prompt:true on the chat payload so a growing conversation reuses the slot KV and re-prefills only the new suffix instead of the whole history (turn-2+ TTFT drops sharply; was O(n^2)). - Emit --flash-attn on when the user's fused_attention flag is set. It was plumbed into load_model + stored on LoadedModelInfo but never turned into a flag; threaded fused_attention into _build_command. Large Metal decode/KV-memory win; required for quantized KV cache types. Quality (samplers): - llama.cpp: add DRY (dry_multiplier/base/allowed_length), XTC (xtc_probability/threshold), top_n_sigma to _LLAMA_SAMPLER_KEYS (forward-only; old binaries ignore unknown fields). - MLX: wire XTC into the sampler + add repeat_penalty via a new logits_processors builder. repeat_penalty was shown in the UI but silently dropped because mlx-lm applies it through logits_processors, not make_sampler. - /v1 parity: forward min_p / repeat_penalty / mirostat(_tau/_eta) which the OpenAI-compat path dropped; added the request-model fields. Tests: new sampler/logits/parity cases; touched-area suites green. The one failing test (dflash runtime-bundle) is pre-existing/unrelated (orphaned dflash-mlx pin, FU-057).

…oning Tier 3 of the chat-LLM review. - Stop re-feeding prior <think> reasoning into history every turn. The live chat path now passes preserve_reasoning=False; upstream Qwen3 / DeepSeek-R1 templates strip prior reasoning, and replaying it inflated the prompt each turn. _build_history_with_reasoning keeps the capability for callers that still want it (sessions side already passed False). - Token-budgeted sliding window (optional token_budget arg). Keeps system messages + the newest turns that fit, drops the oldest — bounding prompt growth so a long chat can't silently truncate on llama.cpp or overflow context on MLX. Budget reserves room for system prompt + current prompt + max_tokens + template overhead, floors at 512 so the latest turn is always kept. Conservative ~3 chars/token estimate (no tokenizer at this layer) errs toward under-filling to avoid overflow. Tests: +7 windowing/budget cases; generate-path + services suites green.

The native MLX chat path rebuilt a fresh cache every turn and re-prefilled the whole conversation. This keeps one persistent mlx-lm prompt cache on the worker and reuses the longest matching token prefix across turns: trim the divergent tail, prefill only the new suffix, re-commit keyed by prompt+generated tokens. A single-slot port of mlx-lm server's LRUPromptCache.fetch_nearest_cache. - New backend_service/mlx_worker_prompt_cache.py: acquire / commit / invalidate. Gated to the native strategy (compression caches keep their path); guarded by can_trim_prompt_cache (SSM/Mamba/rotating-full reset, mlx-lm #980); resets on model change / no common prefix / partial trim / any exception -> fresh full prefill (identical output, no speedup). - WorkerState gains _persist_cache / _persist_tokens / _persist_cache_model_ref; invalidated on every load / unload / profile change. generate_standard + stream_generate collect generated token ids (GenerationResponse.token) so the persisted token list always equals the cache's positional contents (exact next-turn trim). Live-validated (mlx-community/Qwen2.5-0.5B-Instruct-4bit, same session): turn 1 promptTokens=34, turn 2 promptTokens=16 (vs ~90 without reuse) with a coherent, context-aware turn-2 answer -- ~5.6x less prompt processing, no corruption. Tests: +12 reuse-logic cases (fake cache, trim accounting, all reset paths); MLX worker suite green (the one fail, dflash runtime-bundle, is the pre-existing orphaned-pin issue).

Tier 2 added DRY/XTC/top-n-sigma at the engine layer but nothing populated them from a request, so they were unreachable. Complete the chain: - GenerateRequest gains xtcProbability / xtcThreshold / dryMultiplier / dryBase / dryAllowedLength; _build_sampler_overrides maps them to the engine-side snake_case keys (llama-server forwards all via _LLAMA_SAMPLER_KEYS; mlx-lm applies XTC via make_sampler, ignores DRY). - SamplerPanel gains xtc_probability / xtc_threshold / dry_multiplier rows; SamplerOverrides type + samplerOverrides storage/serialize carry the new fields. XTC adds creative variety (both engines); DRY kills verbatim repetition loops better than repeat_penalty (llama.cpp). Both default off. Tests: backend mapping + frontend projection/round-trip; tsc + vitest green.

…ites) The chat SSE stream emitted one json.dumps + frame per token. Batch visible token text in the standard (non-tool) stream path and flush on a 24-char / 50ms window, before any non-token event (reasoning / reasoningDone / panic / thermal / error), and at stream end. Disabled when per-token logprobs are requested (they must stay 1:1 aligned); the agent/tool path is unchanged. full_text accumulation + the per-token runaway / stall / loop guards are untouched, so persisted output + abort behaviour are identical; the frontend reassembles the same text from larger frames. Live-validated (mlx-community/Qwen2.5-0.5B-Instruct-4bit): a ~40-token reply streamed as 9 token frames (avg ~20 chars) instead of ~40, text fully coherent + correctly ordered (phase -> tokens -> done). ~4-5x fewer frames.

…dribble The token coalescer landed on the standard stream path but the agent/tool path still emitted one frame per token, and the agent loop fake-streamed the already-computed final answer 4 chars at a time. Now the agent token forwarding uses the same coalescer (flush before toolCallStart / toolCallResult), and the agent loop emits the final answer in 48-char chunks instead of 4 -- less fake latency + fewer yields. Tool-call detection + execution flow is unchanged. Tests: test_agent + test_backend_service green; identical coalescer mechanics to the live-validated standard path.

cryptopoly added 6 commits June 2, 2026 12:02

cryptopoly merged commit 766a610 into staging Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(chat): KV reuse, flash-attn, modern samplers, history window, token coalescing#73

perf(chat): KV reuse, flash-attn, modern samplers, history window, token coalescing#73
cryptopoly merged 6 commits into
stagingfrom
feature/chatt-llm-improvements

cryptopoly commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cryptopoly commented Jun 4, 2026

Chat-LLM performance & quality improvements

Commits

Intentionally not done

Validation

Safety

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant