perf(chat): KV reuse, flash-attn, modern samplers, history window, token coalescing#73
Merged
Merged
Conversation
…plers Tier 1+2 of the chat-LLM perf/quality review. Performance (llama.cpp): - --cache-reuse 256 + cache_prompt:true on the chat payload so a growing conversation reuses the slot KV and re-prefills only the new suffix instead of the whole history (turn-2+ TTFT drops sharply; was O(n^2)). - Emit --flash-attn on when the user's fused_attention flag is set. It was plumbed into load_model + stored on LoadedModelInfo but never turned into a flag; threaded fused_attention into _build_command. Large Metal decode/KV-memory win; required for quantized KV cache types. Quality (samplers): - llama.cpp: add DRY (dry_multiplier/base/allowed_length), XTC (xtc_probability/threshold), top_n_sigma to _LLAMA_SAMPLER_KEYS (forward-only; old binaries ignore unknown fields). - MLX: wire XTC into the sampler + add repeat_penalty via a new logits_processors builder. repeat_penalty was shown in the UI but silently dropped because mlx-lm applies it through logits_processors, not make_sampler. - /v1 parity: forward min_p / repeat_penalty / mirostat(_tau/_eta) which the OpenAI-compat path dropped; added the request-model fields. Tests: new sampler/logits/parity cases; touched-area suites green. The one failing test (dflash runtime-bundle) is pre-existing/unrelated (orphaned dflash-mlx pin, FU-057).
…oning Tier 3 of the chat-LLM review. - Stop re-feeding prior <think> reasoning into history every turn. The live chat path now passes preserve_reasoning=False; upstream Qwen3 / DeepSeek-R1 templates strip prior reasoning, and replaying it inflated the prompt each turn. _build_history_with_reasoning keeps the capability for callers that still want it (sessions side already passed False). - Token-budgeted sliding window (optional token_budget arg). Keeps system messages + the newest turns that fit, drops the oldest — bounding prompt growth so a long chat can't silently truncate on llama.cpp or overflow context on MLX. Budget reserves room for system prompt + current prompt + max_tokens + template overhead, floors at 512 so the latest turn is always kept. Conservative ~3 chars/token estimate (no tokenizer at this layer) errs toward under-filling to avoid overflow. Tests: +7 windowing/budget cases; generate-path + services suites green.
The native MLX chat path rebuilt a fresh cache every turn and re-prefilled the whole conversation. This keeps one persistent mlx-lm prompt cache on the worker and reuses the longest matching token prefix across turns: trim the divergent tail, prefill only the new suffix, re-commit keyed by prompt+generated tokens. A single-slot port of mlx-lm server's LRUPromptCache.fetch_nearest_cache. - New backend_service/mlx_worker_prompt_cache.py: acquire / commit / invalidate. Gated to the native strategy (compression caches keep their path); guarded by can_trim_prompt_cache (SSM/Mamba/rotating-full reset, mlx-lm #980); resets on model change / no common prefix / partial trim / any exception -> fresh full prefill (identical output, no speedup). - WorkerState gains _persist_cache / _persist_tokens / _persist_cache_model_ref; invalidated on every load / unload / profile change. generate_standard + stream_generate collect generated token ids (GenerationResponse.token) so the persisted token list always equals the cache's positional contents (exact next-turn trim). Live-validated (mlx-community/Qwen2.5-0.5B-Instruct-4bit, same session): turn 1 promptTokens=34, turn 2 promptTokens=16 (vs ~90 without reuse) with a coherent, context-aware turn-2 answer -- ~5.6x less prompt processing, no corruption. Tests: +12 reuse-logic cases (fake cache, trim accounting, all reset paths); MLX worker suite green (the one fail, dflash runtime-bundle, is the pre-existing orphaned-pin issue).
Tier 2 added DRY/XTC/top-n-sigma at the engine layer but nothing populated them from a request, so they were unreachable. Complete the chain: - GenerateRequest gains xtcProbability / xtcThreshold / dryMultiplier / dryBase / dryAllowedLength; _build_sampler_overrides maps them to the engine-side snake_case keys (llama-server forwards all via _LLAMA_SAMPLER_KEYS; mlx-lm applies XTC via make_sampler, ignores DRY). - SamplerPanel gains xtc_probability / xtc_threshold / dry_multiplier rows; SamplerOverrides type + samplerOverrides storage/serialize carry the new fields. XTC adds creative variety (both engines); DRY kills verbatim repetition loops better than repeat_penalty (llama.cpp). Both default off. Tests: backend mapping + frontend projection/round-trip; tsc + vitest green.
…ites) The chat SSE stream emitted one json.dumps + frame per token. Batch visible token text in the standard (non-tool) stream path and flush on a 24-char / 50ms window, before any non-token event (reasoning / reasoningDone / panic / thermal / error), and at stream end. Disabled when per-token logprobs are requested (they must stay 1:1 aligned); the agent/tool path is unchanged. full_text accumulation + the per-token runaway / stall / loop guards are untouched, so persisted output + abort behaviour are identical; the frontend reassembles the same text from larger frames. Live-validated (mlx-community/Qwen2.5-0.5B-Instruct-4bit): a ~40-token reply streamed as 9 token frames (avg ~20 chars) instead of ~40, text fully coherent + correctly ordered (phase -> tokens -> done). ~4-5x fewer frames.
…dribble The token coalescer landed on the standard stream path but the agent/tool path still emitted one frame per token, and the agent loop fake-streamed the already-computed final answer 4 chars at a time. Now the agent token forwarding uses the same coalescer (flush before toolCallStart / toolCallResult), and the agent loop emits the final answer in 48-char chunks instead of 4 -- less fake latency + fewer yields. Tool-call detection + execution flow is unchanged. Tests: test_agent + test_backend_service green; identical coalescer mechanics to the live-validated standard path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Chat-LLM performance & quality improvements
Outcome of a review of the chat text-generation path (MLX + llama.cpp + orchestration). Both engines now reuse KV across turns, modern samplers are reachable, long chats are bounded, and the SSE stream is far cheaper.
Commits
154d6ac--cache-reuse 256+cache_prompt) + flash-attn now emitted (was stored, never used) + DRY/XTC/top-n-sigma sampler keys; MLXrepeat_penaltyfix (was silently dropped) +/v1sampler parityaadb024<think>reasoning every turn4cbb300f21a233739db718011203Intentionally not done
/v1history windowing — silently dropping an OpenAI-API client's messages violates API expectations; the client owns its context, so windowing belongs in our UI (covered by the history window).Validation
DFlashCompatibilityTests— orphaned dflash-mlx pin, FU-057; proven to fail identically on the base commit).tscclean · vitest 455/455 · i18n 100% all locales.Safety
Every perf path has a hard fallback to prior behaviour: MLX cache resets on any model change / no-prefix / partial-trim / exception; coalescing disabled when logprobs requested; flash-attn honours the user's
fused_attentionflag; samplers are forward-only (old binaries ignore unknown keys).🤖 Generated with Claude Code