Releases: Anbeeld/beellama.cpp
Releases · Anbeeld/beellama.cpp
v0.3.1
Changelog
- Merged latest upstream llama.cpp master. This pulls in Gemma 4 12B and Gemma 4 unified multimodal support fixes, including non-causal vision, unified audio/vision projector handling, and FPE fixes; Qwen3.5 post-norm hidden-state behavior for MTP; CUDA KV-cache quantization preallocation and PDL race fixes; WebGPU FlashAttention refactoring with standardized quantization support; CPU backend improvements for RVV/SVE; lower-latency Metal command-buffer status polling; Mermaid diagram rendering and preview support in
tools/ui; updated BoringSSL, SYCL documentation, save/load-state tests, Docker docs, and small CI/release maintenance. - Repaired CUDA fused TurboQuant FlashAttention for same-type
turbo2,turbo3, andturbo4K/V caches. The fused MMA path now loads each supported format correctly, while mixed TurboQuant and TCQ pairs stay on the established non-fused paths; TurboQuant/TCQ partial KV offload now fails early instead of falling back to an incompatible CPU cache and reaching a scheduler crash. AddedGGML_TURBO_FA_DEBUG=1path diagnostics and regression coverage for the supported dispatch matrix. - Updated release packaging and documentation. HIP/ROCm builds now include all quantized FlashAttention combinations, and the prebuilt binary and Docker image lists reflect the current release outputs.
macOS:
Linux:
- Ubuntu x64 CPU
- Ubuntu arm64 CPU
- Ubuntu x64 CUDA 12.4
- Ubuntu x64 CUDA 13.1
- Ubuntu x64 Vulkan
- Ubuntu x64 ROCm 7.2
- Ubuntu x64 SYCL
Windows:
- Windows x64 CPU
- Windows x64 SYCL
- Windows x64 CUDA 12.4 - DLLs
- Windows x64 CUDA 13.1 - DLLs
- Windows x64 HIP
Docker:
- CPU:
docker pull ghcr.io/anbeeld/beellama.cpp:server-cpu-v0.3.1 - CUDA:
docker pull ghcr.io/anbeeld/beellama.cpp:server-cuda-v0.3.1 - CUDA 12:
docker pull ghcr.io/anbeeld/beellama.cpp:server-cuda12-v0.3.1 - CUDA 13:
docker pull ghcr.io/anbeeld/beellama.cpp:server-cuda13-v0.3.1 - ROCm:
docker pull ghcr.io/anbeeld/beellama.cpp:server-rocm-v0.3.1 - Vulkan:
docker pull ghcr.io/anbeeld/beellama.cpp:server-vulkan-v0.3.1 - SYCL:
docker pull ghcr.io/anbeeld/beellama.cpp:server-sycl-v0.3.1
v0.3.0
Changelog
- Updated to a much newer llama.cpp base. The upstream refresh brings native MTP speculative decoding, parallel drafting and backend sampling work, the unified
llamaapp, newer server/API behavior, thetools/uiWeb UI restructure, model and converter additions, multimodal improvements, and backend gains across CUDA/HIP, Metal, Vulkan, SYCL, OpenCL, WebGPU, Hexagon, ZenDNN, and SpacemiT. - Added usable MTP serving on the Bee tree.
draft-mtp/mtpnow has its own speculative path instead of being mixed with DFlash state, uses draft KV cache types, cleans up draft resources on sleep, works with target pre-norm hidden-state capture, and handles text requests when anmmprojis loaded. - Cleaned up DFlash command-line behavior around canonical
--spec-*arguments. DFlash can still be selected with--spec-type dflashor auto-detected from compatible draft GGUF metadata, but stale aliases such as--spec-dflash-default,--draft*,--draft-topk,--draft-model,--tree-budget,--dflash-max-slots,--spec-draft-replace, and--spec-replacewere removed. - Made DFlash defaults match the new upstream speculative surface without changing the practical DFlash defaults. Raw
--spec-draft-n-maxstays at upstream3, DFlash raises the effective omitted draft max to16only after explicit or auto-detected DFlash, omitted--spec-draft-ctx-sizestill becomes256, and DFlash no longer lowers target-b 2048/-ub 512unless the user asks for smaller batches. - Made DFlash-only arguments safe around non-DFlash modes. DFlash-only controls warn and no-op for MTP or other speculative types, while still surviving long enough for server-side DFlash draft-model auto-detection when no explicit
--spec-type dflashwas passed. - Expanded multi-slot DFlash serving. DFlash slots now default to server parallelism,
--spec-dflash-max-slotscaps them below-npwhen needed, uneven-nptarget batches are split safely, mixed speculative/non-speculative target batches are avoided, and flat multi-slot DFlash can use shared drafter batching withGGML_DFLASH_SHARED_DRAFT_BATCH=0as the fallback switch. - Reworked adaptive DFlash draft depth for live serving. The default
profitcontroller now seeds and periodically remeasures a no-spec baseline, probes shallow/mid/full positive depths, backs off failed wake probes, avoids premature demotion, resets per-request state correctly, preserves compatible continuation state, and gates timing logs behind DFlash profiling. - Added default-on device-aware DFlash GPU capture/tape/replay for split CUDA/ROCm target placement. Hidden capture, prefill capture, recurrent tape, conv replay, direct GDN replay, rollback copies, and synchronization now follow each layer's backend device, with CPU/eval-callback fallback and
GGML_DFLASH_MULTI_GPU_TAPE=0/GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0kill switches. - Hardened DFlash device placement. Explicit single draft-device placement pins the target output tensor before target load, auto-detected DFlash drafters stay single-device by default unless draft devices are explicit, iGPU backends are accepted for GPU paths, tensor-split/meta placement is guarded, CUDA helper calls preserve the caller device, and peer D2D copies are used where available.
- Hardened DFlash hidden-ring and prompt-cache state. The server guards GPU hidden-ring spans, discards stale DFlash ring checkpoints, skips DFlash checkpoints for uncached prompts, restores shared prefill capture state, and avoids requesting raw prompt logits from DFlash paths that should not read them.
- Reduced DFlash accept and verification overhead while failing closed on bad reduced-logit state. The accept path defers single-slot rollback sync and drafter KV maintenance, keeps the fused GDN 4D state fast path, adds CPU f16
out_prodfallback, and disables DFlash drafting after repeated invalid reduced-logits drafts instead of looping or corrupting state. - Improved DFlash tool-call and reasoning behavior. DFlash can keep drafting before a lazy tool-call grammar actually constrains output, stale drafter KV state is cleared before long tool-call continuations, stable partial tool-call headers stream earlier, streaming reasoning deltas stay isolated, and streamed title generation suppresses leading thinking syntax.
- Kept flat DFlash available with multimodal serving while making unsupported combinations explicit. With
--mmproj, Bee keeps flat DFlash usable, forces DFlash tree branch budget to0, disables non-DFlash speculative modes, disables unsupported context-shift/cache-reuse paths, and fixes MTP text-only requests when anmmprojis present. - Updated Qwen 3.5/3.6 speculative paths. Qwen gets per-layer KV heads, final-layer output gathering, stable MTP draft context behavior, and Qwen DDTree conv/GDN paths for tree verification.
- Removed legacy DDTree total-node semantics from the public CLI. Tree DFlash now uses the branch-only
--spec-branch-budgetmodel consistently, with--spec-draft-top-kcontrolling candidates per draft position and flat DFlash forcing top-k back to1. - Added new cache and quantization surface beyond v0.2.0.
q6_0is available as a KV/cache type, and Tom'sTQ3_1S/TQ4_1Smodel weight formats are exposed throughllama-quantizewith non-conflicting serialized GGML type IDs; existingturbo2,turbo3,turbo4,turbo2_tcq, andturbo3_tcqcache types remain available on the newer base. - Hardened backend behavior used by Bee features. HIP TCQ attention stays on the native vector path, ROCm can probe fused GDN support, D512 Flash Attention selection is available again, unsupported CPU BF16 scale and CPU Flash Attention types are rejected instead of misrunning, CUDA fatal-warning cases were fixed, and the build now requires C++17 for the common base.
- Improved server and API behavior inherited from upstream and Bee integration. The server reports prompt token counts in
/slots, supports SSE ping intervals and HTTP ETags, exposes real-time reasoning interruption, handles router/model metadata fixes, adds built-in tools such as datetime, and keeps malformed tool-looking text out of final responses. - Updated packaging and release workflows. Docker image links and labels were aligned for BeeLlama, SYCL package/release images were added and documented, release builds were split and cached more reliably, stale package builds are cancelled, and obsolete CUDA architecture options are rejected at configure time.
- Updated user documentation for the new release state, including DFlash args/defaults, removed aliases, adaptive Draft-Max, multi-GPU DFlash behavior, Turbo/TQ cache and weight formats, Docker images, and the upstream multi-GPU guide.
- Expanded regression coverage around DFlash default normalization, removed arg aliases, DFlash-only no-op behavior, DFlash auto-detection, multi-GPU policy helpers, per-layer capture/tape allocation, device-aware replay, CUDA device restoration, adaptive Draft-Max, reduced-logit failure handling, and tool-call/speculative boundaries.
macOS:
Linux:
Windows:
- Windows x64 CPU
- Windows x64 SYCL
- Windows x64 CUDA 12.4 - DLLs
- Windows x64 CUDA 13.1 - DLLs
- Windows x64 HIP
Docker:
- CPU:
docker pull ghcr.io/anbeeld/beellama.cpp:server-cpu-v0.3.0 - CUDA:
docker pull ghcr.io/anbeeld/beellama.cpp:server-cuda-v0.3.0 - CUDA 12:
docker pull ghcr.io/anbeeld/beellama.cpp:server-cuda12-v0.3.0 - CUDA 13:
docker pull ghcr.io/anbeeld/beellama.cpp:server-cuda13-v0.3.0 - ROCm:
docker pull ghcr.io/anbeeld/beellama.cpp:server-rocm-v0.3.0 - Vulkan:
docker pull ghcr.io/anbeeld/beellama.cpp:server-vulkan-v0.3.0 - SYCL:
docker pull ghcr.io/anbeeld/beellama.cpp:server-sycl-v0.3.0
v0.2.0
Changelog
- Added compatibility with upstream DFlash PR drafter GGUFs that use
general.architecture = dflash. Bee now keeps this separate from the olderdflash-draftschema, understands upstream metadata keys such asdflash.block_sizeanddflash.target_layer_ids, reads upstream tensor names, and keeps existing Bee/buun draft GGUF naming intact. - Tightened DFlash draft model discovery and converter behavior. Bee now prefers exact sibling DFlash draft directories, supports nested
dflash_configmetadata, scopes Gemma4 tokenizer handling correctly, and logs clearer DFlash metadata warnings and summaries during conversion. - Hardened recurrent memory, prompt-cache restore, and unified-KV scheduling. Recurrent resize now repairs its metadata after shrink/expand, the server shrinks recurrent state before prompt-cache save/load when it is safe, backup-sequence cleanup is tracked correctly, and non-parent tasks defer unified-KV admission so large pending prompts do not over-commit shared cells.
- Added richer DFlash diagnostics, profiling, and validation.
GGML_DFLASH_PROFILEnow exposes categorized summary/replay/copy/prefill/verify/trace logging, routine decode timing is hidden behind debug logging instead of always printing, the profit controller now logs when it disables speculative depth, drafter/target contract and input validation are stricter, and Bee also exposes targeted debug envs such asGGML_DFLASH_DEBUG,GGML_DFLASH_INPUT_DEBUG,GGML_DFLASH_CUDA_DEBUG,GGML_DFLASH_FORCE_CPU_CROSS,GGML_DFLASH_VERBOSE_CONTRACT, andGGML_DFLASH_CRASH_TRACE. - Improved DFlash CUDA ordering and split-buffer correctness. Hidden capture, recurrent replay, backup copies, K/V projection-cache updates, and DFlash stream waits now use explicit ordering helpers and safer backend ownership checks instead of broader synchronization or wrong-buffer access.
- Added DFlash drafter K/V projection caching for the cross-attention window. Bee now keeps ring-backed drafter K/V state for recent target hidden-state windows, supports chronological D2D append/interleave on CUDA, excludes the unsafe parts from graph capture when needed, and falls back more safely on placements that cannot use the fast GPU path.
- Reworked DFlash prefill capture and flush handling. Prefill capture now uses per-slot and per-view plans, GPU staging buffers, source-aware CPU/GPU ring validity, suffix-span tracking across internal ubatches, graph-reuse keys for source/destination offsets, callback suppression for irrelevant ubatches, and fail-closed behavior for partial or mismatched captures.
- Hardened target hidden-state capture across Qwen3.5, Qwen3.5-MoE, Gemma4-ISWA, hidden-only contexts, GPU tape, and multi-slot GPU cross data. Capture layer assignment, token-count derivation, callback routing, and GPU multi-slot cross collection now have explicit correctness checks.
- Reduced greedy DFlash verification overhead and made verifier control stricter. Eligible verify batches can use reduced top-k logits without raw-logit readback, Bee keeps seed-row alignment correct, the flat verify horizon is capped, server-side depth control is authoritative, and the reduced path falls back when grammar, sampler, or reasoning state requires full logits.
- Hardened DFlash reasoning, draft, and suffix handling. Reasoning-end forcing now goes through the normal full-logits path when needed, invalid reduced-logits drafts are rejected instead of crashing or looping, empty drafts fall back safely, accepted-prefix full-KV commits respect the drafter window, explicit
--spec-draft-ctx-sizeoverrides are tracked correctly, Bee keeps the DFlash auto--cd 256default path when no draft ctx is passed, and the drafter stays aligned with the live accepted suffix. - Improved Gemma 4 support substantially. Bee added Gemma4-ISWA DFlash target plumbing and profiling callbacks, ported the cleaner upstream Gemma4 graph and loader path back onto Bee hooks, restored Bee precision behavior where needed, synced SWA max-position authority and 512-dim FlashAttention selection with upstream, and fixed Gemma multimodal image decode and dynamic resize bounds.
- Extended CUDA kernel coverage and backend hardening. Bee now keeps 512-wide quantized FlashAttention instances for standard and TurboQuant/TCQ KV combinations, syncs upstream Hadamard rotation plumbing, propagates CUDA driver links correctly, and hardens op-table / Gated DeltaNet integration alongside long-context GPU ring stability fixes.
- Reduced peak memory in the perplexity tool and fixed streaming perplexity / KLD cache handling. Streaming perplexity now writes bounded chunks, checks stream errors, avoids retaining unbounded logits for long-context KL runs, and keeps the logits-cache format versioning compatible with the legacy magic.
- Completed the malformed tool-call guard path for non-stream responses. Final OpenAI-compatible responses now quarantine malformed raw tool-looking text the same way streamed tool-parsing responses already did.
Windows:
v0.1.2
Changelog
- Fixed the adaptive
profitcontroller's no-spec baseline path. Profit mode now seeds baseline samples before positive-depth warmup, can shut DFlash fully off when the measured baseline wins, and no longer makes speculative decisions from draft-only telemetry. - Fixed profit-controller reset handling across context-bucket and configuration changes so cleared baseline telemetry cannot leave the controller in a stale active or off state.
- Added low-frequency profit-controller baseline reprobes with
--spec-dm-profit-baseline-interval/LLAMA_ARG_SPEC_DM_PROFIT_BASELINE_INTERVALso runs can refresh target-only timing as context grows. The default interval is 1024 active speculative cycles; reprobes resume the previous active draft depth and avoid off-probe counter starvation. - Hardened active-reasoning EOS handling. When an end-of-generation token appears while reasoning output is still active, the sampler now forces the reasoning-end sequence through the normal full-logits path; reduced DFlash verification rejects that case instead of accepting an unsafe reduced candidate set.
- Hardened DFlash on split CUDA / multi-GPU placement. GPU cross-ring setup, hidden capture, CUDA graph capture, K/V projection cache updates, recurrent replay, conv replay, and async tensor get/set paths now check buffer/backend ownership and fall back to safer CPU or owning-buffer paths instead of reading or writing recurrent state through the wrong CUDA backend.
- Added clearer diagnostics and regression coverage for multi-GPU DFlash fallback decisions, CUDA graph buffer visibility, wrong-device async tensor access, active-reasoning reduced-sampling rejection, adaptive DM defaults, and profit-controller baseline behavior.
- Fixed ROCm 7 build: added
cudaPointerAttributes/cudaMemoryTypeshim aliases tohip.h, extendedCUDART_VERSION >= 10000guards with|| defined(GGML_USE_HIP)so the.typefield path is taken on HIP, and removed theWIN32guard around TurboQuant flash-attention instance compilation so Linux ROCm builds include the turbo KV-cache kernels (acerspyro#11). - Known limitation: the current multi-GPU DFlash path is a correctness fallback, not a performant split-GPU implementation. On split target placement it can be slower than non-speculative decoding because recurrent replay and hidden capture avoid unsafe single-backend GPU fast paths. A performant implementation still needs per-device replay graphs or a scheduler that follows ggml's split-buffer ownership model.
Windows:
v0.1.1
Changelog
- Improved agentic tool-call reliability with lazy grammars. DFlash now remains enabled before a lazy grammar trigger, but stops speculating once grammar-constrained output or reasoning-budget forcing requires normal token-by-token sampling.
- Fixed DFlash accept bookkeeping at grammar and tool-call boundaries. The server now distinguishes accepted draft tokens from bonus-token-shaped results, updates DFlash hidden-state rows with the root plus accepted draft tokens, and uses the same keep count for rollback.
- Added a DFlash suppression guard for raw tool-call markers. When a tool marker appears while lazy grammar is enabled, the server suppresses DFlash for the rest of that response without steering sampler state; fenced code and embedded marker-like strings are excluded from the guard.
- Made partial OpenAI-compatible tool-call streaming safer. The server can stream a stable tool name/id early so clients can show a pending tool call, while withholding partial arguments until the parser sees a complete call.
- Quarantined malformed raw tool-call text in tool-parsing streams. Unfinished or malformed tool-looking text no longer leaks into visible assistant content or hidden reasoning deltas before the parser can classify it.
- Accepted direct tag-style function starts for Qwen-style tool calls. Lazy grammar triggers now include structural function markers such as
<function=, and the tag parser can parse valid direct function calls without the outer<tool_call>wrapper. - Added regression coverage for Kimi and Qwen tool-call streaming, malformed raw marker quarantine, fenced-code false positives, direct Qwen function calls, lazy grammar triggers, and DFlash speculative boundary plumbing.
- Fixed small build issues found after 0.1.0: the DFlash callback setup now uses an explicit callback type for GCC 15, and tests/server code include the required standard headers for
INT_MAXandFLT_MAX.
Windows:
v0.1.0
Changelog
- DFlash speculative decoding:
--spec-type dflashdrives a DFlash draft GGUF alongside the target model. The target captures hidden states into a per-layer 4096-slot ring buffer, the drafter cross-attends to the most recent--spec-dflash-cross-ctxhidden-state tokens and proposes drafts for target verification. - TurboQuant / TCQ KV-cache compression: Five cache types (
turbo2,turbo3,turbo4,turbo2_tcq,turbo3_tcq) spanning from 4x to 7.5x compression, with higher-bit options being practically lossless in many cases. Set independently with--cache-type-kand--cache-type-v. - Adaptive draft-max control: The server adjusts the active draft horizon at runtime instead of using a fixed
--spec-draft-n-max. The defaultprofitcontroller compares speculative throughput against a no-spec baseline; thefringealternative maps acceptance-rate bands to draft depth. Use--no-spec-dm-adaptivefor a static horizon. - Full multimodal support: When
--mmprojis active, the server keeps flat DFlash available for text generation. The model can be fully offloaded to CPU with no problems to reduce VRAM pressure. - Reasoning-loop protection: The server detects repeated hidden reasoning output and intervenes. Default mode is
force-closewith--reasoning-loop-windowand--reasoning-loop-max-periodtuning available. - Sampled DFlash verification:
--spec-draft-tempenables rejection-sampling drafter behavior. Activates when both draft and target temperature exceed zero. Draft log probabilities must be available for rejection sampling to produce correct output. - DDTree branch verification: optional
--spec-branch-budgetadds branch nodes beyond the main draft path with GPUparent_ids, tree masks, and recurrent tree kernels. Disabled automatically when the target model spans more than one GPU. This one is very much work in progress! - Request-level speculative overrides: Draft-max and branch budget can be overridden per-request through JSON fields without restarting the server.
- CopySpec model-free speculation:
--spec-type copyspecprovides rolling-hash suffix matching over previous tokens without a draft model. Results must be benchmarked per workload.
Windows: