feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree by easel · Pull Request #285 · Luce-Org/lucebox-hub

easel · 2026-05-27T17:59:21Z

This PR turns Lucebox into a one-command local inference deployment and ships the two tools that operate it: lucebox (the host CLI that runs and tunes the server) and luce-bench (the benchmark + grading framework that measures it). All three ship together so a fresh box goes from nothing to a tuned, benchmarked server with a single install.

The three pieces, what each is, and how to use it:

1. Docker — the server image

A CUDA 12.8 image (ghcr.io/luce-org/lucebox-hub:cuda12) that builds the dflash server and bundles server/, lucebox/, harness/, and luce-bench/. The entrypoint dispatches serve (default), benchmark, any lucebox subcommand, or shell. An in-container autotune fallback picks VRAM-tiered defaults and resolves the draft GGUF by target architecture (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6).

Use it directly:

docker run --rm --gpus all -p 8080:8080 \
  -v ~/.local/share/lucebox/models:/opt/lucebox-hub/server/models \
  ghcr.io/luce-org/lucebox-hub:cuda12
# OpenAI + Anthropic-compatible API on :8080
curl -s http://localhost:8080/v1/models

Image tags: :cuda12, :vX.Y.Z-cuda12, :X.Y-cuda12, :sha-<short>-cuda12. Built and pushed by .github/workflows/docker.yml; docker-bake.hcl has a cuda13 slot ready.

2. `lucebox` — the host CLI

lucebox.sh is the host-side wrapper (deps: docker + nvidia-smi only). It probes the host, writes a tuned config.toml, runs the container as a user-systemd service, and delegates provisioning/workloads to the in-container Python CLI (models, autotune, profile, smoke, config, the client drivers).

Stand a server up:

lucebox check            # driver / docker / NVIDIA Container Toolkit / VRAM / systemd / WSL2 probe
lucebox pull             # docker pull the cuda12 image
lucebox models download  # pull target + DFlash draft GGUFs  (verbs: list, download)
lucebox autotune         # VRAM-tiered DFLASH_* defaults → ~/.lucebox/config.toml  (autotune --sweep picks a winner empirically)
lucebox install          # install the user-systemd unit
lucebox start            # bring it up   (enable = start at every login)
lucebox status           # unit state + the server's startup banner
lucebox logs             # follow the journal
lucebox smoke            # props/tools/http/1-token health check

Tune it to the GPU:

lucebox profile          # level1/2/3 sweep over DFLASH_MAX_CTX × DFLASH_BUDGET ×
                         # {KV type, pFlash mode, lazy-draft, prefix-cache slots},
                         # gated on capability + ds4-eval/agentic validation before
                         # the winner merges into config.toml

The running config is observable at GET /props (schema 4), which now reports a host block — kernel, OS, WSL vs native, driver, CPU, RAM, GPU — so a server self-describes its real config and host.

3. `luce-bench` — the benchmark + grading framework

In-tree workspace member (luce-bench/, 0.2.7.dev0) that scores any OpenAI/Anthropic-compatible endpoint and writes versioned, comparable result files. Areas: smoke, ds4-eval (92 reasoning items), gsm8k, truthfulqa-mc1, hellaswag, code, longctx, agent, agent_recorded, forge. Every result stamps a per-area grader_version and a host block (from /props.host, or a clearly-marked client-side fallback for servers without /props).

Run it:

uvx --from 'git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench' \
  luce-bench --base-url http://localhost:8080 --model dflash --areas all --no-think

Thinking control is portable. Each request carries three control shapes (chat_template_kwargs.enable_thinking, Anthropic thinking:{type}, reasoning_effort). For servers that ignore the API flags (e.g. OpenRouter), --prompt-thinking-control {auto,on,off} (default auto) injects the model family's in-band token (/no_think, /think); auto fires only when /props shows no server-side enforcement. A post-run verifier records thinking_control_honored so a nothink run that secretly reasoned is flagged, not silently mislabeled.

Comparing results: runs from one grader version are comparable as written. For older snapshots graded by a different version, luce-bench regrade <dirs> re-scores stored outputs at the current pinned grader and refuses to place mismatched-version (or mismatched-host) runs in the same row. report / snapshot / submit-baseline round out the reporting surface.

Also in this PR

harness/ — drives real clients (claude_code, codex, opencode, hermes, pi, openclaw) against a running server; lucebox profile delegates bench runs here.
Model-card sidecars — share/model_cards/{qwen3.6-27b,gemma-4-26b-a4b-it,gemma-4-31b-it,laguna-xs.2}.json + _schema.json, so the server resolves sampler defaults, thinking budgets, and the force-close hint per model.
Workspace — pyproject.toml declares all members (server, lucebox, luce-bench, harness, optimizations/{megakernel,pflash}); [tool.uv.sources] luce-bench = { workspace = true } replaces the prior git-tag pin. release-luce-bench.yml publishes to PyPI on luce-bench-v* tags.
Docs — README quick start + hardware/env reference; server/docs/ benchmark-snapshot spec and experiment write-ups.
Removes the obsolete server/scripts/bench_*.py (their work now lives in luce-bench).

Out of scope / follow-ups

Gemma 4 31B backend wiring beyond what its model card ships (validated empirically @ 24 GB, AR-only).
gemma4 MoE expert split.
Multi-Token Prediction (upstream, draft).

Validation

uv sync clean on the workspace; luce-bench test suite passes.
Full --areas all sweeps run end-to-end against bragi (RTX 5090 Laptop), sindri (RTX 3090 Ti), vidar (M2 Ultra / MLX), and OpenRouter, think and nothink, all on one grader version.
/props.host confirmed populated on lucebox servers (bragi + sindri report WSL2); OpenRouter nothink confirmed honored via client-side /no_think injection.

easel · 2026-05-27T19:53:47Z

Some commands to test this... copied from the readme.

Install the lucebox wrapper:

curl -fsSL https://raw.githubusercontent.com/easel/lucebox-hub/feat/lucebox-docker/lucebox.sh \
       -o ~/.local/bin/lucebox.sh && chmod +x ~/.local/bin/lucebox.sh

Run lucebox using the docker image

# Override the container image to the temporary build:
export LUCEBOX_IMAGE=ghcr.io/easel/lucebox-hub

# Check your machine for lucebox compatibility
lucebox check

# Start the lucebox server
lucebox serve

Run benchmarks against a local server:

uvx --refresh --from "git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench" lucebench --url http://localhost:1236

Run benchmarks against open router

uvx --refresh --from "git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench" lucebench --base-url https://openrouter.ai/api --model qwen/qwen3.6-27b --auth-env OPENROUTER_API_KEY

…g#285 CI CI's "Lint Python surfaces touched by lucebox tooling" job ran `ruff check .` and found 11 errors across surfaces this branch touches. Ruff --fix handled 6 (import sorting, unused imports); 5 needed hand-edits: luce-bench/src/lucebench/report.py:172 E741 rename `for l in` → `for lineup in` lucebox/tests/test_check.py:39, 95 E731 lambda → def stub() for the two HostFacts stubs lucebox/tests/test_cli.py:95 E501 wrap the LUCEBOX_HOST_GPU_LIST_CSV setenv lucebox/tests/test_sweep.py:174, 177 E501 wrap two CellResult constructors 22 lucebox tests touched still pass; ruff is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge PR Luce-Org#285 after it changed from draft to open during the cron run. Resolve refreshed Docker/lucebox/luce-bench conflicts by taking the PR head for feature files while preserving the server include required by the existing integration stack.\n\nValidation:\n- git diff --check\n- python3 -m compileall -q lucebox/src lucebox/tests luce-bench/src luce-bench/tests harness/src\n- uv run --with pytest python -m pytest lucebox/tests luce-bench/tests/test_report.py luce-bench/tests/test_smoke_area.py luce-bench/tests/test_runner.py -q

Keep the primary checkout clean after integrating PR Luce-Org#285 by ignoring the generated .docker-build/ CMake scratch directory. Update the auto-integration manifest with the final PR Luce-Org#285 merge and validation details.

cubic-dev-ai

17 issues found across 188 files

_{Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.

On a pro plan you can use ultrareview for larger PRs.

Re-trigger cubic}

Fix the initial PR Luce-Org#285 CI failures by sorting the ruff-reported import block and installing libcurl development headers before the native dflash server CMake build. Refresh the auto-integration manifest with the CI diagnosis.

Integrate the refreshed lucebox Docker stack PR head (09dc0be), which switches Hugging Face metadata lookup to model_info in lucebox download handling.

Record the clean Luce-Org#285 follow-up merge at 09dc0be and the validation performed after updating the stack.

The job-level `permissions` block replaces the workflow-level default entirely, so `actions/checkout` was running without `contents: read` and would fail on protected refs. Add `contents: read` back alongside the existing `id-token: write`. Addresses cubic #1 on PR Luce-Org#285.

- Dockerfile: keep --frozen on the uv sync fallback so the layer can't silently resolve outside the lockfile. - harness/clients/run_lucebench.sh: default LUCEBENCH_THINK empty (per-area card defaults govern; --no-think only when explicitly set) and default LUCEBENCH_AREA to the level1 capability gate (smoke,code,gsm8k,agent,longctx) instead of `all`, which was too broad for routine harness runs. Addresses cubic #2, Luce-Org#3 (P1) and Luce-Org#14 (P2) on PR Luce-Org#285.

…appers - .github/workflows/{ci,docker,release-luce-bench}.yml: pin actions/checkout, docker/{setup-buildx,login,metadata,bake}-action, and astral-sh/setup-uv to immutable commit SHAs with `# vN` comments so the supply chain is reproducible (Luce-Org#4). - harness/src/harness/clients/_common.py: replace the external `timeout` shell-out with `subprocess.run(..., timeout=N)`, return 124 on TimeoutExpired to match GNU timeout's exit code (Luce-Org#5). - scripts/build_image.sh: normalize REGISTRY to end in `/` instead of silently producing `ghcr.io/luce-orglucebox-hub` when the trailing slash is missing (Luce-Org#6). - harness/src/harness/clients/pi.py: non-interactive launch now mirrors run_pi.sh's validated invocation (--provider, --print, --mode json, --tools, --no-session, --offline) and sets PI_CODING_AGENT_DIR / PI_CODING_AGENT_SESSION_DIR / PI_OFFLINE (Luce-Org#7). - docker-bake.hcl: sanitize `+` → `-` in VERSION before composing tags, since `+` is not a valid Docker tag character (Luce-Org#8). - harness/src/harness/clients/hermes.py: set HERMES_HOME + the rest of run_hermes.sh's env wiring and call `chat --provider --model --accept-hooks --yolo --max-turns --source --query` instead of a bare positional prompt (Luce-Org#9, Luce-Org#10). - harness/src/harness/clients/openclaw.py: apply the OpenClaw config patch via `openclaw config patch --file` before the run, and call `agent --local --json --model lucebox/<model> --session-id --timeout --message` instead of a bare positional prompt (Luce-Org#11). - pyproject.toml: drop the dead dflash/scripts/{prefix_cache,test_server, tool_memory}.py ruff include pins (those paths were renamed during the dflash→server rename and then deleted upstream) (Luce-Org#12). - lefthook.yml: widen the shellcheck/bash-parse glob from `*.sh` to `**/*.sh` so scripts under nested dirs (harness/clients/*.sh, scripts/*.sh, server/scripts/*.sh) are linted on commit (Luce-Org#13). Addresses cubic Luce-Org#4–Luce-Org#13 (P2) on PR Luce-Org#285. Luce-Org#14 was already addressed in the previous commit alongside the LUCEBENCH_THINK default fix.

- lucebox/README.md: fix the relative link to `cli.py`; resolves to `src/lucebox/cli.py` (the actual location), not the nonexistent `lucebox/cli.py` (Luce-Org#15). - luce-bench/NOTICE: the bundled forge_eval LICENSE says "Copyright (c) 2025-2026 Antoine Zambelli", not 2024 — sync NOTICE with the actual upstream LICENSE (Luce-Org#16). - luce-bench/src/lucebench/areas/__init__.py: `__all__` was missing agent / agent_recorded / forge / longctx / smoke. Add the imports + list entries so `from lucebench.areas import *` matches the actual area surface (Luce-Org#17). Addresses cubic Luce-Org#15–Luce-Org#17 (P3) on PR Luce-Org#285.

Merge the advanced Luce-Org#285 docker/harness follow-up head into the integration stack and record the refreshed PR classification, conflict probes, retained worktrees, and validation results.

…nch in-tree Squashes 8 commits from feat/lucebox-docker (PR Luce-Org#285) into a single commit on top of origin/main (8782d07). Net: 189 files changed. Workstreams folded in: * Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage Dockerfile with reproducible `uv sync --frozen`, docker-bake.hcl with VERSION sanitization for Docker tag charset, .github/workflows/docker.yml with SHA-pinned external actions and GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO + HOST_INFO. * Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID guard against systemd self-defeat, container-state preflight), cmd_systemctl_passthrough (already-active short-circuit, restart-loop detection), cmd_update (bootstrap-installer pattern), cmd_completion (bash/zsh/fish), config.toml reader (env > toml > default), all shellcheck-clean. * Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the installed copy so `lucebox update` keeps tracking the channel; refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL. * In-container Python CLI (lucebox/): sparse config.toml persistence, config get/set/unset sub-app, models list/download sub-app (replaces download-models), autotune with --apply / --json / --sweep, profile collapsed onto luce-bench snapshot (1701 → ~150 lines). _load_or_build now respects env > toml > default precedence. * luce-bench: snapshot subcommand + canonical HostInfo schema v2 (multi-GPU lineup, WSL detection, source/collector trust metadata) + levels (level0/1/2/3) + report subcommand (host column + cross-host confounder warnings) + submit-baseline (level3-gated) + regrade. * Server (C++): /props.host block + props_schema=4 + host_info loader, /props.build identity, GGUF metadata + sha256 sidecars, model card sidecars. Deleted server/scripts/bench_{agent,he,llm}.py — bench machinery moved into luce-bench. * Harness: client implementations for claude/codex/opencode/hermes/pi pointed at the running lucebox server, matched against the validated run_*.sh shell wrappers. Cubic AI code review (17 findings) addressed in full: P0: contents: read on luce-bench release job permissions. P1: Dockerfile `--frozen` reinstated; LUCEBENCH_THINK default empty so per-area defaults apply. P2: 6 external actions pinned to immutable SHAs; non-interactive timeout via subprocess.run; REGISTRY trailing-slash normalize; VERSION + Docker tag charset sanitize; harness pi/hermes/openclaw mirrored against run_*.sh wrappers; ruff scan paths corrected to server/scripts/; lefthook glob `**/*.sh`; LUCEBENCH_AREA default level1. P3: lucebox/README.md cli.py link fixed; NOTICE copyright year 2025-2026; areas/__init__.py __all__ exposes all 10 areas. CI on PR Luce-Org#285: all 4 checks green (uv workspace, cmake build, cuda12 prebuild, cubic reviewer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…box-docker) Brings the Qwen3.6/Laguna think-mode reasoning fix (route reasoning into reasoning_content channel instead of content) into the lucebox-docker stack.

Merge the advanced Luce-Org#285 head, carrying Qwen3.6/Laguna started_in_thinking propagation through render, parsing, and SSE emission. Resolve the overlap with the existing Qwen3 Jinja closed-think fallback and refresh the auto-integration manifest.

Document the post-push discovery of PR Luce-Org#285's 9a6db60 head, its clean merge into auto-integration, targeted luce-bench validation, and retained worktree paths.

cubic-dev-ai

3 issues found across 12 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="luce-bench/src/lucebench/cli.py">

<violation number="1" location="luce-bench/src/lucebench/cli.py:692">
P2: `--prompt-thinking-control=on` can be silently disabled by the new card-capability gate, so explicit user override no longer works on unresolved/non-capable cards.</violation>
</file>

<file name="luce-bench/src/lucebench/runner.py">

<violation number="1" location="luce-bench/src/lucebench/runner.py:682">
P2: Continuation request can unintentionally re-enable thinking because `extra_body` overwrites forced disable flags.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-05-29T21:10:08Z

+    # tokens for a thinking-capable card; otherwise force the flag off so
+    # neither the card nor the family-map fallback injects.
+    effective_thinking_control = (
+        prompt_thinking_control if card_is_thinking_capable(model_card) else "off"


P2: --prompt-thinking-control=on can be silently disabled by the new card-capability gate, so explicit user override no longer works on unresolved/non-capable cards.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At luce-bench/src/lucebench/cli.py, line 692: <comment>`--prompt-thinking-control=on` can be silently disabled by the new card-capability gate, so explicit user override no longer works on unresolved/non-capable cards.</comment> <file context> @@ -665,6 +685,13 @@ def _run_standard_area_to_dir( + # tokens for a thinking-capable card; otherwise force the flag off so + # neither the card nor the family-map fallback injects. + effective_thinking_control = ( + prompt_thinking_control if card_is_thinking_capable(model_card) else "off" + ) + </file context>

cubic-dev-ai · 2026-05-29T21:10:08Z

+    if top_k is not None and top_k > 0:
+        body["top_k"] = int(top_k)
+    if extra_body:
+        body.update(extra_body)


P2: Continuation request can unintentionally re-enable thinking because extra_body overwrites forced disable flags.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At luce-bench/src/lucebench/runner.py, line 682: <comment>Continuation request can unintentionally re-enable thinking because `extra_body` overwrites forced disable flags.</comment> <file context> @@ -384,7 +618,88 @@ def run_case( + if top_k is not None and top_k > 0: + body["top_k"] = int(top_k) + if extra_body: + body.update(extra_body) + + headers = {"Content-Type": "application/json"} </file context>

Suggested change

body.update(extra_body)

body.update(

{

k: v

for k, v in extra_body.items()

if k

not in {

"messages",

"max_tokens",

"stream",

"chat_template_kwargs",

"thinking",

"reasoning_effort",

}

}

)

Replaces the single-shot recorded harness with a multi-turn replay, adds the grading/ subpackage with the llm_judge, multi_turn_cases fixture, agent_recorded test + extract-agentic-fixture test, and touches normalize/regrade/cli to support new turn-level metrics. Also carries the two qwen3.6 and two gemma4 coding-agent-loop sweep writeups that consume this area. Depends on: lucebench-harness (edits files introduced there). Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Containerization stack for lucebox-hub: - Dockerfile and docker-bake.hcl define the lucebox-hub container image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars that /props schema-3/4 consumes. - GitHub Actions: .github/workflows/docker.yml builds and publishes the image; ci.yml runs the standard checks; release-luce-bench.yml handles luce-bench release tagging. - Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README.md) live here because the Dockerfile uv-syncs the workspace at build time. - Catch-all dev/CI scripts ship with this stack: card-bundle drift check, lucebox-wrapper sandbox check, pflash session bench, ds4 2-case sweep, agentic fixture extractor, lucebox.sh smoke test, and the dflash eval_quality_compare helper. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Generated with [Claude Code](https://claude.com/claude-code)

…shell wrapper Lucebox hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) + lucebox.sh wrapper + install.sh + the harness/ adapter package (claude_code/codex/hermes/openclaw/opencode/pi adapters used by autotune sweeps). Carries autotune profile-sweep protocol + bragi tuning summary docs. Depends on: - docker-stack PR (CLI launches the lucebox-hub image and reads IMAGE_INFO) - lucebench-harness PR (lucebox bench delegates to lucebench) - server-layer-split PR (autotune presumes layer-split build flags) Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…gime router User-facing pflash feature that lands four tightly-related pieces: - ee7 early-exit drafter with score-range gate and tail-capture guard (server/src/qwen3/qwen3_drafter.{h,cpp}, server/src/common/score_range.h) - Anchor-transitive cascade default-on, fixing the 64K NIAH cliff (server/src/qwen3/anchor_scan.{h,cpp}) - Regime router scaffolding for swapping pflash composition by regime (server/src/common/regime_router.h) - Adaptive composition via per-request fa_window (server/src/server/adaptive_keep_ratio.h) Ships five accompanying C++ unit tests covering the early-exit score-range gate, tail-capture guard, warm-path regression, anchor-transitive cascade, and regime router. Includes the pflash composition / anchor / drafter design docs and the qwen3.6 pflash A/B and prefix-cache regression writeups from bragi 2026-05-31. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Backend feature implementing soft-close thinking termination via a logit-ratio peek, split probe/inject ids, a min_tokens floor, and max_tokens treated as a response-only budget while thinking is active. - qwen35 originates the mechanism in qwen35_backend.{cpp,h}. - gemma4 ports the same mechanism in gemma4_backend.cpp. - gemma4_internal.h carries the probe/inject hooks and is owned here. - server_main CLI flags ship with the separate thinking-control API PR. Backend-only, plus the soft-close design plan doc under docs/experiments/soft-close-thinking-termination-plan.md. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Generated with Claude Code

Parses Gemma's plain-text call:<verb>{} emissions (also accepts the ``_call:`` tokenizer artifact prefix) and renders them as Anthropic tool_use + tool_result blocks. Isolated to tool_parser.{cpp,h}; the streaming detection in sse_emitter is owned by the thinking-control PR. Includes the call-verb tool parser plan and the Gemma4-26B call-verb parser fix writeup from bragi (2026-05-31). Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…l + /props schema-4 - chat_template prefills closed <think> when thinking off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema from 2 to 4, adding build/model.target/ model.draft/host blocks so clients can introspect the runtime. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for richer card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream. - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, thinking-budget spec, and experiments capturing the thinking-control protocol/mechanism work. - test_server_unit gets the matching coverage spanning chat_template prefill, /props schema 4, and the reasoning_content routing. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…shell wrapper Lucebox hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) + lucebox.sh wrapper + install.sh + the harness/ adapter package (claude_code/codex/hermes/openclaw/opencode/pi adapters used by autotune sweeps). Carries autotune profile-sweep protocol + bragi tuning summary docs. Depends on: - docker-stack PR (CLI launches the lucebox-hub image and reads IMAGE_INFO) - lucebench-harness PR (lucebox bench delegates to lucebench) - server-layer-split PR (autotune presumes layer-split build flags) Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Backend feature implementing soft-close thinking termination via a logit-ratio peek, split probe/inject ids, a min_tokens floor, and max_tokens treated as a response-only budget while thinking is active. - qwen35 originates the mechanism in qwen35_backend.{cpp,h}. - gemma4 ports the same mechanism in gemma4_backend.cpp. - gemma4_internal.h carries the probe/inject hooks and is owned here. - server_main CLI flags ship with the separate thinking-control API PR. Backend-only, plus the soft-close design plan doc under docs/experiments/soft-close-thinking-termination-plan.md. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Generated with Claude Code

Parses Gemma's plain-text call:<verb>{} emissions (also accepts the ``_call:`` tokenizer artifact prefix) and renders them as Anthropic tool_use + tool_result blocks. Isolated to tool_parser.{cpp,h}; the streaming detection in sse_emitter is owned by the thinking-control PR. Includes the call-verb tool parser plan and the Gemma4-26B call-verb parser fix writeup from bragi (2026-05-31). Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…gime router User-facing pflash feature that lands four tightly-related pieces: - ee7 early-exit drafter with score-range gate and tail-capture guard (server/src/qwen3/qwen3_drafter.{h,cpp}, server/src/common/score_range.h) - Anchor-transitive cascade default-on, fixing the 64K NIAH cliff (server/src/qwen3/anchor_scan.{h,cpp}) - Regime router scaffolding for swapping pflash composition by regime (server/src/common/regime_router.h) - Adaptive composition via per-request fa_window (server/src/server/adaptive_keep_ratio.h) Ships five accompanying C++ unit tests covering the early-exit score-range gate, tail-capture guard, warm-path regression, anchor-transitive cascade, and regime router. Includes the pflash composition / anchor / drafter design docs and the qwen3.6 pflash A/B and prefix-cache regression writeups from bragi 2026-05-31. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code)

…ate plumbing Cross-backend C++ refactor that extracts a shared layer_split_backend, a GGUF inspection helper, the c2_spec_decode_permitted predicate, dim-aware draft loading, and the PFLASH_COMPRESS_* rename. Touches all backends (qwen3, qwen35, gemma4), CMake, and bench scripts. c2_gate.h is a backend predicate (not a user switch) so it lives here. Independent base for the four feature PRs that follow. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code)