feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree#285
feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree#285easel wants to merge 14 commits into
Conversation
b5d4cc5 to
3642703
Compare
f2ddfc4 to
2be3eef
Compare
|
Some commands to test this... copied from the readme. Install the lucebox wrapper: curl -fsSL https://raw.githubusercontent.com/easel/lucebox-hub/feat/lucebox-docker/lucebox.sh \
-o ~/.local/bin/lucebox.sh && chmod +x ~/.local/bin/lucebox.shRun lucebox using the docker image # Override the container image to the temporary build:
export LUCEBOX_IMAGE=ghcr.io/easel/lucebox-hub
# Check your machine for lucebox compatibility
lucebox check
# Start the lucebox server
lucebox serveRun benchmarks against a local server: uvx --refresh --from "git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench" lucebench --url http://localhost:1236Run benchmarks against open router uvx --refresh --from "git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench" lucebench --base-url https://openrouter.ai/api --model qwen/qwen3.6-27b --auth-env OPENROUTER_API_KEY |
244257c to
f4db35b
Compare
…g#285 CI CI's "Lint Python surfaces touched by lucebox tooling" job ran `ruff check .` and found 11 errors across surfaces this branch touches. Ruff --fix handled 6 (import sorting, unused imports); 5 needed hand-edits: luce-bench/src/lucebench/report.py:172 E741 rename `for l in` → `for lineup in` lucebox/tests/test_check.py:39, 95 E731 lambda → def stub() for the two HostFacts stubs lucebox/tests/test_cli.py:95 E501 wrap the LUCEBOX_HOST_GPU_LIST_CSV setenv lucebox/tests/test_sweep.py:174, 177 E501 wrap two CellResult constructors 22 lucebox tests touched still pass; ruff is clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merge PR Luce-Org#285 after it changed from draft to open during the cron run. Resolve refreshed Docker/lucebox/luce-bench conflicts by taking the PR head for feature files while preserving the server include required by the existing integration stack.\n\nValidation:\n- git diff --check\n- python3 -m compileall -q lucebox/src lucebox/tests luce-bench/src luce-bench/tests harness/src\n- uv run --with pytest python -m pytest lucebox/tests luce-bench/tests/test_report.py luce-bench/tests/test_smoke_area.py luce-bench/tests/test_runner.py -q
Keep the primary checkout clean after integrating PR Luce-Org#285 by ignoring the generated .docker-build/ CMake scratch directory. Update the auto-integration manifest with the final PR Luce-Org#285 merge and validation details.
There was a problem hiding this comment.
17 issues found across 188 files
Note: This PR contains a large number of files. cubic only reviews up to 100 files per PR, so some files may not have been reviewed. cubic prioritizes the most important files to review.
On a pro plan you can use ultrareview for larger PRs.
Re-trigger cubic
Fix the initial PR Luce-Org#285 CI failures by sorting the ruff-reported import block and installing libcurl development headers before the native dflash server CMake build. Refresh the auto-integration manifest with the CI diagnosis.
Integrate the refreshed lucebox Docker stack PR head (09dc0be), which switches Hugging Face metadata lookup to model_info in lucebox download handling.
Record the clean Luce-Org#285 follow-up merge at 09dc0be and the validation performed after updating the stack.
The job-level `permissions` block replaces the workflow-level default entirely, so `actions/checkout` was running without `contents: read` and would fail on protected refs. Add `contents: read` back alongside the existing `id-token: write`. Addresses cubic #1 on PR Luce-Org#285.
- Dockerfile: keep --frozen on the uv sync fallback so the layer can't silently resolve outside the lockfile. - harness/clients/run_lucebench.sh: default LUCEBENCH_THINK empty (per-area card defaults govern; --no-think only when explicitly set) and default LUCEBENCH_AREA to the level1 capability gate (smoke,code,gsm8k,agent,longctx) instead of `all`, which was too broad for routine harness runs. Addresses cubic #2, Luce-Org#3 (P1) and Luce-Org#14 (P2) on PR Luce-Org#285.
…appers
- .github/workflows/{ci,docker,release-luce-bench}.yml: pin
actions/checkout, docker/{setup-buildx,login,metadata,bake}-action,
and astral-sh/setup-uv to immutable commit SHAs with `# vN` comments
so the supply chain is reproducible (Luce-Org#4).
- harness/src/harness/clients/_common.py: replace the external `timeout`
shell-out with `subprocess.run(..., timeout=N)`, return 124 on
TimeoutExpired to match GNU timeout's exit code (Luce-Org#5).
- scripts/build_image.sh: normalize REGISTRY to end in `/` instead of
silently producing `ghcr.io/luce-orglucebox-hub` when the trailing
slash is missing (Luce-Org#6).
- harness/src/harness/clients/pi.py: non-interactive launch now mirrors
run_pi.sh's validated invocation (--provider, --print, --mode json,
--tools, --no-session, --offline) and sets PI_CODING_AGENT_DIR /
PI_CODING_AGENT_SESSION_DIR / PI_OFFLINE (Luce-Org#7).
- docker-bake.hcl: sanitize `+` → `-` in VERSION before composing tags,
since `+` is not a valid Docker tag character (Luce-Org#8).
- harness/src/harness/clients/hermes.py: set HERMES_HOME + the rest of
run_hermes.sh's env wiring and call `chat --provider --model
--accept-hooks --yolo --max-turns --source --query` instead of a bare
positional prompt (Luce-Org#9, Luce-Org#10).
- harness/src/harness/clients/openclaw.py: apply the OpenClaw config
patch via `openclaw config patch --file` before the run, and call
`agent --local --json --model lucebox/<model> --session-id --timeout
--message` instead of a bare positional prompt (Luce-Org#11).
- pyproject.toml: drop the dead dflash/scripts/{prefix_cache,test_server,
tool_memory}.py ruff include pins (those paths were renamed during
the dflash→server rename and then deleted upstream) (Luce-Org#12).
- lefthook.yml: widen the shellcheck/bash-parse glob from `*.sh` to
`**/*.sh` so scripts under nested dirs (harness/clients/*.sh,
scripts/*.sh, server/scripts/*.sh) are linted on commit (Luce-Org#13).
Addresses cubic Luce-Org#4–Luce-Org#13 (P2) on PR Luce-Org#285. Luce-Org#14 was already addressed in
the previous commit alongside the LUCEBENCH_THINK default fix.
- lucebox/README.md: fix the relative link to `cli.py`; resolves to `src/lucebox/cli.py` (the actual location), not the nonexistent `lucebox/cli.py` (Luce-Org#15). - luce-bench/NOTICE: the bundled forge_eval LICENSE says "Copyright (c) 2025-2026 Antoine Zambelli", not 2024 — sync NOTICE with the actual upstream LICENSE (Luce-Org#16). - luce-bench/src/lucebench/areas/__init__.py: `__all__` was missing agent / agent_recorded / forge / longctx / smoke. Add the imports + list entries so `from lucebench.areas import *` matches the actual area surface (Luce-Org#17). Addresses cubic Luce-Org#15–Luce-Org#17 (P3) on PR Luce-Org#285.
Merge the advanced Luce-Org#285 docker/harness follow-up head into the integration stack and record the refreshed PR classification, conflict probes, retained worktrees, and validation results.
…nch in-tree Squashes 8 commits from feat/lucebox-docker (PR Luce-Org#285) into a single commit on top of origin/main (8782d07). Net: 189 files changed. Workstreams folded in: * Docker prebuild stack: ghcr.io/easel/lucebox-hub:cuda12 image, multi-stage Dockerfile with reproducible `uv sync --frozen`, docker-bake.hcl with VERSION sanitization for Docker tag charset, .github/workflows/docker.yml with SHA-pinned external actions and GHA cache, build identity baked into /opt/lucebox-hub/IMAGE_INFO + HOST_INFO. * Host wrapper (lucebox.sh): probe_host, smart cmd_serve (INVOCATION_ID guard against systemd self-defeat, container-state preflight), cmd_systemctl_passthrough (already-active short-circuit, restart-loop detection), cmd_update (bootstrap-installer pattern), cmd_completion (bash/zsh/fish), config.toml reader (env > toml > default), all shellcheck-clean. * Bootstrap installer (install.sh): bakes LUCEBOX_INSTALLED_FROM into the installed copy so `lucebox update` keeps tracking the channel; refuses SHA-pinned URLs without LUCEBOX_INSTALL_CHANNEL. * In-container Python CLI (lucebox/): sparse config.toml persistence, config get/set/unset sub-app, models list/download sub-app (replaces download-models), autotune with --apply / --json / --sweep, profile collapsed onto luce-bench snapshot (1701 → ~150 lines). _load_or_build now respects env > toml > default precedence. * luce-bench: snapshot subcommand + canonical HostInfo schema v2 (multi-GPU lineup, WSL detection, source/collector trust metadata) + levels (level0/1/2/3) + report subcommand (host column + cross-host confounder warnings) + submit-baseline (level3-gated) + regrade. * Server (C++): /props.host block + props_schema=4 + host_info loader, /props.build identity, GGUF metadata + sha256 sidecars, model card sidecars. Deleted server/scripts/bench_{agent,he,llm}.py — bench machinery moved into luce-bench. * Harness: client implementations for claude/codex/opencode/hermes/pi pointed at the running lucebox server, matched against the validated run_*.sh shell wrappers. Cubic AI code review (17 findings) addressed in full: P0: contents: read on luce-bench release job permissions. P1: Dockerfile `--frozen` reinstated; LUCEBENCH_THINK default empty so per-area defaults apply. P2: 6 external actions pinned to immutable SHAs; non-interactive timeout via subprocess.run; REGISTRY trailing-slash normalize; VERSION + Docker tag charset sanitize; harness pi/hermes/openclaw mirrored against run_*.sh wrappers; ruff scan paths corrected to server/scripts/; lefthook glob `**/*.sh`; LUCEBENCH_AREA default level1. P3: lucebox/README.md cli.py link fixed; NOTICE copyright year 2025-2026; areas/__init__.py __all__ exposes all 10 areas. CI on PR Luce-Org#285: all 4 checks green (uv workspace, cmake build, cuda12 prebuild, cubic reviewer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ccef455 to
37b3fbd
Compare
…box-docker) Brings the Qwen3.6/Laguna think-mode reasoning fix (route reasoning into reasoning_content channel instead of content) into the lucebox-docker stack.
Merge the advanced Luce-Org#285 head, carrying Qwen3.6/Laguna started_in_thinking propagation through render, parsing, and SSE emission. Resolve the overlap with the existing Qwen3 Jinja closed-think fallback and refresh the auto-integration manifest.
Document the post-push discovery of PR Luce-Org#285's 9a6db60 head, its clean merge into auto-integration, targeted luce-bench validation, and retained worktree paths.
There was a problem hiding this comment.
3 issues found across 12 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="luce-bench/src/lucebench/cli.py">
<violation number="1" location="luce-bench/src/lucebench/cli.py:692">
P2: `--prompt-thinking-control=on` can be silently disabled by the new card-capability gate, so explicit user override no longer works on unresolved/non-capable cards.</violation>
</file>
<file name="luce-bench/src/lucebench/runner.py">
<violation number="1" location="luce-bench/src/lucebench/runner.py:682">
P2: Continuation request can unintentionally re-enable thinking because `extra_body` overwrites forced disable flags.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| # tokens for a thinking-capable card; otherwise force the flag off so | ||
| # neither the card nor the family-map fallback injects. | ||
| effective_thinking_control = ( | ||
| prompt_thinking_control if card_is_thinking_capable(model_card) else "off" |
There was a problem hiding this comment.
P2: --prompt-thinking-control=on can be silently disabled by the new card-capability gate, so explicit user override no longer works on unresolved/non-capable cards.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At luce-bench/src/lucebench/cli.py, line 692:
<comment>`--prompt-thinking-control=on` can be silently disabled by the new card-capability gate, so explicit user override no longer works on unresolved/non-capable cards.</comment>
<file context>
@@ -665,6 +685,13 @@ def _run_standard_area_to_dir(
+ # tokens for a thinking-capable card; otherwise force the flag off so
+ # neither the card nor the family-map fallback injects.
+ effective_thinking_control = (
+ prompt_thinking_control if card_is_thinking_capable(model_card) else "off"
+ )
+
</file context>
| if top_k is not None and top_k > 0: | ||
| body["top_k"] = int(top_k) | ||
| if extra_body: | ||
| body.update(extra_body) |
There was a problem hiding this comment.
P2: Continuation request can unintentionally re-enable thinking because extra_body overwrites forced disable flags.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At luce-bench/src/lucebench/runner.py, line 682:
<comment>Continuation request can unintentionally re-enable thinking because `extra_body` overwrites forced disable flags.</comment>
<file context>
@@ -384,7 +618,88 @@ def run_case(
+ if top_k is not None and top_k > 0:
+ body["top_k"] = int(top_k)
+ if extra_body:
+ body.update(extra_body)
+
+ headers = {"Content-Type": "application/json"}
</file context>
| body.update(extra_body) | |
| body.update( | |
| { | |
| k: v | |
| for k, v in extra_body.items() | |
| if k | |
| not in { | |
| "messages", | |
| "max_tokens", | |
| "stream", | |
| "chat_template_kwargs", | |
| "thinking", | |
| "reasoning_effort", | |
| } | |
| } | |
| ) |
Replaces the single-shot recorded harness with a multi-turn replay, adds the grading/ subpackage with the llm_judge, multi_turn_cases fixture, agent_recorded test + extract-agentic-fixture test, and touches normalize/regrade/cli to support new turn-level metrics. Also carries the two qwen3.6 and two gemma4 coding-agent-loop sweep writeups that consume this area. Depends on: lucebench-harness (edits files introduced there). Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Containerization stack for lucebox-hub: - Dockerfile and docker-bake.hcl define the lucebox-hub container image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars that /props schema-3/4 consumes. - GitHub Actions: .github/workflows/docker.yml builds and publishes the image; ci.yml runs the standard checks; release-luce-bench.yml handles luce-bench release tagging. - Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README.md) live here because the Dockerfile uv-syncs the workspace at build time. - Catch-all dev/CI scripts ship with this stack: card-bundle drift check, lucebox-wrapper sandbox check, pflash session bench, ds4 2-case sweep, agentic fixture extractor, lucebox.sh smoke test, and the dflash eval_quality_compare helper. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Generated with [Claude Code](https://claude.com/claude-code)
…shell wrapper Lucebox hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) + lucebox.sh wrapper + install.sh + the harness/ adapter package (claude_code/codex/hermes/openclaw/opencode/pi adapters used by autotune sweeps). Carries autotune profile-sweep protocol + bragi tuning summary docs. Depends on: - docker-stack PR (CLI launches the lucebox-hub image and reads IMAGE_INFO) - lucebench-harness PR (lucebox bench delegates to lucebench) - server-layer-split PR (autotune presumes layer-split build flags) Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…gime router
User-facing pflash feature that lands four tightly-related pieces:
- ee7 early-exit drafter with score-range gate and tail-capture guard
(server/src/qwen3/qwen3_drafter.{h,cpp}, server/src/common/score_range.h)
- Anchor-transitive cascade default-on, fixing the 64K NIAH cliff
(server/src/qwen3/anchor_scan.{h,cpp})
- Regime router scaffolding for swapping pflash composition by regime
(server/src/common/regime_router.h)
- Adaptive composition via per-request fa_window
(server/src/server/adaptive_keep_ratio.h)
Ships five accompanying C++ unit tests covering the early-exit score-range
gate, tail-capture guard, warm-path regression, anchor-transitive cascade,
and regime router. Includes the pflash composition / anchor / drafter
design docs and the qwen3.6 pflash A/B and prefix-cache regression
writeups from bragi 2026-05-31.
Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Backend feature implementing soft-close thinking termination via a
logit-ratio peek, split probe/inject ids, a min_tokens floor, and
max_tokens treated as a response-only budget while thinking is active.
- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp.
- gemma4_internal.h carries the probe/inject hooks and is owned here.
- server_main CLI flags ship with the separate thinking-control API PR.
Backend-only, plus the soft-close design plan doc under
docs/experiments/soft-close-thinking-termination-plan.md.
Part of the PR Luce-Org#285 split (Luce-Org#285) into
tightly-scoped PRs.
Generated with Claude Code
Parses Gemma's plain-text call:<verb>{} emissions (also accepts the
``_call:`` tokenizer artifact prefix) and renders them as Anthropic
tool_use + tool_result blocks. Isolated to tool_parser.{cpp,h}; the
streaming detection in sse_emitter is owned by the thinking-control PR.
Includes the call-verb tool parser plan and the Gemma4-26B call-verb
parser fix writeup from bragi (2026-05-31).
Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
…l + /props schema-4 - chat_template prefills closed <think> when thinking off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema from 2 to 4, adding build/model.target/ model.draft/host blocks so clients can introspect the runtime. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for richer card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream. - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, thinking-budget spec, and experiments capturing the thinking-control protocol/mechanism work. - test_server_unit gets the matching coverage spanning chat_template prefill, /props schema 4, and the reasoning_content routing. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…shell wrapper Lucebox hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) + lucebox.sh wrapper + install.sh + the harness/ adapter package (claude_code/codex/hermes/openclaw/opencode/pi adapters used by autotune sweeps). Carries autotune profile-sweep protocol + bragi tuning summary docs. Depends on: - docker-stack PR (CLI launches the lucebox-hub image and reads IMAGE_INFO) - lucebench-harness PR (lucebox bench delegates to lucebench) - server-layer-split PR (autotune presumes layer-split build flags) Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Backend feature implementing soft-close thinking termination via a
logit-ratio peek, split probe/inject ids, a min_tokens floor, and
max_tokens treated as a response-only budget while thinking is active.
- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp.
- gemma4_internal.h carries the probe/inject hooks and is owned here.
- server_main CLI flags ship with the separate thinking-control API PR.
Backend-only, plus the soft-close design plan doc under
docs/experiments/soft-close-thinking-termination-plan.md.
Part of the PR Luce-Org#285 split (Luce-Org#285) into
tightly-scoped PRs.
Generated with Claude Code
Parses Gemma's plain-text call:<verb>{} emissions (also accepts the
``_call:`` tokenizer artifact prefix) and renders them as Anthropic
tool_use + tool_result blocks. Isolated to tool_parser.{cpp,h}; the
streaming detection in sse_emitter is owned by the thinking-control PR.
Includes the call-verb tool parser plan and the Gemma4-26B call-verb
parser fix writeup from bragi (2026-05-31).
Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
…gime router
User-facing pflash feature that lands four tightly-related pieces:
- ee7 early-exit drafter with score-range gate and tail-capture guard
(server/src/qwen3/qwen3_drafter.{h,cpp}, server/src/common/score_range.h)
- Anchor-transitive cascade default-on, fixing the 64K NIAH cliff
(server/src/qwen3/anchor_scan.{h,cpp})
- Regime router scaffolding for swapping pflash composition by regime
(server/src/common/regime_router.h)
- Adaptive composition via per-request fa_window
(server/src/server/adaptive_keep_ratio.h)
Ships five accompanying C++ unit tests covering the early-exit score-range
gate, tail-capture guard, warm-path regression, anchor-transitive cascade,
and regime router. Includes the pflash composition / anchor / drafter
design docs and the qwen3.6 pflash A/B and prefix-cache regression
writeups from bragi 2026-05-31.
Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)
…ate plumbing Cross-backend C++ refactor that extracts a shared layer_split_backend, a GGUF inspection helper, the c2_spec_decode_permitted predicate, dim-aware draft loading, and the PFLASH_COMPRESS_* rename. Touches all backends (qwen3, qwen35, gemma4), CMake, and bench scripts. c2_gate.h is a backend predicate (not a user switch) so it lives here. Independent base for the four feature PRs that follow. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Backend feature implementing soft-close thinking termination via a
logit-ratio peek, split probe/inject ids, a min_tokens floor, and
max_tokens treated as a response-only budget while thinking is active.
- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp.
- gemma4_internal.h carries the probe/inject hooks and is owned here.
- server_main CLI flags ship with the separate thinking-control API PR.
Backend-only, plus the soft-close design plan doc under
docs/experiments/soft-close-thinking-termination-plan.md.
Part of the PR Luce-Org#285 split (Luce-Org#285) into
tightly-scoped PRs.
Generated with Claude Code
…l + /props schema-4 - chat_template prefills closed <think> when thinking off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema from 2 to 4, adding build/model.target/ model.draft/host blocks so clients can introspect the runtime. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for richer card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream. - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, thinking-budget spec, and experiments capturing the thinking-control protocol/mechanism work. - test_server_unit gets the matching coverage spanning chat_template prefill, /props schema 4, and the reasoning_content routing. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…gime router
User-facing pflash feature that lands four tightly-related pieces:
- ee7 early-exit drafter with score-range gate and tail-capture guard
(server/src/qwen3/qwen3_drafter.{h,cpp}, server/src/common/score_range.h)
- Anchor-transitive cascade default-on, fixing the 64K NIAH cliff
(server/src/qwen3/anchor_scan.{h,cpp})
- Regime router scaffolding for swapping pflash composition by regime
(server/src/common/regime_router.h)
- Adaptive composition via per-request fa_window
(server/src/server/adaptive_keep_ratio.h)
Ships five accompanying C++ unit tests covering the early-exit score-range
gate, tail-capture guard, warm-path regression, anchor-transitive cascade,
and regime router. Includes the pflash composition / anchor / drafter
design docs and the qwen3.6 pflash A/B and prefix-cache regression
writeups from bragi 2026-05-31.
Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Backend feature implementing soft-close thinking termination via a
logit-ratio peek, split probe/inject ids, a min_tokens floor, and
max_tokens treated as a response-only budget while thinking is active.
- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp.
- gemma4_internal.h carries the probe/inject hooks and is owned here.
- server_main CLI flags ship with the separate thinking-control API PR.
Backend-only, plus the soft-close design plan doc under
docs/experiments/soft-close-thinking-termination-plan.md.
Part of the PR Luce-Org#285 split (Luce-Org#285) into
tightly-scoped PRs.
Generated with Claude Code
…ate plumbing Cross-backend C++ refactor that extracts a shared layer_split_backend, a GGUF inspection helper, the c2_spec_decode_permitted predicate, dim-aware draft loading, and the PFLASH_COMPRESS_* rename. Touches all backends (qwen3, qwen35, gemma4), CMake, and bench scripts. c2_gate.h is a backend predicate (not a user switch) so it lives here. Independent base for the four feature PRs that follow. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Containerization stack for lucebox-hub: - Dockerfile and docker-bake.hcl define the lucebox-hub container image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars that /props schema-3/4 consumes. - GitHub Actions: .github/workflows/docker.yml builds and publishes the image; ci.yml runs the standard checks; release-luce-bench.yml handles luce-bench release tagging. - Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README.md) live here because the Dockerfile uv-syncs the workspace at build time. - Catch-all dev/CI scripts ship with this stack: card-bundle drift check, lucebox-wrapper sandbox check, pflash session bench, ds4 2-case sweep, agentic fixture extractor, lucebox.sh smoke test, and the dflash eval_quality_compare helper. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Generated with [Claude Code](https://claude.com/claude-code)
Parses Gemma's plain-text call:<verb>{} emissions (also accepts the
``_call:`` tokenizer artifact prefix) and renders them as Anthropic
tool_use + tool_result blocks. Isolated to tool_parser.{cpp,h}; the
streaming detection in sse_emitter is owned by the thinking-control PR.
Includes the call-verb tool parser plan and the Gemma4-26B call-verb
parser fix writeup from bragi (2026-05-31).
Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
…l + /props schema-4 - chat_template prefills closed <think> when thinking off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema from 2 to 4, adding build/model.target/ model.draft/host blocks so clients can introspect the runtime. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for richer card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream. - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, thinking-budget spec, and experiments capturing the thinking-control protocol/mechanism work. - test_server_unit gets the matching coverage spanning chat_template prefill, /props schema 4, and the reasoning_content routing. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Containerization stack for lucebox-hub: - Dockerfile and docker-bake.hcl define the lucebox-hub container image (build-env and runtime stages); scripts/build_image.sh drives local builds; server/scripts/entrypoint.sh emits IMAGE_INFO / HOST_INFO sidecars that /props schema-3/4 consumes. - GitHub Actions: .github/workflows/docker.yml builds and publishes the image; ci.yml runs the standard checks; release-luce-bench.yml handles luce-bench release tagging. - Workspace-root files (pyproject.toml, uv.lock, Makefile, lefthook.yml, .gitignore, README.md) live here because the Dockerfile uv-syncs the workspace at build time. - Catch-all dev/CI scripts ship with this stack: card-bundle drift check, lucebox-wrapper sandbox check, pflash session bench, ds4 2-case sweep, agentic fixture extractor, lucebox.sh smoke test, and the dflash eval_quality_compare helper. Part of the PR Luce-Org#285 split (Luce-Org#285) into tightly-scoped PRs. Generated with [Claude Code](https://claude.com/claude-code)
|
Superseded by the 8-PR split (#334, #335, #336, #337, #338, #339, #340, #341). All core feature content is captured across those 8 PRs; the umbrella diff additionally contained docs/experiments and dev-script baggage that was intentionally dropped during the polish pass. Closing in favor of reviewing the splits. 🤖 Generated with Claude Code |
This PR turns Lucebox into a one-command local inference deployment and ships the two tools that operate it:
lucebox(the host CLI that runs and tunes the server) andluce-bench(the benchmark + grading framework that measures it). All three ship together so a fresh box goes from nothing to a tuned, benchmarked server with a single install.The three pieces, what each is, and how to use it:
1. Docker — the server image
A CUDA 12.8 image (
ghcr.io/luce-org/lucebox-hub:cuda12) that builds the dflash server and bundlesserver/,lucebox/,harness/, andluce-bench/. The entrypoint dispatchesserve(default),benchmark, anyluceboxsubcommand, orshell. An in-container autotune fallback picks VRAM-tiered defaults and resolves the draft GGUF by target architecture (gemma4 → gemma drafter, qwen3.6 → dflash-draft-3.6).Use it directly:
Image tags:
:cuda12,:vX.Y.Z-cuda12,:X.Y-cuda12,:sha-<short>-cuda12. Built and pushed by.github/workflows/docker.yml;docker-bake.hclhas acuda13slot ready.2.
lucebox— the host CLIlucebox.shis the host-side wrapper (deps:docker+nvidia-smionly). It probes the host, writes a tunedconfig.toml, runs the container as a user-systemd service, and delegates provisioning/workloads to the in-container Python CLI (models,autotune,profile,smoke,config, the client drivers).Stand a server up:
Tune it to the GPU:
The running config is observable at
GET /props(schema 4), which now reports ahostblock — kernel, OS, WSL vs native, driver, CPU, RAM, GPU — so a server self-describes its real config and host.3.
luce-bench— the benchmark + grading frameworkIn-tree workspace member (
luce-bench/,0.2.7.dev0) that scores any OpenAI/Anthropic-compatible endpoint and writes versioned, comparable result files. Areas:smoke,ds4-eval(92 reasoning items),gsm8k,truthfulqa-mc1,hellaswag,code,longctx,agent,agent_recorded,forge. Every result stamps a per-areagrader_versionand ahostblock (from/props.host, or a clearly-marked client-side fallback for servers without/props).Run it:
uvx --from 'git+https://github.com/easel/lucebox-hub@feat/lucebox-docker#subdirectory=luce-bench' \ luce-bench --base-url http://localhost:8080 --model dflash --areas all --no-thinkThinking control is portable. Each request carries three control shapes (
chat_template_kwargs.enable_thinking, Anthropicthinking:{type},reasoning_effort). For servers that ignore the API flags (e.g. OpenRouter),--prompt-thinking-control {auto,on,off}(defaultauto) injects the model family's in-band token (/no_think,/think);autofires only when/propsshows no server-side enforcement. A post-run verifier recordsthinking_control_honoredso a nothink run that secretly reasoned is flagged, not silently mislabeled.Comparing results: runs from one grader version are comparable as written. For older snapshots graded by a different version,
luce-bench regrade <dirs>re-scores stored outputs at the current pinned grader and refuses to place mismatched-version (or mismatched-host) runs in the same row.report/snapshot/submit-baselineround out the reporting surface.Also in this PR
harness/— drives real clients (claude_code,codex,opencode,hermes,pi,openclaw) against a running server;lucebox profiledelegates bench runs here.share/model_cards/{qwen3.6-27b,gemma-4-26b-a4b-it,gemma-4-31b-it,laguna-xs.2}.json+_schema.json, so the server resolves sampler defaults, thinking budgets, and the force-close hint per model.pyproject.tomldeclares all members (server,lucebox,luce-bench,harness,optimizations/{megakernel,pflash});[tool.uv.sources] luce-bench = { workspace = true }replaces the prior git-tag pin.release-luce-bench.ymlpublishes to PyPI onluce-bench-v*tags.server/docs/benchmark-snapshot spec and experiment write-ups.server/scripts/bench_*.py(their work now lives inluce-bench).Out of scope / follow-ups
Validation
uv syncclean on the workspace; luce-bench test suite passes.--areas allsweeps run end-to-end against bragi (RTX 5090 Laptop), sindri (RTX 3090 Ti), vidar (M2 Ultra / MLX), and OpenRouter, think and nothink, all on one grader version./props.hostconfirmed populated on lucebox servers (bragi + sindri report WSL2); OpenRouter nothink confirmed honored via client-side/no_thinkinjection.