refactor(server): shared layer-split backend + GGUF inspection + c2-gate plumbing by easel · Pull Request #336 · Luce-Org/lucebox-hub

easel · 2026-06-03T21:31:08Z

Summary

Cross-backend C++ refactor that extracts a shared layer_split_backend, a GGUF inspection helper (gguf_inspect), the c2_spec_decode_permitted predicate (c2_gate.h), dim-aware draft GGUF loading, and the PFLASH_COMPRESS_* rename. Touches all three backends (qwen3, qwen35, gemma4), CMake, and the integration test surface. Also removes seven legacy server/scripts/bench_*.py scripts that have been superseded upstream.

Files

server/src/common/: new gguf_inspect.{h,cpp}, updates to layer_split_backend.{h,cpp} and model_backend.h
server/src/qwen35/: new c2_gate.h; refactors across layer_split_forward.{h,cpp}, qwen35_layer_split_adapter.{h,cpp}, gguf_target_loader.cpp, qwen35_dflash_target.h
server/src/qwen3/: backend, graph, and loader updates (qwen3_backend.cpp, qwen3_graph.cpp, qwen3_loader.cpp)
server/src/gemma4/: graph + layer-split adapter updates
server/src/draft/: dim-aware draft GGUF loader
server/CMakeLists.txt, server/README.md
server/test/test_server_unit.cpp, server/scripts/test_server_integration.py
server/scripts/fixtures/agent_cases/cases.json (new)
Deletes: server/scripts/bench_agent.py, bench_agent_loop.py, bench_daemon.py, bench_he.py, bench_he_http.py, bench_llm.py, bench_server.py

Net: 30 files changed, ~1010 insertions / ~2580 deletions.

Dependencies

None — this PR is independent and forms the foundation for sibling server PRs.

Reverse dependencies (PRs that depend on this one):

feat(lucebox): hub CLI + autotune/sweep/profile + harness adapters + shell wrapper #335 (lucebox-cli) — consumes server build artifacts
feat(server): soft-close thinking termination (qwen35 + gemma4) #339 (server-soft-close) — depends on c2_gate.h introduced here
feat(server): card-driven thinking control + reasoning_content channel + /props schema-4 #341 (server-thinking-control) — depends on c2_gate.h (if still referenced after Phase 1 cleanup)

Test plan

server/test/test_server_unit.cpp passes on rolled-up branch
server/scripts/test_server_integration.py passes on rolled-up branch
qwen3, qwen35, and gemma4 backends all build and produce identical decode output vs. main
Draft GGUF loader handles dim-mismatched draft models correctly
PFLASH_COMPRESS_* env-var rename is consistent across server + bench callers

Note: this split PR is not independently buildable. server/CMakeLists.txt also references sources (e.g. server/src/qwen3/anchor_scan.cpp) that land in sibling split PRs, so standalone CMake configuration will fail on missing sources. Validation has to happen on the rolled-up branch (easel/feat/lucebox-docker), which is green.

Part of the PR #285 split

This PR is one of several tightly-scoped PRs carved out of the omnibus PR #285. See that PR for the full context and the other split-out PRs.

Generated with Claude Code (https://claude.com/claude-code)

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

… plumbing ## What Cross-backend C++ refactor: - Extracts a shared layer_split_backend (common/) and a GGUF inspection helper (common/gguf_inspect.{h,cpp}). - Adds the c2_spec_decode_permitted predicate in qwen35/c2_gate.h (a backend gating predicate, not a user-visible switch). - Dim-aware draft loading in draft/draft_gguf_loader.cpp. - Renames PFLASH_COMPRESS_* config keys across all three backends (qwen3, qwen35, gemma4) plus their layer_split adapters/graphs. - Removes obsolete bench scripts (bench_agent, bench_he, bench_llm, bench_server, bench_daemon) and lands a unified test_server_integration.py plus C++ test_server_unit.cpp coverage. ## Why Base refactor that downstream feature PRs build on. Centralizing layer-split + GGUF inspection + c2 gating prevents drift between backends and makes the four follow-on feature PRs reviewable as small diffs against a clean base. ## Dependencies None - this PR is independent.

## What Backend mechanism for soft-close thinking termination via a logit-ratio peek with split probe/inject ids, a min_tokens floor, and max_tokens treated as a response-only budget while thinking is active. - qwen35 originates the mechanism in qwen35_backend.{cpp,h}. - gemma4 ports the same mechanism in gemma4_backend.cpp (plus the probe/inject hooks in gemma4_internal.h owned here). - common/model_backend.h carries the shared hook contract. - test_server_unit.cpp adds 533 lines of coverage for both backends. - docs/specs/thinking-budget.md updated; soft-close design plan under docs/experiments/. server_main CLI flags for this mechanism ship in the thinking-control API PR (Luce-Org#341). ## Why Lets a model finish its reasoning gracefully rather than being hard-cut when a thinking budget is hit, which preserves answer quality on long reasoning traces. ## Dependencies - Luce-Org#336 (server-layer-split): #includes qwen35/c2_gate.h and uses BudgetHook fa_window_override + c2_spec_decode_permitted from the shared backend

…ps schema-4 ## What User-facing thinking-control API across the HTTP server surface: - chat_template prefills a closed <think> block when thinking is off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema 2 -> 4, adding build / model.target / model.draft / host blocks for client introspection. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream (Pattern-B call-verb streaming hook). - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, updated thinking-budget spec, and the thinking-control protocol/mechanism experiments. - test_server_unit gets matching coverage (~1100 lines) for prefill, /props schema-4, and reasoning_content routing. ## Why Gives clients a single, card-driven API to control thinking budgets, soft-close behavior, and reasoning visibility - and an introspectable /props surface to discover what the server supports. ## Dependencies - Luce-Org#336 (server-layer-split): CMake/build references - Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio + pflash_on contracts - Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks rely on tool_parser changes from Luce-Org#340

cubic-dev-ai

10 issues found across 31 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35/c2_gate.h">

<violation number="1" location="server/src/qwen35/c2_gate.h:28">
P2: The gate threshold uses `int` multiplication (`2 * fa_window_cfg`), which can overflow and miscompute spec-decode eligibility for large configs.</violation>
</file>

<file name="server/src/gemma4/gemma4_layer_split_adapter.cpp">

<violation number="1" location="server/src/gemma4/gemma4_layer_split_adapter.cpp:149">
P2: Sampler penalty settings are silently ignored in Gemma4 layer-split after sampler state was removed, causing accepted requests to run as plain greedy decode.</violation>
</file>

<file name="server/src/qwen3/qwen3_loader.cpp">

<violation number="1" location="server/src/qwen3/qwen3_loader.cpp:136">
P2: Weight type detection for Q8_0 is logged but never plumbed into tensor loading. The comment claims Q8_0 support, but `copy_tensor_from_file` (same file) only handles same-type, BF16→F16, BF16→F32, and F16→F32 — no Q8_0 conversion path exists. Loading a Q8_0 GGUF would print "detected weight type: Q8_0" then fail every tensor copy with "unsupported tensor conversion". Additionally, the ternary `wtype == GGML_TYPE_Q8_0 ? "Q8_0" : "BF16"` silently mislabels non-Q8_0 types (F16, Q4_0, etc.) as BF16; `ggml_type_name(wtype)` is already used for error logging in the same function and should be used here instead.</violation>
</file>

<file name="server/src/common/gguf_inspect.cpp">

<violation number="1" location="server/src/common/gguf_inspect.cpp:178">
P2: The hashing loop does not distinguish EOF from read errors, so it can return/cache a SHA-256 for truncated data.</violation>
</file>

<file name="server/scripts/test_server_integration.py">

<violation number="1" location="server/scripts/test_server_integration.py:1108">
P2: Natural-close test encodes an incorrect token-accounting invariant (`content_tokens == 0`), conflicting with the documented `finish_details` contract.</violation>
</file>

<file name="server/src/qwen35/qwen35_dflash_target.h">

<violation number="1" location="server/src/qwen35/qwen35_dflash_target.h:59">
P2: Sticky override state with no reset — `set_fa_window` permanently mutates `fa_window_` but the constructor's original value is lost after any call, so subsequent verify cycles silently use a stale override when the caller omits a set call before a specific verify step.</violation>
</file>

<file name="server/src/qwen35/qwen35_layer_split_adapter.cpp">

<violation number="1" location="server/src/qwen35/qwen35_layer_split_adapter.cpp:403">
P1: AR decode no longer applies sampler logit processing, so repetition/frequency/presence penalties are ignored.</violation>

<violation number="2" location="server/src/qwen35/qwen35_layer_split_adapter.cpp:418">
P2: DFlash gating is too permissive: it allows spec decode when non-temperature logit penalties are active.</violation>
</file>

<file name="server/src/draft/draft_gguf_loader.cpp">

<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:371">
P2: New post-allocation validation returns can leak draft GPU/context resources on load failure.</violation>
</file>

<file name="server/src/common/layer_split_backend.cpp">

<violation number="1" location="server/src/common/layer_split_backend.cpp:60">
P1: The condition `req.sampler.temp > 0.0f` is an incomplete proxy for logit-processing needs, allowing penalty-only requests to silently bypass the sampling-unsupported guard.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-04T05:17:00Z

-                cfg_.run_dflash ? &feature_ring_ : nullptr,
-                /*argmax_out=*/nullptr,
-                sampler_.needs_logit_processing() ? &logits_buf : nullptr)) {
+                cfg_.run_dflash ? &feature_ring_ : nullptr)) {


P1: AR decode no longer applies sampler logit processing, so repetition/frequency/presence penalties are ignored.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_layer_split_adapter.cpp, line 403: <comment>AR decode no longer applies sampler logit processing, so repetition/frequency/presence penalties are ignored.</comment> <file context> @@ -419,24 +397,16 @@ bool Qwen35LayerSplitAdapter::decode_ar( - cfg_.run_dflash ? &feature_ring_ : nullptr, - /*argmax_out=*/nullptr, - sampler_.needs_logit_processing() ? &logits_buf : nullptr)) { + cfg_.run_dflash ? &feature_ring_ : nullptr)) { return false; } </file context>

cubic-dev-ai · 2026-06-04T05:17:00Z

    }
-    if (req.do_sample && req.sampler.needs_logit_processing() &&
-        !adapter_->supports_cpu_sampling()) {
+    if (req.do_sample && req.sampler.temp > 0.0f) {


P1: The condition req.sampler.temp > 0.0f is an incomplete proxy for logit-processing needs, allowing penalty-only requests to silently bypass the sampling-unsupported guard.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/layer_split_backend.cpp, line 60: <comment>The condition `req.sampler.temp > 0.0f` is an incomplete proxy for logit-processing needs, allowing penalty-only requests to silently bypass the sampling-unsupported guard.</comment> <file context> @@ -57,8 +57,7 @@ GenerateResult LayerSplitBackend::run_from_state(const GenerateRequest & req, } - if (req.do_sample && req.sampler.needs_logit_processing() && - !adapter_->supports_cpu_sampling()) { + if (req.do_sample && req.sampler.temp > 0.0f) { result.error = "sampling_unsupported"; return result; </file context>

Suggested change

if (req.do_sample && req.sampler.temp > 0.0f) {

if (req.do_sample) {

cubic-dev-ai · 2026-06-04T05:17:00Z

+                                     int kv_committed) {
+    (void)kv_committed;
+    return (fa_window_override == 0)
+        || (fa_window_override <= 2 * fa_window_cfg);


P2: The gate threshold uses int multiplication (2 * fa_window_cfg), which can overflow and miscompute spec-decode eligibility for large configs.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/c2_gate.h, line 28: <comment>The gate threshold uses `int` multiplication (`2 * fa_window_cfg`), which can overflow and miscompute spec-decode eligibility for large configs.</comment> <file context> @@ -0,0 +1,31 @@ + int kv_committed) { + (void)kv_committed; + return (fa_window_override == 0) + || (fa_window_override <= 2 * fa_window_cfg); +} + </file context>

cubic-dev-ai · 2026-06-04T05:17:00Z

-    if (req.do_sample && sampler_.seed != 0) {
-        sampler_rng_.seed(sampler_.seed);
-    }
+    (void)req;


P2: Sampler penalty settings are silently ignored in Gemma4 layer-split after sampler state was removed, causing accepted requests to run as plain greedy decode.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/gemma4/gemma4_layer_split_adapter.cpp, line 149: <comment>Sampler penalty settings are silently ignored in Gemma4 layer-split after sampler state was removed, causing accepted requests to run as plain greedy decode.</comment> <file context> @@ -146,25 +146,20 @@ bool Gemma4LayerSplitAdapter::init() { - if (req.do_sample && sampler_.seed != 0) { - sampler_rng_.seed(sampler_.seed); - } + (void)req; } </file context>

cubic-dev-ai · 2026-06-04T05:17:00Z

    out.head_dim   = (int)get_u32(gctx, "qwen3.attention.key_length", 128);
    out.rope_theta = get_f32(gctx, "qwen3.rope.freq_base", 1000000.0f);

+    // Detect weight quant type from blk.0.attn_q.weight; support BF16 and Q8_0.


P2: Weight type detection for Q8_0 is logged but never plumbed into tensor loading. The comment claims Q8_0 support, but copy_tensor_from_file (same file) only handles same-type, BF16→F16, BF16→F32, and F16→F32 — no Q8_0 conversion path exists. Loading a Q8_0 GGUF would print "detected weight type: Q8_0" then fail every tensor copy with "unsupported tensor conversion". Additionally, the ternary wtype == GGML_TYPE_Q8_0 ? "Q8_0" : "BF16" silently mislabels non-Q8_0 types (F16, Q4_0, etc.) as BF16; ggml_type_name(wtype) is already used for error logging in the same function and should be used here instead.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/qwen3_loader.cpp, line 136: <comment>Weight type detection for Q8_0 is logged but never plumbed into tensor loading. The comment claims Q8_0 support, but `copy_tensor_from_file` (same file) only handles same-type, BF16→F16, BF16→F32, and F16→F32 — no Q8_0 conversion path exists. Loading a Q8_0 GGUF would print "detected weight type: Q8_0" then fail every tensor copy with "unsupported tensor conversion". Additionally, the ternary `wtype == GGML_TYPE_Q8_0 ? "Q8_0" : "BF16"` silently mislabels non-Q8_0 types (F16, Q4_0, etc.) as BF16; `ggml_type_name(wtype)` is already used for error logging in the same function and should be used here instead.</comment> <file context> @@ -133,6 +133,18 @@ bool load_qwen3_drafter_model(const std::string & path, out.head_dim = (int)get_u32(gctx, "qwen3.attention.key_length", 128); out.rope_theta = get_f32(gctx, "qwen3.rope.freq_base", 1000000.0f); + // Detect weight quant type from blk.0.attn_q.weight; support BF16 and Q8_0. + ggml_type wtype = GGML_TYPE_BF16; + { </file context>

cubic-dev-ai · 2026-06-04T05:17:00Z

+    while (f) {
+        f.read(reinterpret_cast<char*>(buf.data()), BUF);
+        std::streamsize got = f.gcount();
+        if (got > 0) sha256_update(c, buf.data(), size_t(got));
+    }
+    return sha256_final(c);


P2: The hashing loop does not distinguish EOF from read errors, so it can return/cache a SHA-256 for truncated data.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/gguf_inspect.cpp, line 178: <comment>The hashing loop does not distinguish EOF from read errors, so it can return/cache a SHA-256 for truncated data.</comment> <file context> @@ -36,4 +41,292 @@ GgufModelInfo inspect_gguf_model_info(const char * path) { + // blow the C++ thread stack. + constexpr size_t BUF = 4 * 1024 * 1024; + std::vector<uint8_t> buf(BUF); + while (f) { + f.read(reinterpret_cast<char*>(buf.data()), BUF); + std::streamsize got = f.gcount(); </file context>

Suggested change

while (f) {

f.read(reinterpret_cast<char*>(buf.data()), BUF);

std::streamsize got = f.gcount();

if (got > 0) sha256_update(c, buf.data(), size_t(got));

}

return sha256_final(c);

while (true) {

f.read(reinterpret_cast<char*>(buf.data()), BUF);

const std::streamsize got = f.gcount();

if (got > 0) sha256_update(c, buf.data(), size_t(got));

if (f.eof()) break;

if (!f) return {};

}

return sha256_final(c);

cubic-dev-ai · 2026-06-04T05:17:00Z

+        assert fd["content_tokens"] == 0
+        assert fd["thinking_tokens"] == fd["total_tokens"]


P2: Natural-close test encodes an incorrect token-accounting invariant (content_tokens == 0), conflicting with the documented finish_details contract.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/scripts/test_server_integration.py, line 1108: <comment>Natural-close test encodes an incorrect token-accounting invariant (`content_tokens == 0`), conflicting with the documented `finish_details` contract.</comment> <file context> @@ -871,3 +897,243 @@ def test_stop_no_match(self): + # Phase-2 did not fire — content_tokens stays 0 when the model + # self-closes (all generated tokens are reasoning + content interleaved + # via the emitter on the phase-1 stream). + assert fd["content_tokens"] == 0 + assert fd["thinking_tokens"] == fd["total_tokens"] + </file context>

Suggested change

assert fd["content_tokens"] == 0

assert fd["thinking_tokens"] == fd["total_tokens"]

assert fd["content_tokens"] > 0

assert fd["thinking_tokens"] + fd["content_tokens"] == fd["total_tokens"]

cubic-dev-ai · 2026-06-04T05:17:00Z

+    // Per-call override for the verify-time flash-attention window. Used by
+    // do_spec_decode to widen the window when pflash compression has shrunk
+    // the prompt — see GenerateRequest.fa_window_override.
+    void set_fa_window(int fa) { fa_window_ = fa; }


P2: Sticky override state with no reset — set_fa_window permanently mutates fa_window_ but the constructor's original value is lost after any call, so subsequent verify cycles silently use a stale override when the caller omits a set call before a specific verify step.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_dflash_target.h, line 59: <comment>Sticky override state with no reset — `set_fa_window` permanently mutates `fa_window_` but the constructor's original value is lost after any call, so subsequent verify cycles silently use a stale override when the caller omits a set call before a specific verify step.</comment> <file context> @@ -53,6 +53,11 @@ class Qwen35DFlashTarget : public DFlashTarget { + // Per-call override for the verify-time flash-attention window. Used by + // do_spec_decode to widen the window when pflash compression has shrunk + // the prompt — see GenerateRequest.fa_window_override. + void set_fa_window(int fa) { fa_window_ = fa; } + private: </file context>

cubic-dev-ai · 2026-06-04T05:17:00Z


 bool Qwen35LayerSplitAdapter::can_dflash_decode() const {
-    return cfg_.run_dflash && cfg_.draft_path && !sampler_.needs_logit_processing();
+    return cfg_.run_dflash && cfg_.draft_path && sampler_.temp == 0.0f;


P2: DFlash gating is too permissive: it allows spec decode when non-temperature logit penalties are active.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_layer_split_adapter.cpp, line 418: <comment>DFlash gating is too permissive: it allows spec decode when non-temperature logit penalties are active.</comment> <file context> @@ -445,7 +415,7 @@ bool Qwen35LayerSplitAdapter::decode_ar( bool Qwen35LayerSplitAdapter::can_dflash_decode() const { - return cfg_.run_dflash && cfg_.draft_path && !sampler_.needs_logit_processing(); + return cfg_.run_dflash && cfg_.draft_path && sampler_.temp == 0.0f; } </file context>

Suggested change

return cfg_.run_dflash && cfg_.draft_path && sampler_.temp == 0.0f;

return cfg_.run_dflash && cfg_.draft_path && !sampler_.needs_logit_processing();

cubic-dev-ai · 2026-06-04T05:17:00Z

+                (long long)derived_q_dim,
+                out.n_head, out.head_dim, (long long)expected_q_dim);
+            set_last_error(buf);
+            return false;


P2: New post-allocation validation returns can leak draft GPU/context resources on load failure.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/draft/draft_gguf_loader.cpp, line 371: <comment>New post-allocation validation returns can leak draft GPU/context resources on load failure.</comment> <file context> @@ -349,6 +349,63 @@ bool load_draft_gguf(const std::string & path, + (long long)derived_q_dim, + out.n_head, out.head_dim, (long long)expected_q_dim); + set_last_error(buf); + return false; + } + if (derived_kv_dim != expected_kv_dim) { </file context>

easel · 2026-06-04T17:15:03Z

Closing this PR — provenance audit revealed two distinct problems that make a clean rebase impossible:

1. Byte-identical theft from dusterbloom's PR #274 (7 files):

server/src/common/c2_gate.h
server/src/inference/drafters/qwen3_drafter.{cpp,h}
server/src/inference/drafters/qwen35_dflash_target.h
server/src/inference/loaders/gguf_target_loader.cpp
server/src/inference/loaders/draft_gguf_loader.cpp
the c2-gate test cases in server/tests/unit/test_server_unit.cpp

These are dusterbloom's structural-defense / drafter work and should land via #274 (or a successor PR opened by their owner), not under this branch's authorship.

2. Silent reverts of weicj's already-merged work from #273 / #295 / #297 (6 files):

This branch was cut before those PRs merged into main, so the merge here would inadvertently roll back supports_cpu_sampling, prefill_last_logits, and related plumbing. Those reverts must not land.

What survives the audit (Erik's own work):

server/src/common/gguf_inspect.cpp (~292 LOC additive)
server/src/common/gguf_inspect.h (~40 LOC additive)

That's a SHA-256 mini-impl + GgufMetadata reader + sidecar caching + llama_ftype_name decode, layered on top of Howard Su's prior gguf_inspect from PR #305. It's being re-opened as a clean, scoped PR: feat/server-gguf-inspect (link in follow-up comment once pushed).

No code from this PR is being silently merged; everything legitimate is being re-introduced via its proper canonical PR.

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

easel · 2026-06-04T17:16:25Z

Follow-up: the clean, scoped replacement PR for the surviving gguf_inspect work is #344 — #344 (333 insertions, 0 deletions, on top of #305).

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

## What Adds a richer GGUF identity reader on top of Howard Su's existing `gguf_inspect` (PR Luce-Org#305): - `GgufMetadata` struct — captures `general.*` + `<arch>.*` header fields (architecture, name, file_type, quantization_version, block_count, embedding_length, context_length, vocab_size) with -1 / "" sentinels distinguishing "not in GGUF" from legitimate zero. - `read_gguf_metadata(path, compute_sha256)` — best-effort header read; optional SHA-256 of the whole file. - Self-contained SHA-256 mini-impl (RFC 6234) — no OpenSSL dependency added for one hash. - `<path>.sha256` sidecar caching — first server start hashes the file (~30s for a 17 GB GGUF on NVMe), subsequent starts read the sidecar. Sidecar I/O failures are non-fatal. - `llama_ftype_name` decode — maps `general.file_type` ints to human-readable names ("Q4_K_M", "IQ4_XS", etc.) for /props. ## Why `/props` schema-4 wants a single authoritative "exactly what binary + GGUF + quant + sha256 is loaded" payload so benchmarking and provenance tooling can pin model identity across runs without re-parsing GGUF headers in every consumer. The sidecar makes the SHA-256 free after the first boot, which is what makes it usable as a default-on identity field. ## Dependencies None. This is purely additive on top of `gguf_inspect.{cpp,h}` as merged in PR Luce-Org#305 — zero deletions, 333 insertions total. No other server files or build rules change in this PR; consumers will be wired up separately. ## Scope note This PR is the extracted-and-cleaned remnant of the previously closed PR Luce-Org#336 after a provenance audit; everything else from that branch (c2_gate, qwen3 drafter changes, structural-defense loaders, and the inadvertent reverts of Luce-Org#273/Luce-Org#295/Luce-Org#297) is either landing through its canonical PR (Luce-Org#274) or being dropped entirely.

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

## What Adds a richer GGUF identity reader on top of Howard Su's existing `gguf_inspect` (PR Luce-Org#305): - `GgufMetadata` struct — captures `general.*` + `<arch>.*` header fields (architecture, name, file_type, quantization_version, block_count, embedding_length, context_length, vocab_size) with -1 / "" sentinels distinguishing "not in GGUF" from legitimate zero. - `read_gguf_metadata(path, compute_sha256)` — best-effort header read; optional SHA-256 of the whole file. - Self-contained SHA-256 mini-impl (RFC 6234) — no OpenSSL dependency added for one hash. - `<path>.sha256` sidecar caching — first server start hashes the file (~30s for a 17 GB GGUF on NVMe), subsequent starts read the sidecar. Sidecar I/O failures are non-fatal. - `llama_ftype_name` decode — maps `general.file_type` ints to human-readable names ("Q4_K_M", "IQ4_XS", etc.) for /props. ## Why `/props` schema-4 wants a single authoritative "exactly what binary + GGUF + quant + sha256 is loaded" payload so benchmarking and provenance tooling can pin model identity across runs without re-parsing GGUF headers in every consumer. The sidecar makes the SHA-256 free after the first boot, which is what makes it usable as a default-on identity field. ## Dependencies None. This is purely additive on top of `gguf_inspect.{cpp,h}` as merged in PR Luce-Org#305 — zero deletions, 333 insertions total. No other server files or build rules change in this PR; consumers will be wired up separately. ## Scope note This PR is the extracted-and-cleaned remnant of the previously closed PR Luce-Org#336 after a provenance audit; everything else from that branch (c2_gate, qwen3 drafter changes, structural-defense loaders, and the inadvertent reverts of Luce-Org#273/Luce-Org#295/Luce-Org#297) is either landing through its canonical PR (Luce-Org#274) or being dropped entirely.

… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts

…ps schema-4 User-facing thinking-control API across the HTTP server surface: - chat_template prefills a closed <think> block when thinking is off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema 2 -> 4, adding build / model.target / model.draft / host blocks for client introspection. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream (Pattern-B call-verb streaming hook). - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, updated thinking-budget spec, and the thinking-control protocol/mechanism experiments. - test_server_unit gets matching coverage (~1100 lines) for prefill, /props schema-4, and reasoning_content routing. Gives clients a single, card-driven API to control thinking budgets, soft-close behavior, and reasoning visibility - and an introspectable /props surface to discover what the server supports. - Luce-Org#336 (server-layer-split): CMake/build references - Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio + pflash_on contracts - Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks rely on tool_parser changes from Luce-Org#340

easel force-pushed the feat/server-layer-split branch 2 times, most recently from 3acc1d9 to 62aac00 Compare June 4, 2026 03:37

easel force-pushed the feat/server-layer-split branch from 62aac00 to 2cf8c99 Compare June 4, 2026 05:02

easel marked this pull request as ready for review June 4, 2026 05:03

cubic-dev-ai Bot reviewed Jun 4, 2026

View reviewed changes

This was referenced Jun 4, 2026

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree #285

Closed

feat(server): soft-close thinking termination via logit-ratio peek #326

Closed

easel closed this Jun 4, 2026

easel mentioned this pull request Jun 4, 2026

feat(server): GgufMetadata reader + SHA-256 sidecar for /props schema-4 #344

Open

4 tasks

easel deleted the feat/server-layer-split branch June 4, 2026 17:16

	if (req.do_sample && req.sampler.temp > 0.0f) {
	if (req.do_sample) {

		assert fd["content_tokens"] == 0
		assert fd["thinking_tokens"] == fd["total_tokens"]

	return cfg_.run_dflash && cfg_.draft_path && sampler_.temp == 0.0f;
	return cfg_.run_dflash && cfg_.draft_path && !sampler_.needs_logit_processing();

Conversation

easel commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Dependencies

Test plan

Part of the PR #285 split

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

easel commented Jun 4, 2026

Uh oh!

easel commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

easel commented Jun 3, 2026 •

edited

Loading