refactor(server): shared layer-split backend + GGUF inspection + c2-gate plumbing#336
refactor(server): shared layer-split backend + GGUF inspection + c2-gate plumbing#336easel wants to merge 1 commit into
Conversation
3acc1d9 to
62aac00
Compare
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
… plumbing
## What
Cross-backend C++ refactor:
- Extracts a shared layer_split_backend (common/) and a GGUF inspection
helper (common/gguf_inspect.{h,cpp}).
- Adds the c2_spec_decode_permitted predicate in qwen35/c2_gate.h
(a backend gating predicate, not a user-visible switch).
- Dim-aware draft loading in draft/draft_gguf_loader.cpp.
- Renames PFLASH_COMPRESS_* config keys across all three backends
(qwen3, qwen35, gemma4) plus their layer_split adapters/graphs.
- Removes obsolete bench scripts (bench_agent, bench_he, bench_llm,
bench_server, bench_daemon) and lands a unified
test_server_integration.py plus C++ test_server_unit.cpp coverage.
## Why
Base refactor that downstream feature PRs build on. Centralizing
layer-split + GGUF inspection + c2 gating prevents drift between
backends and makes the four follow-on feature PRs reviewable as small
diffs against a clean base.
## Dependencies
None - this PR is independent.
62aac00 to
2cf8c99
Compare
## What
Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.
- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
docs/experiments/.
server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).
## Why
Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.
## Dependencies
- Luce-Org#336 (server-layer-split): #includes qwen35/c2_gate.h and uses BudgetHook
fa_window_override + c2_spec_decode_permitted from the shared backend
…ps schema-4 ## What User-facing thinking-control API across the HTTP server surface: - chat_template prefills a closed <think> block when thinking is off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema 2 -> 4, adding build / model.target / model.draft / host blocks for client introspection. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream (Pattern-B call-verb streaming hook). - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, updated thinking-budget spec, and the thinking-control protocol/mechanism experiments. - test_server_unit gets matching coverage (~1100 lines) for prefill, /props schema-4, and reasoning_content routing. ## Why Gives clients a single, card-driven API to control thinking budgets, soft-close behavior, and reasoning visibility - and an introspectable /props surface to discover what the server supports. ## Dependencies - Luce-Org#336 (server-layer-split): CMake/build references - Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio + pflash_on contracts - Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks rely on tool_parser changes from Luce-Org#340
There was a problem hiding this comment.
10 issues found across 31 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/qwen35/c2_gate.h">
<violation number="1" location="server/src/qwen35/c2_gate.h:28">
P2: The gate threshold uses `int` multiplication (`2 * fa_window_cfg`), which can overflow and miscompute spec-decode eligibility for large configs.</violation>
</file>
<file name="server/src/gemma4/gemma4_layer_split_adapter.cpp">
<violation number="1" location="server/src/gemma4/gemma4_layer_split_adapter.cpp:149">
P2: Sampler penalty settings are silently ignored in Gemma4 layer-split after sampler state was removed, causing accepted requests to run as plain greedy decode.</violation>
</file>
<file name="server/src/qwen3/qwen3_loader.cpp">
<violation number="1" location="server/src/qwen3/qwen3_loader.cpp:136">
P2: Weight type detection for Q8_0 is logged but never plumbed into tensor loading. The comment claims Q8_0 support, but `copy_tensor_from_file` (same file) only handles same-type, BF16→F16, BF16→F32, and F16→F32 — no Q8_0 conversion path exists. Loading a Q8_0 GGUF would print "detected weight type: Q8_0" then fail every tensor copy with "unsupported tensor conversion". Additionally, the ternary `wtype == GGML_TYPE_Q8_0 ? "Q8_0" : "BF16"` silently mislabels non-Q8_0 types (F16, Q4_0, etc.) as BF16; `ggml_type_name(wtype)` is already used for error logging in the same function and should be used here instead.</violation>
</file>
<file name="server/src/common/gguf_inspect.cpp">
<violation number="1" location="server/src/common/gguf_inspect.cpp:178">
P2: The hashing loop does not distinguish EOF from read errors, so it can return/cache a SHA-256 for truncated data.</violation>
</file>
<file name="server/scripts/test_server_integration.py">
<violation number="1" location="server/scripts/test_server_integration.py:1108">
P2: Natural-close test encodes an incorrect token-accounting invariant (`content_tokens == 0`), conflicting with the documented `finish_details` contract.</violation>
</file>
<file name="server/src/qwen35/qwen35_dflash_target.h">
<violation number="1" location="server/src/qwen35/qwen35_dflash_target.h:59">
P2: Sticky override state with no reset — `set_fa_window` permanently mutates `fa_window_` but the constructor's original value is lost after any call, so subsequent verify cycles silently use a stale override when the caller omits a set call before a specific verify step.</violation>
</file>
<file name="server/src/qwen35/qwen35_layer_split_adapter.cpp">
<violation number="1" location="server/src/qwen35/qwen35_layer_split_adapter.cpp:403">
P1: AR decode no longer applies sampler logit processing, so repetition/frequency/presence penalties are ignored.</violation>
<violation number="2" location="server/src/qwen35/qwen35_layer_split_adapter.cpp:418">
P2: DFlash gating is too permissive: it allows spec decode when non-temperature logit penalties are active.</violation>
</file>
<file name="server/src/draft/draft_gguf_loader.cpp">
<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:371">
P2: New post-allocation validation returns can leak draft GPU/context resources on load failure.</violation>
</file>
<file name="server/src/common/layer_split_backend.cpp">
<violation number="1" location="server/src/common/layer_split_backend.cpp:60">
P1: The condition `req.sampler.temp > 0.0f` is an incomplete proxy for logit-processing needs, allowing penalty-only requests to silently bypass the sampling-unsupported guard.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| cfg_.run_dflash ? &feature_ring_ : nullptr, | ||
| /*argmax_out=*/nullptr, | ||
| sampler_.needs_logit_processing() ? &logits_buf : nullptr)) { | ||
| cfg_.run_dflash ? &feature_ring_ : nullptr)) { |
There was a problem hiding this comment.
P1: AR decode no longer applies sampler logit processing, so repetition/frequency/presence penalties are ignored.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_layer_split_adapter.cpp, line 403:
<comment>AR decode no longer applies sampler logit processing, so repetition/frequency/presence penalties are ignored.</comment>
<file context>
@@ -419,24 +397,16 @@ bool Qwen35LayerSplitAdapter::decode_ar(
- cfg_.run_dflash ? &feature_ring_ : nullptr,
- /*argmax_out=*/nullptr,
- sampler_.needs_logit_processing() ? &logits_buf : nullptr)) {
+ cfg_.run_dflash ? &feature_ring_ : nullptr)) {
return false;
}
</file context>
| } | ||
| if (req.do_sample && req.sampler.needs_logit_processing() && | ||
| !adapter_->supports_cpu_sampling()) { | ||
| if (req.do_sample && req.sampler.temp > 0.0f) { |
There was a problem hiding this comment.
P1: The condition req.sampler.temp > 0.0f is an incomplete proxy for logit-processing needs, allowing penalty-only requests to silently bypass the sampling-unsupported guard.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/layer_split_backend.cpp, line 60:
<comment>The condition `req.sampler.temp > 0.0f` is an incomplete proxy for logit-processing needs, allowing penalty-only requests to silently bypass the sampling-unsupported guard.</comment>
<file context>
@@ -57,8 +57,7 @@ GenerateResult LayerSplitBackend::run_from_state(const GenerateRequest & req,
}
- if (req.do_sample && req.sampler.needs_logit_processing() &&
- !adapter_->supports_cpu_sampling()) {
+ if (req.do_sample && req.sampler.temp > 0.0f) {
result.error = "sampling_unsupported";
return result;
</file context>
| if (req.do_sample && req.sampler.temp > 0.0f) { | |
| if (req.do_sample) { |
| int kv_committed) { | ||
| (void)kv_committed; | ||
| return (fa_window_override == 0) | ||
| || (fa_window_override <= 2 * fa_window_cfg); |
There was a problem hiding this comment.
P2: The gate threshold uses int multiplication (2 * fa_window_cfg), which can overflow and miscompute spec-decode eligibility for large configs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/c2_gate.h, line 28:
<comment>The gate threshold uses `int` multiplication (`2 * fa_window_cfg`), which can overflow and miscompute spec-decode eligibility for large configs.</comment>
<file context>
@@ -0,0 +1,31 @@
+ int kv_committed) {
+ (void)kv_committed;
+ return (fa_window_override == 0)
+ || (fa_window_override <= 2 * fa_window_cfg);
+}
+
</file context>
| if (req.do_sample && sampler_.seed != 0) { | ||
| sampler_rng_.seed(sampler_.seed); | ||
| } | ||
| (void)req; |
There was a problem hiding this comment.
P2: Sampler penalty settings are silently ignored in Gemma4 layer-split after sampler state was removed, causing accepted requests to run as plain greedy decode.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/gemma4/gemma4_layer_split_adapter.cpp, line 149:
<comment>Sampler penalty settings are silently ignored in Gemma4 layer-split after sampler state was removed, causing accepted requests to run as plain greedy decode.</comment>
<file context>
@@ -146,25 +146,20 @@ bool Gemma4LayerSplitAdapter::init() {
- if (req.do_sample && sampler_.seed != 0) {
- sampler_rng_.seed(sampler_.seed);
- }
+ (void)req;
}
</file context>
| out.head_dim = (int)get_u32(gctx, "qwen3.attention.key_length", 128); | ||
| out.rope_theta = get_f32(gctx, "qwen3.rope.freq_base", 1000000.0f); | ||
|
|
||
| // Detect weight quant type from blk.0.attn_q.weight; support BF16 and Q8_0. |
There was a problem hiding this comment.
P2: Weight type detection for Q8_0 is logged but never plumbed into tensor loading. The comment claims Q8_0 support, but copy_tensor_from_file (same file) only handles same-type, BF16→F16, BF16→F32, and F16→F32 — no Q8_0 conversion path exists. Loading a Q8_0 GGUF would print "detected weight type: Q8_0" then fail every tensor copy with "unsupported tensor conversion". Additionally, the ternary wtype == GGML_TYPE_Q8_0 ? "Q8_0" : "BF16" silently mislabels non-Q8_0 types (F16, Q4_0, etc.) as BF16; ggml_type_name(wtype) is already used for error logging in the same function and should be used here instead.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/qwen3_loader.cpp, line 136:
<comment>Weight type detection for Q8_0 is logged but never plumbed into tensor loading. The comment claims Q8_0 support, but `copy_tensor_from_file` (same file) only handles same-type, BF16→F16, BF16→F32, and F16→F32 — no Q8_0 conversion path exists. Loading a Q8_0 GGUF would print "detected weight type: Q8_0" then fail every tensor copy with "unsupported tensor conversion". Additionally, the ternary `wtype == GGML_TYPE_Q8_0 ? "Q8_0" : "BF16"` silently mislabels non-Q8_0 types (F16, Q4_0, etc.) as BF16; `ggml_type_name(wtype)` is already used for error logging in the same function and should be used here instead.</comment>
<file context>
@@ -133,6 +133,18 @@ bool load_qwen3_drafter_model(const std::string & path,
out.head_dim = (int)get_u32(gctx, "qwen3.attention.key_length", 128);
out.rope_theta = get_f32(gctx, "qwen3.rope.freq_base", 1000000.0f);
+ // Detect weight quant type from blk.0.attn_q.weight; support BF16 and Q8_0.
+ ggml_type wtype = GGML_TYPE_BF16;
+ {
</file context>
| while (f) { | ||
| f.read(reinterpret_cast<char*>(buf.data()), BUF); | ||
| std::streamsize got = f.gcount(); | ||
| if (got > 0) sha256_update(c, buf.data(), size_t(got)); | ||
| } | ||
| return sha256_final(c); |
There was a problem hiding this comment.
P2: The hashing loop does not distinguish EOF from read errors, so it can return/cache a SHA-256 for truncated data.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/gguf_inspect.cpp, line 178:
<comment>The hashing loop does not distinguish EOF from read errors, so it can return/cache a SHA-256 for truncated data.</comment>
<file context>
@@ -36,4 +41,292 @@ GgufModelInfo inspect_gguf_model_info(const char * path) {
+ // blow the C++ thread stack.
+ constexpr size_t BUF = 4 * 1024 * 1024;
+ std::vector<uint8_t> buf(BUF);
+ while (f) {
+ f.read(reinterpret_cast<char*>(buf.data()), BUF);
+ std::streamsize got = f.gcount();
</file context>
| while (f) { | |
| f.read(reinterpret_cast<char*>(buf.data()), BUF); | |
| std::streamsize got = f.gcount(); | |
| if (got > 0) sha256_update(c, buf.data(), size_t(got)); | |
| } | |
| return sha256_final(c); | |
| while (true) { | |
| f.read(reinterpret_cast<char*>(buf.data()), BUF); | |
| const std::streamsize got = f.gcount(); | |
| if (got > 0) sha256_update(c, buf.data(), size_t(got)); | |
| if (f.eof()) break; | |
| if (!f) return {}; | |
| } | |
| return sha256_final(c); |
| assert fd["content_tokens"] == 0 | ||
| assert fd["thinking_tokens"] == fd["total_tokens"] |
There was a problem hiding this comment.
P2: Natural-close test encodes an incorrect token-accounting invariant (content_tokens == 0), conflicting with the documented finish_details contract.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/scripts/test_server_integration.py, line 1108:
<comment>Natural-close test encodes an incorrect token-accounting invariant (`content_tokens == 0`), conflicting with the documented `finish_details` contract.</comment>
<file context>
@@ -871,3 +897,243 @@ def test_stop_no_match(self):
+ # Phase-2 did not fire — content_tokens stays 0 when the model
+ # self-closes (all generated tokens are reasoning + content interleaved
+ # via the emitter on the phase-1 stream).
+ assert fd["content_tokens"] == 0
+ assert fd["thinking_tokens"] == fd["total_tokens"]
+
</file context>
| assert fd["content_tokens"] == 0 | |
| assert fd["thinking_tokens"] == fd["total_tokens"] | |
| assert fd["content_tokens"] > 0 | |
| assert fd["thinking_tokens"] + fd["content_tokens"] == fd["total_tokens"] |
| // Per-call override for the verify-time flash-attention window. Used by | ||
| // do_spec_decode to widen the window when pflash compression has shrunk | ||
| // the prompt — see GenerateRequest.fa_window_override. | ||
| void set_fa_window(int fa) { fa_window_ = fa; } |
There was a problem hiding this comment.
P2: Sticky override state with no reset — set_fa_window permanently mutates fa_window_ but the constructor's original value is lost after any call, so subsequent verify cycles silently use a stale override when the caller omits a set call before a specific verify step.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_dflash_target.h, line 59:
<comment>Sticky override state with no reset — `set_fa_window` permanently mutates `fa_window_` but the constructor's original value is lost after any call, so subsequent verify cycles silently use a stale override when the caller omits a set call before a specific verify step.</comment>
<file context>
@@ -53,6 +53,11 @@ class Qwen35DFlashTarget : public DFlashTarget {
+ // Per-call override for the verify-time flash-attention window. Used by
+ // do_spec_decode to widen the window when pflash compression has shrunk
+ // the prompt — see GenerateRequest.fa_window_override.
+ void set_fa_window(int fa) { fa_window_ = fa; }
+
private:
</file context>
|
|
||
| bool Qwen35LayerSplitAdapter::can_dflash_decode() const { | ||
| return cfg_.run_dflash && cfg_.draft_path && !sampler_.needs_logit_processing(); | ||
| return cfg_.run_dflash && cfg_.draft_path && sampler_.temp == 0.0f; |
There was a problem hiding this comment.
P2: DFlash gating is too permissive: it allows spec decode when non-temperature logit penalties are active.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_layer_split_adapter.cpp, line 418:
<comment>DFlash gating is too permissive: it allows spec decode when non-temperature logit penalties are active.</comment>
<file context>
@@ -445,7 +415,7 @@ bool Qwen35LayerSplitAdapter::decode_ar(
bool Qwen35LayerSplitAdapter::can_dflash_decode() const {
- return cfg_.run_dflash && cfg_.draft_path && !sampler_.needs_logit_processing();
+ return cfg_.run_dflash && cfg_.draft_path && sampler_.temp == 0.0f;
}
</file context>
| return cfg_.run_dflash && cfg_.draft_path && sampler_.temp == 0.0f; | |
| return cfg_.run_dflash && cfg_.draft_path && !sampler_.needs_logit_processing(); |
| (long long)derived_q_dim, | ||
| out.n_head, out.head_dim, (long long)expected_q_dim); | ||
| set_last_error(buf); | ||
| return false; |
There was a problem hiding this comment.
P2: New post-allocation validation returns can leak draft GPU/context resources on load failure.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/draft/draft_gguf_loader.cpp, line 371:
<comment>New post-allocation validation returns can leak draft GPU/context resources on load failure.</comment>
<file context>
@@ -349,6 +349,63 @@ bool load_draft_gguf(const std::string & path,
+ (long long)derived_q_dim,
+ out.n_head, out.head_dim, (long long)expected_q_dim);
+ set_last_error(buf);
+ return false;
+ }
+ if (derived_kv_dim != expected_kv_dim) {
</file context>
|
Closing this PR — provenance audit revealed two distinct problems that make a clean rebase impossible: 1. Byte-identical theft from dusterbloom's PR #274 (7 files):
These are dusterbloom's structural-defense / drafter work and should land via #274 (or a successor PR opened by their owner), not under this branch's authorship. 2. Silent reverts of weicj's already-merged work from #273 / #295 / #297 (6 files): This branch was cut before those PRs merged into What survives the audit (Erik's own work):
That's a SHA-256 mini-impl + No code from this PR is being silently merged; everything legitimate is being re-introduced via its proper canonical PR. |
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
## What Adds a richer GGUF identity reader on top of Howard Su's existing `gguf_inspect` (PR Luce-Org#305): - `GgufMetadata` struct — captures `general.*` + `<arch>.*` header fields (architecture, name, file_type, quantization_version, block_count, embedding_length, context_length, vocab_size) with -1 / "" sentinels distinguishing "not in GGUF" from legitimate zero. - `read_gguf_metadata(path, compute_sha256)` — best-effort header read; optional SHA-256 of the whole file. - Self-contained SHA-256 mini-impl (RFC 6234) — no OpenSSL dependency added for one hash. - `<path>.sha256` sidecar caching — first server start hashes the file (~30s for a 17 GB GGUF on NVMe), subsequent starts read the sidecar. Sidecar I/O failures are non-fatal. - `llama_ftype_name` decode — maps `general.file_type` ints to human-readable names ("Q4_K_M", "IQ4_XS", etc.) for /props. ## Why `/props` schema-4 wants a single authoritative "exactly what binary + GGUF + quant + sha256 is loaded" payload so benchmarking and provenance tooling can pin model identity across runs without re-parsing GGUF headers in every consumer. The sidecar makes the SHA-256 free after the first boot, which is what makes it usable as a default-on identity field. ## Dependencies None. This is purely additive on top of `gguf_inspect.{cpp,h}` as merged in PR Luce-Org#305 — zero deletions, 333 insertions total. No other server files or build rules change in this PR; consumers will be wired up separately. ## Scope note This PR is the extracted-and-cleaned remnant of the previously closed PR Luce-Org#336 after a provenance audit; everything else from that branch (c2_gate, qwen3 drafter changes, structural-defense loaders, and the inadvertent reverts of Luce-Org#273/Luce-Org#295/Luce-Org#297) is either landing through its canonical PR (Luce-Org#274) or being dropped entirely.
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
## What Adds a richer GGUF identity reader on top of Howard Su's existing `gguf_inspect` (PR Luce-Org#305): - `GgufMetadata` struct — captures `general.*` + `<arch>.*` header fields (architecture, name, file_type, quantization_version, block_count, embedding_length, context_length, vocab_size) with -1 / "" sentinels distinguishing "not in GGUF" from legitimate zero. - `read_gguf_metadata(path, compute_sha256)` — best-effort header read; optional SHA-256 of the whole file. - Self-contained SHA-256 mini-impl (RFC 6234) — no OpenSSL dependency added for one hash. - `<path>.sha256` sidecar caching — first server start hashes the file (~30s for a 17 GB GGUF on NVMe), subsequent starts read the sidecar. Sidecar I/O failures are non-fatal. - `llama_ftype_name` decode — maps `general.file_type` ints to human-readable names ("Q4_K_M", "IQ4_XS", etc.) for /props. ## Why `/props` schema-4 wants a single authoritative "exactly what binary + GGUF + quant + sha256 is loaded" payload so benchmarking and provenance tooling can pin model identity across runs without re-parsing GGUF headers in every consumer. The sidecar makes the SHA-256 free after the first boot, which is what makes it usable as a default-on identity field. ## Dependencies None. This is purely additive on top of `gguf_inspect.{cpp,h}` as merged in PR Luce-Org#305 — zero deletions, 333 insertions total. No other server files or build rules change in this PR; consumers will be wired up separately. ## Scope note This PR is the extracted-and-cleaned remnant of the previously closed PR Luce-Org#336 after a provenance audit; everything else from that branch (c2_gate, qwen3 drafter changes, structural-defense loaders, and the inadvertent reverts of Luce-Org#273/Luce-Org#295/Luce-Org#297) is either landing through its canonical PR (Luce-Org#274) or being dropped entirely.
## What Adds a richer GGUF identity reader on top of Howard Su's existing `gguf_inspect` (PR Luce-Org#305): - `GgufMetadata` struct — captures `general.*` + `<arch>.*` header fields (architecture, name, file_type, quantization_version, block_count, embedding_length, context_length, vocab_size) with -1 / "" sentinels distinguishing "not in GGUF" from legitimate zero. - `read_gguf_metadata(path, compute_sha256)` — best-effort header read; optional SHA-256 of the whole file. - Self-contained SHA-256 mini-impl (RFC 6234) — no OpenSSL dependency added for one hash. - `<path>.sha256` sidecar caching — first server start hashes the file (~30s for a 17 GB GGUF on NVMe), subsequent starts read the sidecar. Sidecar I/O failures are non-fatal. - `llama_ftype_name` decode — maps `general.file_type` ints to human-readable names ("Q4_K_M", "IQ4_XS", etc.) for /props. ## Why `/props` schema-4 wants a single authoritative "exactly what binary + GGUF + quant + sha256 is loaded" payload so benchmarking and provenance tooling can pin model identity across runs without re-parsing GGUF headers in every consumer. The sidecar makes the SHA-256 free after the first boot, which is what makes it usable as a default-on identity field. ## Dependencies None. This is purely additive on top of `gguf_inspect.{cpp,h}` as merged in PR Luce-Org#305 — zero deletions, 333 insertions total. No other server files or build rules change in this PR; consumers will be wired up separately. ## Scope note This PR is the extracted-and-cleaned remnant of the previously closed PR Luce-Org#336 after a provenance audit; everything else from that branch (c2_gate, qwen3 drafter changes, structural-defense loaders, and the inadvertent reverts of Luce-Org#273/Luce-Org#295/Luce-Org#297) is either landing through its canonical PR (Luce-Org#274) or being dropped entirely.
… adapters ## What New lucebox/ Python package exposing the hub CLI (autotune, sweep, profile, smoke, models, config, download, host-check, docker_run) plus the lucebox.sh launcher wrapper and install.sh. Adds the harness/ adapter package wrapping external coding agents (claude_code, codex, hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh for wrapper validation, full pytest coverage under lucebox/tests/, and the bragi autotune profile-sweep protocol docs. ## Why This is the user-facing surface of lucebox-hub: one CLI to launch the image, tune layer-split / pflash settings against a host, run sweeps, and dispatch bench runs. Splitting it out keeps Python-side review independent of the C++ server and Docker stack reviews. ## Dependencies - Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image - Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep) - Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
…ps schema-4 User-facing thinking-control API across the HTTP server surface: - chat_template prefills a closed <think> block when thinking is off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema 2 -> 4, adding build / model.target / model.draft / host blocks for client introspection. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream (Pattern-B call-verb streaming hook). - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, updated thinking-budget spec, and the thinking-control protocol/mechanism experiments. - test_server_unit gets matching coverage (~1100 lines) for prefill, /props schema-4, and reasoning_content routing. Gives clients a single, card-driven API to control thinking budgets, soft-close behavior, and reasoning visibility - and an introspectable /props surface to discover what the server supports. - Luce-Org#336 (server-layer-split): CMake/build references - Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio + pflash_on contracts - Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks rely on tool_parser changes from Luce-Org#340
Summary
Cross-backend C++ refactor that extracts a shared
layer_split_backend, a GGUF inspection helper (gguf_inspect), thec2_spec_decode_permittedpredicate (c2_gate.h), dim-aware draft GGUF loading, and thePFLASH_COMPRESS_*rename. Touches all three backends (qwen3, qwen35, gemma4), CMake, and the integration test surface. Also removes seven legacyserver/scripts/bench_*.pyscripts that have been superseded upstream.Files
server/src/common/: newgguf_inspect.{h,cpp}, updates tolayer_split_backend.{h,cpp}andmodel_backend.hserver/src/qwen35/: newc2_gate.h; refactors acrosslayer_split_forward.{h,cpp},qwen35_layer_split_adapter.{h,cpp},gguf_target_loader.cpp,qwen35_dflash_target.hserver/src/qwen3/: backend, graph, and loader updates (qwen3_backend.cpp,qwen3_graph.cpp,qwen3_loader.cpp)server/src/gemma4/: graph + layer-split adapter updatesserver/src/draft/: dim-aware draft GGUF loaderserver/CMakeLists.txt,server/README.mdserver/test/test_server_unit.cpp,server/scripts/test_server_integration.pyserver/scripts/fixtures/agent_cases/cases.json(new)server/scripts/bench_agent.py,bench_agent_loop.py,bench_daemon.py,bench_he.py,bench_he_http.py,bench_llm.py,bench_server.pyNet: 30 files changed, ~1010 insertions / ~2580 deletions.
Dependencies
None — this PR is independent and forms the foundation for sibling server PRs.
Reverse dependencies (PRs that depend on this one):
c2_gate.hintroduced herec2_gate.h(if still referenced after Phase 1 cleanup)Test plan
server/test/test_server_unit.cpppasses on rolled-up branchserver/scripts/test_server_integration.pypasses on rolled-up branchPFLASH_COMPRESS_*env-var rename is consistent across server + bench callersNote: this split PR is not independently buildable.
server/CMakeLists.txtalso references sources (e.g.server/src/qwen3/anchor_scan.cpp) that land in sibling split PRs, so standalone CMake configuration will fail on missing sources. Validation has to happen on the rolled-up branch (easel/feat/lucebox-docker), which is green.Part of the PR #285 split
This PR is one of several tightly-scoped PRs carved out of the omnibus PR #285. See that PR for the full context and the other split-out PRs.
Generated with Claude Code (https://claude.com/claude-code)