Skip to content

refactor(server): shared layer-split backend + GGUF inspection + c2-gate plumbing#336

Closed
easel wants to merge 1 commit into
Luce-Org:mainfrom
easel:feat/server-layer-split
Closed

refactor(server): shared layer-split backend + GGUF inspection + c2-gate plumbing#336
easel wants to merge 1 commit into
Luce-Org:mainfrom
easel:feat/server-layer-split

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented Jun 3, 2026

Summary

Cross-backend C++ refactor that extracts a shared layer_split_backend, a GGUF inspection helper (gguf_inspect), the c2_spec_decode_permitted predicate (c2_gate.h), dim-aware draft GGUF loading, and the PFLASH_COMPRESS_* rename. Touches all three backends (qwen3, qwen35, gemma4), CMake, and the integration test surface. Also removes seven legacy server/scripts/bench_*.py scripts that have been superseded upstream.

Files

  • server/src/common/: new gguf_inspect.{h,cpp}, updates to layer_split_backend.{h,cpp} and model_backend.h
  • server/src/qwen35/: new c2_gate.h; refactors across layer_split_forward.{h,cpp}, qwen35_layer_split_adapter.{h,cpp}, gguf_target_loader.cpp, qwen35_dflash_target.h
  • server/src/qwen3/: backend, graph, and loader updates (qwen3_backend.cpp, qwen3_graph.cpp, qwen3_loader.cpp)
  • server/src/gemma4/: graph + layer-split adapter updates
  • server/src/draft/: dim-aware draft GGUF loader
  • server/CMakeLists.txt, server/README.md
  • server/test/test_server_unit.cpp, server/scripts/test_server_integration.py
  • server/scripts/fixtures/agent_cases/cases.json (new)
  • Deletes: server/scripts/bench_agent.py, bench_agent_loop.py, bench_daemon.py, bench_he.py, bench_he_http.py, bench_llm.py, bench_server.py

Net: 30 files changed, ~1010 insertions / ~2580 deletions.

Dependencies

None — this PR is independent and forms the foundation for sibling server PRs.

Reverse dependencies (PRs that depend on this one):

Test plan

  • server/test/test_server_unit.cpp passes on rolled-up branch
  • server/scripts/test_server_integration.py passes on rolled-up branch
  • qwen3, qwen35, and gemma4 backends all build and produce identical decode output vs. main
  • Draft GGUF loader handles dim-mismatched draft models correctly
  • PFLASH_COMPRESS_* env-var rename is consistent across server + bench callers

Note: this split PR is not independently buildable. server/CMakeLists.txt also references sources (e.g. server/src/qwen3/anchor_scan.cpp) that land in sibling split PRs, so standalone CMake configuration will fail on missing sources. Validation has to happen on the rolled-up branch (easel/feat/lucebox-docker), which is green.

Part of the PR #285 split

This PR is one of several tightly-scoped PRs carved out of the omnibus PR #285. See that PR for the full context and the other split-out PRs.

Generated with Claude Code (https://claude.com/claude-code)

@easel easel force-pushed the feat/server-layer-split branch 2 times, most recently from 3acc1d9 to 62aac00 Compare June 4, 2026 03:37
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
… plumbing

## What

Cross-backend C++ refactor:

- Extracts a shared layer_split_backend (common/) and a GGUF inspection
  helper (common/gguf_inspect.{h,cpp}).
- Adds the c2_spec_decode_permitted predicate in qwen35/c2_gate.h
  (a backend gating predicate, not a user-visible switch).
- Dim-aware draft loading in draft/draft_gguf_loader.cpp.
- Renames PFLASH_COMPRESS_* config keys across all three backends
  (qwen3, qwen35, gemma4) plus their layer_split adapters/graphs.
- Removes obsolete bench scripts (bench_agent, bench_he, bench_llm,
  bench_server, bench_daemon) and lands a unified
  test_server_integration.py plus C++ test_server_unit.cpp coverage.

## Why

Base refactor that downstream feature PRs build on. Centralizing
layer-split + GGUF inspection + c2 gating prevents drift between
backends and makes the four follow-on feature PRs reviewable as small
diffs against a clean base.

## Dependencies

None - this PR is independent.
@easel easel force-pushed the feat/server-layer-split branch from 62aac00 to 2cf8c99 Compare June 4, 2026 05:02
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
## What

Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.

- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
  probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
  docs/experiments/.

server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).

## Why

Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.

## Dependencies

- Luce-Org#336 (server-layer-split): #includes qwen35/c2_gate.h and uses BudgetHook
  fa_window_override + c2_spec_decode_permitted from the shared backend
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
…ps schema-4

## What

User-facing thinking-control API across the HTTP server surface:

- chat_template prefills a closed <think> block when thinking is off
  (Qwen3-gated) so the model skips the reasoning preamble without
  losing the assistant turn.
- http_server bumps /props schema 2 -> 4, adding build / model.target /
  model.draft / host blocks for client introspection.
- server_main adds --debug-thinking-logits and --think-soft-close-*
  flags plus image/host-info loaders for card-driven boot.
- sse_emitter routes Qwen3.6/Laguna think-mode output to the
  reasoning_content channel so reasoning never leaks into the
  user-visible content stream (Pattern-B call-verb streaming hook).
- Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards,
  the /props OpenAPI doc, updated thinking-budget spec, and the
  thinking-control protocol/mechanism experiments.
- test_server_unit gets matching coverage (~1100 lines) for prefill,
  /props schema-4, and reasoning_content routing.

## Why

Gives clients a single, card-driven API to control thinking budgets,
soft-close behavior, and reasoning visibility - and an introspectable
/props surface to discover what the server supports.

## Dependencies

- Luce-Org#336 (server-layer-split): CMake/build references
- Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio +
  pflash_on contracts
- Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks
  rely on tool_parser changes from Luce-Org#340
@easel easel marked this pull request as ready for review June 4, 2026 05:03
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 issues found across 31 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen35/c2_gate.h">

<violation number="1" location="server/src/qwen35/c2_gate.h:28">
P2: The gate threshold uses `int` multiplication (`2 * fa_window_cfg`), which can overflow and miscompute spec-decode eligibility for large configs.</violation>
</file>

<file name="server/src/gemma4/gemma4_layer_split_adapter.cpp">

<violation number="1" location="server/src/gemma4/gemma4_layer_split_adapter.cpp:149">
P2: Sampler penalty settings are silently ignored in Gemma4 layer-split after sampler state was removed, causing accepted requests to run as plain greedy decode.</violation>
</file>

<file name="server/src/qwen3/qwen3_loader.cpp">

<violation number="1" location="server/src/qwen3/qwen3_loader.cpp:136">
P2: Weight type detection for Q8_0 is logged but never plumbed into tensor loading. The comment claims Q8_0 support, but `copy_tensor_from_file` (same file) only handles same-type, BF16→F16, BF16→F32, and F16→F32 — no Q8_0 conversion path exists. Loading a Q8_0 GGUF would print "detected weight type: Q8_0" then fail every tensor copy with "unsupported tensor conversion". Additionally, the ternary `wtype == GGML_TYPE_Q8_0 ? "Q8_0" : "BF16"` silently mislabels non-Q8_0 types (F16, Q4_0, etc.) as BF16; `ggml_type_name(wtype)` is already used for error logging in the same function and should be used here instead.</violation>
</file>

<file name="server/src/common/gguf_inspect.cpp">

<violation number="1" location="server/src/common/gguf_inspect.cpp:178">
P2: The hashing loop does not distinguish EOF from read errors, so it can return/cache a SHA-256 for truncated data.</violation>
</file>

<file name="server/scripts/test_server_integration.py">

<violation number="1" location="server/scripts/test_server_integration.py:1108">
P2: Natural-close test encodes an incorrect token-accounting invariant (`content_tokens == 0`), conflicting with the documented `finish_details` contract.</violation>
</file>

<file name="server/src/qwen35/qwen35_dflash_target.h">

<violation number="1" location="server/src/qwen35/qwen35_dflash_target.h:59">
P2: Sticky override state with no reset — `set_fa_window` permanently mutates `fa_window_` but the constructor's original value is lost after any call, so subsequent verify cycles silently use a stale override when the caller omits a set call before a specific verify step.</violation>
</file>

<file name="server/src/qwen35/qwen35_layer_split_adapter.cpp">

<violation number="1" location="server/src/qwen35/qwen35_layer_split_adapter.cpp:403">
P1: AR decode no longer applies sampler logit processing, so repetition/frequency/presence penalties are ignored.</violation>

<violation number="2" location="server/src/qwen35/qwen35_layer_split_adapter.cpp:418">
P2: DFlash gating is too permissive: it allows spec decode when non-temperature logit penalties are active.</violation>
</file>

<file name="server/src/draft/draft_gguf_loader.cpp">

<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:371">
P2: New post-allocation validation returns can leak draft GPU/context resources on load failure.</violation>
</file>

<file name="server/src/common/layer_split_backend.cpp">

<violation number="1" location="server/src/common/layer_split_backend.cpp:60">
P1: The condition `req.sampler.temp > 0.0f` is an incomplete proxy for logit-processing needs, allowing penalty-only requests to silently bypass the sampling-unsupported guard.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

cfg_.run_dflash ? &feature_ring_ : nullptr,
/*argmax_out=*/nullptr,
sampler_.needs_logit_processing() ? &logits_buf : nullptr)) {
cfg_.run_dflash ? &feature_ring_ : nullptr)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: AR decode no longer applies sampler logit processing, so repetition/frequency/presence penalties are ignored.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_layer_split_adapter.cpp, line 403:

<comment>AR decode no longer applies sampler logit processing, so repetition/frequency/presence penalties are ignored.</comment>

<file context>
@@ -419,24 +397,16 @@ bool Qwen35LayerSplitAdapter::decode_ar(
-                cfg_.run_dflash ? &feature_ring_ : nullptr,
-                /*argmax_out=*/nullptr,
-                sampler_.needs_logit_processing() ? &logits_buf : nullptr)) {
+                cfg_.run_dflash ? &feature_ring_ : nullptr)) {
             return false;
         }
</file context>

}
if (req.do_sample && req.sampler.needs_logit_processing() &&
!adapter_->supports_cpu_sampling()) {
if (req.do_sample && req.sampler.temp > 0.0f) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: The condition req.sampler.temp > 0.0f is an incomplete proxy for logit-processing needs, allowing penalty-only requests to silently bypass the sampling-unsupported guard.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/layer_split_backend.cpp, line 60:

<comment>The condition `req.sampler.temp > 0.0f` is an incomplete proxy for logit-processing needs, allowing penalty-only requests to silently bypass the sampling-unsupported guard.</comment>

<file context>
@@ -57,8 +57,7 @@ GenerateResult LayerSplitBackend::run_from_state(const GenerateRequest & req,
     }
-    if (req.do_sample && req.sampler.needs_logit_processing() &&
-        !adapter_->supports_cpu_sampling()) {
+    if (req.do_sample && req.sampler.temp > 0.0f) {
         result.error = "sampling_unsupported";
         return result;
</file context>
Suggested change
if (req.do_sample && req.sampler.temp > 0.0f) {
if (req.do_sample) {

int kv_committed) {
(void)kv_committed;
return (fa_window_override == 0)
|| (fa_window_override <= 2 * fa_window_cfg);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The gate threshold uses int multiplication (2 * fa_window_cfg), which can overflow and miscompute spec-decode eligibility for large configs.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/c2_gate.h, line 28:

<comment>The gate threshold uses `int` multiplication (`2 * fa_window_cfg`), which can overflow and miscompute spec-decode eligibility for large configs.</comment>

<file context>
@@ -0,0 +1,31 @@
+                                     int kv_committed) {
+    (void)kv_committed;
+    return (fa_window_override == 0)
+        || (fa_window_override <= 2 * fa_window_cfg);
+}
+
</file context>

if (req.do_sample && sampler_.seed != 0) {
sampler_rng_.seed(sampler_.seed);
}
(void)req;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Sampler penalty settings are silently ignored in Gemma4 layer-split after sampler state was removed, causing accepted requests to run as plain greedy decode.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/gemma4/gemma4_layer_split_adapter.cpp, line 149:

<comment>Sampler penalty settings are silently ignored in Gemma4 layer-split after sampler state was removed, causing accepted requests to run as plain greedy decode.</comment>

<file context>
@@ -146,25 +146,20 @@ bool Gemma4LayerSplitAdapter::init() {
-    if (req.do_sample && sampler_.seed != 0) {
-        sampler_rng_.seed(sampler_.seed);
-    }
+    (void)req;
 }
 
</file context>

out.head_dim = (int)get_u32(gctx, "qwen3.attention.key_length", 128);
out.rope_theta = get_f32(gctx, "qwen3.rope.freq_base", 1000000.0f);

// Detect weight quant type from blk.0.attn_q.weight; support BF16 and Q8_0.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Weight type detection for Q8_0 is logged but never plumbed into tensor loading. The comment claims Q8_0 support, but copy_tensor_from_file (same file) only handles same-type, BF16→F16, BF16→F32, and F16→F32 — no Q8_0 conversion path exists. Loading a Q8_0 GGUF would print "detected weight type: Q8_0" then fail every tensor copy with "unsupported tensor conversion". Additionally, the ternary wtype == GGML_TYPE_Q8_0 ? "Q8_0" : "BF16" silently mislabels non-Q8_0 types (F16, Q4_0, etc.) as BF16; ggml_type_name(wtype) is already used for error logging in the same function and should be used here instead.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/qwen3_loader.cpp, line 136:

<comment>Weight type detection for Q8_0 is logged but never plumbed into tensor loading. The comment claims Q8_0 support, but `copy_tensor_from_file` (same file) only handles same-type, BF16→F16, BF16→F32, and F16→F32 — no Q8_0 conversion path exists. Loading a Q8_0 GGUF would print "detected weight type: Q8_0" then fail every tensor copy with "unsupported tensor conversion". Additionally, the ternary `wtype == GGML_TYPE_Q8_0 ? "Q8_0" : "BF16"` silently mislabels non-Q8_0 types (F16, Q4_0, etc.) as BF16; `ggml_type_name(wtype)` is already used for error logging in the same function and should be used here instead.</comment>

<file context>
@@ -133,6 +133,18 @@ bool load_qwen3_drafter_model(const std::string & path,
     out.head_dim   = (int)get_u32(gctx, "qwen3.attention.key_length", 128);
     out.rope_theta = get_f32(gctx, "qwen3.rope.freq_base", 1000000.0f);
 
+    // Detect weight quant type from blk.0.attn_q.weight; support BF16 and Q8_0.
+    ggml_type wtype = GGML_TYPE_BF16;
+    {
</file context>

Comment on lines +178 to +183
while (f) {
f.read(reinterpret_cast<char*>(buf.data()), BUF);
std::streamsize got = f.gcount();
if (got > 0) sha256_update(c, buf.data(), size_t(got));
}
return sha256_final(c);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: The hashing loop does not distinguish EOF from read errors, so it can return/cache a SHA-256 for truncated data.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/gguf_inspect.cpp, line 178:

<comment>The hashing loop does not distinguish EOF from read errors, so it can return/cache a SHA-256 for truncated data.</comment>

<file context>
@@ -36,4 +41,292 @@ GgufModelInfo inspect_gguf_model_info(const char * path) {
+    // blow the C++ thread stack.
+    constexpr size_t BUF = 4 * 1024 * 1024;
+    std::vector<uint8_t> buf(BUF);
+    while (f) {
+        f.read(reinterpret_cast<char*>(buf.data()), BUF);
+        std::streamsize got = f.gcount();
</file context>
Suggested change
while (f) {
f.read(reinterpret_cast<char*>(buf.data()), BUF);
std::streamsize got = f.gcount();
if (got > 0) sha256_update(c, buf.data(), size_t(got));
}
return sha256_final(c);
while (true) {
f.read(reinterpret_cast<char*>(buf.data()), BUF);
const std::streamsize got = f.gcount();
if (got > 0) sha256_update(c, buf.data(), size_t(got));
if (f.eof()) break;
if (!f) return {};
}
return sha256_final(c);

Comment on lines +1108 to +1109
assert fd["content_tokens"] == 0
assert fd["thinking_tokens"] == fd["total_tokens"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Natural-close test encodes an incorrect token-accounting invariant (content_tokens == 0), conflicting with the documented finish_details contract.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/scripts/test_server_integration.py, line 1108:

<comment>Natural-close test encodes an incorrect token-accounting invariant (`content_tokens == 0`), conflicting with the documented `finish_details` contract.</comment>

<file context>
@@ -871,3 +897,243 @@ def test_stop_no_match(self):
+        # Phase-2 did not fire — content_tokens stays 0 when the model
+        # self-closes (all generated tokens are reasoning + content interleaved
+        # via the emitter on the phase-1 stream).
+        assert fd["content_tokens"] == 0
+        assert fd["thinking_tokens"] == fd["total_tokens"]
+
</file context>
Suggested change
assert fd["content_tokens"] == 0
assert fd["thinking_tokens"] == fd["total_tokens"]
assert fd["content_tokens"] > 0
assert fd["thinking_tokens"] + fd["content_tokens"] == fd["total_tokens"]

// Per-call override for the verify-time flash-attention window. Used by
// do_spec_decode to widen the window when pflash compression has shrunk
// the prompt — see GenerateRequest.fa_window_override.
void set_fa_window(int fa) { fa_window_ = fa; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Sticky override state with no reset — set_fa_window permanently mutates fa_window_ but the constructor's original value is lost after any call, so subsequent verify cycles silently use a stale override when the caller omits a set call before a specific verify step.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_dflash_target.h, line 59:

<comment>Sticky override state with no reset — `set_fa_window` permanently mutates `fa_window_` but the constructor's original value is lost after any call, so subsequent verify cycles silently use a stale override when the caller omits a set call before a specific verify step.</comment>

<file context>
@@ -53,6 +53,11 @@ class Qwen35DFlashTarget : public DFlashTarget {
+    // Per-call override for the verify-time flash-attention window. Used by
+    // do_spec_decode to widen the window when pflash compression has shrunk
+    // the prompt — see GenerateRequest.fa_window_override.
+    void set_fa_window(int fa) { fa_window_ = fa; }
+
 private:
</file context>


bool Qwen35LayerSplitAdapter::can_dflash_decode() const {
return cfg_.run_dflash && cfg_.draft_path && !sampler_.needs_logit_processing();
return cfg_.run_dflash && cfg_.draft_path && sampler_.temp == 0.0f;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: DFlash gating is too permissive: it allows spec decode when non-temperature logit penalties are active.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_layer_split_adapter.cpp, line 418:

<comment>DFlash gating is too permissive: it allows spec decode when non-temperature logit penalties are active.</comment>

<file context>
@@ -445,7 +415,7 @@ bool Qwen35LayerSplitAdapter::decode_ar(
 
 bool Qwen35LayerSplitAdapter::can_dflash_decode() const {
-    return cfg_.run_dflash && cfg_.draft_path && !sampler_.needs_logit_processing();
+    return cfg_.run_dflash && cfg_.draft_path && sampler_.temp == 0.0f;
 }
 
</file context>
Suggested change
return cfg_.run_dflash && cfg_.draft_path && sampler_.temp == 0.0f;
return cfg_.run_dflash && cfg_.draft_path && !sampler_.needs_logit_processing();

(long long)derived_q_dim,
out.n_head, out.head_dim, (long long)expected_q_dim);
set_last_error(buf);
return false;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: New post-allocation validation returns can leak draft GPU/context resources on load failure.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/draft/draft_gguf_loader.cpp, line 371:

<comment>New post-allocation validation returns can leak draft GPU/context resources on load failure.</comment>

<file context>
@@ -349,6 +349,63 @@ bool load_draft_gguf(const std::string & path,
+                (long long)derived_q_dim,
+                out.n_head, out.head_dim, (long long)expected_q_dim);
+            set_last_error(buf);
+            return false;
+        }
+        if (derived_kv_dim != expected_kv_dim) {
</file context>

@easel
Copy link
Copy Markdown
Collaborator Author

easel commented Jun 4, 2026

Closing this PR — provenance audit revealed two distinct problems that make a clean rebase impossible:

1. Byte-identical theft from dusterbloom's PR #274 (7 files):

  • server/src/common/c2_gate.h
  • server/src/inference/drafters/qwen3_drafter.{cpp,h}
  • server/src/inference/drafters/qwen35_dflash_target.h
  • server/src/inference/loaders/gguf_target_loader.cpp
  • server/src/inference/loaders/draft_gguf_loader.cpp
  • the c2-gate test cases in server/tests/unit/test_server_unit.cpp

These are dusterbloom's structural-defense / drafter work and should land via #274 (or a successor PR opened by their owner), not under this branch's authorship.

2. Silent reverts of weicj's already-merged work from #273 / #295 / #297 (6 files):

This branch was cut before those PRs merged into main, so the merge here would inadvertently roll back supports_cpu_sampling, prefill_last_logits, and related plumbing. Those reverts must not land.

What survives the audit (Erik's own work):

  • server/src/common/gguf_inspect.cpp (~292 LOC additive)
  • server/src/common/gguf_inspect.h (~40 LOC additive)

That's a SHA-256 mini-impl + GgufMetadata reader + sidecar caching + llama_ftype_name decode, layered on top of Howard Su's prior gguf_inspect from PR #305. It's being re-opened as a clean, scoped PR: feat/server-gguf-inspect (link in follow-up comment once pushed).

No code from this PR is being silently merged; everything legitimate is being re-introduced via its proper canonical PR.

@easel easel closed this Jun 4, 2026
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
@easel
Copy link
Copy Markdown
Collaborator Author

easel commented Jun 4, 2026

Follow-up: the clean, scoped replacement PR for the surviving gguf_inspect work is #344#344 (333 insertions, 0 deletions, on top of #305).

@easel easel deleted the feat/server-layer-split branch June 4, 2026 17:16
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
## What

Adds a richer GGUF identity reader on top of Howard Su's existing
`gguf_inspect` (PR Luce-Org#305):

- `GgufMetadata` struct — captures `general.*` + `<arch>.*` header fields
  (architecture, name, file_type, quantization_version, block_count,
  embedding_length, context_length, vocab_size) with -1 / "" sentinels
  distinguishing "not in GGUF" from legitimate zero.
- `read_gguf_metadata(path, compute_sha256)` — best-effort header read;
  optional SHA-256 of the whole file.
- Self-contained SHA-256 mini-impl (RFC 6234) — no OpenSSL dependency
  added for one hash.
- `<path>.sha256` sidecar caching — first server start hashes the file
  (~30s for a 17 GB GGUF on NVMe), subsequent starts read the sidecar.
  Sidecar I/O failures are non-fatal.
- `llama_ftype_name` decode — maps `general.file_type` ints to
  human-readable names ("Q4_K_M", "IQ4_XS", etc.) for /props.

## Why

`/props` schema-4 wants a single authoritative "exactly what binary +
GGUF + quant + sha256 is loaded" payload so benchmarking and provenance
tooling can pin model identity across runs without re-parsing GGUF
headers in every consumer. The sidecar makes the SHA-256 free after the
first boot, which is what makes it usable as a default-on identity
field.

## Dependencies

None. This is purely additive on top of `gguf_inspect.{cpp,h}` as merged
in PR Luce-Org#305 — zero deletions, 333 insertions total. No other server
files or build rules change in this PR; consumers will be wired up
separately.

## Scope note

This PR is the extracted-and-cleaned remnant of the previously closed
PR Luce-Org#336 after a provenance audit; everything else from that branch
(c2_gate, qwen3 drafter changes, structural-defense loaders, and the
inadvertent reverts of Luce-Org#273/Luce-Org#295/Luce-Org#297) is either landing through its
canonical PR (Luce-Org#274) or being dropped entirely.
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
## What

Adds a richer GGUF identity reader on top of Howard Su's existing
`gguf_inspect` (PR Luce-Org#305):

- `GgufMetadata` struct — captures `general.*` + `<arch>.*` header fields
  (architecture, name, file_type, quantization_version, block_count,
  embedding_length, context_length, vocab_size) with -1 / "" sentinels
  distinguishing "not in GGUF" from legitimate zero.
- `read_gguf_metadata(path, compute_sha256)` — best-effort header read;
  optional SHA-256 of the whole file.
- Self-contained SHA-256 mini-impl (RFC 6234) — no OpenSSL dependency
  added for one hash.
- `<path>.sha256` sidecar caching — first server start hashes the file
  (~30s for a 17 GB GGUF on NVMe), subsequent starts read the sidecar.
  Sidecar I/O failures are non-fatal.
- `llama_ftype_name` decode — maps `general.file_type` ints to
  human-readable names ("Q4_K_M", "IQ4_XS", etc.) for /props.

## Why

`/props` schema-4 wants a single authoritative "exactly what binary +
GGUF + quant + sha256 is loaded" payload so benchmarking and provenance
tooling can pin model identity across runs without re-parsing GGUF
headers in every consumer. The sidecar makes the SHA-256 free after the
first boot, which is what makes it usable as a default-on identity
field.

## Dependencies

None. This is purely additive on top of `gguf_inspect.{cpp,h}` as merged
in PR Luce-Org#305 — zero deletions, 333 insertions total. No other server
files or build rules change in this PR; consumers will be wired up
separately.

## Scope note

This PR is the extracted-and-cleaned remnant of the previously closed
PR Luce-Org#336 after a provenance audit; everything else from that branch
(c2_gate, qwen3 drafter changes, structural-defense loaders, and the
inadvertent reverts of Luce-Org#273/Luce-Org#295/Luce-Org#297) is either landing through its
canonical PR (Luce-Org#274) or being dropped entirely.
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
## What

Adds a richer GGUF identity reader on top of Howard Su's existing
`gguf_inspect` (PR Luce-Org#305):

- `GgufMetadata` struct — captures `general.*` + `<arch>.*` header fields
  (architecture, name, file_type, quantization_version, block_count,
  embedding_length, context_length, vocab_size) with -1 / "" sentinels
  distinguishing "not in GGUF" from legitimate zero.
- `read_gguf_metadata(path, compute_sha256)` — best-effort header read;
  optional SHA-256 of the whole file.
- Self-contained SHA-256 mini-impl (RFC 6234) — no OpenSSL dependency
  added for one hash.
- `<path>.sha256` sidecar caching — first server start hashes the file
  (~30s for a 17 GB GGUF on NVMe), subsequent starts read the sidecar.
  Sidecar I/O failures are non-fatal.
- `llama_ftype_name` decode — maps `general.file_type` ints to
  human-readable names ("Q4_K_M", "IQ4_XS", etc.) for /props.

## Why

`/props` schema-4 wants a single authoritative "exactly what binary +
GGUF + quant + sha256 is loaded" payload so benchmarking and provenance
tooling can pin model identity across runs without re-parsing GGUF
headers in every consumer. The sidecar makes the SHA-256 free after the
first boot, which is what makes it usable as a default-on identity
field.

## Dependencies

None. This is purely additive on top of `gguf_inspect.{cpp,h}` as merged
in PR Luce-Org#305 — zero deletions, 333 insertions total. No other server
files or build rules change in this PR; consumers will be wired up
separately.

## Scope note

This PR is the extracted-and-cleaned remnant of the previously closed
PR Luce-Org#336 after a provenance audit; everything else from that branch
(c2_gate, qwen3 drafter changes, structural-defense loaders, and the
inadvertent reverts of Luce-Org#273/Luce-Org#295/Luce-Org#297) is either landing through its
canonical PR (Luce-Org#274) or being dropped entirely.
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
… adapters

## What

New lucebox/ Python package exposing the hub CLI (autotune, sweep,
profile, smoke, models, config, download, host-check, docker_run) plus
the lucebox.sh launcher wrapper and install.sh. Adds the harness/
adapter package wrapping external coding agents (claude_code, codex,
hermes, openclaw, opencode, pi) that autotune sweeps drive. Ships
scripts/check_lucebox_wrapper_sandbox.sh and scripts/test_lucebox_sh.sh
for wrapper validation, full pytest coverage under lucebox/tests/, and
the bragi autotune profile-sweep protocol docs.

## Why

This is the user-facing surface of lucebox-hub: one CLI to launch the
image, tune layer-split / pflash settings against a host, run sweeps,
and dispatch bench runs. Splitting it out keeps Python-side review
independent of the C++ server and Docker stack reviews.

## Dependencies

- Luce-Org#334 (docker-stack): docker_run.py launches the lucebox-hub image
- Luce-Org#337 (lucebench-harness): lucebox bench delegates to luce-bench (workspace dep)
- Luce-Org#336 (server-layer-split): autotune presumes layer-split build artifacts
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
…ps schema-4

User-facing thinking-control API across the HTTP server surface:

- chat_template prefills a closed <think> block when thinking is off
  (Qwen3-gated) so the model skips the reasoning preamble without
  losing the assistant turn.
- http_server bumps /props schema 2 -> 4, adding build / model.target /
  model.draft / host blocks for client introspection.
- server_main adds --debug-thinking-logits and --think-soft-close-*
  flags plus image/host-info loaders for card-driven boot.
- sse_emitter routes Qwen3.6/Laguna think-mode output to the
  reasoning_content channel so reasoning never leaks into the
  user-visible content stream (Pattern-B call-verb streaming hook).
- Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards,
  the /props OpenAPI doc, updated thinking-budget spec, and the
  thinking-control protocol/mechanism experiments.
- test_server_unit gets matching coverage (~1100 lines) for prefill,
  /props schema-4, and reasoning_content routing.

Gives clients a single, card-driven API to control thinking budgets,
soft-close behavior, and reasoning visibility - and an introspectable
/props surface to discover what the server supports.

- Luce-Org#336 (server-layer-split): CMake/build references
- Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio +
  pflash_on contracts
- Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks
  rely on tool_parser changes from Luce-Org#340
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant