feat(pflash): ee7 early-exit drafter + anchor-transitive cascade + regime router by easel · Pull Request #338 · Luce-Org/lucebox-hub

easel · 2026-06-03T21:40:40Z

Summary

Adds the anchor-transitive cascade for the qwen3 pflash path (default-on, addresses the 64K NIAH cliff) along with drafter changes that thread the new anchor scan into the existing pflash drafter. Includes a focused C++ unit test for the anchor-transitive behavior and two design/configuration docs.

Files

server/src/qwen3/anchor_scan.{h,cpp} - new anchor scan implementation
server/src/qwen3/qwen3_drafter.{h,cpp} - drafter integration of anchor scan (~200 net LOC of changes)
server/src/server/http_server.h, server/src/server/server_main.cpp - minor wiring touch-ups
server/test/test_anchor_transitive.cpp - unit test for anchor-transitive cascade
docs/anchor-transitive.md, docs/pflash-compress-cfg.md - design / config notes

The earlier-claimed ee7 early-exit drafter, regime router, adaptive_keep_ratio, score_range, and the other four C++ unit tests are NOT in this PR's diff after the Phase 1-3 cleanup. The full A/B and prefix-cache experiment writeups previously listed are likewise not part of this scope.

Dependencies

None. After the Phase 1 cleanup this PR is independent and can land on its own (no remaining references to sibling-PR headers such as c2_gate.h or shared scaffolding).

Note: like the other server-layer split PRs, this branch cannot be built standalone because CMake still cross-references files that live on sibling branches; CI build validation has to happen on the integration branch.

Test plan

test_anchor_transitive passes locally and in CI on the integration branch
Qwen3.6 pflash run does not regress on the 64K NIAH point that motivated the anchor-transitive default-on change
Drafter warm-path behavior unchanged for the non-anchor regime (smoke test via a short pflash request)
Note: server PRs in this split cannot be validated standalone due to CMake cross-references; full build/test must run on the integration branch

🤖 Generated with Claude Code

…gime router ## What User-facing pflash feature that lands four related pieces in the qwen3 backend and server: - ee7 early-exit drafter with score-range gate and tail-capture guard (server/src/qwen3/qwen3_drafter.{h,cpp}, server/src/common/score_range.h). - Anchor-transitive cascade default-on, fixing the 64K NIAH cliff (server/src/qwen3/anchor_scan.{h,cpp}). - Regime router scaffolding for swapping pflash composition by regime (server/src/common/regime_router.h). - Adaptive composition via per-request fa_window (server/src/server/adaptive_keep_ratio.h). Ships test_anchor_transitive.cpp with five C++ unit tests (early-exit score-range gate, tail-capture guard, warm-path regression, anchor-transitive cascade, regime router) and the pflash composition / anchor / drafter design docs plus the qwen3.6 pflash A/B and prefix-cache writeups. ## Why ee7 closes the long-context quality cliff on qwen3.6 at 64K and adds the regime/adaptive-composition hooks needed for per-request tuning. ## Dependencies None - post-Phase-1 cleanup; forward-only plumbing was dropped.

…ps schema-4 ## What User-facing thinking-control API across the HTTP server surface: - chat_template prefills a closed <think> block when thinking is off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema 2 -> 4, adding build / model.target / model.draft / host blocks for client introspection. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream (Pattern-B call-verb streaming hook). - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, updated thinking-budget spec, and the thinking-control protocol/mechanism experiments. - test_server_unit gets matching coverage (~1100 lines) for prefill, /props schema-4, and reasoning_content routing. ## Why Gives clients a single, card-driven API to control thinking budgets, soft-close behavior, and reasoning visibility - and an introspectable /props surface to discover what the server supports. ## Dependencies - Luce-Org#336 (server-layer-split): CMake/build references - Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio + pflash_on contracts - Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks rely on tool_parser changes from Luce-Org#340

cubic-dev-ai

4 issues found across 9 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen3/anchor_scan.cpp">

<violation number="1" location="server/src/qwen3/anchor_scan.cpp:27">
P2: Anchor scan incorrectly searches one position when `body_end < ngram`, causing false-positive forced chunks from non-body tokens.</violation>

<violation number="2" location="server/src/qwen3/anchor_scan.cpp:31">
P1: `max_anchor_hits` is effectively capped at 8 due to fixed `hit_pos[8]`, so long-context configs (16/32) do not work as intended.</violation>
</file>

<file name="server/src/qwen3/qwen3_drafter.cpp">

<violation number="1" location="server/src/qwen3/qwen3_drafter.cpp:132">
P2: Deprecation warning is logged on every request in a hot path; it should be emitted once, not per compression call.</violation>
</file>

<file name="docs/anchor-transitive.md">

<violation number="1" location="docs/anchor-transitive.md:13">
P3: Documentation references `bench/2026-05-25_longbench_hotpotqa/` but the `bench/` directory does not exist anywhere in this codebase, so the reference is unresolvable.</violation>
</file>

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

cubic-dev-ai · 2026-06-04T05:10:33Z

+
+    for (int qi = 0; qi + ngram <= (int)query_pool.size(); ++qi) {
+        int hits = 0;
+        int hit_pos[8];


P1: max_anchor_hits is effectively capped at 8 due to fixed hit_pos[8], so long-context configs (16/32) do not work as intended.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 31: <comment>`max_anchor_hits` is effectively capped at 8 due to fixed `hit_pos[8]`, so long-context configs (16/32) do not work as intended.</comment> <file context> @@ -0,0 +1,169 @@ + + for (int qi = 0; qi + ngram <= (int)query_pool.size(); ++qi) { + int hits = 0; + int hit_pos[8]; + for (int p = 0; p <= search_end && hits <= cfg.max_anchor_hits; ++p) { + bool same = true; </file context>

cubic-dev-ai · 2026-06-04T05:10:33Z

+{
+    const int n_chunks = (int)forced.size();
+    const int ngram    = cfg.ngram;
+    const int search_end = std::max(0, body_end - ngram);


P2: Anchor scan incorrectly searches one position when body_end < ngram, causing false-positive forced chunks from non-body tokens.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 27: <comment>Anchor scan incorrectly searches one position when `body_end < ngram`, causing false-positive forced chunks from non-body tokens.</comment> <file context> @@ -0,0 +1,169 @@ +{ + const int n_chunks = (int)forced.size(); + const int ngram = cfg.ngram; + const int search_end = std::max(0, body_end - ngram); + + for (int qi = 0; qi + ngram <= (int)query_pool.size(); ++qi) { </file context>

Suggested change

const int search_end = std::max(0, body_end - ngram);

if (body_end < ngram) return;

const int search_end = body_end - ngram;

cubic-dev-ai · 2026-06-04T05:10:33Z

+        const int nv = env_int("PFLASH_COMPRESS_ANCHOR_NGRAM", -1);
+        const int lv = env_int("DFLASH_COMPRESS_ANCHOR_NGRAM", -1);
+        if (nv >= 0) return nv;
+        if (lv >= 0) { fprintf(stderr, "[WARN] DFLASH_COMPRESS_ANCHOR_NGRAM deprecated, use PFLASH_COMPRESS_ANCHOR_NGRAM\n"); return lv; }


P2: Deprecation warning is logged on every request in a hot path; it should be emitted once, not per compression call.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/qwen3_drafter.cpp, line 132: <comment>Deprecation warning is logged on every request in a hot path; it should be emitted once, not per compression call.</comment> <file context> @@ -64,11 +65,122 @@ static int env_int(const char * name, int fallback) { + const int nv = env_int("PFLASH_COMPRESS_ANCHOR_NGRAM", -1); + const int lv = env_int("DFLASH_COMPRESS_ANCHOR_NGRAM", -1); + if (nv >= 0) return nv; + if (lv >= 0) { fprintf(stderr, "[WARN] DFLASH_COMPRESS_ANCHOR_NGRAM deprecated, use PFLASH_COMPRESS_ANCHOR_NGRAM\n"); return lv; } + return 4; + }(); </file context>

cubic-dev-ai · 2026-06-04T05:10:33Z

+
+Empirical result: F1=0.628 on LongBench HotpotQA at ee7 + keep=0.15
+(vs uncompressed F1=0.697). This is the ceiling for attention-score-based
+prefill compression on this task; see bench/2026-05-25_longbench_hotpotqa/.


P3: Documentation references bench/2026-05-25_longbench_hotpotqa/ but the bench/ directory does not exist anywhere in this codebase, so the reference is unresolvable.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At docs/anchor-transitive.md, line 13: <comment>Documentation references `bench/2026-05-25_longbench_hotpotqa/` but the `bench/` directory does not exist anywhere in this codebase, so the reference is unresolvable.</comment> <file context> @@ -0,0 +1,15 @@ + +Empirical result: F1=0.628 on LongBench HotpotQA at ee7 + keep=0.15 +(vs uncompressed F1=0.697). This is the ceiling for attention-score-based +prefill compression on this task; see bench/2026-05-25_longbench_hotpotqa/. + +On by default. Disable via PFLASH_COMPRESS_ANCHOR_TRANSITIVE=0. </file context>

easel · 2026-06-04T15:35:02Z

Closing this PR — it duplicates open PR #274 (feat/pflash-drafter-ee7) by @dusterbloom.

The substantive content here — anchor_scan.{cpp,h}, the qwen3_drafter changes, test_anchor_transitive.cpp, anchor-transitive.md, and pflash-compress-cfg.md — is byte-identical to what's already in #274. The original commit d4546a51 (dusterbloom, 2026-05-27) predates my squash 36979404 (2026-05-31), so attribution belongs upstream.

Reviewers: please land #274 as the canonical source.

Apologies for the inadvertent re-attribution — this should never have been opened against the original work. If any small subtractive cleanup is warranted (e.g., removing the regime_router wiring), I'll open that as a separate, narrowly-scoped PR with honest authorship after #274 lands.

…ps schema-4 User-facing thinking-control API across the HTTP server surface: - chat_template prefills a closed <think> block when thinking is off (Qwen3-gated) so the model skips the reasoning preamble without losing the assistant turn. - http_server bumps /props schema 2 -> 4, adding build / model.target / model.draft / host blocks for client introspection. - server_main adds --debug-thinking-logits and --think-soft-close-* flags plus image/host-info loaders for card-driven boot. - sse_emitter routes Qwen3.6/Laguna think-mode output to the reasoning_content channel so reasoning never leaks into the user-visible content stream (Pattern-B call-verb streaming hook). - Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards, the /props OpenAPI doc, updated thinking-budget spec, and the thinking-control protocol/mechanism experiments. - test_server_unit gets matching coverage (~1100 lines) for prefill, /props schema-4, and reasoning_content routing. Gives clients a single, card-driven API to control thinking budgets, soft-close behavior, and reasoning visibility - and an introspectable /props surface to discover what the server supports. - Luce-Org#336 (server-layer-split): CMake/build references - Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio + pflash_on contracts - Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks rely on tool_parser changes from Luce-Org#340

easel force-pushed the feat/server-pflash-drafter branch 3 times, most recently from 6b25405 to 0b54605 Compare June 4, 2026 03:37

This was referenced Jun 4, 2026

feat(luce-bench): in-tree bench harness + multi-turn agent_recorded + LLM judge #337

Open

feat(server): plain-text call:<verb>{} tool parsing (Gemma4) #340

Merged

feat(server): card-driven thinking control + reasoning_content channel + /props schema-4 #341

Open

easel force-pushed the feat/server-pflash-drafter branch from 0b54605 to bfbe07a Compare June 4, 2026 05:03

easel marked this pull request as ready for review June 4, 2026 05:03

cubic-dev-ai Bot reviewed Jun 4, 2026

View reviewed changes

easel mentioned this pull request Jun 4, 2026

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree #285

Closed

easel closed this Jun 4, 2026

easel deleted the feat/server-pflash-drafter branch June 4, 2026 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pflash): ee7 early-exit drafter + anchor-transitive cascade + regime router#338

feat(pflash): ee7 early-exit drafter + anchor-transitive cascade + regime router#338
easel wants to merge 1 commit into
Luce-Org:mainfrom
easel:feat/server-pflash-drafter

easel commented Jun 3, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Uh oh!

easel commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	const int search_end = std::max(0, body_end - ngram);
	if (body_end < ngram) return;
	const int search_end = body_end - ngram;

Conversation

easel commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Dependencies

Test plan

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

easel commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

easel commented Jun 3, 2026 •

edited

Loading