Skip to content

feat(pflash): ee7 early-exit drafter + anchor-transitive cascade + regime router#338

Closed
easel wants to merge 1 commit into
Luce-Org:mainfrom
easel:feat/server-pflash-drafter
Closed

feat(pflash): ee7 early-exit drafter + anchor-transitive cascade + regime router#338
easel wants to merge 1 commit into
Luce-Org:mainfrom
easel:feat/server-pflash-drafter

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented Jun 3, 2026

Summary

Adds the anchor-transitive cascade for the qwen3 pflash path (default-on, addresses the 64K NIAH cliff) along with drafter changes that thread the new anchor scan into the existing pflash drafter. Includes a focused C++ unit test for the anchor-transitive behavior and two design/configuration docs.

Files

  • server/src/qwen3/anchor_scan.{h,cpp} - new anchor scan implementation
  • server/src/qwen3/qwen3_drafter.{h,cpp} - drafter integration of anchor scan (~200 net LOC of changes)
  • server/src/server/http_server.h, server/src/server/server_main.cpp - minor wiring touch-ups
  • server/test/test_anchor_transitive.cpp - unit test for anchor-transitive cascade
  • docs/anchor-transitive.md, docs/pflash-compress-cfg.md - design / config notes

The earlier-claimed ee7 early-exit drafter, regime router, adaptive_keep_ratio, score_range, and the other four C++ unit tests are NOT in this PR's diff after the Phase 1-3 cleanup. The full A/B and prefix-cache experiment writeups previously listed are likewise not part of this scope.

Dependencies

None. After the Phase 1 cleanup this PR is independent and can land on its own (no remaining references to sibling-PR headers such as c2_gate.h or shared scaffolding).

Note: like the other server-layer split PRs, this branch cannot be built standalone because CMake still cross-references files that live on sibling branches; CI build validation has to happen on the integration branch.

Test plan

  • test_anchor_transitive passes locally and in CI on the integration branch
  • Qwen3.6 pflash run does not regress on the 64K NIAH point that motivated the anchor-transitive default-on change
  • Drafter warm-path behavior unchanged for the non-anchor regime (smoke test via a short pflash request)
  • Note: server PRs in this split cannot be validated standalone due to CMake cross-references; full build/test must run on the integration branch

🤖 Generated with Claude Code

…gime router

## What

User-facing pflash feature that lands four related pieces in the qwen3
backend and server:

- ee7 early-exit drafter with score-range gate and tail-capture guard
  (server/src/qwen3/qwen3_drafter.{h,cpp},
  server/src/common/score_range.h).
- Anchor-transitive cascade default-on, fixing the 64K NIAH cliff
  (server/src/qwen3/anchor_scan.{h,cpp}).
- Regime router scaffolding for swapping pflash composition by regime
  (server/src/common/regime_router.h).
- Adaptive composition via per-request fa_window
  (server/src/server/adaptive_keep_ratio.h).

Ships test_anchor_transitive.cpp with five C++ unit tests (early-exit
score-range gate, tail-capture guard, warm-path regression,
anchor-transitive cascade, regime router) and the pflash composition
/ anchor / drafter design docs plus the qwen3.6 pflash A/B and
prefix-cache writeups.

## Why

ee7 closes the long-context quality cliff on qwen3.6 at 64K and adds
the regime/adaptive-composition hooks needed for per-request tuning.

## Dependencies

None - post-Phase-1 cleanup; forward-only plumbing was dropped.
@easel easel force-pushed the feat/server-pflash-drafter branch from 0b54605 to bfbe07a Compare June 4, 2026 05:03
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
…ps schema-4

## What

User-facing thinking-control API across the HTTP server surface:

- chat_template prefills a closed <think> block when thinking is off
  (Qwen3-gated) so the model skips the reasoning preamble without
  losing the assistant turn.
- http_server bumps /props schema 2 -> 4, adding build / model.target /
  model.draft / host blocks for client introspection.
- server_main adds --debug-thinking-logits and --think-soft-close-*
  flags plus image/host-info loaders for card-driven boot.
- sse_emitter routes Qwen3.6/Laguna think-mode output to the
  reasoning_content channel so reasoning never leaks into the
  user-visible content stream (Pattern-B call-verb streaming hook).
- Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards,
  the /props OpenAPI doc, updated thinking-budget spec, and the
  thinking-control protocol/mechanism experiments.
- test_server_unit gets matching coverage (~1100 lines) for prefill,
  /props schema-4, and reasoning_content routing.

## Why

Gives clients a single, card-driven API to control thinking budgets,
soft-close behavior, and reasoning visibility - and an introspectable
/props surface to discover what the server supports.

## Dependencies

- Luce-Org#336 (server-layer-split): CMake/build references
- Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio +
  pflash_on contracts
- Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks
  rely on tool_parser changes from Luce-Org#340
@easel easel marked this pull request as ready for review June 4, 2026 05:03
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 9 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/qwen3/anchor_scan.cpp">

<violation number="1" location="server/src/qwen3/anchor_scan.cpp:27">
P2: Anchor scan incorrectly searches one position when `body_end < ngram`, causing false-positive forced chunks from non-body tokens.</violation>

<violation number="2" location="server/src/qwen3/anchor_scan.cpp:31">
P1: `max_anchor_hits` is effectively capped at 8 due to fixed `hit_pos[8]`, so long-context configs (16/32) do not work as intended.</violation>
</file>

<file name="server/src/qwen3/qwen3_drafter.cpp">

<violation number="1" location="server/src/qwen3/qwen3_drafter.cpp:132">
P2: Deprecation warning is logged on every request in a hot path; it should be emitted once, not per compression call.</violation>
</file>

<file name="docs/anchor-transitive.md">

<violation number="1" location="docs/anchor-transitive.md:13">
P3: Documentation references `bench/2026-05-25_longbench_hotpotqa/` but the `bench/` directory does not exist anywhere in this codebase, so the reference is unresolvable.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic


for (int qi = 0; qi + ngram <= (int)query_pool.size(); ++qi) {
int hits = 0;
int hit_pos[8];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: max_anchor_hits is effectively capped at 8 due to fixed hit_pos[8], so long-context configs (16/32) do not work as intended.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 31:

<comment>`max_anchor_hits` is effectively capped at 8 due to fixed `hit_pos[8]`, so long-context configs (16/32) do not work as intended.</comment>

<file context>
@@ -0,0 +1,169 @@
+
+    for (int qi = 0; qi + ngram <= (int)query_pool.size(); ++qi) {
+        int hits = 0;
+        int hit_pos[8];
+        for (int p = 0; p <= search_end && hits <= cfg.max_anchor_hits; ++p) {
+            bool same = true;
</file context>

{
const int n_chunks = (int)forced.size();
const int ngram = cfg.ngram;
const int search_end = std::max(0, body_end - ngram);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Anchor scan incorrectly searches one position when body_end < ngram, causing false-positive forced chunks from non-body tokens.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/anchor_scan.cpp, line 27:

<comment>Anchor scan incorrectly searches one position when `body_end < ngram`, causing false-positive forced chunks from non-body tokens.</comment>

<file context>
@@ -0,0 +1,169 @@
+{
+    const int n_chunks = (int)forced.size();
+    const int ngram    = cfg.ngram;
+    const int search_end = std::max(0, body_end - ngram);
+
+    for (int qi = 0; qi + ngram <= (int)query_pool.size(); ++qi) {
</file context>
Suggested change
const int search_end = std::max(0, body_end - ngram);
if (body_end < ngram) return;
const int search_end = body_end - ngram;

const int nv = env_int("PFLASH_COMPRESS_ANCHOR_NGRAM", -1);
const int lv = env_int("DFLASH_COMPRESS_ANCHOR_NGRAM", -1);
if (nv >= 0) return nv;
if (lv >= 0) { fprintf(stderr, "[WARN] DFLASH_COMPRESS_ANCHOR_NGRAM deprecated, use PFLASH_COMPRESS_ANCHOR_NGRAM\n"); return lv; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Deprecation warning is logged on every request in a hot path; it should be emitted once, not per compression call.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen3/qwen3_drafter.cpp, line 132:

<comment>Deprecation warning is logged on every request in a hot path; it should be emitted once, not per compression call.</comment>

<file context>
@@ -64,11 +65,122 @@ static int env_int(const char * name, int fallback) {
+        const int nv = env_int("PFLASH_COMPRESS_ANCHOR_NGRAM", -1);
+        const int lv = env_int("DFLASH_COMPRESS_ANCHOR_NGRAM", -1);
+        if (nv >= 0) return nv;
+        if (lv >= 0) { fprintf(stderr, "[WARN] DFLASH_COMPRESS_ANCHOR_NGRAM deprecated, use PFLASH_COMPRESS_ANCHOR_NGRAM\n"); return lv; }
+        return 4;
+    }();
</file context>

Comment thread docs/anchor-transitive.md

Empirical result: F1=0.628 on LongBench HotpotQA at ee7 + keep=0.15
(vs uncompressed F1=0.697). This is the ceiling for attention-score-based
prefill compression on this task; see bench/2026-05-25_longbench_hotpotqa/.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Documentation references bench/2026-05-25_longbench_hotpotqa/ but the bench/ directory does not exist anywhere in this codebase, so the reference is unresolvable.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At docs/anchor-transitive.md, line 13:

<comment>Documentation references `bench/2026-05-25_longbench_hotpotqa/` but the `bench/` directory does not exist anywhere in this codebase, so the reference is unresolvable.</comment>

<file context>
@@ -0,0 +1,15 @@
+
+Empirical result: F1=0.628 on LongBench HotpotQA at ee7 + keep=0.15
+(vs uncompressed F1=0.697). This is the ceiling for attention-score-based
+prefill compression on this task; see bench/2026-05-25_longbench_hotpotqa/.
+
+On by default. Disable via PFLASH_COMPRESS_ANCHOR_TRANSITIVE=0.
</file context>

@easel
Copy link
Copy Markdown
Collaborator Author

easel commented Jun 4, 2026

Closing this PR — it duplicates open PR #274 (feat/pflash-drafter-ee7) by @dusterbloom.

The substantive content here — anchor_scan.{cpp,h}, the qwen3_drafter changes, test_anchor_transitive.cpp, anchor-transitive.md, and pflash-compress-cfg.md — is byte-identical to what's already in #274. The original commit d4546a51 (dusterbloom, 2026-05-27) predates my squash 36979404 (2026-05-31), so attribution belongs upstream.

Reviewers: please land #274 as the canonical source.

Apologies for the inadvertent re-attribution — this should never have been opened against the original work. If any small subtractive cleanup is warranted (e.g., removing the regime_router wiring), I'll open that as a separate, narrowly-scoped PR with honest authorship after #274 lands.

@easel easel closed this Jun 4, 2026
@easel easel deleted the feat/server-pflash-drafter branch June 4, 2026 15:35
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
…ps schema-4

User-facing thinking-control API across the HTTP server surface:

- chat_template prefills a closed <think> block when thinking is off
  (Qwen3-gated) so the model skips the reasoning preamble without
  losing the assistant turn.
- http_server bumps /props schema 2 -> 4, adding build / model.target /
  model.draft / host blocks for client introspection.
- server_main adds --debug-thinking-logits and --think-soft-close-*
  flags plus image/host-info loaders for card-driven boot.
- sse_emitter routes Qwen3.6/Laguna think-mode output to the
  reasoning_content channel so reasoning never leaks into the
  user-visible content stream (Pattern-B call-verb streaming hook).
- Ships the model-card _schema.json, qwen3.6-27b and laguna-xs.2 cards,
  the /props OpenAPI doc, updated thinking-budget spec, and the
  thinking-control protocol/mechanism experiments.
- test_server_unit gets matching coverage (~1100 lines) for prefill,
  /props schema-4, and reasoning_content routing.

Gives clients a single, card-driven API to control thinking budgets,
soft-close behavior, and reasoning visibility - and an introspectable
/props surface to discover what the server supports.

- Luce-Org#336 (server-layer-split): CMake/build references
- Luce-Org#338 (server-pflash-drafter): check_admission uses pflash_keep_ratio +
  pflash_on contracts
- Luce-Org#340 (server-call-verb): sse_emitter Pattern-B call-verb streaming hooks
  rely on tool_parser changes from Luce-Org#340
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant