feat(server): soft-close thinking termination (qwen35 + gemma4)#339
Open
easel wants to merge 2 commits into
Open
feat(server): soft-close thinking termination (qwen35 + gemma4)#339easel wants to merge 2 commits into
easel wants to merge 2 commits into
Conversation
ea7e365 to
1f2bb36
Compare
This was referenced Jun 4, 2026
1f2bb36 to
5b7adc0
Compare
This was referenced Jun 4, 2026
b9a3237 to
56bd222
Compare
Contributor
There was a problem hiding this comment.
3 issues found across 6 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
easel
added a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 4, 2026
…probe, GPU-argmax logits) Three P1 fixes from cubic AI review on PR Luce-Org#339: 1) qwen35 + gemma4: soft-close set close_inject_pos=1 after firing, which caused maybe_force_close's same-step continuation branch to overwrite tok=inject[0] with close_token_ids[1] — skipping the first close token in multi-token close sequences. Set it to 0 so the continuation reads close_token_ids[0] (no-op rewrite) and advances to 1 for the next step. 2) gemma4: maybe_soft_close used close_token_ids.front() (the INJECT token) for the logit comparator instead of soft_close_probe_token() (the PROBE marker), and skipped the soft_close_min_tokens floor. Mirror qwen35: peek the probe, write the inject, gate on the floor. 3) qwen35 GPU-argmax path: maybe_soft_close was called with stale logits because the GPU-argmax branch only reads the 4-byte argmax token id and leaves logits_buf untouched. Fetch logits from GPU when soft-close (or its diagnostic) is enabled. Zero-cost when off.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 5, 2026
…(qwen35 + gemma4)
Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.
- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
docs/experiments/.
server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).
Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.
…probe, GPU-argmax logits) Three P1 fixes from cubic AI review on PR Luce-Org#339: 1) qwen35 + gemma4: soft-close set close_inject_pos=1 after firing, which caused maybe_force_close's same-step continuation branch to overwrite tok=inject[0] with close_token_ids[1] — skipping the first close token in multi-token close sequences. Set it to 0 so the continuation reads close_token_ids[0] (no-op rewrite) and advances to 1 for the next step. 2) gemma4: maybe_soft_close used close_token_ids.front() (the INJECT token) for the logit comparator instead of soft_close_probe_token() (the PROBE marker), and skipped the soft_close_min_tokens floor. Mirror qwen35: peek the probe, write the inject, gate on the floor. 3) qwen35 GPU-argmax path: maybe_soft_close was called with stale logits because the GPU-argmax branch only reads the 4-byte argmax token id and leaves logits_buf untouched. Fetch logits from GPU when soft-close (or its diagnostic) is enabled. Zero-cost when off.
53ede18 to
10c6e88
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds soft-close thinking termination to qwen35 and gemma4 backends: a logit-ratio peek (
prob[</think>] / prob[chosen]) lets the AR loop voluntarily inject the close sequence before the hard budget cap fires. ExtendsBudgetHookwith split probe-vs-inject token ids,soft_close_min_ratio, andsoft_close_min_tokensfloor, and updates the thinking-budget spec to document the newsofttaxonomy value and the operator/per-request dial. Backend-only; the matching CLI flags and request-envelope plumbing ship in #341.Files
server/src/common/model_backend.h— extendedBudgetHookstruct (probe ids, min_ratio, min_tokens, helpers,soft_close::namespace)server/src/qwen35/qwen35_backend.{cpp,h}— soft-close peek wired into the AR loopserver/src/gemma4/gemma4_backend.cpp— gemma4 port of the same patternserver/test/test_server_unit.cpp— soft-close unit tests (+533 lines)docs/specs/thinking-budget.md—softtaxonomy value,soft_close_min_ratiodial, per-request override semanticsDependencies
None. Soft-close is independent of the pflash/c2-gate stack — prior references in
qwen35_backend.cpptoc2_gate.h,req.use_transitive, andreq.fa_window_overridehave been dropped (those land via PR #274's pflash compression path, not this branch).Part of the PR #285 split
One of a series of tightly-scoped PRs splitting Luce-Org/lucebox-hub#285 into reviewable chunks.
Test plan
server/builds standalone on this branchtest_server_unitand confirm the new soft-close cases pass--think-soft-close-min-ratioand a per-request override; verify thestop_reasontaxonomy emitssoftwhen the threshold clears before the hard capnatural/hard) behavior is unchanged whensoft_close_min_ratio = 0.0(default)soft_close_min_tokensfloor prevents premature firing on short thinking blocksRisk
Low — soft-close is off by default (
min_ratio = 0.0); existing hard-close and natural-close paths are unchanged when the dial is not engaged.