Skip to content

feat(server): soft-close thinking termination (qwen35 + gemma4)#339

Open
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:feat/server-soft-close
Open

feat(server): soft-close thinking termination (qwen35 + gemma4)#339
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:feat/server-soft-close

Conversation

@easel
Copy link
Copy Markdown
Collaborator

@easel easel commented Jun 3, 2026

Summary

Adds soft-close thinking termination to qwen35 and gemma4 backends: a logit-ratio peek (prob[</think>] / prob[chosen]) lets the AR loop voluntarily inject the close sequence before the hard budget cap fires. Extends BudgetHook with split probe-vs-inject token ids, soft_close_min_ratio, and soft_close_min_tokens floor, and updates the thinking-budget spec to document the new soft taxonomy value and the operator/per-request dial. Backend-only; the matching CLI flags and request-envelope plumbing ship in #341.

Files

  • server/src/common/model_backend.h — extended BudgetHook struct (probe ids, min_ratio, min_tokens, helpers, soft_close:: namespace)
  • server/src/qwen35/qwen35_backend.{cpp,h} — soft-close peek wired into the AR loop
  • server/src/gemma4/gemma4_backend.cpp — gemma4 port of the same pattern
  • server/test/test_server_unit.cpp — soft-close unit tests (+533 lines)
  • docs/specs/thinking-budget.mdsoft taxonomy value, soft_close_min_ratio dial, per-request override semantics

Dependencies

None. Soft-close is independent of the pflash/c2-gate stack — prior references in qwen35_backend.cpp to c2_gate.h, req.use_transitive, and req.fa_window_override have been dropped (those land via PR #274's pflash compression path, not this branch).

Part of the PR #285 split

One of a series of tightly-scoped PRs splitting Luce-Org/lucebox-hub#285 into reviewable chunks.

Test plan

  • Confirm server/ builds standalone on this branch
  • Run test_server_unit and confirm the new soft-close cases pass
  • Stack with feat(server): card-driven thinking control + reasoning_content channel + /props schema-4 #341 to exercise end-to-end with --think-soft-close-min-ratio and a per-request override; verify the stop_reason taxonomy emits soft when the threshold clears before the hard cap
  • Confirm legacy two-value (natural / hard) behavior is unchanged when soft_close_min_ratio = 0.0 (default)
  • Confirm soft_close_min_tokens floor prevents premature firing on short thinking blocks

Risk

Low — soft-close is off by default (min_ratio = 0.0); existing hard-close and natural-close paths are unchanged when the dial is not engaged.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 6 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_backend.cpp Outdated
Comment thread server/src/gemma4/gemma4_backend.cpp Outdated
Comment thread server/src/qwen35/qwen35_backend.cpp Outdated
easel added a commit to easel/lucebox-hub that referenced this pull request Jun 4, 2026
…probe, GPU-argmax logits)

Three P1 fixes from cubic AI review on PR Luce-Org#339:

1) qwen35 + gemma4: soft-close set close_inject_pos=1 after firing,
   which caused maybe_force_close's same-step continuation branch to
   overwrite tok=inject[0] with close_token_ids[1] — skipping the first
   close token in multi-token close sequences. Set it to 0 so the
   continuation reads close_token_ids[0] (no-op rewrite) and advances
   to 1 for the next step.

2) gemma4: maybe_soft_close used close_token_ids.front() (the INJECT
   token) for the logit comparator instead of soft_close_probe_token()
   (the PROBE marker), and skipped the soft_close_min_tokens floor.
   Mirror qwen35: peek the probe, write the inject, gate on the floor.

3) qwen35 GPU-argmax path: maybe_soft_close was called with stale
   logits because the GPU-argmax branch only reads the 4-byte argmax
   token id and leaves logits_buf untouched. Fetch logits from GPU
   when soft-close (or its diagnostic) is enabled. Zero-cost when off.
easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026
easel added 2 commits June 5, 2026 16:02
Backend mechanism for soft-close thinking termination via a logit-ratio
peek with split probe/inject ids, a min_tokens floor, and max_tokens
treated as a response-only budget while thinking is active.

- qwen35 originates the mechanism in qwen35_backend.{cpp,h}.
- gemma4 ports the same mechanism in gemma4_backend.cpp (plus the
  probe/inject hooks in gemma4_internal.h owned here).
- common/model_backend.h carries the shared hook contract.
- test_server_unit.cpp adds 533 lines of coverage for both backends.
- docs/specs/thinking-budget.md updated; soft-close design plan under
  docs/experiments/.

server_main CLI flags for this mechanism ship in the thinking-control
API PR (Luce-Org#341).

Lets a model finish its reasoning gracefully rather than being hard-cut
when a thinking budget is hit, which preserves answer quality on long
reasoning traces.
…probe, GPU-argmax logits)

Three P1 fixes from cubic AI review on PR Luce-Org#339:

1) qwen35 + gemma4: soft-close set close_inject_pos=1 after firing,
   which caused maybe_force_close's same-step continuation branch to
   overwrite tok=inject[0] with close_token_ids[1] — skipping the first
   close token in multi-token close sequences. Set it to 0 so the
   continuation reads close_token_ids[0] (no-op rewrite) and advances
   to 1 for the next step.

2) gemma4: maybe_soft_close used close_token_ids.front() (the INJECT
   token) for the logit comparator instead of soft_close_probe_token()
   (the PROBE marker), and skipped the soft_close_min_tokens floor.
   Mirror qwen35: peek the probe, write the inject, gate on the floor.

3) qwen35 GPU-argmax path: maybe_soft_close was called with stale
   logits because the GPU-argmax branch only reads the 4-byte argmax
   token id and leaves logits_buf untouched. Fetch logits from GPU
   when soft-close (or its diagnostic) is enabled. Zero-cost when off.
@easel easel force-pushed the feat/server-soft-close branch from 53ede18 to 10c6e88 Compare June 5, 2026 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant