feat(server): soft-close thinking termination (qwen35 + gemma4) by easel · Pull Request #339 · Luce-Org/lucebox-hub

easel · 2026-06-03T21:40:53Z

Summary

Adds soft-close thinking termination to qwen35 and gemma4 backends: a logit-ratio peek (prob[</think>] / prob[chosen]) lets the AR loop voluntarily inject the close sequence before the hard budget cap fires. Extends BudgetHook with split probe-vs-inject token ids, soft_close_min_ratio, and soft_close_min_tokens floor, and updates the thinking-budget spec to document the new soft taxonomy value and the operator/per-request dial. Backend-only; the matching CLI flags and request-envelope plumbing ship in #341.

Files

server/src/common/model_backend.h — extended BudgetHook struct (probe ids, min_ratio, min_tokens, helpers, soft_close:: namespace)
server/src/qwen35/qwen35_backend.{cpp,h} — soft-close peek wired into the AR loop
server/src/gemma4/gemma4_backend.cpp — gemma4 port of the same pattern
server/test/test_server_unit.cpp — soft-close unit tests (+533 lines)
docs/specs/thinking-budget.md — soft taxonomy value, soft_close_min_ratio dial, per-request override semantics

Dependencies

None. Soft-close is independent of the pflash/c2-gate stack — prior references in qwen35_backend.cpp to c2_gate.h, req.use_transitive, and req.fa_window_override have been dropped (those land via PR #274's pflash compression path, not this branch).

Part of the PR #285 split

One of a series of tightly-scoped PRs splitting Luce-Org/lucebox-hub#285 into reviewable chunks.

Test plan

Confirm server/ builds standalone on this branch
Run test_server_unit and confirm the new soft-close cases pass
Stack with feat(server): card-driven thinking control + reasoning_content channel + /props schema-4 #341 to exercise end-to-end with --think-soft-close-min-ratio and a per-request override; verify the stop_reason taxonomy emits soft when the threshold clears before the hard cap
Confirm legacy two-value (natural / hard) behavior is unchanged when soft_close_min_ratio = 0.0 (default)
Confirm soft_close_min_tokens floor prevents premature firing on short thinking blocks

Risk

Low — soft-close is off by default (min_ratio = 0.0); existing hard-close and natural-close paths are unchanged when the dial is not engaged.

cubic-dev-ai

3 issues found across 6 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

…probe, GPU-argmax logits) Three P1 fixes from cubic AI review on PR Luce-Org#339: 1) qwen35 + gemma4: soft-close set close_inject_pos=1 after firing, which caused maybe_force_close's same-step continuation branch to overwrite tok=inject[0] with close_token_ids[1] — skipping the first close token in multi-token close sequences. Set it to 0 so the continuation reads close_token_ids[0] (no-op rewrite) and advances to 1 for the next step. 2) gemma4: maybe_soft_close used close_token_ids.front() (the INJECT token) for the logit comparator instead of soft_close_probe_token() (the PROBE marker), and skipped the soft_close_min_tokens floor. Mirror qwen35: peek the probe, write the inject, gate on the floor. 3) qwen35 GPU-argmax path: maybe_soft_close was called with stale logits because the GPU-argmax branch only reads the 4-byte argmax token id and leaves logits_buf untouched. Fetch logits from GPU when soft-close (or its diagnostic) is enabled. Zero-cost when off.

…(qwen35 + gemma4)

Backend mechanism for soft-close thinking termination via a logit-ratio peek with split probe/inject ids, a min_tokens floor, and max_tokens treated as a response-only budget while thinking is active. - qwen35 originates the mechanism in qwen35_backend.{cpp,h}. - gemma4 ports the same mechanism in gemma4_backend.cpp (plus the probe/inject hooks in gemma4_internal.h owned here). - common/model_backend.h carries the shared hook contract. - test_server_unit.cpp adds 533 lines of coverage for both backends. - docs/specs/thinking-budget.md updated; soft-close design plan under docs/experiments/. server_main CLI flags for this mechanism ship in the thinking-control API PR (Luce-Org#341). Lets a model finish its reasoning gracefully rather than being hard-cut when a thinking budget is hit, which preserves answer quality on long reasoning traces.

…probe, GPU-argmax logits) Three P1 fixes from cubic AI review on PR Luce-Org#339: 1) qwen35 + gemma4: soft-close set close_inject_pos=1 after firing, which caused maybe_force_close's same-step continuation branch to overwrite tok=inject[0] with close_token_ids[1] — skipping the first close token in multi-token close sequences. Set it to 0 so the continuation reads close_token_ids[0] (no-op rewrite) and advances to 1 for the next step. 2) gemma4: maybe_soft_close used close_token_ids.front() (the INJECT token) for the logit comparator instead of soft_close_probe_token() (the PROBE marker), and skipped the soft_close_min_tokens floor. Mirror qwen35: peek the probe, write the inject, gate on the floor. 3) qwen35 GPU-argmax path: maybe_soft_close was called with stale logits because the GPU-argmax branch only reads the 4-byte argmax token id and leaves logits_buf untouched. Fetch logits from GPU when soft-close (or its diagnostic) is enabled. Zero-cost when off.

easel force-pushed the feat/server-soft-close branch 4 times, most recently from ea7e365 to 1f2bb36 Compare June 4, 2026 03:37

easel force-pushed the feat/server-soft-close branch from 1f2bb36 to 5b7adc0 Compare June 4, 2026 05:03

This was referenced Jun 4, 2026

feat(lucebox): docker stack + CLI + bench/profile + harness + luce-bench in-tree #285

Closed

feat(server): soft-close thinking termination via logit-ratio peek #326

Closed

easel force-pushed the feat/server-soft-close branch 2 times, most recently from b9a3237 to 56bd222 Compare June 4, 2026 18:39

easel marked this pull request as ready for review June 4, 2026 19:18

cubic-dev-ai Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread server/src/qwen35/qwen35_backend.cpp Outdated

Comment thread server/src/gemma4/gemma4_backend.cpp Outdated

Comment thread server/src/qwen35/qwen35_backend.cpp Outdated

easel pushed a commit to easel/lucebox-hub that referenced this pull request Jun 5, 2026

Merge PR Luce-Org#339: feat(server): soft-close thinking termination …

b4e16d7

…(qwen35 + gemma4)

easel added 2 commits June 5, 2026 16:02

easel force-pushed the feat/server-soft-close branch from 53ede18 to 10c6e88 Compare June 5, 2026 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): soft-close thinking termination (qwen35 + gemma4)#339

feat(server): soft-close thinking termination (qwen35 + gemma4)#339
easel wants to merge 2 commits into
Luce-Org:mainfrom
easel:feat/server-soft-close

easel commented Jun 3, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

easel commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Dependencies

Part of the PR #285 split

Test plan

Risk

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

easel commented Jun 3, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading