feat(server): reduce layer-split activation memory with backend precision policy#310
Open
weicj wants to merge 4 commits into
Open
feat(server): reduce layer-split activation memory with backend precision policy#310weicj wants to merge 4 commits into
weicj wants to merge 4 commits into
Conversation
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record PR Luce-Org#310 integration, refreshed PR classification, retained probe worktrees, and validation outcomes for the unattended integration run.
Contributor
There was a problem hiding this comment.
4 issues found across 30 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
e73c2e3 to
bf9f4b5
Compare
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Apply the updated Luce-Org#310 backend activation precision policy changes on top of the existing auto-integration conflict resolution, preserving the Qwen35 MoE persistent logits graph resolution already carried in the stack.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 29, 2026
Record the 2026-05-29 14:03 cron pass, confirming Luce-Org#309/Luce-Org#310 and all other current included PR heads remain ancestors of easel/auto-integration. Re-probe the remaining old non-draft PRs from the current integration tip and record their conflict sets and retained worktrees.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR reduces long-context target layer-split OOM risk by storing layer-split activation staging buffers in a backend-appropriate dtype instead of always using F32. The graph still casts back to F32 at ggml RMSNorm / norm-weight boundaries, so the lower-precision storage does not change those operator contracts.
The allocation reduced here is
hidden_size * context_tokens * bytes_per_element * staging_buffer_count. Qwen35 and Laguna currently use two staging buffers; Gemma4 uses three because its per-layer input path also keeps an original embedding staging buffer.At a 256K context cap, the staging allocation changes from:
Changes
BackendActivationPolicyincommon/backend_precisionfor target layer-split activation staging.LUCEBOX_LAYER_SPLIT_ACT_TYPE=f32|f16|bf16as an explicit local override.common/ggml_graph_precision.hfor graph-side F32 casts.Notes