Gemma4 MTP#17
Open
am17an wants to merge 4 commits into
Open
Conversation
am17an
commented
May 19, 2026
| // of streams (one per active draft seq); q->ne[2] is not divisible by the full | ||
| // n_stream and the view collapses tokens. Slice k/v down to exactly the streams | ||
| // referenced by this ubatch. Requires those streams to form a contiguous range. | ||
| if (k->ne[3] > 1 && (uint32_t) k->ne[3] != ubatch.n_seqs_unq) { |
ggml_backend_dev_by_name always appends a nullptr sentinel to the devices vector. Skipping nullptr entries prevents assertion failure in ggml_backend_dev_name. Assisted-by: llama.cpp:local pi
|
@am17an Are these AIME results with default thinking, or did you set a reasoning budget? |
Owner
Author
|
Just the default, no budget |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Works with both gemma-31B and gemma-26B but the MoE model is slower. I see a good speed up on my DGX spark (~2-2.5x speedup) on the dense model. The main problem is sharing the memory ctx between the two llama_contexts, so currently it's pretty hacky plus also the ubatch splitting is not super clean.
Replicated the AIME-26 results for Gemma-31B with
-np 4