Skip to content

Gemma4 MTP#17

Open
am17an wants to merge 4 commits into
masterfrom
gemma4-mtp
Open

Gemma4 MTP#17
am17an wants to merge 4 commits into
masterfrom
gemma4-mtp

Conversation

@am17an
Copy link
Copy Markdown
Owner

@am17an am17an commented May 19, 2026

Works with both gemma-31B and gemma-26B but the MoE model is slower. I see a good speed up on my DGX spark (~2-2.5x speedup) on the dense model. The main problem is sharing the memory ctx between the two llama_contexts, so currently it's pretty hacky plus also the ubatch splitting is not super clean.

Replicated the AIME-26 results for Gemma-31B with -np 4

image

Comment thread src/llama-graph.cpp
// of streams (one per active draft seq); q->ne[2] is not divisible by the full
// n_stream and the view collapses tokens. Slice k/v down to exactly the streams
// referenced by this ubatch. Requires those streams to form a contiguous range.
if (k->ne[3] > 1 && (uint32_t) k->ne[3] != ubatch.n_seqs_unq) {
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov this part

am17an and others added 2 commits May 20, 2026 23:41
ggml_backend_dev_by_name always appends a nullptr sentinel to the devices
vector. Skipping nullptr entries prevents assertion failure in
ggml_backend_dev_name.

Assisted-by: llama.cpp:local pi
@ggerganov
Copy link
Copy Markdown

@am17an Are these AIME results with default thinking, or did you set a reasoning budget?

@am17an
Copy link
Copy Markdown
Owner Author

am17an commented May 21, 2026

Just the default, no budget

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants