mtp: support for gemma-4 E2B and E4B assistants by max-krasnyansky · Pull Request #24282 · ggml-org/llama.cpp

max-krasnyansky · 2026-06-07T22:59:40Z

Overview

Just a few small updates to enable conversion and loading of the smaller E2B and E4B gemma-4 assistant models.
The main issue was that those models include two additional tensors that we currently do not support.
masked_embedding.centroids.weight and masked_embedding.token_ordering.
I added those to the converter and updated the loader to mark those as TENSOR_NOT_REQUIRED.

Additional information

How to convert

$ ./convert_hf_to_gguf.py ../gemma-4-E2B-it-qat-q4_0-unquantized-assistant --outfile gemma-4-E2B-it-assist-F16.gguf --outtype f16

$ ./llama-quantize gemma-4-E2B-it-assist-F16.gguf gemma-4-E2B-it-assist-Q4_0.gguf Q4_0

How to run.
This is an example for running on Galaxy S26+. Hexagon backend runs the main model the CPU runs the draft.

$ llama-server --no-mmap -m gemma-4-E4B_q4_0-it.gguf --model-draft gemma-4-E4B-it-assist-Q4_0.gguf \
     --spec-type draft-mtp --spec-draft-n-max 3 \
     -fa on -ngl 99 --ctx-size 8192 --host 192.168.1.150 --device HTP0 --spec-draft-device none \
     -t 6
...
0.47.432.458 I slot print_timing: id  3 | task 0 | prompt eval time =    7780.15 ms /  5310 tokens (    1.47 ms per token,   682.51 tokens per second)
0.47.432.460 I slot print_timing: id  3 | task 0 |        eval time =   27496.00 ms /   386 tokens (   71.23 ms per token,    14.04 tokens per second)
0.47.432.460 I slot print_timing: id  3 | task 0 |       total time =   35276.15 ms /  5696 tokens
0.47.432.461 I slot print_timing: id  3 | task 0 |    graphs reused =        259
0.47.432.462 I slot print_timing: id  3 | task 0 | draft acceptance = 0.47893 (  125 accepted /   261 generated)
0.47.432.475 I statistics        draft-mtp: #calls(b,g,a) =    1    261    261, #gen drafts =    261, #acc drafts =   125, #gen tokens =    261, #acc tokens =   125, dur(b,g,a) = 0.004, 1656.289, 0.155 ms

The acceptances rates I'm seeing aren't amazing. Will dig into that a bit later.
Otherwise works well. I'm seeing about 1-2 TPS bump on S26+ for gemma-4-E2B with n-draft 3.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

max-krasnyansky · 2026-06-07T23:13:58Z

@am17an @ggerganov

btw It seems that we'll need to update llama-speculative for these separate MTP drafters.
llama-server runs with no issues but llama-speculative throws a few errors and then crashes (at least on my Mac).

./build-macos/bin/llama-speculative -m ../gguf/gemma-4-E2B_q4_0-it.gguf -md ../gguf/gemma-4-E2B-it-assist-Q4_0.gguf  -f ../sample_prompt_1024.txt --spec-type draft-mtp --spec-draft-n-max 2 -fa on -ngl 99 --ctx-size 8192

0.00.029.513 I common_init_result: fitting params to device memory ...
0.00.029.517 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.291.336 I common_params_fit_impl: projected to use 3281 MiB of host memory vs. 49152 MiB of total host memory
0.00.501.728 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.503.522 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.508.188 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.01.961.865 W llama_context: n_ctx_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
0.01.967.211 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.02.002.894 I common_init_result: fitting params to device memory ...
0.02.002.897 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.02.225.933 E llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)
0.02.254.099 E common_fit_params: encountered an error while trying to fit params to free device memory: failed to create llama_context from model
0.02.457.611 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.02.459.449 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.02.464.193 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.02.473.449 W model has unused tensor masked_embd_centroids.weight (size = 294912 bytes) -- ignoring
0.02.473.451 W model has unused tensor masked_embd_ordering (size = 1048576 bytes) -- ignoring
0.02.477.498 E llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)
0.02.477.503 E common_init_result: failed to create context with model '../gguf/gemma-4-E2B-it-assist-Q4_0.gguf'
0.02.477.503 E common_init_from_params: failed to create context with model '../gguf/gemma-4-E2B-it-assist-Q4_0.gguf
...
Segfault

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xd0)
  * frame #0: 0x00000001003c7d88 libllama.0.dylib`llama_context::decode(llama_batch const&) + 68
    frame #1: 0x00000001003cd92c libllama.0.dylib`llama_decode + 20
    frame #2: 0x0000000100003320 llama-speculative`main + 2104
    frame #3: 0x0000000189a3be00 dyld`start + 6992

ggerganov

Yes, we should adapt the llama-speculative and llama-speculative-simple examples.

CISC · 2026-06-08T06:53:04Z

This is not enough, we also need centroid_intermediate_top_k and num_centroids metadata from the config.

max-krasnyansky · 2026-06-08T17:03:23Z

This is not enough, we also need centroid_intermediate_top_k and num_centroids metadata from the config.

@CISC can you please clarify what you mean by not enough. i.e. Not enough for what?
The changes I added are enough to convert and run smaller drafters E2B and E4B just like the larger drafters 12B and 26B. The larger drafters do not have those two extra tensors but are otherwise identical in terms of the layers (sizes are different of course).

If you mean that it's not enough to properly enabled masked_embeddings functionality then that's certainly the case. But we don't have any support for that, unless I'm missing something, so I'd think adding that metadata won't be enough either.

Georgi's comment above was with respect to updating llama-speculative to match llama-server behavior to support these gemma-4 MTP drafters.

Please let me know. I'd be happy to add additional meta but my thinking is that we should add that if/when we add support for masked_embeddings.

max-krasnyansky · 2026-06-08T17:16:58Z

This is the right load-enabling fix; I have landed with Claude Code the same TENSOR_NOT_REQUIRED handling for masked_embedding.centroids + token_ordering in a downstream (llamafile-based) fork a day before this PR.

A few data points that might help with the "acceptance rates aren't amazing / modest TPS" you flagged.

1. Those two tensors drive the ordered ("efficient") head — a draft-step speed win, not an acceptance win. The E-series drafter uses centroids + token_ordering to compute logits for only a small candidate set (top-k ≈ 32 via the centroids) instead of a full [n_embd × 262144] vocab projection. Running the dense head (as here) is correct, but for the tiny E2B/E4B targets that full-vocab projection is a large fraction of each draft step — likely part of your "modest TPS". It does not change acceptance (that's drafter quality, and it's genuinely modest for the E-series — see table); it just makes each draft step much cheaper, which is what tips net throughput positive.
[SNIP]

@mann1x
Very nice insight. Thank you.
Speeding up the drafter would definitely help the TPS :)

I was going to profile the drafters separately first but llama-bench is not able to load them. It's missing that ctx_other thing just like llama-speculative.

mann1x · 2026-06-08T18:32:50Z

@max-krasnyansky

Sorry I was in a training all day, had only a few minutes break, couldn't do better than posting an edited answer from Claude.

I'm trying to map back my patches to llamafile from the AtomicBot fork but it's not easy.
The E-Series are sharing KV across layers but I don't see that in your patch.
Is that already handled by the original PR to add MTP support?
I'll try to have a look at it.

CISC · 2026-06-08T19:37:16Z

Please let me know. I'd be happy to add additional meta but my thinking is that we should add that if/when we add support for masked_embeddings.

We should not include tensors that are basically unusable, which they are without this metadata. IMO we either remove them from conversion, or we at least include the associated metadata as well.

The potential issue with the latter is that they could still end up being useless as we don't know if they require additional transformations to be useful (I suspect the I64 token_ordering will be unwieldy), so I am actually leaning towards the former.

max-krasnyansky · 2026-06-08T20:33:28Z

Please let me know. I'd be happy to add additional meta but my thinking is that we should add that if/when we add support for masked_embeddings.

We should not include tensors that are basically unusable, which they are without this metadata. IMO we either remove them from conversion, or we at least include the associated metadata as well.

The potential issue with the latter is that they could still end up being useless as we don't know if they require additional transformations to be useful (I suspect the I64 token_ordering will be unwieldy), so I am actually leaning towards the former.

Sounds good. I added them to the filter as well.

$ ./convert_hf_to_gguf.py gemma-4-E4B-it-qat-q4_0-unquantized-assistant --outfile gemma-4-E4B-it-assist-F16.gguf --outtype f16 --verbose
INFO:hf-to-gguf:Loading model: gemma-4-E4B-it-qat-q4_0-unquantized-assistant
INFO:hf-to-gguf:Model architecture: Gemma4AssistantForCausalLM
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
DEBUG:hf-to-gguf:Skipping get tensor 'masked_embedding.centroids.weight' in safetensors so that convert can end normally.
DEBUG:hf-to-gguf:Skipping get tensor 'masked_embedding.token_ordering' in safetensors so that convert can end normally.
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,                torch.float32 --> F32, shape = {256}
INFO:hf-to-gguf:token_embd.weight,                torch.bfloat16 --> F16, shape = {256, 262144}
INFO:hf-to-gguf:blk.0.attn_norm.weight,           torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.0.layer_output_scale.weight,  torch.bfloat16 --> F32, shape = {1}
DEBUG:hf-to-gguf:Skipping get tensor 'masked_embedding.centroids.weight' in safetensors so that convert can end normally.
DEBUG:hf-to-gguf:Skipping get tensor 'masked_embedding.token_ordering' in safetensors so that convert can end normally.

max-krasnyansky · 2026-06-08T20:48:26Z

@max-krasnyansky

Sorry I was in a training all day, had only a few minutes break, couldn't do better than posting an edited answer from Claude.

I'm trying to map back my patches to llamafile from the AtomicBot fork but it's not easy. The E-Series are sharing KV across layers but I don't see that in your patch. Is that already handled by the original PR to add MTP support? I'll try to have a look at it.

Yes. That was in the original PR. llama-server is setting that up.
It's missing in llama-speculative tool and that's what Georgi and I mentioned earlier in this discussion.

hyperscientist · 2026-06-09T10:27:45Z

Is it expected that flash attention must be off for draft model to work? I'm running Gemma 4 E4B QAT on Jetson Orin Nano, get great results with this patch (19.7->27.7t/s on decode), but I'm unable to fit base E4B + assistant model + mmproj in memory and I suspect I would with FA, but unfortunately I get fatal error when I try E4B+assistant+FA combo.

Relevant AI slop that tries to explain the reason below (sorry if it's unwanted):


With --flash-attn on (or auto), llama.cpp aborts during the draft model's first decode with a hard GGML_ABORT("fatal error") in ggml/src/ggml-cuda/fattn.cu, around line 110.

Root cause

The CUDA FlashAttention dispatcher in fattn.cu selects a kernel based on head dimension (DKQ) and GQA ratio. The relevant control flow is:

- gqa_ratio = Q->ne[2] / K->ne[2] (= n_head / n_head_kv)
- If use_gqa_opt && gqa_ratio > 4 -> multi-column kernel (ncols 8)
- Else if use_gqa_opt && gqa_ratio > 2 -> multi-column kernel (ncols 4)
- Else if constexpr (DKQ <= 256) -> ncols-1/2 fallback kernels
- Else -> GGML_ABORT("fatal error")

So the unhandled combination is DKQ > 256 AND gqa_ratio <= 2. There is no compiled kernel for a large head-dim when the GQA ratio is too low to route into the multi-column GQA-optimized path.

Why this specific model hits it

The Gemma 4 E4B MTP draft head (gemma4-assistant arch) has, on its global-attention layer:

- head_dim = 512 (its real global_head_dim; the sliding-window layers use 256)
- 4 attention heads / 2 KV heads -> gqa_ratio = 2

That lands exactly in the abort branch: DKQ=512, gqa_ratio=2.

Why the main E4B model does NOT hit it

The target E4B model also has head_dim = 512 on its global layers, but it has 8 heads / 2 KV heads -> gqa_ratio = 4, which routes into the gqa_ratio > 2 multi-column kernel that does support DKQ=512. That's why the production multimodal config runs --flash-attn on fine.

Bug vs. missing feature

It's both, arguably, and you can frame it two ways for the issue:

1. Missing feature: there is no CUDA FA kernel covering DKQ=512, gqa_ratio <= 2. Adding one, or routing this case to a ncols-1 path that supports DKQ > 256, would be the feature.

2. Genuine bug in auto: this is the stronger argument. --flash-attn auto is supposed to fall back gracefully when FA isn't supported, but here it still selects FA and then hard-aborts at compute time instead of disabling FA for that model/layer. An auto mode that crashes is a bug regardless of whether the kernel ever gets added. At minimum, the support check should detect the unsupported DKQ/gqa_ratio combination up front and either disable FA or error cleanly, not GGML_ABORT mid-decode.

Useful details to include in the issue

- Hardware/backend: CUDA, NVIDIA Jetson Orin Nano, compute capability 8.7 / sm_87.
- Build flag: GGML_CUDA_FA_ALL_QUANTS=OFF. KV is f16, so quant coverage probably is not the issue; this appears to be head-dim/GQA-related.
- Model: google/gemma-4-E4B-it-...-assistant MTP draft, converted via convert_hf_to_gguf.py.
- Architecture: gemma4-assistant.
- Run mode: used as --model-draft with --spec-type draft-mtp.
- Trigger: any decode with -fa on/auto.
- Workaround: -fa off works, so the issue is isolated to the CUDA FA path for this layer shape.
- Crash site: ggml_cuda_flash_attn_ext -> GGML_ABORT("fatal error") in fattn.cu.```

max-krasnyansky · 2026-06-09T15:33:27Z

Is it expected that flash attention must be off for draft model to work? I'm running Gemma 4 E4B QAT on Jetson Orin Nano, get great results with this patch (19.7->27.7t/s on decode), but I'm unable to fit base E4B + assistant model + mmproj in memory and I suspect I would with FA, but unfortunately I get fatal error when I try E4B+assistant+FA combo.

Yeah, seems like related to kernels missing/selection logic.
The example I included (hexagon backend) FA is enabled without issues.

Pulls in ggml-org/llama.cpp#23398 (gemma4-assistant draft arch) and ggml-org/llama.cpp#24282 (E2B/E4B assistants). Without this, loading any mtp-gemma-4-*.gguf drafter fails with: unknown model architecture 'gemma4-assistant'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

max-krasnyansky added 3 commits June 7, 2026 14:10

models: update converter to support smaller assistants

f910212

models: add masked_embd tensors to gemma4-assist arch

0bc2536

gemma-4: remove temp debug for conversion

be07d14

max-krasnyansky requested a review from CISC as a code owner June 7, 2026 22:59

github-actions Bot added model Model specific python python script changes labels Jun 7, 2026

max-krasnyansky mentioned this pull request Jun 7, 2026

llama : add Gemma4 MTP #23398

Merged

ggerganov approved these changes Jun 8, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

gemma-4-mtp: filter out masked_embedding tensors during conversion

a9af7cf

CISC approved these changes Jun 8, 2026

View reviewed changes

max-krasnyansky merged commit 7d2b45b into ggml-org:master Jun 8, 2026
26 of 27 checks passed

dboybaker mentioned this pull request Jun 9, 2026

Eval bug: Gemma4 E4B MTP drafter crashes at slot init with fatal error in fattn.cu:110 #24376

Open

Blueforcer mentioned this pull request Jun 12, 2026

Gemma 4 MTP draft models fail to load: unknown model architecture 'gemma4-assistant' xorbitsai/xllamacpp#157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtp: support for gemma-4 E2B and E4B assistants#24282

mtp: support for gemma-4 E2B and E4B assistants#24282
max-krasnyansky merged 4 commits into
ggml-org:masterfrom
qualcomm:gemma-4-support-smaller-assistants

max-krasnyansky commented Jun 7, 2026 •

edited

Loading

Uh oh!

max-krasnyansky commented Jun 7, 2026

Uh oh!

ggerganov left a comment

Uh oh!

CISC commented Jun 8, 2026

Uh oh!

This comment has been minimized.

max-krasnyansky commented Jun 8, 2026

Uh oh!

max-krasnyansky commented Jun 8, 2026

Uh oh!

mann1x commented Jun 8, 2026

Uh oh!

CISC commented Jun 8, 2026

Uh oh!

max-krasnyansky commented Jun 8, 2026

Uh oh!

max-krasnyansky commented Jun 8, 2026

Uh oh!

Uh oh!

hyperscientist commented Jun 9, 2026 •

edited

Loading

Uh oh!

max-krasnyansky commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

max-krasnyansky commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

max-krasnyansky commented Jun 7, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

CISC commented Jun 8, 2026

Uh oh!

This comment has been minimized.

max-krasnyansky commented Jun 8, 2026

Uh oh!

max-krasnyansky commented Jun 8, 2026

Uh oh!

mann1x commented Jun 8, 2026

Uh oh!

CISC commented Jun 8, 2026

Uh oh!

max-krasnyansky commented Jun 8, 2026

Uh oh!

max-krasnyansky commented Jun 8, 2026

Uh oh!

Uh oh!

hyperscientist commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

max-krasnyansky commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

max-krasnyansky commented Jun 7, 2026 •

edited

Loading

hyperscientist commented Jun 9, 2026 •

edited

Loading