Skip to content

mtp: support for gemma-4 E2B and E4B assistants#24282

Merged
max-krasnyansky merged 4 commits into
ggml-org:masterfrom
qualcomm:gemma-4-support-smaller-assistants
Jun 8, 2026
Merged

mtp: support for gemma-4 E2B and E4B assistants#24282
max-krasnyansky merged 4 commits into
ggml-org:masterfrom
qualcomm:gemma-4-support-smaller-assistants

Conversation

@max-krasnyansky

@max-krasnyansky max-krasnyansky commented Jun 7, 2026

Copy link
Copy Markdown
Member

Overview

Just a few small updates to enable conversion and loading of the smaller E2B and E4B gemma-4 assistant models.
The main issue was that those models include two additional tensors that we currently do not support.
masked_embedding.centroids.weight and masked_embedding.token_ordering.
I added those to the converter and updated the loader to mark those as TENSOR_NOT_REQUIRED.

Additional information

How to convert

$ ./convert_hf_to_gguf.py ../gemma-4-E2B-it-qat-q4_0-unquantized-assistant --outfile gemma-4-E2B-it-assist-F16.gguf --outtype f16

$ ./llama-quantize gemma-4-E2B-it-assist-F16.gguf gemma-4-E2B-it-assist-Q4_0.gguf Q4_0

How to run.
This is an example for running on Galaxy S26+. Hexagon backend runs the main model the CPU runs the draft.

$ llama-server --no-mmap -m gemma-4-E4B_q4_0-it.gguf --model-draft gemma-4-E4B-it-assist-Q4_0.gguf \
     --spec-type draft-mtp --spec-draft-n-max 3 \
     -fa on -ngl 99 --ctx-size 8192 --host 192.168.1.150 --device HTP0 --spec-draft-device none \
     -t 6
...
0.47.432.458 I slot print_timing: id  3 | task 0 | prompt eval time =    7780.15 ms /  5310 tokens (    1.47 ms per token,   682.51 tokens per second)
0.47.432.460 I slot print_timing: id  3 | task 0 |        eval time =   27496.00 ms /   386 tokens (   71.23 ms per token,    14.04 tokens per second)
0.47.432.460 I slot print_timing: id  3 | task 0 |       total time =   35276.15 ms /  5696 tokens
0.47.432.461 I slot print_timing: id  3 | task 0 |    graphs reused =        259
0.47.432.462 I slot print_timing: id  3 | task 0 | draft acceptance = 0.47893 (  125 accepted /   261 generated)
0.47.432.475 I statistics        draft-mtp: #calls(b,g,a) =    1    261    261, #gen drafts =    261, #acc drafts =   125, #gen tokens =    261, #acc tokens =   125, dur(b,g,a) = 0.004, 1656.289, 0.155 ms

The acceptances rates I'm seeing aren't amazing. Will dig into that a bit later.
Otherwise works well. I'm seeing about 1-2 TPS bump on S26+ for gemma-4-E2B with n-draft 3.

Requirements

@max-krasnyansky max-krasnyansky requested a review from CISC as a code owner June 7, 2026 22:59
@github-actions github-actions Bot added model Model specific python python script changes labels Jun 7, 2026
@max-krasnyansky

Copy link
Copy Markdown
Member Author

@am17an @ggerganov

btw It seems that we'll need to update llama-speculative for these separate MTP drafters.
llama-server runs with no issues but llama-speculative throws a few errors and then crashes (at least on my Mac).

./build-macos/bin/llama-speculative -m ../gguf/gemma-4-E2B_q4_0-it.gguf -md ../gguf/gemma-4-E2B-it-assist-Q4_0.gguf  -f ../sample_prompt_1024.txt --spec-type draft-mtp --spec-draft-n-max 2 -fa on -ngl 99 --ctx-size 8192

0.00.029.513 I common_init_result: fitting params to device memory ...
0.00.029.517 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.00.291.336 I common_params_fit_impl: projected to use 3281 MiB of host memory vs. 49152 MiB of total host memory
0.00.501.728 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.503.522 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.00.508.188 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.01.961.865 W llama_context: n_ctx_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
0.01.967.211 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.02.002.894 I common_init_result: fitting params to device memory ...
0.02.002.897 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.02.225.933 E llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)
0.02.254.099 E common_fit_params: encountered an error while trying to fit params to free device memory: failed to create llama_context from model
0.02.457.611 W load: control-looking token:     50 '<|tool_response>' was not control-type; this is probably a bug in the model. its type will be overridden
0.02.459.449 W load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
0.02.464.193 W load: special_eog_ids contains '<|tool_response>', removing '</s>' token from EOG list
0.02.473.449 W model has unused tensor masked_embd_centroids.weight (size = 294912 bytes) -- ignoring
0.02.473.451 W model has unused tensor masked_embd_ordering (size = 1048576 bytes) -- ignoring
0.02.477.498 E llama_init_from_model: failed to initialize the context: Gemma4Assistant requires ctx_other to be set (this is normal during memory fitting)
0.02.477.503 E common_init_result: failed to create context with model '../gguf/gemma-4-E2B-it-assist-Q4_0.gguf'
0.02.477.503 E common_init_from_params: failed to create context with model '../gguf/gemma-4-E2B-it-assist-Q4_0.gguf
...
Segfault

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xd0)
  * frame #0: 0x00000001003c7d88 libllama.0.dylib`llama_context::decode(llama_batch const&) + 68
    frame #1: 0x00000001003cd92c libllama.0.dylib`llama_decode + 20
    frame #2: 0x0000000100003320 llama-speculative`main + 2104
    frame #3: 0x0000000189a3be00 dyld`start + 6992

@ggerganov ggerganov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should adapt the llama-speculative and llama-speculative-simple examples.

@CISC

CISC commented Jun 8, 2026

Copy link
Copy Markdown
Member

This is not enough, we also need centroid_intermediate_top_k and num_centroids metadata from the config.

@mann1x

This comment has been minimized.

@max-krasnyansky

Copy link
Copy Markdown
Member Author

This is not enough, we also need centroid_intermediate_top_k and num_centroids metadata from the config.

@CISC can you please clarify what you mean by not enough. i.e. Not enough for what?
The changes I added are enough to convert and run smaller drafters E2B and E4B just like the larger drafters 12B and 26B. The larger drafters do not have those two extra tensors but are otherwise identical in terms of the layers (sizes are different of course).

If you mean that it's not enough to properly enabled masked_embeddings functionality then that's certainly the case. But we don't have any support for that, unless I'm missing something, so I'd think adding that metadata won't be enough either.

Georgi's comment above was with respect to updating llama-speculative to match llama-server behavior to support these gemma-4 MTP drafters.

Please let me know. I'd be happy to add additional meta but my thinking is that we should add that if/when we add support for masked_embeddings.

@max-krasnyansky

Copy link
Copy Markdown
Member Author

This is the right load-enabling fix; I have landed with Claude Code the same TENSOR_NOT_REQUIRED handling for masked_embedding.centroids + token_ordering in a downstream (llamafile-based) fork a day before this PR.

A few data points that might help with the "acceptance rates aren't amazing / modest TPS" you flagged.

1. Those two tensors drive the ordered ("efficient") head — a draft-step speed win, not an acceptance win. The E-series drafter uses centroids + token_ordering to compute logits for only a small candidate set (top-k ≈ 32 via the centroids) instead of a full [n_embd × 262144] vocab projection. Running the dense head (as here) is correct, but for the tiny E2B/E4B targets that full-vocab projection is a large fraction of each draft step — likely part of your "modest TPS". It does not change acceptance (that's drafter quality, and it's genuinely modest for the E-series — see table); it just makes each draft step much cheaper, which is what tips net throughput positive.
[SNIP]

@mann1x
Very nice insight. Thank you.
Speeding up the drafter would definitely help the TPS :)

I was going to profile the drafters separately first but llama-bench is not able to load them. It's missing that ctx_other thing just like llama-speculative.

@mann1x

mann1x commented Jun 8, 2026

Copy link
Copy Markdown

@max-krasnyansky

Sorry I was in a training all day, had only a few minutes break, couldn't do better than posting an edited answer from Claude.

I'm trying to map back my patches to llamafile from the AtomicBot fork but it's not easy.
The E-Series are sharing KV across layers but I don't see that in your patch.
Is that already handled by the original PR to add MTP support?
I'll try to have a look at it.

@CISC

CISC commented Jun 8, 2026

Copy link
Copy Markdown
Member

Please let me know. I'd be happy to add additional meta but my thinking is that we should add that if/when we add support for masked_embeddings.

We should not include tensors that are basically unusable, which they are without this metadata. IMO we either remove them from conversion, or we at least include the associated metadata as well.

The potential issue with the latter is that they could still end up being useless as we don't know if they require additional transformations to be useful (I suspect the I64 token_ordering will be unwieldy), so I am actually leaning towards the former.

@max-krasnyansky

Copy link
Copy Markdown
Member Author

Please let me know. I'd be happy to add additional meta but my thinking is that we should add that if/when we add support for masked_embeddings.

We should not include tensors that are basically unusable, which they are without this metadata. IMO we either remove them from conversion, or we at least include the associated metadata as well.

The potential issue with the latter is that they could still end up being useless as we don't know if they require additional transformations to be useful (I suspect the I64 token_ordering will be unwieldy), so I am actually leaning towards the former.

Sounds good. I added them to the filter as well.

$ ./convert_hf_to_gguf.py gemma-4-E4B-it-qat-q4_0-unquantized-assistant --outfile gemma-4-E4B-it-assist-F16.gguf --outtype f16 --verbose
INFO:hf-to-gguf:Loading model: gemma-4-E4B-it-qat-q4_0-unquantized-assistant
INFO:hf-to-gguf:Model architecture: Gemma4AssistantForCausalLM
INFO:hf-to-gguf:gguf: indexing model part 'model.safetensors'
DEBUG:hf-to-gguf:Skipping get tensor 'masked_embedding.centroids.weight' in safetensors so that convert can end normally.
DEBUG:hf-to-gguf:Skipping get tensor 'masked_embedding.token_ordering' in safetensors so that convert can end normally.
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,                torch.float32 --> F32, shape = {256}
INFO:hf-to-gguf:token_embd.weight,                torch.bfloat16 --> F16, shape = {256, 262144}
INFO:hf-to-gguf:blk.0.attn_norm.weight,           torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.0.layer_output_scale.weight,  torch.bfloat16 --> F32, shape = {1}
DEBUG:hf-to-gguf:Skipping get tensor 'masked_embedding.centroids.weight' in safetensors so that convert can end normally.
DEBUG:hf-to-gguf:Skipping get tensor 'masked_embedding.token_ordering' in safetensors so that convert can end normally.

@max-krasnyansky

Copy link
Copy Markdown
Member Author

@max-krasnyansky

Sorry I was in a training all day, had only a few minutes break, couldn't do better than posting an edited answer from Claude.

I'm trying to map back my patches to llamafile from the AtomicBot fork but it's not easy. The E-Series are sharing KV across layers but I don't see that in your patch. Is that already handled by the original PR to add MTP support? I'll try to have a look at it.

Yes. That was in the original PR. llama-server is setting that up.
It's missing in llama-speculative tool and that's what Georgi and I mentioned earlier in this discussion.

@max-krasnyansky max-krasnyansky merged commit 7d2b45b into ggml-org:master Jun 8, 2026
26 of 27 checks passed
@hyperscientist

hyperscientist commented Jun 9, 2026

Copy link
Copy Markdown

Is it expected that flash attention must be off for draft model to work? I'm running Gemma 4 E4B QAT on Jetson Orin Nano, get great results with this patch (19.7->27.7t/s on decode), but I'm unable to fit base E4B + assistant model + mmproj in memory and I suspect I would with FA, but unfortunately I get fatal error when I try E4B+assistant+FA combo.

Relevant AI slop that tries to explain the reason below (sorry if it's unwanted):


With --flash-attn on (or auto), llama.cpp aborts during the draft model's first decode with a hard GGML_ABORT("fatal error") in ggml/src/ggml-cuda/fattn.cu, around line 110.

Root cause

The CUDA FlashAttention dispatcher in fattn.cu selects a kernel based on head dimension (DKQ) and GQA ratio. The relevant control flow is:

- gqa_ratio = Q->ne[2] / K->ne[2] (= n_head / n_head_kv)
- If use_gqa_opt && gqa_ratio > 4 -> multi-column kernel (ncols 8)
- Else if use_gqa_opt && gqa_ratio > 2 -> multi-column kernel (ncols 4)
- Else if constexpr (DKQ <= 256) -> ncols-1/2 fallback kernels
- Else -> GGML_ABORT("fatal error")

So the unhandled combination is DKQ > 256 AND gqa_ratio <= 2. There is no compiled kernel for a large head-dim when the GQA ratio is too low to route into the multi-column GQA-optimized path.

Why this specific model hits it

The Gemma 4 E4B MTP draft head (gemma4-assistant arch) has, on its global-attention layer:

- head_dim = 512 (its real global_head_dim; the sliding-window layers use 256)
- 4 attention heads / 2 KV heads -> gqa_ratio = 2

That lands exactly in the abort branch: DKQ=512, gqa_ratio=2.

Why the main E4B model does NOT hit it

The target E4B model also has head_dim = 512 on its global layers, but it has 8 heads / 2 KV heads -> gqa_ratio = 4, which routes into the gqa_ratio > 2 multi-column kernel that does support DKQ=512. That's why the production multimodal config runs --flash-attn on fine.

Bug vs. missing feature

It's both, arguably, and you can frame it two ways for the issue:

1. Missing feature: there is no CUDA FA kernel covering DKQ=512, gqa_ratio <= 2. Adding one, or routing this case to a ncols-1 path that supports DKQ > 256, would be the feature.

2. Genuine bug in auto: this is the stronger argument. --flash-attn auto is supposed to fall back gracefully when FA isn't supported, but here it still selects FA and then hard-aborts at compute time instead of disabling FA for that model/layer. An auto mode that crashes is a bug regardless of whether the kernel ever gets added. At minimum, the support check should detect the unsupported DKQ/gqa_ratio combination up front and either disable FA or error cleanly, not GGML_ABORT mid-decode.

Useful details to include in the issue

- Hardware/backend: CUDA, NVIDIA Jetson Orin Nano, compute capability 8.7 / sm_87.
- Build flag: GGML_CUDA_FA_ALL_QUANTS=OFF. KV is f16, so quant coverage probably is not the issue; this appears to be head-dim/GQA-related.
- Model: google/gemma-4-E4B-it-...-assistant MTP draft, converted via convert_hf_to_gguf.py.
- Architecture: gemma4-assistant.
- Run mode: used as --model-draft with --spec-type draft-mtp.
- Trigger: any decode with -fa on/auto.
- Workaround: -fa off works, so the issue is isolated to the CUDA FA path for this layer shape.
- Crash site: ggml_cuda_flash_attn_ext -> GGML_ABORT("fatal error") in fattn.cu.```

@max-krasnyansky

Copy link
Copy Markdown
Member Author

Is it expected that flash attention must be off for draft model to work? I'm running Gemma 4 E4B QAT on Jetson Orin Nano, get great results with this patch (19.7->27.7t/s on decode), but I'm unable to fit base E4B + assistant model + mmproj in memory and I suspect I would with FA, but unfortunately I get fatal error when I try E4B+assistant+FA combo.

Yeah, seems like related to kernels missing/selection logic.
The example I included (hexagon backend) FA is enabled without issues.

Blueforcer added a commit to aleph-garden/xllamacpp that referenced this pull request Jun 12, 2026
Pulls in ggml-org/llama.cpp#23398 (gemma4-assistant draft arch) and
ggml-org/llama.cpp#24282 (E2B/E4B assistants). Without this, loading any
mtp-gemma-4-*.gguf drafter fails with: unknown model architecture
'gemma4-assistant'.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants