mtp: support for gemma-4 E2B and E4B assistants#24282
Conversation
|
btw It seems that we'll need to update |
ggerganov
left a comment
There was a problem hiding this comment.
Yes, we should adapt the llama-speculative and llama-speculative-simple examples.
|
This is not enough, we also need |
This comment has been minimized.
This comment has been minimized.
@CISC can you please clarify what you mean by not enough. i.e. Not enough for what? If you mean that it's not enough to properly enabled Georgi's comment above was with respect to updating Please let me know. I'd be happy to add additional meta but my thinking is that we should add that if/when we add support for masked_embeddings. |
@mann1x I was going to profile the drafters separately first but |
|
Sorry I was in a training all day, had only a few minutes break, couldn't do better than posting an edited answer from Claude. I'm trying to map back my patches to llamafile from the AtomicBot fork but it's not easy. |
We should not include tensors that are basically unusable, which they are without this metadata. IMO we either remove them from conversion, or we at least include the associated metadata as well. The potential issue with the latter is that they could still end up being useless as we don't know if they require additional transformations to be useful (I suspect the |
Sounds good. I added them to the filter as well. |
Yes. That was in the original PR. |
|
Is it expected that flash attention must be off for draft model to work? I'm running Gemma 4 E4B QAT on Jetson Orin Nano, get great results with this patch (19.7->27.7t/s on decode), but I'm unable to fit base E4B + assistant model + mmproj in memory and I suspect I would with FA, but unfortunately I get fatal error when I try E4B+assistant+FA combo. Relevant AI slop that tries to explain the reason below (sorry if it's unwanted): |
Yeah, seems like related to kernels missing/selection logic. |
Pulls in ggml-org/llama.cpp#23398 (gemma4-assistant draft arch) and ggml-org/llama.cpp#24282 (E2B/E4B assistants). Without this, loading any mtp-gemma-4-*.gguf drafter fails with: unknown model architecture 'gemma4-assistant'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Overview
Just a few small updates to enable conversion and loading of the smaller E2B and E4B gemma-4 assistant models.
The main issue was that those models include two additional tensors that we currently do not support.
masked_embedding.centroids.weightandmasked_embedding.token_ordering.I added those to the converter and updated the loader to mark those as
TENSOR_NOT_REQUIRED.Additional information
How to convert
How to run.
This is an example for running on Galaxy S26+. Hexagon backend runs the main model the CPU runs the draft.
The acceptances rates I'm seeing aren't amazing. Will dig into that a bit later.
Otherwise works well. I'm seeing about 1-2 TPS bump on S26+ for gemma-4-E2B with n-draft 3.
Requirements