llama : add Gemma4 MTP#23398
Conversation
|
Thank you. Results tests in dual 3080 (20gb) seems a decrease in perfomance. Logs follow up: Setup with Gemma4-31B-Q8_0 (same on your hf repo).
With MTP enabled same performance in draft 1,2,3,4 (
The logs show 0 draft acceptance: draft acceptance = 0.00000 (0 accepted / 1090 generated)
#gen tokens = 1090, #acc tokens = 0So speculative decoding appears to be active, but all draft tokens are rejected, resulting in a significant performance decrease instead of acceleration. Commands used: ./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --no-warmup./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf --model-draft mtp-gemma-4-31B-it.gguf -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2 --no-warmup |
|
I did a few quick tests with my system. MTP was actually slightly slower for me. I assume it's because of my hardware setup. 52 token prompt to have it code an html animation for me. Hardware:Without MTP:With MTP: |
|
Multi GPU is currently broken, I will push a fix in a bit. |
That explains it. I'll rerun my test when you push a fix. |
|
Thank you for your work! Here is my test result, I have to use Qwen3.6-35B-A3B to translate. Compared to the other two commenters, my test results were quite surprising. Environment:
1. Baseline Test (No Speculative Decoding)Launch Command: llama-server -m /mnt/disk_2t/Models/gemma-4-31B-it-Q8_0/gemma-4-31B-it-Q8_0.gguf --ctx-size 65536 --flash-attn on --no-mmap --cache-ram 32768 --fit on --temp 1 --samplers top_k;top_p;temperature --top-p 0.95 --top-k 64 --ctx-checkpoints 1 --split-mode tensor --batch-size 2048 --ubatch-size 512 --parallel 1 --threads -1 --seed -1 -dioLog Output: Metrics:
2. Draft-MTP Test (With Speculative Decoding)Draft Model: Launch Command: llama-server -m /mnt/disk_2t/Models/gemma-4-31B-it-Q8_0/gemma-4-31B-it-Q8_0.gguf --ctx-size 65536 --spec-type draft-mtp --flash-attn on --spec-draft-n-max 4 --no-mmap --cache-ram 32768 --fit on --spec-draft-model /home/mark/MTP/mtp-gemma-4-31B-it.gguf --temp 1 --samplers top_k;top_p;temperature --top-p 0.95 --top-k 64 --ctx-checkpoints 1 --split-mode tensor --batch-size 2048 --ubatch-size 512 --parallel 1 --threads -1 --seed -1 -dioLog Output: Metrics:
3. Comparison Summary
|
|
@BootsSiR for me on 1x4090, 1x5090 on this test https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090 MTP: "wall_s_total": 18.23 You may need to specify |
|
@fabriciomalta I think you maybe have some wrong file, 0% acceptance rate is highly unusual. I couldn't replicate it |
I have the same issue when i use Q8 Cache Quantization with Vulkan. If you turn it off, it works properly. |
|
I can reproduce the 0% acceptance rate when the main model's KV cache is quantized to q8_0. With f16 KV cache, the acceptance rate seems normal. It seems quantizing the KV cache breaks it. |
|
Thanks, that's a real bug then. I will fix |
Tested with the latest code and that python test. Device InfoNo MTPllama-server -m ~/ai-models/mtp/Gemma4-31B-Q8_0.gguf -c 16384MTP Enabledllama-server -m ~/ai-models/mtp/Gemma4-31B-Q8_0.gguf -md ~/ai-models/mtp/mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --device-draft CUDA1👏 |
|
@am17an Update: it is working now. I deleted the dir and pull again. The issue was the quantized KV cache. With Hardware:
Working command: ./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf -md mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --flash-attn on --no-mmap --temp 1 --top-p 0.95 --top-k 64 --parallel 1 --batch-size 2048 --ubatch-size 512 -ngl 999 --device-draft CUDA1 --no-warmupResult: eval time = 13229.08 ms / 671 tokens (19.72 ms per token, 50.72 tokens per second)
draft acceptance = 0.59596 (472 accepted / 792 generated)
#gen tokens = 792, #acc tokens = 472So the previous 0% acceptance was caused by Q8 KV cache. With f16/default KV cache, Draft-MTP works correctly on my dual 3080 setup. Additional confirmation: I re-tested with Q8 KV cache enabled again ( Command: ./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf -md mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --flash-attn on --no-mmap --temp 1 --top-p 0.95 --top-k 64 --parallel 1 --batch-size 2048 --ubatch-size 512 -ngl 999 --device-draft CUDA1 -ctk q8_0 -ctv q8_0 --no-warmupWith Q8 KV cache enabled, performance dropped again: n_decoded = 100, tg = 14.61 t/s
n_decoded = 145, tg = 14.63 t/s
n_decoded = 189, tg = 14.63 t/sWithout Q8 KV cache, the same setup reached: eval time = 13229.08 ms / 671 tokens (19.72 ms per token, 50.72 tokens per second)
draft acceptance = 0.59596 (472 accepted / 792 generated)
#gen tokens = 792, #acc tokens = 472So this confirms the issue is related to Q8 KV cache. With default/f16 KV cache, Draft-MTP works correctly; with |
|
Strix Halo: Best results for me: Q=4 with N=3 seems to be pretty fast. |
This comment has been minimized.
This comment has been minimized.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Thanks very much for this PR! The performance numbers with MTP on the dense model look great! Regarding the MoE model, I also tried a related experiment with an Eagle3 checkpoint on DGX Spark, and it appears to provide some speedup there. This may be a useful reference point for understanding why MTP does not show the same speedup on the MoE model. One possible explanation is that Eagle3 is more lightweight: it uses a single-layer transformer and incorporates d2t vocabulary mapping, which may reduce the draft-model overhead compared with MTP. A possible future direction could be to explore whether Eagle3 and MTP can be combined. MTP is generally strong across broad tasks because it is paired with target-model pretraining, while Eagle3 may be easier to adapt for domain-specific use cases, since users can train Eagle3 separately on their own customized datasets. For reference, here are the Eagle3 performance numbers with Gemma4-A4B-26B (BF16) on DGX Spark:
code_python pred= 192 draft= 0 acc= 0 rate=n/a tok/s=28.3
code_cpp pred= 192 draft= 0 acc= 0 rate=n/a tok/s=28.3
explain_concept pred= 192 draft= 0 acc= 0 rate=n/a tok/s=28.3
summarize pred= 192 draft= 0 acc= 0 rate=n/a tok/s=28.3
qa_factual pred= 192 draft= 0 acc= 0 rate=n/a tok/s=28.3
translation pred= 192 draft= 0 acc= 0 rate=n/a tok/s=28.3
creative_short pred= 192 draft= 0 acc= 0 rate=n/a tok/s=28.3
stepwise_math pred= 192 draft= 0 acc= 0 rate=n/a tok/s=28.3
long_code_review pred= 192 draft= 0 acc= 0 rate=n/a tok/s=27.6
Aggregate: {
"n_requests": 9,
"total_predicted": 1728,
"total_draft": 0,
"total_draft_accepted": 0,
"aggregate_accept_rate": null,
"wall_s_total": 64.01
}
code_python pred= 192 draft= 215 acc= 116 rate=0.539 tok/s=41.4
code_cpp pred= 192 draft= 181 acc= 102 rate=0.564 tok/s=38.6
explain_concept pred= 192 draft= 172 acc= 99 rate=0.576 tok/s=37.0
summarize pred= 192 draft= 211 acc= 119 rate=0.564 tok/s=42.3
qa_factual pred= 192 draft= 181 acc= 108 rate=0.597 tok/s=40.7
translation pred= 192 draft= 176 acc= 95 rate=0.540 tok/s=36.0
creative_short pred= 192 draft= 182 acc= 80 rate=0.440 tok/s=32.4
stepwise_math pred= 192 draft= 204 acc= 121 rate=0.593 tok/s=44.2
long_code_review pred= 192 draft= 155 acc= 108 rate=0.697 tok/s=40.1
Aggregate: {
"n_requests": 9,
"total_predicted": 1728,
"total_draft": 1677,
"total_draft_accepted": 948,
"aggregate_accept_rate": 0.5653,
"wall_s_total": 46.12
}Details can be found in Eagle3 PR: #18039 (comment) |
|
Why is the E4B/E2B not supported yet? Is it that different? |
|
Quantized kv-cache should now work, it was missing the hadamard rotn for Q. @Handyfff it will be added later |
|
Having issues with a hard crash when trying to use multigpu across a GV100 & 5070Ti, works on either card, but if I try to split the model residency between cards I get a hard crash when the model finishes loading, [gemma-4-31B-UD-Q4_K_XL-MTP]
sm = layer
device = CUDA1
spec-draft-device = CUDA1
chat-template-file = .\models\gemma_31b_fixed.jinja
model = .\models\google_gemma-4-31B-it-Q4_K_L.gguf
md = .\models\mtp-gemma-4-31B-it.gguf
spec-type = draft-mtp
ngld = 99
spec-draft-n-max = 3
temp = 1.0
ctk = q8_0
ctv = q8_0
b = 4096
ub = 1024
top-k = 64
top-p = 0.95
ctx-size = 131072
ctx-checkpoints = 12this works fine, both set to CUDA0 also works fine (w/ offloading to CPU) however the ideal case where the draft model sits on one GPU and the main model is split across both doesn't work no matter what I try (built from 4b1d1ae this morning) on the GV 100 i get
I didn't do comparative testing on the 5070Ti, but with ngld 99 and ngl 27 i get about 15 tokens/s and about 7tok/s on pure CPU, overall amazing work, really brings the best local model within reach usable for the average gamer, really game changing EDIT: also prefill went from ~900 to ~700tok/s vs baseline |
|
@thot-experiment can you create a debug build and see where it crashes? |
|
@forforever73 Great, 0.80 acceptance! Do you actually beat your no-MTP tok/s with it, and does your CoT stay coherent or ever switch to Chinese like mine? Asking because my Q2_K_XL setup isn't memory-bandwidth representative, so the MTP win gets eaten and it's not worth it on my end (96GB VRAM) Also, which target + draft GGUFs are you running exactly (vendor/repo)? |
|
@ServeurpersoCom I use and no mtp result is hmm, h800 has way more than enough memory, wait me run it on spark real quick |
|
Nice machine! Full BF16 makes the decode heavy enough that MTP is a real win for you. On the Spark you'll have to quantize to fit and probably hit my problem, where the MTP gain gets eaten. Main thing I'm curious about: is your model reasoning clean on the monster machine, or does the CoT ever switch to Chinese like mine? This allows the patch to be validated! |
|
@ServeurpersoCom I think you have to use |
Good catch ! I run a test now : |
|
@ServeurpersoCom On spark i use no mtp with mtp with mtp and -ctk q8_0 -ctv q8_0 Currently --spec-draft-n-max 3 will worse than 1, I'm working on support step3.5 3 layer mtp, but due to a conflict with Gemma 4, it will still take some time |
|
As ggerganov said, use --spec-draft-n-max 1 because more than 1 has not been implemented for this model. I'm trying a combination of the two patches : |
|
All working, the combined patch is the cleanest @ggerganov : keep your can_reuse guards, and use the mask's own buffer for set_input. Your base guard keyed off self_k_idxs_swa, which is allocated for a SWA-only draft head (StepFun's MTP head is SWA-only), so it still wrote the null base mask and crashed at load. Guarding each mask on its own buffer covers both cases, on all 4 sites. You can try this last @forforever73 it must work |
|
@ServeurpersoCom yes, it can work as well. |
|
I confirm that patch from #23398 (comment) fixes StepFun 3.7 MTP |
Thanks for testing. Just 2 more runners #24294 and it'll merge :) |
(cherry picked from commit 04eb4c4)
Integration glue so the upstream MTP lineage (ggml-org#23198..ggml-org#23398) builds on this fork without disturbing TurboQuant+ or the custom kernels: - llama_kv_cache ctor: thread the new `hparams` param and `layer_share_cb` through all call sites (iswa, memory-hybrid, dsa, model.cpp); keep the fork's turbo auto-asymmetric K upgrade, n_layer_kv() sizing (+3 rotation tensors), and per-side LLAMA_ATTN_ROT_* policy (default OFF) — now nested under the new `if (other) { share } else { ... }` KV-sharing branch. - hparams: carry n_layer_all/n_layer_nextn + n_layer()/n_layer_kv() from the refactor while keeping the fork's n_layer_kv_from_start; restore the swa_layers->is_swa_impl / recurrent_layer_arr->is_recr_impl / nextn_predict_layers->n_layer_nextn renames across fork models. - add n_outputs_max to cparams / common_params / llama_context_params and wire it through; restore deepstack_mapping_arr. - server: keep the ggml-org#23398 ctx_other (MTP draft KV-sharing) wiring; drop the ggml-org#23988 --fit VRAM pre-estimation block (depends on upstream helpers not on this fork; MTP does not need it). - drop upstream-only models pulled in by the refactor (deepseek32, mellum, talkie); keep non-MTP fork models on their own source + mechanical refactor. Builds clean on Metal; turbo quant unit test passes (turbo2/3/4 round-trip). Kernels (ggml-cuda / ggml-metal) untouched.
Add support for gemma4-assistant models as MTP (Multi-Token Prediction)
draft heads for speculative decoding with gemma4 target models.
## Key Features
### Automatic Assistant Detection
- Detect gemma4-assistant models via 'gemma4.assistant.type = mtp' metadata
- Automatically route to gemma4-assistant implementation even when GGUF declares
'general.architecture = gemma4'
- Read 'gemma4.assistant.backbone_hidden_size' to get target model's hidden size
### Architecture Alignment with Upstream
- Rename LLM_ARCH_GEMMA4_MTP to LLM_ARCH_GEMMA4_ASSISTANT
- Rename gemma4_mtp.cpp to gemma4-assistant.cpp
- Add ctx_other integration for shared memory between target and assistant
- Align layer counting with upstream (n_layer_all vs n_layer)
### Tensor Support
- Add LLM_TENSOR_ASSISTANT_PRE_PROJ and LLM_TENSOR_ASSISTANT_POST_PROJ
- Map 'assistant.pre_projection' and 'assistant.post_projection' tensor names
- Make rope_freqs optional (assistant GGUFs don't include this tensor)
- Fix layer_output_scale tensor name (remove 'weight' suffix)
- Add optional MTP projection tensors to gemma4.cpp
### Layer Counting Alignment
- Use n_layer_all for iterating all layers in assistant models
- Use n_layer() for regular layers (n_layer_all - n_layer_nextn)
- Assistant models have n_layer() = 0 (all layers are nextn layers)
### Stride and Dimension Fixes
- Use n_embd_out() for stride in output_reorder()
- Use target's n_embd_out for k==0 nextn fallback
- Add embeddings_pre_norm to allow_reuse() check
## Testing
Assistant model loads successfully:
- gemma-4-E2B-it-assistant-BF16.gguf: ✓ Loads (requires ctx_other for inference)
- Architecture detection: ✓ Automatically detects as gemma4-assistant
- Tensor loading: ✓ All 48 tensors found
Note: Full MTP speculative decoding requires a working target model. The
gemma4 target models in our test environment have separate tensor count
issues unrelated to this PR.
## Usage
```bash
./llama-server \
-m target-gemma4.gguf \
-md assistant-gemma4.gguf \
--spec-type mtp \
--draft-block-size 3 \
--draft-max 8
```
## Files Changed
- 147 files modified
- +659 insertions, -455 deletions
- New: src/models/gemma4-assistant.cpp
- Deleted: src/models/gemma4_mtp.cpp
## References
- Upstream PR: ggml-org#23398
- Model card: https://huggingface.co/google/gemma-4-E2B-it-assistant
- GGUF repo: https://huggingface.co/AtomicChat/gemma-4-E2B-it-assistant-GGUF
Assisted-by: opencode
Pulls in ggml-org/llama.cpp#23398 (gemma4-assistant draft arch) and ggml-org/llama.cpp#24282 (E2B/E4B assistants). Without this, loading any mtp-gemma-4-*.gguf drafter fails with: unknown model architecture 'gemma4-assistant'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The Backend & quantization table omitted two HT-specific speculative decoding features that have shipped to ht: - DFlash (LLM_ARCH_DFLASH, --spec-type dflash, custom CUDA kernels for partial-accept feature extraction) — landed via PR #62 (b0daec5), integrates the z-lab DFlash block-diffusion drafter against Gemma4 31B targets. - Gemma4 MTP (gemma4-assistant arch + --spec-type draft-mtp) — vendored via PR #93 (4c09765) ahead of upstream PR ggml-org#23398 merge so the gemma-4-12b-qat-mtp preset can ship on titan. Marked with Tracked-upstream=ggml-org#23398 since it retires when that PR merges and flows through a normal master sync. Found during a §7 documentation freshness sweep — the inventory exists to be authoritative ("consult it before assuming a behaviour is upstream stock" per AGENTS.md), so omissions defeat the purpose. Docs-only, no code touched. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>


Overview
This PR adds MTP support for Gemma 4 models. For the MoE model I don't observe a speed-up on my system, but the dense model has on average >2x speedup. Correctness wise I am able to replicate the AIME-26 (~87%) results as advertised by the Gemma team. This works for the 31B and 26B-4B but not the E4B E2B variants for now.
Note
Multi-GPU works but you may need to specify
--spec-draft-devicewith-sm layerAdditional information
Performance on mtp-bench on a DGX Spark 🧵
No MTP
--spec-draft-n-max 4How to use
If you have lots of VRAM
llama-server -hf am17an/Gemma4-31B-it-GGUF --spec-type draft-mtp --spec-draft-n-max 4Requirements