llama : add Gemma4 MTP by am17an · Pull Request #23398 · ggml-org/llama.cpp

am17an · 2026-05-20T08:56:15Z

Overview

This PR adds MTP support for Gemma 4 models. For the MoE model I don't observe a speed-up on my system, but the dense model has on average >2x speedup. Correctness wise I am able to replicate the AIME-26 (~87%) results as advertised by the Gemma team. This works for the 31B and 26B-4B but not the E4B E2B variants for now.

Note

Multi-GPU works but you may need to specify --spec-draft-device with -sm layer

Additional information

Performance on mtp-bench on a DGX Spark 🧵

No MTP

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.1
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.2
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.0
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=5.9
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=5.9
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.2
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.2
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.0
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 290.01
}

`--spec-draft-n-max 4`

  code_python        pred= 192 draft= 231 acc= 133 rate=0.576 tok/s=14.9
  code_cpp           pred= 192 draft= 197 acc= 141 rate=0.716 tok/s=18.0
  explain_concept    pred= 192 draft= 268 acc= 123 rate=0.459 tok/s=12.9
  summarize          pred= 192 draft= 208 acc= 138 rate=0.663 tok/s=16.2
  qa_factual         pred= 192 draft= 211 acc= 138 rate=0.654 tok/s=16.4
  translation        pred= 192 draft= 235 acc= 131 rate=0.557 tok/s=14.6
  creative_short     pred= 192 draft= 292 acc= 117 rate=0.401 tok/s=11.4
  stepwise_math      pred= 192 draft= 180 acc= 146 rate=0.811 tok/s=19.3
  long_code_review   pred= 192 draft= 222 acc= 135 rate=0.608 tok/s=14.9

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 2044,
  "total_draft_accepted": 1202,
  "aggregate_accept_rate": 0.5881,
  "wall_s_total": 120.65
}

How to use

If you have lots of VRAM

llama-server -hf am17an/Gemma4-31B-it-GGUF --spec-type draft-mtp --spec-draft-n-max 4

https://huggingface.co/am17an/Gemma4-31B-it-GGUF

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, for mainly adding code to share the kv-cache and testing against the transformers implementation.

fabriciomalta · 2026-05-20T14:34:43Z

Thank you. Results tests in dual 3080 (20gb) seems a decrease in perfomance. Logs follow up:

Setup with Gemma4-31B-Q8_0 (same on your hf repo).
Full logs: https://pastebin.com/DRjGrZ9R
Without MTP:

~19.3 t/s

With MTP enabled same performance in draft 1,2,3,4 (--spec-type draft-mtp --spec-draft-n-max 2):

~9.3 t/s

The logs show 0 draft acceptance:

draft acceptance = 0.00000 (0 accepted / 1090 generated)
#gen tokens = 1090, #acc tokens = 0

So speculative decoding appears to be active, but all draft tokens are rejected, resulting in a significant performance decrease instead of acceleration.

Commands used:

./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --no-warmup

./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf --model-draft mtp-gemma-4-31B-it.gguf -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2 --no-warmup

BootsSiR · 2026-05-20T14:48:01Z

I did a few quick tests with my system. MTP was actually slightly slower for me. I assume it's because of my hardware setup.

52 token prompt to have it code an html animation for me.

Hardware:

0.00.237.123 I device_info:
0.00.303.668 I   - CUDA0   : NVIDIA GeForce RTX 5090 (32108 MiB, 29101 MiB free)
0.00.380.610 I   - CUDA1   : NVIDIA GeForce RTX 4090 (24082 MiB, 23671 MiB free)

Without MTP:

1.51.859.646 I slot print_timing: id  3 | task 0 | prompt eval time =      68.32 ms /    52 tokens (    1.31 ms per token,   761.16 tokens per second)
1.51.859.648 I slot print_timing: id  3 | task 0 |        eval time =   96783.23 ms /  3114 tokens (   31.08 ms per token,    32.17 tokens per second)
1.51.859.649 I slot print_timing: id  3 | task 0 |       total time =   96851.55 ms /  3166 tokens
1.51.859.653 I slot print_timing: id  3 | task 0 |    graphs reused =       3101
1.51.859.672 I slot      release: id  3 | task 0 | stop processing: n_tokens = 3165, truncated = 0

With MTP:

2.26.014.320 I slot print_timing: id  3 | task 0 | prompt eval time =     111.03 ms /    52 tokens (    2.14 ms per token,   468.34 tokens per second)
2.26.014.322 I slot print_timing: id  3 | task 0 |        eval time =  114817.54 ms /  3308 tokens (   34.71 ms per token,    28.81 tokens per second)
2.26.014.323 I slot print_timing: id  3 | task 0 |       total time =  114928.57 ms /  3360 tokens
2.26.014.326 I slot print_timing: id  3 | task 0 |    graphs reused =       1015
2.26.014.327 I slot print_timing: id  3 | task 0 | draft acceptance = 0.55447 ( 2280 accepted /  4112 generated)

am17an · 2026-05-20T14:51:56Z

Multi GPU is currently broken, I will push a fix in a bit.

BootsSiR · 2026-05-20T15:00:43Z

Multi GPU is currently broken, I will push a fix in a bit.

That explains it. I'll rerun my test when you push a fix.

IIIIIllllIIIIIlllll · 2026-05-20T15:01:25Z

Thank you for your work! Here is my test result, I have to use Qwen3.6-35B-A3B to translate.

Compared to the other two commenters, my test results were quite surprising.

Environment:

Hardware: 2x NVIDIA GeForce RTX 3090 (Tensor Parallel)
Model: gemma-4-31B-it-Q8_0.gguf (Q8_0 Quantization)
Input: 32,767 tokens (Random noise)

1. Baseline Test (No Speculative Decoding)

Launch Command:

llama-server -m /mnt/disk_2t/Models/gemma-4-31B-it-Q8_0/gemma-4-31B-it-Q8_0.gguf --ctx-size 65536 --flash-attn on --no-mmap --cache-ram 32768 --fit on --temp 1 --samplers top_k;top_p;temperature --top-p 0.95 --top-k 64 --ctx-checkpoints 1 --split-mode tensor --batch-size 2048 --ubatch-size 512 --parallel 1 --threads -1 --seed -1 -dio

Log Output:

1.03.338.149 I slot print_timing: id  0 | task 1 | prompt eval time =   27857.24 ms / 32767 tokens (    0.85 ms per token,  1176.25 tokens per second)
1.03.338.152 I slot print_timing: id  0 | task 1 |        eval time =    7167.21 ms /   256 tokens (   28.00 ms per token,    35.72 tokens per second)

Metrics:

Prompt Eval: 1,176.25 tok/s
Decode (256 tokens): 35.72 tok/s

2. Draft-MTP Test (With Speculative Decoding)

Draft Model: /home/mark/MTP/mtp-gemma-4-31B-it.gguf

Launch Command:

llama-server -m /mnt/disk_2t/Models/gemma-4-31B-it-Q8_0/gemma-4-31B-it-Q8_0.gguf --ctx-size 65536 --spec-type draft-mtp --flash-attn on --spec-draft-n-max 4 --no-mmap --cache-ram 32768 --fit on --spec-draft-model /home/mark/MTP/mtp-gemma-4-31B-it.gguf --temp 1 --samplers top_k;top_p;temperature --top-p 0.95 --top-k 64 --ctx-checkpoints 1 --split-mode tensor --batch-size 2048 --ubatch-size 512 --parallel 1 --threads -1 --seed -1 -dio

Log Output:

3.44.872.979 I slot print_timing: id  0 | task 554 | prompt eval time =   28147.80 ms / 32767 tokens (    0.86 ms per token,  1164.11 tokens per second)
3.44.872.982 I slot print_timing: id  0 | task 554 |        eval time =    4106.41 ms /   256 tokens (   16.04 ms per token,    62.34 tokens per second)
3.44.872.984 I slot print_timing: id  0 | task 554 | draft acceptance = 0.43902 (  162 accepted /   369 generated)
3.44.872.994 I statistics        draft-mtp: #calls(b,g,a) =    4    637    637, #gen drafts =    637, #acc drafts =   413, #gen tokens =   2545, #acc tokens =  1111, dur(b,g,a) = 0.006,  9834.855, 0.569 ms

Metrics:

Prompt Eval: 1,164.11 tok/s
Decode (256 tokens): 62.34 tok/s
Draft Acceptance Rate: 43.9% (162 accepted / 369 generated)

3. Comparison Summary

Metric	Baseline (No MTP)	With Draft-MTP	Improvement
Prompt Throughput	1,176.25 tok/s	1,164.11 tok/s	~ -1% (Negligible)
Decode Throughput	35.72 tok/s	62.34 tok/s	+74.5% Speedup
Decode Latency	28.00 ms/tok	16.04 ms/tok	Significant Reduction

am17an · 2026-05-20T15:55:08Z

@BootsSiR for me on 1x4090, 1x5090 on this test https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090

MTP: "wall_s_total": 18.23
no-MTP: "wall_s_total": 47.13

You may need to specify --spec-device-draft

am17an · 2026-05-20T15:57:41Z

@fabriciomalta I think you maybe have some wrong file, 0% acceptance rate is highly unusual. I couldn't replicate it

theDTV2 · 2026-05-20T16:23:17Z

Thank you. Results tests in dual 3080 (20gb) seems a decrease in perfomance. Logs follow up:

Setup with Gemma4-31B-Q8_0 (same on your hf repo). Full logs: https://pastebin.com/DRjGrZ9R Without MTP:
* ~19.3 t/s
With MTP enabled same performance in draft 1,2,3,4 (--spec-type draft-mtp --spec-draft-n-max 2):
* ~9.3 t/s
The logs show 0 draft acceptance:
draft acceptance = 0.00000 (0 accepted / 1090 generated)
#gen tokens = 1090, #acc tokens = 0
So speculative decoding appears to be active, but all draft tokens are rejected, resulting in a significant performance decrease instead of acceleration.

Commands used:
./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --no-warmup
./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf --model-draft mtp-gemma-4-31B-it.gguf -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2 --no-warmup

I have the same issue when i use Q8 Cache Quantization with Vulkan. If you turn it off, it works properly.
@am17an

theo77186 · 2026-05-20T16:24:36Z

I can reproduce the 0% acceptance rate when the main model's KV cache is quantized to q8_0. With f16 KV cache, the acceptance rate seems normal. It seems quantizing the KV cache breaks it.

am17an · 2026-05-20T16:40:44Z

Thanks, that's a real bug then. I will fix

BootsSiR · 2026-05-20T16:44:15Z

@BootsSiR for me on 1x4090, 1x5090 on this test https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090

MTP: "wall_s_total": 18.23 no-MTP: "wall_s_total": 47.13

You may need to specify --spec-device-draft

Tested with the latest code and that python test.

Device Info

CUDA0   : NVIDIA GeForce RTX 5090 (32108 MiB, 28786 MiB free)
CUDA1   : NVIDIA GeForce RTX 4090 (24082 MiB, 23671 MiB free)

No MTP

llama-server -m ~/ai-models/mtp/Gemma4-31B-Q8_0.gguf -c 16384

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.9
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.6
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.8
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.6
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.9
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.9
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.9
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.8
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 52.52
}

MTP Enabled

llama-server -m ~/ai-models/mtp/Gemma4-31B-Q8_0.gguf -md ~/ai-models/mtp/mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --device-draft CUDA1

  code_python        pred= 192 draft= 207 acc= 139 rate=0.671 tok/s=94.7
  code_cpp           pred= 192 draft= 212 acc= 138 rate=0.651 tok/s=94.4
  explain_concept    pred= 192 draft= 255 acc= 127 rate=0.498 tok/s=78.0
  summarize          pred= 192 draft= 188 acc= 143 rate=0.761 tok/s=104.0
  qa_factual         pred= 192 draft= 221 acc= 135 rate=0.611 tok/s=89.5
  translation        pred= 192 draft= 226 acc= 133 rate=0.589 tok/s=86.9
  creative_short     pred= 192 draft= 272 acc= 122 rate=0.449 tok/s=72.6
  stepwise_math      pred= 192 draft= 201 acc= 140 rate=0.697 tok/s=97.6
  long_code_review   pred= 192 draft= 225 acc= 134 rate=0.596 tok/s=85.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 2007,
  "total_draft_accepted": 1211,
  "aggregate_accept_rate": 0.6034,
  "wall_s_total": 21.05
}

👏

fabriciomalta · 2026-05-20T17:16:33Z

@am17an Update: it is working now.

I deleted the dir and pull again. The issue was the quantized KV cache. With -ctk q8_0 -ctv q8_0, Draft-MTP initialized but had 0% acceptance. After rebuilding the latest PR code and removing Q8 KV cache, acceptance became normal.

Hardware:

2x RTX 3080 20GB
Gemma4-31B-Q8_0
Draft model: mtp-gemma-4-31B-it.gguf

Working command:

./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf -md mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --flash-attn on --no-mmap --temp 1 --top-p 0.95 --top-k 64 --parallel 1 --batch-size 2048 --ubatch-size 512 -ngl 999 --device-draft CUDA1 --no-warmup

Result:

eval time = 13229.08 ms / 671 tokens (19.72 ms per token, 50.72 tokens per second)
draft acceptance = 0.59596 (472 accepted / 792 generated)
#gen tokens = 792, #acc tokens = 472

So the previous 0% acceptance was caused by Q8 KV cache. With f16/default KV cache, Draft-MTP works correctly on my dual 3080 setup.

Additional confirmation: I re-tested with Q8 KV cache enabled again (-ctk q8_0 -ctv q8_0) using the same working setup.

Command:

./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf -md mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --flash-attn on --no-mmap --temp 1 --top-p 0.95 --top-k 64 --parallel 1 --batch-size 2048 --ubatch-size 512 -ngl 999 --device-draft CUDA1 -ctk q8_0 -ctv q8_0 --no-warmup

With Q8 KV cache enabled, performance dropped again:

n_decoded = 100, tg = 14.61 t/s
n_decoded = 145, tg = 14.63 t/s
n_decoded = 189, tg = 14.63 t/s

Without Q8 KV cache, the same setup reached:

eval time = 13229.08 ms / 671 tokens (19.72 ms per token, 50.72 tokens per second)
draft acceptance = 0.59596 (472 accepted / 792 generated)
#gen tokens = 792, #acc tokens = 472

So this confirms the issue is related to Q8 KV cache. With default/f16 KV cache, Draft-MTP works correctly; with -ctk q8_0 -ctv q8_0, it degrades heavily / previously reached 0% acceptance.

exander77 · 2026-05-20T23:46:16Z

Strix Halo:

$llama-server -m models/Gemma4-31B-Q8_0.gguf --port 18080

 python3 scripts/mtp-bench.py --url http://127.0.0.1:18080
  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.7
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.6
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.7
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.6
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.6
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.7
  creative_short     pred=  36 draft=   0 acc=   0 rate=n/a tok/s=6.8
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.7
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1572,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 244.6
}

$llama-server -m models/Gemma4-31B-Q8_0.gguf -md models/mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --port 18080

python3 scripts/mtp-bench.py --url http://127.0.0.1:18080
  code_python        pred= 192 draft= 209 acc= 138 rate=0.660 tok/s=17.7
  code_cpp           pred= 192 draft= 167 acc= 149 rate=0.892 tok/s=22.1
  explain_concept    pred= 192 draft= 188 acc= 144 rate=0.766 tok/s=19.9
  summarize          pred= 192 draft= 180 acc= 145 rate=0.806 tok/s=20.3
  qa_factual         pred= 192 draft= 169 acc= 148 rate=0.876 tok/s=21.7
  translation        pred= 192 draft= 327 acc= 107 rate=0.327 tok/s=11.2
  creative_short     pred=  36 draft=  68 acc=  19 rate=0.279 tok/s=10.4
  stepwise_math      pred= 192 draft= 192 acc= 142 rate=0.740 tok/s=19.1
  long_code_review   pred= 192 draft= 317 acc= 111 rate=0.350 tok/s=11.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1572,
  "total_draft": 1817,
  "total_draft_accepted": 1103,
  "aggregate_accept_rate": 0.607,
  "wall_s_total": 103.42
}

Best results for me:

$llama-server -m models/gemma-4-31B-it-Q4_K_M.gguf -md models/mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 3 --port 18080

  code_python        pred= 192 draft= 177 acc= 132 rate=0.746 tok/s=26.5
  code_cpp           pred= 192 draft= 201 acc= 124 rate=0.617 tok/s=22.0
  explain_concept    pred= 192 draft= 222 acc= 116 rate=0.522 tok/s=19.7
  summarize          pred= 192 draft= 156 acc= 138 rate=0.885 tok/s=27.9
  qa_factual         pred= 192 draft= 188 acc= 128 rate=0.681 tok/s=23.5
  translation        pred= 192 draft= 180 acc= 130 rate=0.722 tok/s=24.3
  creative_short     pred=  36 draft=  54 acc=  18 rate=0.333 tok/s=15.5
  stepwise_math      pred= 192 draft= 151 acc= 140 rate=0.927 tok/s=28.9
  long_code_review   pred= 192 draft= 276 acc=  97 rate=0.351 tok/s=14.8

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1572,
  "total_draft": 1605,
  "total_draft_accepted": 1023,
  "aggregate_accept_rate": 0.6374,
  "wall_s_total": 78.44
}

Q=4 with N=3 seems to be pretty fast.

aldehir · 2026-05-21T09:37:00Z

My earlier results were scuffed. This should be more representative for this hardware. Looks good!

CUDA0   : NVIDIA RTX PRO 6000 Blackwell Server Edition (97249 MiB, 96691 MiB free)
CPU     : AMD EPYC 9355 32-Core Processor (1547705 MiB, 1547705 MiB free)

Detailed Results

No MTP (Q8)

./llama-server -m ../Gemma4-31B-Q8_0.gguf -np 1

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=39.9
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.0
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.0
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.1
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.2
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.2
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.2
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.1
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=39.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 44.51
}

MTP --spec-draft-n-max 2 (Q8)

./llama-server -m ../Gemma4-31B-Q8_0.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 155 acc= 113 rate=0.729 tok/s=72.8
  code_cpp           pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=74.5
  explain_concept    pred= 192 draft= 171 acc= 104 rate=0.608 tok/s=66.2
  summarize          pred= 192 draft= 149 acc= 115 rate=0.772 tok/s=76.0
  qa_factual         pred= 192 draft= 158 acc= 111 rate=0.703 tok/s=72.3
  translation        pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=76.1
  creative_short     pred= 192 draft= 184 acc=  99 rate=0.538 tok/s=62.9
  stepwise_math      pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=77.0
  long_code_review   pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=69.3

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1430,
  "total_draft_accepted": 997,
  "aggregate_accept_rate": 0.6972,
  "wall_s_total": 25.6
}

MTP --spec-draft-n-max 3 (Q8)

./llama-server -m ../Gemma4-31B-Q8_0.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 186 acc= 128 rate=0.688 tok/s=83.1
  code_cpp           pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=87.8
  explain_concept    pred= 192 draft= 237 acc= 111 rate=0.468 tok/s=66.4
  summarize          pred= 192 draft= 182 acc= 130 rate=0.714 tok/s=86.5
  qa_factual         pred= 192 draft= 191 acc= 127 rate=0.665 tok/s=82.8
  translation        pred= 192 draft= 186 acc= 129 rate=0.694 tok/s=85.8
  creative_short     pred= 192 draft= 238 acc= 110 rate=0.462 tok/s=66.0
  stepwise_math      pred= 192 draft= 174 acc= 132 rate=0.759 tok/s=90.2
  long_code_review   pred= 192 draft= 230 acc= 113 rate=0.491 tok/s=66.7

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1803,
  "total_draft_accepted": 1111,
  "aggregate_accept_rate": 0.6162,
  "wall_s_total": 23.58
}

MTP --spec-draft-n-max 4 (Q8)

./llama-server -m ../Gemma4-31B-Q8_0.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 4

  code_python        pred= 192 draft= 220 acc= 135 rate=0.614 tok/s=84.5
  code_cpp           pred= 192 draft= 208 acc= 139 rate=0.668 tok/s=91.8
  explain_concept    pred= 192 draft= 285 acc= 118 rate=0.414 tok/s=65.7
  summarize          pred= 192 draft= 190 acc= 143 rate=0.753 tok/s=98.9
  qa_factual         pred= 192 draft= 211 acc= 138 rate=0.654 tok/s=89.8
  translation        pred= 192 draft= 230 acc= 132 rate=0.574 tok/s=81.4
  creative_short     pred= 192 draft= 287 acc= 119 rate=0.415 tok/s=66.7
  stepwise_math      pred= 192 draft= 207 acc= 139 rate=0.671 tok/s=91.7
  long_code_review   pred= 192 draft= 235 acc= 132 rate=0.562 tok/s=79.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 2073,
  "total_draft_accepted": 1195,
  "aggregate_accept_rate": 0.5765,
  "wall_s_total": 22.62
}

No MTP (Q4)

./llama-server -m ../gemma4-31B-it-Q4_K_M.gguf -np 1

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=61.8
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=61.9
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.1
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.1
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.4
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.6
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.8
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.5
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=61.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 29.09
}

MTP --spec-draft-n-max 2 (Q4)

./llama-server -m ../gemma-4-31B-it-Q4_0-mtp.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 166 acc= 107 rate=0.645 tok/s=93.6
  code_cpp           pred= 192 draft= 165 acc= 108 rate=0.654 tok/s=95.5
  explain_concept    pred= 192 draft= 158 acc= 111 rate=0.703 tok/s=99.3
  summarize          pred= 192 draft= 154 acc= 113 rate=0.734 tok/s=101.5
  qa_factual         pred= 192 draft= 154 acc= 114 rate=0.740 tok/s=102.8
  translation        pred= 192 draft= 169 acc= 105 rate=0.621 tok/s=92.4
  creative_short     pred= 192 draft= 184 acc=  99 rate=0.538 tok/s=86.3
  stepwise_math      pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=105.4
  long_code_review   pred= 192 draft= 155 acc= 113 rate=0.729 tok/s=97.4

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1455,
  "total_draft_accepted": 986,
  "aggregate_accept_rate": 0.6777,
  "wall_s_total": 19.26
}

MTP --spec-draft-n-max 3 (Q4)

./llama-server -m ../gemma-4-31B-it-Q4_0-mtp.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 208 acc= 121 rate=0.582 tok/s=94.4
  code_cpp           pred= 192 draft= 190 acc= 127 rate=0.668 tok/s=104.6
  explain_concept    pred= 192 draft= 199 acc= 123 rate=0.618 tok/s=99.2
  summarize          pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=106.0
  qa_factual         pred= 192 draft= 184 acc= 129 rate=0.701 tok/s=107.9
  translation        pred= 192 draft= 183 acc= 128 rate=0.700 tok/s=106.7
  creative_short     pred= 192 draft= 238 acc= 111 rate=0.466 tok/s=83.8
  stepwise_math      pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=111.3
  long_code_review   pred= 192 draft= 213 acc= 120 rate=0.563 tok/s=91.3

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1782,
  "total_draft_accepted": 1118,
  "aggregate_accept_rate": 0.6274,
  "wall_s_total": 18.72
}

MTP --spec-draft-n-max 4 (Q4)

./llama-server -m ../gemma-4-31B-it-Q4_0-mtp.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 4

  code_python        pred= 192 draft= 232 acc= 133 rate=0.573 tok/s=95.0
  code_cpp           pred= 192 draft= 220 acc= 136 rate=0.618 tok/s=102.4
  explain_concept    pred= 192 draft= 244 acc= 130 rate=0.533 tok/s=93.0
  summarize          pred= 192 draft= 215 acc= 137 rate=0.637 tok/s=104.8
  qa_factual         pred= 192 draft= 225 acc= 133 rate=0.591 tok/s=98.9
  translation        pred= 192 draft= 198 acc= 140 rate=0.707 tok/s=112.3
  creative_short     pred= 192 draft= 290 acc= 116 rate=0.400 tok/s=76.9
  stepwise_math      pred= 192 draft= 188 acc= 144 rate=0.766 tok/s=120.0
  long_code_review   pred= 192 draft= 281 acc= 119 rate=0.423 tok/s=78.3

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 2093,
  "total_draft_accepted": 1188,
  "aggregate_accept_rate": 0.5676,
  "wall_s_total": 19.39
}

Summary

Configuration	Accept rate	Wall (s)	Mean tok/s	Min tok/s	Max tok/s	Speedup vs. baseline
Q8 no MTP	n/a	44.51	40.0	39.5	40.2	1.00x
Q8 MTP n=2	0.6972	25.60	71.9	62.9	77.0	1.74x
Q8 MTP n=3	0.6162	23.58	79.5	66.0	90.2	1.89x
Q8 MTP n=4	0.5765	22.62	83.3	65.7	98.9	1.97x
Q4 no MTP	n/a	29.09	62.1	61.0	62.8	1.00x
Q4 MTP n=2	0.6777	19.26	97.1	86.3	105.4	1.51x
Q4 MTP n=3	0.6274	18.72	100.6	83.8	111.3	1.55x
Q4 MTP n=4	0.5676	19.39	98.0	76.9	120.0	1.50x

ruixiang63 · 2026-05-21T13:10:23Z

For the MoE model I don't observe a speed-up on my system, but the dense model has on average >2x speedup.

Thanks very much for this PR! The performance numbers with MTP on the dense model look great!

Regarding the MoE model, I also tried a related experiment with an Eagle3 checkpoint on DGX Spark, and it appears to provide some speedup there. This may be a useful reference point for understanding why MTP does not show the same speedup on the MoE model.

One possible explanation is that Eagle3 is more lightweight: it uses a single-layer transformer and incorporates d2t vocabulary mapping, which may reduce the draft-model overhead compared with MTP.
In addition, I found that tuning --spec-draft-p-min also helps improve the speedup.

A possible future direction could be to explore whether Eagle3 and MTP can be combined. MTP is generally strong across broad tasks because it is paired with target-model pretraining, while Eagle3 may be easier to adapt for domain-specific use cases, since users can train Eagle3 separately on their own customized datasets.

For reference, here are the Eagle3 performance numbers with Gemma4-A4B-26B (BF16) on DGX Spark:

Without Eagle3

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 64.01
}

With Eagle3

  code_python        pred= 192 draft= 215 acc= 116 rate=0.539 tok/s=41.4
  code_cpp           pred= 192 draft= 181 acc= 102 rate=0.564 tok/s=38.6
  explain_concept    pred= 192 draft= 172 acc=  99 rate=0.576 tok/s=37.0
  summarize          pred= 192 draft= 211 acc= 119 rate=0.564 tok/s=42.3
  qa_factual         pred= 192 draft= 181 acc= 108 rate=0.597 tok/s=40.7
  translation        pred= 192 draft= 176 acc=  95 rate=0.540 tok/s=36.0
  creative_short     pred= 192 draft= 182 acc=  80 rate=0.440 tok/s=32.4
  stepwise_math      pred= 192 draft= 204 acc= 121 rate=0.593 tok/s=44.2
  long_code_review   pred= 192 draft= 155 acc= 108 rate=0.697 tok/s=40.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1677,
  "total_draft_accepted": 948,
  "aggregate_accept_rate": 0.5653,
  "wall_s_total": 46.12
}

Details can be found in Eagle3 PR: #18039 (comment)
Q4_K_M quantized models is a bit worse. #18039 (comment)

Handyfff · 2026-05-23T04:14:16Z

Why is the E4B/E2B not supported yet? Is it that different?

am17an · 2026-05-23T07:02:48Z

Quantized kv-cache should now work, it was missing the hadamard rotn for Q.

@Handyfff it will be added later

thot-experiment · 2026-05-23T22:52:48Z

Having issues with a hard crash when trying to use multigpu across a GV100 & 5070Ti, works on either card, but if I try to split the model residency between cards I get a hard crash when the model finishes loading, -lv 99 reveals nothing worthwhile.

[gemma-4-31B-UD-Q4_K_XL-MTP]
sm = layer
device = CUDA1
spec-draft-device = CUDA1
chat-template-file = .\models\gemma_31b_fixed.jinja
model = .\models\google_gemma-4-31B-it-Q4_K_L.gguf
md = .\models\mtp-gemma-4-31B-it.gguf
spec-type = draft-mtp 
ngld = 99
spec-draft-n-max = 3
temp = 1.0
ctk = q8_0
ctv = q8_0
b = 4096
ub = 1024
top-k = 64
top-p = 0.95
ctx-size = 131072
ctx-checkpoints = 12

this works fine, both set to CUDA0 also works fine (w/ offloading to CPU) however the ideal case where the draft model sits on one GPU and the main model is split across both doesn't work no matter what I try (built from 4b1d1ae this morning)

on the GV 100 i get

Configuration	Performance
Baseline q4 model	18 tok/s
MTP q8 (Drafter q4 model)	35 tok/s

I didn't do comparative testing on the 5070Ti, but with ngld 99 and ngl 27 i get about 15 tokens/s and about 7tok/s on pure CPU, overall amazing work, really brings the best local model within reach usable for the average gamer, really game changing

EDIT: also prefill went from ~900 to ~700tok/s vs baseline

am17an · 2026-05-24T12:07:33Z

@thot-experiment can you create a debug build and see where it crashes?

ServeurpersoCom · 2026-06-08T06:46:22Z

@forforever73 Great, 0.80 acceptance! Do you actually beat your no-MTP tok/s with it, and does your CoT stay coherent or ever switch to Chinese like mine? Asking because my Q2_K_XL setup isn't memory-bandwidth representative, so the MTP win gets eaten and it's not worth it on my end (96GB VRAM) Also, which target + draft GGUFs are you running exactly (vendor/repo)?

forforever73 · 2026-06-08T07:00:34Z

@ServeurpersoCom I use

-m step3.7-text-bf16.gguf \
-md Step3.7-flash-mtp-bf16.gguf \
--host 0.0.0.0 --port 8080 \
-ngl 999 -ngld 999 \
-sm layer \
-fa on \
-ctk bf16 -ctv bf16 \
--no-mmap \
-b 4096 -ub 2048 \
-np 1 \
-t 32 \
--spec-type draft-mtp --spec-draft-n-max 1 \
--jinja \

and no mtp result is

code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.8
code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.9
explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.8
summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.9
qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.9
translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.9
creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.8
stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.8
long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.2

Aggregate: {
"n_requests": 9,
"total_predicted": 1728,
"total_draft": 0,
"total_draft_accepted": 0,
"aggregate_accept_rate": null,
"wall_s_total": 27.79
}

hmm, h800 has way more than enough memory, wait me run it on spark real quick

ServeurpersoCom · 2026-06-08T07:11:15Z

Nice machine! Full BF16 makes the decode heavy enough that MTP is a real win for you. On the Spark you'll have to quantize to fit and probably hit my problem, where the MTP gain gets eaten. Main thing I'm curious about: is your model reasoning clean on the monster machine, or does the CoT ever switch to Chinese like mine? This allows the patch to be validated!
Also I run a q8_0 KV cache (-ctk q8_0 -ctv q8_0): worth testing on your end too to give the patch more coverage, the quantized-KV path hits the Hadamard rotation that default f16 skips.

ggerganov · 2026-06-08T07:22:58Z

@ServeurpersoCom I think you have to use --spec-draft-n-max 1 because more than 1 was not implemented for this model.

ServeurpersoCom · 2026-06-08T07:28:33Z

@ServeurpersoCom I think you have to use --spec-draft-n-max 1 because more than 1 was not implemented for this model.

Good catch ! I run a test now :
No more Chinese in my reasoning, 160.33 t/s for MTP and No-MTP

forforever73 · 2026-06-08T07:42:44Z

@ServeurpersoCom On spark i use

-m Step-3.7-IQ4_XS.gguf \
    --spec-type draft-mtp \
    --spec-draft-model Step3.7-flash-mtp-Q8_0.gguf \
    -ngl all \
    --spec-draft-ngl all \
    -c 35000 \
    -np 1 \
    -b 2048 \
    -ub 1024 \
    --temp 0 \
    --spec-draft-n-max 1 \
    --spec-draft-p-min 0.6 \

no mtp

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.8
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.9
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.8
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.9
  qa_factual         pred= 154 draft=   0 acc=   0 rate=n/a tok/s=27.9
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.9
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.8
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.9
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1690,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 76.51
}

with mtp

  code_python        pred= 192 draft=  91 acc=  85 rate=0.934 tok/s=32.4
  code_cpp           pred= 192 draft=  90 acc=  86 rate=0.956 tok/s=33.0
  explain_concept    pred= 192 draft=  91 acc=  81 rate=0.890 tok/s=31.6
  summarize          pred= 192 draft=  90 acc=  83 rate=0.922 tok/s=32.2
  qa_factual         pred= 157 draft=  74 acc=  72 rate=0.973 tok/s=34.0
  translation        pred= 192 draft=  87 acc=  83 rate=0.954 tok/s=32.6
  creative_short     pred= 192 draft=  74 acc=  71 rate=0.960 tok/s=29.7
  stepwise_math      pred= 192 draft=  85 acc=  84 rate=0.988 tok/s=32.6
  long_code_review   pred= 192 draft=  89 acc=  83 rate=0.933 tok/s=31.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1693,
  "total_draft": 771,
  "total_draft_accepted": 728,
  "aggregate_accept_rate": 0.9442,
  "wall_s_total": 57.89
}

with mtp and -ctk q8_0 -ctv q8_0

  code_python        pred= 192 draft=  93 acc=  87 rate=0.935 tok/s=32.6
  code_cpp           pred= 192 draft=  90 acc=  75 rate=0.833 tok/s=30.3
  explain_concept    pred= 192 draft=  93 acc=  83 rate=0.892 tok/s=32.1
  summarize          pred= 192 draft=  84 acc=  77 rate=0.917 tok/s=30.7
  qa_factual         pred= 156 draft=  75 acc=  74 rate=0.987 tok/s=34.9
  translation        pred= 192 draft=  86 acc=  79 rate=0.919 tok/s=31.4
  creative_short     pred= 192 draft=  77 acc=  63 rate=0.818 tok/s=27.8
  stepwise_math      pred= 192 draft=  86 acc=  85 rate=0.988 tok/s=32.9
  long_code_review   pred= 192 draft=  84 acc=  80 rate=0.952 tok/s=30.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1692,
  "total_draft": 768,
  "total_draft_accepted": 703,
  "aggregate_accept_rate": 0.9154,
  "wall_s_total": 59.55
}

and the reasoning:

Currently --spec-draft-n-max 3 will worse than 1, I'm working on support step3.5 3 layer mtp, but due to a conflict with Gemma 4, it will still take some time

ServeurpersoCom · 2026-06-08T07:50:07Z

As ggerganov said, use --spec-draft-n-max 1 because more than 1 has not been implemented for this model.

I'm trying a combination of the two patches :

(root|~/llama.cpp.pascal) git diff
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index da7a92955..4cc4a4a16 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -567,7 +567,10 @@ void llm_graph_input_attn_kv_iswa::set_input(const llama_ubatch * ubatch) {
         mctx->get_base()->set_input_v_idxs(self_v_idxs, ubatch);
     }

-    mctx->get_base()->set_input_kq_mask(self_kq_mask, ubatch, cparams.causal_attn);
+    // the kq mask guards on its own buffer: shared cells leave idxs unbacked while the mask stays live
+    if (self_kq_mask && self_kq_mask->buffer) {
+        mctx->get_base()->set_input_kq_mask(self_kq_mask, ubatch, cparams.causal_attn);
+    }

     // swa tensors may not be allocated if there are no SWA attention layers
     if (self_k_idxs_swa && self_k_idxs_swa->buffer) {
@@ -575,7 +578,9 @@ void llm_graph_input_attn_kv_iswa::set_input(const llama_ubatch * ubatch) {
         mctx->get_swa()->set_input_v_idxs(self_v_idxs_swa, ubatch);
     }

-    mctx->get_swa()->set_input_kq_mask(self_kq_mask_swa, ubatch, cparams.causal_attn);
+    if (self_kq_mask_swa && self_kq_mask_swa->buffer) {
+        mctx->get_swa()->set_input_kq_mask(self_kq_mask_swa, ubatch, cparams.causal_attn);
+    }

     if (self_k_rot) {
         mctx->get_base()->set_input_k_rot(self_k_rot);
@@ -607,7 +612,9 @@ bool llm_graph_input_attn_kv_iswa::can_reuse(const llm_graph_params & params) {
       //res &= self_v_idxs->ne[0] == params.ubatch.n_tokens; // TODO: need to move this to the unified cache and check there
     }

-    res &= can_reuse_kq_mask(self_kq_mask, mctx->get_base(), params.ubatch, params.cparams);
+    if (self_kq_mask && self_kq_mask->buffer) {
+        res &= can_reuse_kq_mask(self_kq_mask, mctx->get_base(), params.ubatch, params.cparams);
+    }

     // swa tensors may not be allocated if there are no SWA attention layers
     if (self_k_idxs_swa && self_k_idxs_swa->buffer) {
@@ -615,7 +622,9 @@ bool llm_graph_input_attn_kv_iswa::can_reuse(const llm_graph_params & params) {
       //res &= self_v_idxs_swa->ne[0] == params.ubatch.n_tokens; // TODO: need to move this to the unified cache and check there
     }

-    res &= can_reuse_kq_mask(self_kq_mask_swa, mctx->get_swa(), params.ubatch, params.cparams);
+    if (self_kq_mask_swa && self_kq_mask_swa->buffer) {
+        res &= can_reuse_kq_mask(self_kq_mask_swa, mctx->get_swa(), params.ubatch, params.cparams);
+    }

     return res;
 }

ServeurpersoCom · 2026-06-08T07:57:12Z

All working, the combined patch is the cleanest @ggerganov : keep your can_reuse guards, and use the mask's own buffer for set_input. Your base guard keyed off self_k_idxs_swa, which is allocated for a SWA-only draft head (StepFun's MTP head is SWA-only), so it still wrote the null base mask and crashed at load. Guarding each mask on its own buffer covers both cases, on all 4 sites. You can try this last @forforever73 it must work

forforever73 · 2026-06-08T08:12:37Z

@ServeurpersoCom yes, it can work as well.

vbooka1 · 2026-06-08T10:00:22Z

I confirm that patch from #23398 (comment) fixes StepFun 3.7 MTP

ServeurpersoCom · 2026-06-08T10:06:32Z

I confirm that patch from #23398 (comment) fixes StepFun 3.7 MTP

Thanks for testing. Just 2 more runners #24294 and it'll merge :)

(cherry picked from commit 04eb4c4)

Integration glue so the upstream MTP lineage (ggml-org#23198..ggml-org#23398) builds on this fork without disturbing TurboQuant+ or the custom kernels: - llama_kv_cache ctor: thread the new `hparams` param and `layer_share_cb` through all call sites (iswa, memory-hybrid, dsa, model.cpp); keep the fork's turbo auto-asymmetric K upgrade, n_layer_kv() sizing (+3 rotation tensors), and per-side LLAMA_ATTN_ROT_* policy (default OFF) — now nested under the new `if (other) { share } else { ... }` KV-sharing branch. - hparams: carry n_layer_all/n_layer_nextn + n_layer()/n_layer_kv() from the refactor while keeping the fork's n_layer_kv_from_start; restore the swa_layers->is_swa_impl / recurrent_layer_arr->is_recr_impl / nextn_predict_layers->n_layer_nextn renames across fork models. - add n_outputs_max to cparams / common_params / llama_context_params and wire it through; restore deepstack_mapping_arr. - server: keep the ggml-org#23398 ctx_other (MTP draft KV-sharing) wiring; drop the ggml-org#23988 --fit VRAM pre-estimation block (depends on upstream helpers not on this fork; MTP does not need it). - drop upstream-only models pulled in by the refactor (deepseek32, mellum, talkie); keep non-MTP fork models on their own source + mechanical refactor. Builds clean on Metal; turbo quant unit test passes (turbo2/3/4 round-trip). Kernels (ggml-cuda / ggml-metal) untouched.

Add support for gemma4-assistant models as MTP (Multi-Token Prediction) draft heads for speculative decoding with gemma4 target models. ## Key Features ### Automatic Assistant Detection - Detect gemma4-assistant models via 'gemma4.assistant.type = mtp' metadata - Automatically route to gemma4-assistant implementation even when GGUF declares 'general.architecture = gemma4' - Read 'gemma4.assistant.backbone_hidden_size' to get target model's hidden size ### Architecture Alignment with Upstream - Rename LLM_ARCH_GEMMA4_MTP to LLM_ARCH_GEMMA4_ASSISTANT - Rename gemma4_mtp.cpp to gemma4-assistant.cpp - Add ctx_other integration for shared memory between target and assistant - Align layer counting with upstream (n_layer_all vs n_layer) ### Tensor Support - Add LLM_TENSOR_ASSISTANT_PRE_PROJ and LLM_TENSOR_ASSISTANT_POST_PROJ - Map 'assistant.pre_projection' and 'assistant.post_projection' tensor names - Make rope_freqs optional (assistant GGUFs don't include this tensor) - Fix layer_output_scale tensor name (remove 'weight' suffix) - Add optional MTP projection tensors to gemma4.cpp ### Layer Counting Alignment - Use n_layer_all for iterating all layers in assistant models - Use n_layer() for regular layers (n_layer_all - n_layer_nextn) - Assistant models have n_layer() = 0 (all layers are nextn layers) ### Stride and Dimension Fixes - Use n_embd_out() for stride in output_reorder() - Use target's n_embd_out for k==0 nextn fallback - Add embeddings_pre_norm to allow_reuse() check ## Testing Assistant model loads successfully: - gemma-4-E2B-it-assistant-BF16.gguf: ✓ Loads (requires ctx_other for inference) - Architecture detection: ✓ Automatically detects as gemma4-assistant - Tensor loading: ✓ All 48 tensors found Note: Full MTP speculative decoding requires a working target model. The gemma4 target models in our test environment have separate tensor count issues unrelated to this PR. ## Usage ```bash ./llama-server \ -m target-gemma4.gguf \ -md assistant-gemma4.gguf \ --spec-type mtp \ --draft-block-size 3 \ --draft-max 8 ``` ## Files Changed - 147 files modified - +659 insertions, -455 deletions - New: src/models/gemma4-assistant.cpp - Deleted: src/models/gemma4_mtp.cpp ## References - Upstream PR: ggml-org#23398 - Model card: https://huggingface.co/google/gemma-4-E2B-it-assistant - GGUF repo: https://huggingface.co/AtomicChat/gemma-4-E2B-it-assistant-GGUF Assisted-by: opencode

Pulls in ggml-org/llama.cpp#23398 (gemma4-assistant draft arch) and ggml-org/llama.cpp#24282 (E2B/E4B assistants). Without this, loading any mtp-gemma-4-*.gguf drafter fails with: unknown model architecture 'gemma4-assistant'. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The Backend & quantization table omitted two HT-specific speculative decoding features that have shipped to ht: - DFlash (LLM_ARCH_DFLASH, --spec-type dflash, custom CUDA kernels for partial-accept feature extraction) — landed via PR #62 (b0daec5), integrates the z-lab DFlash block-diffusion drafter against Gemma4 31B targets. - Gemma4 MTP (gemma4-assistant arch + --spec-type draft-mtp) — vendored via PR #93 (4c09765) ahead of upstream PR ggml-org#23398 merge so the gemma-4-12b-qat-mtp preset can ship on titan. Marked with Tracked-upstream=ggml-org#23398 since it retires when that PR merges and flows through a normal master sync. Found during a §7 documentation freshness sweep — the inventory exists to be authoritative ("consult it before assuming a behaviour is upstream stock" per AGENTS.md), so omissions defeat the purpose. Docs-only, no code touched. Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>

github-actions Bot added model Model specific examples python python script changes server labels May 20, 2026

am17an force-pushed the gemma4-mtp branch from cd2e5b2 to a03120c Compare May 20, 2026 16:28

mixa3607 added a commit to mixa3607/ML-gfx906 that referenced this pull request May 20, 2026

llamacpp: ggml-org/llama.cpp#23398

7343f0e

This comment has been minimized.

Sign in to view

aldehir mentioned this pull request May 21, 2026

Eval bug: Model type gemma4_assistant not supported #23161

Open

This comment was marked as off-topic.

Sign in to view

am17an force-pushed the gemma4-mtp branch from a03120c to 4b1d1ae Compare May 23, 2026 07:01

ServeurpersoCom mentioned this pull request Jun 8, 2026

graph: guard iswa kq_mask on its own buffer #24294

Merged

localai-bot mentioned this pull request Jun 8, 2026

feat(gallery): add Gemma 4 QAT family + MTP speculative-decoding pairs mudler/LocalAI#10215

Merged

mann1x mentioned this pull request Jun 8, 2026

mtp: support for gemma-4 E2B and E4B assistants #24282

Merged

deadprogram mentioned this pull request Jun 8, 2026

context: add new parent context for MTP support added in llama.cpp hybridgroup/yzma#257

Merged

TheTom mentioned this pull request Jun 8, 2026

Gemma 4 MTP: bring in the upstream MTP lineage (qwen35 post-norm + gemma4) on TurboQuant+ TheTom/llama-cpp-turboquant#172

Merged

TheTom pushed a commit to TheTom/llama-cpp-turboquant that referenced this pull request Jun 8, 2026

llama : add Gemma4 MTP (ggml-org#23398)

d1e70aa

(cherry picked from commit 04eb4c4)

so-dimm mentioned this pull request Jun 9, 2026

Eval bug: llama.cpp-b9568/ggml/src/ggml-cuda/fattn.cu:579: fatal error #24324

Open

xhochy mentioned this pull request Jun 9, 2026

llama.cpp 9574 conda-forge/llama.cpp-feedstock#106

Merged

TheLonelyDevil9 mentioned this pull request Jun 9, 2026

[BUG?] (Kobold v1.113) - Was MTP integrated from upstream? LostRuins/koboldcpp#2211

Open

turbo-tan mentioned this pull request Jun 10, 2026

feat: add gemma4-assistant MTP speculative decoding with automatic model detection turbo-tan/llama.cpp-tq3#27

Merged

kostich mentioned this pull request Jun 11, 2026

Eval bug: Gemma 4 31B MTP (draft-mtp) crashes on Vulkan backend, pre-allocated tensor cannot run operation NONE #24492

Open

sammcj mentioned this pull request Jun 12, 2026

Gemma 4 MTP drafter (gemma4-assistant) is indexed but never offered in the speculative decoding draft model dropdown on macOS lmstudio-ai/lmstudio-bug-tracker#2044

Open

ggerganov mentioned this pull request Jun 12, 2026

server : unify mtmd image processing with post-decode callback #24520

Draft

1 task

Blueforcer mentioned this pull request Jun 12, 2026

Gemma 4 MTP draft models fail to load: unknown model architecture 'gemma4-assistant' xorbitsai/xllamacpp#157

Open

marksverdhei mentioned this pull request Jun 13, 2026

chore(sync): upstream master → ht (114-commit sync 2026-06-13) heiervang-technologies/ht-llama.cpp#107

Closed

Conversation

am17an commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

No MTP

--spec-draft-n-max 4

How to use

Requirements

Uh oh!

fabriciomalta commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BootsSiR commented May 20, 2026

Hardware:

Without MTP:

With MTP:

Uh oh!

am17an commented May 20, 2026

Uh oh!

BootsSiR commented May 20, 2026

Uh oh!

IIIIIllllIIIIIlllll commented May 20, 2026

1. Baseline Test (No Speculative Decoding)

2. Draft-MTP Test (With Speculative Decoding)

3. Comparison Summary

Uh oh!

am17an commented May 20, 2026

Uh oh!

am17an commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theDTV2 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theo77186 commented May 20, 2026

Uh oh!

am17an commented May 20, 2026

Uh oh!

BootsSiR commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Device Info

No MTP

llama-server -m ~/ai-models/mtp/Gemma4-31B-Q8_0.gguf -c 16384

MTP Enabled

llama-server -m ~/ai-models/mtp/Gemma4-31B-Q8_0.gguf -md ~/ai-models/mtp/mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --device-draft CUDA1

Uh oh!

fabriciomalta commented May 20, 2026

Uh oh!

exander77 commented May 20, 2026

Uh oh!

This comment has been minimized.

aldehir commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

No MTP (Q8)

MTP --spec-draft-n-max 2 (Q8)

MTP --spec-draft-n-max 3 (Q8)

MTP --spec-draft-n-max 4 (Q8)

No MTP (Q4)

MTP --spec-draft-n-max 2 (Q4)

MTP --spec-draft-n-max 3 (Q4)

MTP --spec-draft-n-max 4 (Q4)

Summary

Uh oh!

This comment was marked as off-topic.

This comment was marked as off-topic.

ruixiang63 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Handyfff commented May 23, 2026

Uh oh!

am17an commented May 23, 2026

Uh oh!

thot-experiment commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented May 24, 2026

Uh oh!

ServeurpersoCom commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

am17an commented May 20, 2026 •

edited

Loading

`--spec-draft-n-max 4`

fabriciomalta commented May 20, 2026 •

edited

Loading

am17an commented May 20, 2026 •

edited

Loading

theDTV2 commented May 20, 2026 •

edited

Loading

BootsSiR commented May 20, 2026 •

edited

Loading

aldehir commented May 21, 2026 •

edited

Loading

ruixiang63 commented May 21, 2026 •

edited

Loading

thot-experiment commented May 23, 2026 •

edited

Loading

ServeurpersoCom commented Jun 8, 2026 •

edited

Loading

ServeurpersoCom commented Jun 8, 2026 •

edited

Loading

ggerganov commented Jun 8, 2026 •

edited by ServeurpersoCom

Loading

ServeurpersoCom commented Jun 8, 2026 •

edited

Loading

ServeurpersoCom commented Jun 8, 2026 •

edited

Loading