Skip to content

llama : add Gemma4 MTP#23398

Merged
am17an merged 30 commits into
ggml-org:masterfrom
am17an:gemma4-mtp
Jun 7, 2026
Merged

llama : add Gemma4 MTP#23398
am17an merged 30 commits into
ggml-org:masterfrom
am17an:gemma4-mtp

Conversation

@am17an

@am17an am17an commented May 20, 2026

Copy link
Copy Markdown
Contributor

Overview

This PR adds MTP support for Gemma 4 models. For the MoE model I don't observe a speed-up on my system, but the dense model has on average >2x speedup. Correctness wise I am able to replicate the AIME-26 (~87%) results as advertised by the Gemma team. This works for the 31B and 26B-4B but not the E4B E2B variants for now.

Note

Multi-GPU works but you may need to specify --spec-draft-device with -sm layer

Additional information

Performance on mtp-bench on a DGX Spark 🧵

No MTP

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.1
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.2
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.0
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=5.9
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=5.9
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.2
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.2
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.0
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 290.01
}

--spec-draft-n-max 4

  code_python        pred= 192 draft= 231 acc= 133 rate=0.576 tok/s=14.9
  code_cpp           pred= 192 draft= 197 acc= 141 rate=0.716 tok/s=18.0
  explain_concept    pred= 192 draft= 268 acc= 123 rate=0.459 tok/s=12.9
  summarize          pred= 192 draft= 208 acc= 138 rate=0.663 tok/s=16.2
  qa_factual         pred= 192 draft= 211 acc= 138 rate=0.654 tok/s=16.4
  translation        pred= 192 draft= 235 acc= 131 rate=0.557 tok/s=14.6
  creative_short     pred= 192 draft= 292 acc= 117 rate=0.401 tok/s=11.4
  stepwise_math      pred= 192 draft= 180 acc= 146 rate=0.811 tok/s=19.3
  long_code_review   pred= 192 draft= 222 acc= 135 rate=0.608 tok/s=14.9

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 2044,
  "total_draft_accepted": 1202,
  "aggregate_accept_rate": 0.5881,
  "wall_s_total": 120.65
}

How to use

If you have lots of VRAM

llama-server -hf am17an/Gemma4-31B-it-GGUF --spec-type draft-mtp --spec-draft-n-max 4

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, for mainly adding code to share the kv-cache and testing against the transformers implementation.

@github-actions github-actions Bot added model Model specific examples python python script changes server labels May 20, 2026
@fabriciomalta

fabriciomalta commented May 20, 2026

Copy link
Copy Markdown

Thank you. Results tests in dual 3080 (20gb) seems a decrease in perfomance. Logs follow up:

Setup with Gemma4-31B-Q8_0 (same on your hf repo).
Full logs: https://pastebin.com/DRjGrZ9R
Without MTP:

  • ~19.3 t/s

With MTP enabled same performance in draft 1,2,3,4 (--spec-type draft-mtp --spec-draft-n-max 2):

  • ~9.3 t/s

The logs show 0 draft acceptance:

draft acceptance = 0.00000 (0 accepted / 1090 generated)
#gen tokens = 1090, #acc tokens = 0

So speculative decoding appears to be active, but all draft tokens are rejected, resulting in a significant performance decrease instead of acceleration.

Commands used:

./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --no-warmup
./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf --model-draft mtp-gemma-4-31B-it.gguf -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2 --no-warmup

@BootsSiR

Copy link
Copy Markdown

I did a few quick tests with my system. MTP was actually slightly slower for me. I assume it's because of my hardware setup.

52 token prompt to have it code an html animation for me.

Hardware:

0.00.237.123 I device_info:
0.00.303.668 I   - CUDA0   : NVIDIA GeForce RTX 5090 (32108 MiB, 29101 MiB free)
0.00.380.610 I   - CUDA1   : NVIDIA GeForce RTX 4090 (24082 MiB, 23671 MiB free)

Without MTP:

1.51.859.646 I slot print_timing: id  3 | task 0 | prompt eval time =      68.32 ms /    52 tokens (    1.31 ms per token,   761.16 tokens per second)
1.51.859.648 I slot print_timing: id  3 | task 0 |        eval time =   96783.23 ms /  3114 tokens (   31.08 ms per token,    32.17 tokens per second)
1.51.859.649 I slot print_timing: id  3 | task 0 |       total time =   96851.55 ms /  3166 tokens
1.51.859.653 I slot print_timing: id  3 | task 0 |    graphs reused =       3101
1.51.859.672 I slot      release: id  3 | task 0 | stop processing: n_tokens = 3165, truncated = 0

With MTP:

2.26.014.320 I slot print_timing: id  3 | task 0 | prompt eval time =     111.03 ms /    52 tokens (    2.14 ms per token,   468.34 tokens per second)
2.26.014.322 I slot print_timing: id  3 | task 0 |        eval time =  114817.54 ms /  3308 tokens (   34.71 ms per token,    28.81 tokens per second)
2.26.014.323 I slot print_timing: id  3 | task 0 |       total time =  114928.57 ms /  3360 tokens
2.26.014.326 I slot print_timing: id  3 | task 0 |    graphs reused =       1015
2.26.014.327 I slot print_timing: id  3 | task 0 | draft acceptance = 0.55447 ( 2280 accepted /  4112 generated)

@am17an

am17an commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Multi GPU is currently broken, I will push a fix in a bit.

@BootsSiR

Copy link
Copy Markdown

Multi GPU is currently broken, I will push a fix in a bit.

That explains it. I'll rerun my test when you push a fix.

@IIIIIllllIIIIIlllll

Copy link
Copy Markdown

Thank you for your work! Here is my test result, I have to use Qwen3.6-35B-A3B to translate.

Compared to the other two commenters, my test results were quite surprising.


Environment:

  • Hardware: 2x NVIDIA GeForce RTX 3090 (Tensor Parallel)
  • Model: gemma-4-31B-it-Q8_0.gguf (Q8_0 Quantization)
  • Input: 32,767 tokens (Random noise)

1. Baseline Test (No Speculative Decoding)

Launch Command:

llama-server -m /mnt/disk_2t/Models/gemma-4-31B-it-Q8_0/gemma-4-31B-it-Q8_0.gguf --ctx-size 65536 --flash-attn on --no-mmap --cache-ram 32768 --fit on --temp 1 --samplers top_k;top_p;temperature --top-p 0.95 --top-k 64 --ctx-checkpoints 1 --split-mode tensor --batch-size 2048 --ubatch-size 512 --parallel 1 --threads -1 --seed -1 -dio

Log Output:

1.03.338.149 I slot print_timing: id  0 | task 1 | prompt eval time =   27857.24 ms / 32767 tokens (    0.85 ms per token,  1176.25 tokens per second)
1.03.338.152 I slot print_timing: id  0 | task 1 |        eval time =    7167.21 ms /   256 tokens (   28.00 ms per token,    35.72 tokens per second)

Metrics:

  • Prompt Eval: 1,176.25 tok/s
  • Decode (256 tokens): 35.72 tok/s

2. Draft-MTP Test (With Speculative Decoding)

Draft Model: /home/mark/MTP/mtp-gemma-4-31B-it.gguf

Launch Command:

llama-server -m /mnt/disk_2t/Models/gemma-4-31B-it-Q8_0/gemma-4-31B-it-Q8_0.gguf --ctx-size 65536 --spec-type draft-mtp --flash-attn on --spec-draft-n-max 4 --no-mmap --cache-ram 32768 --fit on --spec-draft-model /home/mark/MTP/mtp-gemma-4-31B-it.gguf --temp 1 --samplers top_k;top_p;temperature --top-p 0.95 --top-k 64 --ctx-checkpoints 1 --split-mode tensor --batch-size 2048 --ubatch-size 512 --parallel 1 --threads -1 --seed -1 -dio

Log Output:

3.44.872.979 I slot print_timing: id  0 | task 554 | prompt eval time =   28147.80 ms / 32767 tokens (    0.86 ms per token,  1164.11 tokens per second)
3.44.872.982 I slot print_timing: id  0 | task 554 |        eval time =    4106.41 ms /   256 tokens (   16.04 ms per token,    62.34 tokens per second)
3.44.872.984 I slot print_timing: id  0 | task 554 | draft acceptance = 0.43902 (  162 accepted /   369 generated)
3.44.872.994 I statistics        draft-mtp: #calls(b,g,a) =    4    637    637, #gen drafts =    637, #acc drafts =   413, #gen tokens =   2545, #acc tokens =  1111, dur(b,g,a) = 0.006,  9834.855, 0.569 ms

Metrics:

  • Prompt Eval: 1,164.11 tok/s
  • Decode (256 tokens): 62.34 tok/s
  • Draft Acceptance Rate: 43.9% (162 accepted / 369 generated)

3. Comparison Summary

Metric Baseline (No MTP) With Draft-MTP Improvement
Prompt Throughput 1,176.25 tok/s 1,164.11 tok/s ~ -1% (Negligible)
Decode Throughput 35.72 tok/s 62.34 tok/s +74.5% Speedup
Decode Latency 28.00 ms/tok 16.04 ms/tok Significant Reduction

@am17an

am17an commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

@BootsSiR for me on 1x4090, 1x5090 on this test https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090

MTP: "wall_s_total": 18.23
no-MTP: "wall_s_total": 47.13

You may need to specify --spec-device-draft

@am17an

am17an commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

@fabriciomalta I think you maybe have some wrong file, 0% acceptance rate is highly unusual. I couldn't replicate it

@theDTV2

theDTV2 commented May 20, 2026

Copy link
Copy Markdown

Thank you. Results tests in dual 3080 (20gb) seems a decrease in perfomance. Logs follow up:

Setup with Gemma4-31B-Q8_0 (same on your hf repo). Full logs: https://pastebin.com/DRjGrZ9R Without MTP:

* ~19.3 t/s

With MTP enabled same performance in draft 1,2,3,4 (--spec-type draft-mtp --spec-draft-n-max 2):

* ~9.3 t/s

The logs show 0 draft acceptance:

draft acceptance = 0.00000 (0 accepted / 1090 generated)
#gen tokens = 1090, #acc tokens = 0

So speculative decoding appears to be active, but all draft tokens are rejected, resulting in a significant performance decrease instead of acceleration.

Commands used:

./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --no-warmup
./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf --model-draft mtp-gemma-4-31B-it.gguf -c 32768 -fa on -ngl 999 -ctk q8_0 -ctv q8_0 --spec-type draft-mtp --spec-draft-n-max 2 --no-warmup

I have the same issue when i use Q8 Cache Quantization with Vulkan. If you turn it off, it works properly.
@am17an

@theo77186

Copy link
Copy Markdown
Contributor

I can reproduce the 0% acceptance rate when the main model's KV cache is quantized to q8_0. With f16 KV cache, the acceptance rate seems normal. It seems quantizing the KV cache breaks it.

@am17an

am17an commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Thanks, that's a real bug then. I will fix

@BootsSiR

BootsSiR commented May 20, 2026

Copy link
Copy Markdown

@BootsSiR for me on 1x4090, 1x5090 on this test https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090

MTP: "wall_s_total": 18.23 no-MTP: "wall_s_total": 47.13

You may need to specify --spec-device-draft

Tested with the latest code and that python test.

Device Info

CUDA0   : NVIDIA GeForce RTX 5090 (32108 MiB, 28786 MiB free)
CUDA1   : NVIDIA GeForce RTX 4090 (24082 MiB, 23671 MiB free)

No MTP

llama-server -m ~/ai-models/mtp/Gemma4-31B-Q8_0.gguf -c 16384

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.9
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.6
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.8
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.6
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.9
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.9
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.9
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.8
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=33.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 52.52
}

MTP Enabled

llama-server -m ~/ai-models/mtp/Gemma4-31B-Q8_0.gguf -md ~/ai-models/mtp/mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --device-draft CUDA1

  code_python        pred= 192 draft= 207 acc= 139 rate=0.671 tok/s=94.7
  code_cpp           pred= 192 draft= 212 acc= 138 rate=0.651 tok/s=94.4
  explain_concept    pred= 192 draft= 255 acc= 127 rate=0.498 tok/s=78.0
  summarize          pred= 192 draft= 188 acc= 143 rate=0.761 tok/s=104.0
  qa_factual         pred= 192 draft= 221 acc= 135 rate=0.611 tok/s=89.5
  translation        pred= 192 draft= 226 acc= 133 rate=0.589 tok/s=86.9
  creative_short     pred= 192 draft= 272 acc= 122 rate=0.449 tok/s=72.6
  stepwise_math      pred= 192 draft= 201 acc= 140 rate=0.697 tok/s=97.6
  long_code_review   pred= 192 draft= 225 acc= 134 rate=0.596 tok/s=85.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 2007,
  "total_draft_accepted": 1211,
  "aggregate_accept_rate": 0.6034,
  "wall_s_total": 21.05
}

👏

@fabriciomalta

Copy link
Copy Markdown

@am17an Update: it is working now.

I deleted the dir and pull again. The issue was the quantized KV cache. With -ctk q8_0 -ctv q8_0, Draft-MTP initialized but had 0% acceptance. After rebuilding the latest PR code and removing Q8 KV cache, acceptance became normal.

Hardware:

  • 2x RTX 3080 20GB
  • Gemma4-31B-Q8_0
  • Draft model: mtp-gemma-4-31B-it.gguf

Working command:

./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf -md mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --flash-attn on --no-mmap --temp 1 --top-p 0.95 --top-k 64 --parallel 1 --batch-size 2048 --ubatch-size 512 -ngl 999 --device-draft CUDA1 --no-warmup

Result:

eval time = 13229.08 ms / 671 tokens (19.72 ms per token, 50.72 tokens per second)
draft acceptance = 0.59596 (472 accepted / 792 generated)
#gen tokens = 792, #acc tokens = 472

So the previous 0% acceptance was caused by Q8 KV cache. With f16/default KV cache, Draft-MTP works correctly on my dual 3080 setup.

Additional confirmation: I re-tested with Q8 KV cache enabled again (-ctk q8_0 -ctv q8_0) using the same working setup.

Command:

./build-cuda/bin/llama-server -m Gemma4-31B-Q8_0.gguf -md mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --flash-attn on --no-mmap --temp 1 --top-p 0.95 --top-k 64 --parallel 1 --batch-size 2048 --ubatch-size 512 -ngl 999 --device-draft CUDA1 -ctk q8_0 -ctv q8_0 --no-warmup

With Q8 KV cache enabled, performance dropped again:

n_decoded = 100, tg = 14.61 t/s
n_decoded = 145, tg = 14.63 t/s
n_decoded = 189, tg = 14.63 t/s

Without Q8 KV cache, the same setup reached:

eval time = 13229.08 ms / 671 tokens (19.72 ms per token, 50.72 tokens per second)
draft acceptance = 0.59596 (472 accepted / 792 generated)
#gen tokens = 792, #acc tokens = 472

So this confirms the issue is related to Q8 KV cache. With default/f16 KV cache, Draft-MTP works correctly; with -ctk q8_0 -ctv q8_0, it degrades heavily / previously reached 0% acceptance.

mixa3607 added a commit to mixa3607/ML-gfx906 that referenced this pull request May 20, 2026
@exander77

Copy link
Copy Markdown

Strix Halo:

$llama-server -m models/Gemma4-31B-Q8_0.gguf --port 18080

 python3 scripts/mtp-bench.py --url http://127.0.0.1:18080
  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.7
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.6
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.7
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.6
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.6
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.7
  creative_short     pred=  36 draft=   0 acc=   0 rate=n/a tok/s=6.8
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.7
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=6.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1572,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 244.6
}
$llama-server -m models/Gemma4-31B-Q8_0.gguf -md models/mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 4 --port 18080

python3 scripts/mtp-bench.py --url http://127.0.0.1:18080
  code_python        pred= 192 draft= 209 acc= 138 rate=0.660 tok/s=17.7
  code_cpp           pred= 192 draft= 167 acc= 149 rate=0.892 tok/s=22.1
  explain_concept    pred= 192 draft= 188 acc= 144 rate=0.766 tok/s=19.9
  summarize          pred= 192 draft= 180 acc= 145 rate=0.806 tok/s=20.3
  qa_factual         pred= 192 draft= 169 acc= 148 rate=0.876 tok/s=21.7
  translation        pred= 192 draft= 327 acc= 107 rate=0.327 tok/s=11.2
  creative_short     pred=  36 draft=  68 acc=  19 rate=0.279 tok/s=10.4
  stepwise_math      pred= 192 draft= 192 acc= 142 rate=0.740 tok/s=19.1
  long_code_review   pred= 192 draft= 317 acc= 111 rate=0.350 tok/s=11.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1572,
  "total_draft": 1817,
  "total_draft_accepted": 1103,
  "aggregate_accept_rate": 0.607,
  "wall_s_total": 103.42
}

Best results for me:

$llama-server -m models/gemma-4-31B-it-Q4_K_M.gguf -md models/mtp-gemma-4-31B-it.gguf -c 16384 --spec-type draft-mtp --spec-draft-n-max 3 --port 18080

  code_python        pred= 192 draft= 177 acc= 132 rate=0.746 tok/s=26.5
  code_cpp           pred= 192 draft= 201 acc= 124 rate=0.617 tok/s=22.0
  explain_concept    pred= 192 draft= 222 acc= 116 rate=0.522 tok/s=19.7
  summarize          pred= 192 draft= 156 acc= 138 rate=0.885 tok/s=27.9
  qa_factual         pred= 192 draft= 188 acc= 128 rate=0.681 tok/s=23.5
  translation        pred= 192 draft= 180 acc= 130 rate=0.722 tok/s=24.3
  creative_short     pred=  36 draft=  54 acc=  18 rate=0.333 tok/s=15.5
  stepwise_math      pred= 192 draft= 151 acc= 140 rate=0.927 tok/s=28.9
  long_code_review   pred= 192 draft= 276 acc=  97 rate=0.351 tok/s=14.8

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1572,
  "total_draft": 1605,
  "total_draft_accepted": 1023,
  "aggregate_accept_rate": 0.6374,
  "wall_s_total": 78.44
}

Q=4 with N=3 seems to be pretty fast.

@exander77

This comment has been minimized.

@aldehir

aldehir commented May 21, 2026

Copy link
Copy Markdown
Contributor

My earlier results were scuffed. This should be more representative for this hardware. Looks good!

throughput
CUDA0   : NVIDIA RTX PRO 6000 Blackwell Server Edition (97249 MiB, 96691 MiB free)
CPU     : AMD EPYC 9355 32-Core Processor (1547705 MiB, 1547705 MiB free)
Detailed Results

No MTP (Q8)

./llama-server -m ../Gemma4-31B-Q8_0.gguf -np 1

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=39.9
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.0
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.0
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.1
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.2
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.2
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.2
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=40.1
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=39.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 44.51
}

MTP --spec-draft-n-max 2 (Q8)

./llama-server -m ../Gemma4-31B-Q8_0.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 155 acc= 113 rate=0.729 tok/s=72.8
  code_cpp           pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=74.5
  explain_concept    pred= 192 draft= 171 acc= 104 rate=0.608 tok/s=66.2
  summarize          pred= 192 draft= 149 acc= 115 rate=0.772 tok/s=76.0
  qa_factual         pred= 192 draft= 158 acc= 111 rate=0.703 tok/s=72.3
  translation        pred= 192 draft= 150 acc= 115 rate=0.767 tok/s=76.1
  creative_short     pred= 192 draft= 184 acc=  99 rate=0.538 tok/s=62.9
  stepwise_math      pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=77.0
  long_code_review   pred= 192 draft= 161 acc= 110 rate=0.683 tok/s=69.3

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1430,
  "total_draft_accepted": 997,
  "aggregate_accept_rate": 0.6972,
  "wall_s_total": 25.6
}

MTP --spec-draft-n-max 3 (Q8)

./llama-server -m ../Gemma4-31B-Q8_0.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 186 acc= 128 rate=0.688 tok/s=83.1
  code_cpp           pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=87.8
  explain_concept    pred= 192 draft= 237 acc= 111 rate=0.468 tok/s=66.4
  summarize          pred= 192 draft= 182 acc= 130 rate=0.714 tok/s=86.5
  qa_factual         pred= 192 draft= 191 acc= 127 rate=0.665 tok/s=82.8
  translation        pred= 192 draft= 186 acc= 129 rate=0.694 tok/s=85.8
  creative_short     pred= 192 draft= 238 acc= 110 rate=0.462 tok/s=66.0
  stepwise_math      pred= 192 draft= 174 acc= 132 rate=0.759 tok/s=90.2
  long_code_review   pred= 192 draft= 230 acc= 113 rate=0.491 tok/s=66.7

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1803,
  "total_draft_accepted": 1111,
  "aggregate_accept_rate": 0.6162,
  "wall_s_total": 23.58
}

MTP --spec-draft-n-max 4 (Q8)

./llama-server -m ../Gemma4-31B-Q8_0.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 4

  code_python        pred= 192 draft= 220 acc= 135 rate=0.614 tok/s=84.5
  code_cpp           pred= 192 draft= 208 acc= 139 rate=0.668 tok/s=91.8
  explain_concept    pred= 192 draft= 285 acc= 118 rate=0.414 tok/s=65.7
  summarize          pred= 192 draft= 190 acc= 143 rate=0.753 tok/s=98.9
  qa_factual         pred= 192 draft= 211 acc= 138 rate=0.654 tok/s=89.8
  translation        pred= 192 draft= 230 acc= 132 rate=0.574 tok/s=81.4
  creative_short     pred= 192 draft= 287 acc= 119 rate=0.415 tok/s=66.7
  stepwise_math      pred= 192 draft= 207 acc= 139 rate=0.671 tok/s=91.7
  long_code_review   pred= 192 draft= 235 acc= 132 rate=0.562 tok/s=79.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 2073,
  "total_draft_accepted": 1195,
  "aggregate_accept_rate": 0.5765,
  "wall_s_total": 22.62
}

No MTP (Q4)

./llama-server -m ../gemma4-31B-it-Q4_K_M.gguf -np 1

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=61.8
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=61.9
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.1
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.1
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.4
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.6
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.8
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=62.5
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=61.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 29.09
}

MTP --spec-draft-n-max 2 (Q4)

./llama-server -m ../gemma-4-31B-it-Q4_0-mtp.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 166 acc= 107 rate=0.645 tok/s=93.6
  code_cpp           pred= 192 draft= 165 acc= 108 rate=0.654 tok/s=95.5
  explain_concept    pred= 192 draft= 158 acc= 111 rate=0.703 tok/s=99.3
  summarize          pred= 192 draft= 154 acc= 113 rate=0.734 tok/s=101.5
  qa_factual         pred= 192 draft= 154 acc= 114 rate=0.740 tok/s=102.8
  translation        pred= 192 draft= 169 acc= 105 rate=0.621 tok/s=92.4
  creative_short     pred= 192 draft= 184 acc=  99 rate=0.538 tok/s=86.3
  stepwise_math      pred= 192 draft= 150 acc= 116 rate=0.773 tok/s=105.4
  long_code_review   pred= 192 draft= 155 acc= 113 rate=0.729 tok/s=97.4

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1455,
  "total_draft_accepted": 986,
  "aggregate_accept_rate": 0.6777,
  "wall_s_total": 19.26
}

MTP --spec-draft-n-max 3 (Q4)

./llama-server -m ../gemma-4-31B-it-Q4_0-mtp.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 208 acc= 121 rate=0.582 tok/s=94.4
  code_cpp           pred= 192 draft= 190 acc= 127 rate=0.668 tok/s=104.6
  explain_concept    pred= 192 draft= 199 acc= 123 rate=0.618 tok/s=99.2
  summarize          pred= 192 draft= 189 acc= 128 rate=0.677 tok/s=106.0
  qa_factual         pred= 192 draft= 184 acc= 129 rate=0.701 tok/s=107.9
  translation        pred= 192 draft= 183 acc= 128 rate=0.700 tok/s=106.7
  creative_short     pred= 192 draft= 238 acc= 111 rate=0.466 tok/s=83.8
  stepwise_math      pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=111.3
  long_code_review   pred= 192 draft= 213 acc= 120 rate=0.563 tok/s=91.3

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1782,
  "total_draft_accepted": 1118,
  "aggregate_accept_rate": 0.6274,
  "wall_s_total": 18.72
}

MTP --spec-draft-n-max 4 (Q4)

./llama-server -m ../gemma-4-31B-it-Q4_0-mtp.gguf -np 1 --spec-draft-model ../mtp-gemma4-31B-it.gguf --spec-type draft-mtp --spec-draft-n-max 4

  code_python        pred= 192 draft= 232 acc= 133 rate=0.573 tok/s=95.0
  code_cpp           pred= 192 draft= 220 acc= 136 rate=0.618 tok/s=102.4
  explain_concept    pred= 192 draft= 244 acc= 130 rate=0.533 tok/s=93.0
  summarize          pred= 192 draft= 215 acc= 137 rate=0.637 tok/s=104.8
  qa_factual         pred= 192 draft= 225 acc= 133 rate=0.591 tok/s=98.9
  translation        pred= 192 draft= 198 acc= 140 rate=0.707 tok/s=112.3
  creative_short     pred= 192 draft= 290 acc= 116 rate=0.400 tok/s=76.9
  stepwise_math      pred= 192 draft= 188 acc= 144 rate=0.766 tok/s=120.0
  long_code_review   pred= 192 draft= 281 acc= 119 rate=0.423 tok/s=78.3

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 2093,
  "total_draft_accepted": 1188,
  "aggregate_accept_rate": 0.5676,
  "wall_s_total": 19.39
}

Summary

Configuration Accept rate Wall (s) Mean tok/s Min tok/s Max tok/s Speedup vs. baseline
Q8 no MTP n/a 44.51 40.0 39.5 40.2 1.00x
Q8 MTP n=2 0.6972 25.60 71.9 62.9 77.0 1.74x
Q8 MTP n=3 0.6162 23.58 79.5 66.0 90.2 1.89x
Q8 MTP n=4 0.5765 22.62 83.3 65.7 98.9 1.97x
Q4 no MTP n/a 29.09 62.1 61.0 62.8 1.00x
Q4 MTP n=2 0.6777 19.26 97.1 86.3 105.4 1.51x
Q4 MTP n=3 0.6274 18.72 100.6 83.8 111.3 1.55x
Q4 MTP n=4 0.5676 19.39 98.0 76.9 120.0 1.50x

@exander77

This comment was marked as off-topic.

@exander77

This comment was marked as off-topic.

@ruixiang63

ruixiang63 commented May 21, 2026

Copy link
Copy Markdown
Contributor

For the MoE model I don't observe a speed-up on my system, but the dense model has on average >2x speedup.

Thanks very much for this PR! The performance numbers with MTP on the dense model look great!

Regarding the MoE model, I also tried a related experiment with an Eagle3 checkpoint on DGX Spark, and it appears to provide some speedup there. This may be a useful reference point for understanding why MTP does not show the same speedup on the MoE model.

One possible explanation is that Eagle3 is more lightweight: it uses a single-layer transformer and incorporates d2t vocabulary mapping, which may reduce the draft-model overhead compared with MTP.
In addition, I found that tuning --spec-draft-p-min also helps improve the speedup.

A possible future direction could be to explore whether Eagle3 and MTP can be combined. MTP is generally strong across broad tasks because it is paired with target-model pretraining, while Eagle3 may be easier to adapt for domain-specific use cases, since users can train Eagle3 separately on their own customized datasets.

For reference, here are the Eagle3 performance numbers with Gemma4-A4B-26B (BF16) on DGX Spark:

  • Without Eagle3
  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=28.3
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 64.01
}
  • With Eagle3
  code_python        pred= 192 draft= 215 acc= 116 rate=0.539 tok/s=41.4
  code_cpp           pred= 192 draft= 181 acc= 102 rate=0.564 tok/s=38.6
  explain_concept    pred= 192 draft= 172 acc=  99 rate=0.576 tok/s=37.0
  summarize          pred= 192 draft= 211 acc= 119 rate=0.564 tok/s=42.3
  qa_factual         pred= 192 draft= 181 acc= 108 rate=0.597 tok/s=40.7
  translation        pred= 192 draft= 176 acc=  95 rate=0.540 tok/s=36.0
  creative_short     pred= 192 draft= 182 acc=  80 rate=0.440 tok/s=32.4
  stepwise_math      pred= 192 draft= 204 acc= 121 rate=0.593 tok/s=44.2
  long_code_review   pred= 192 draft= 155 acc= 108 rate=0.697 tok/s=40.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1677,
  "total_draft_accepted": 948,
  "aggregate_accept_rate": 0.5653,
  "wall_s_total": 46.12
}

Details can be found in Eagle3 PR: #18039 (comment)
Q4_K_M quantized models is a bit worse. #18039 (comment)

@Handyfff

Copy link
Copy Markdown

Why is the E4B/E2B not supported yet? Is it that different?

@am17an

am17an commented May 23, 2026

Copy link
Copy Markdown
Contributor Author

Quantized kv-cache should now work, it was missing the hadamard rotn for Q.

@Handyfff it will be added later

@thot-experiment

thot-experiment commented May 23, 2026

Copy link
Copy Markdown

Having issues with a hard crash when trying to use multigpu across a GV100 & 5070Ti, works on either card, but if I try to split the model residency between cards I get a hard crash when the model finishes loading, -lv 99 reveals nothing worthwhile.

[gemma-4-31B-UD-Q4_K_XL-MTP]
sm = layer
device = CUDA1
spec-draft-device = CUDA1
chat-template-file = .\models\gemma_31b_fixed.jinja
model = .\models\google_gemma-4-31B-it-Q4_K_L.gguf
md = .\models\mtp-gemma-4-31B-it.gguf
spec-type = draft-mtp 
ngld = 99
spec-draft-n-max = 3
temp = 1.0
ctk = q8_0
ctv = q8_0
b = 4096
ub = 1024
top-k = 64
top-p = 0.95
ctx-size = 131072
ctx-checkpoints = 12

this works fine, both set to CUDA0 also works fine (w/ offloading to CPU) however the ideal case where the draft model sits on one GPU and the main model is split across both doesn't work no matter what I try (built from 4b1d1ae this morning)

on the GV 100 i get

Configuration Performance
Baseline q4 model 18 tok/s
MTP q8 (Drafter q4 model) 35 tok/s

I didn't do comparative testing on the 5070Ti, but with ngld 99 and ngl 27 i get about 15 tokens/s and about 7tok/s on pure CPU, overall amazing work, really brings the best local model within reach usable for the average gamer, really game changing

EDIT: also prefill went from ~900 to ~700tok/s vs baseline

@am17an

am17an commented May 24, 2026

Copy link
Copy Markdown
Contributor Author

@thot-experiment can you create a debug build and see where it crashes?

@ServeurpersoCom

ServeurpersoCom commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@forforever73 Great, 0.80 acceptance! Do you actually beat your no-MTP tok/s with it, and does your CoT stay coherent or ever switch to Chinese like mine? Asking because my Q2_K_XL setup isn't memory-bandwidth representative, so the MTP win gets eaten and it's not worth it on my end (96GB VRAM) Also, which target + draft GGUFs are you running exactly (vendor/repo)?

@forforever73

Copy link
Copy Markdown
Contributor

@ServeurpersoCom I use

-m step3.7-text-bf16.gguf \
-md Step3.7-flash-mtp-bf16.gguf \
--host 0.0.0.0 --port 8080 \
-ngl 999 -ngld 999 \
-sm layer \
-fa on \
-ctk bf16 -ctv bf16 \
--no-mmap \
-b 4096 -ub 2048 \
-np 1 \
-t 32 \
--spec-type draft-mtp --spec-draft-n-max 1 \
--jinja \

and no mtp result is

code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.8
code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.9
explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.8
summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.9
qa_factual         pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.9
translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.9
creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.8
stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.8
long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=69.2

Aggregate: {
"n_requests": 9,
"total_predicted": 1728,
"total_draft": 0,
"total_draft_accepted": 0,
"aggregate_accept_rate": null,
"wall_s_total": 27.79
}

hmm, h800 has way more than enough memory, wait me run it on spark real quick

@ServeurpersoCom

ServeurpersoCom commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Nice machine! Full BF16 makes the decode heavy enough that MTP is a real win for you. On the Spark you'll have to quantize to fit and probably hit my problem, where the MTP gain gets eaten. Main thing I'm curious about: is your model reasoning clean on the monster machine, or does the CoT ever switch to Chinese like mine? This allows the patch to be validated!
Also I run a q8_0 KV cache (-ctk q8_0 -ctv q8_0): worth testing on your end too to give the patch more coverage, the quantized-KV path hits the Hadamard rotation that default f16 skips.

@ggerganov

ggerganov commented Jun 8, 2026

Copy link
Copy Markdown
Member

@ServeurpersoCom I think you have to use --spec-draft-n-max 1 because more than 1 was not implemented for this model.

@ServeurpersoCom

ServeurpersoCom commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

@ServeurpersoCom I think you have to use --spec-draft-n-max 1 because more than 1 was not implemented for this model.

Good catch ! I run a test now :
No more Chinese in my reasoning, 160.33 t/s for MTP and No-MTP

@forforever73

Copy link
Copy Markdown
Contributor

@ServeurpersoCom On spark i use

-m Step-3.7-IQ4_XS.gguf \
    --spec-type draft-mtp \
    --spec-draft-model Step3.7-flash-mtp-Q8_0.gguf \
    -ngl all \
    --spec-draft-ngl all \
    -c 35000 \
    -np 1 \
    -b 2048 \
    -ub 1024 \
    --temp 0 \
    --spec-draft-n-max 1 \
    --spec-draft-p-min 0.6 \

no mtp

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.8
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.9
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.8
  summarize          pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.9
  qa_factual         pred= 154 draft=   0 acc=   0 rate=n/a tok/s=27.9
  translation        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.9
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.8
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.9
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=27.2

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1690,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 76.51
}

with mtp

  code_python        pred= 192 draft=  91 acc=  85 rate=0.934 tok/s=32.4
  code_cpp           pred= 192 draft=  90 acc=  86 rate=0.956 tok/s=33.0
  explain_concept    pred= 192 draft=  91 acc=  81 rate=0.890 tok/s=31.6
  summarize          pred= 192 draft=  90 acc=  83 rate=0.922 tok/s=32.2
  qa_factual         pred= 157 draft=  74 acc=  72 rate=0.973 tok/s=34.0
  translation        pred= 192 draft=  87 acc=  83 rate=0.954 tok/s=32.6
  creative_short     pred= 192 draft=  74 acc=  71 rate=0.960 tok/s=29.7
  stepwise_math      pred= 192 draft=  85 acc=  84 rate=0.988 tok/s=32.6
  long_code_review   pred= 192 draft=  89 acc=  83 rate=0.933 tok/s=31.1

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1693,
  "total_draft": 771,
  "total_draft_accepted": 728,
  "aggregate_accept_rate": 0.9442,
  "wall_s_total": 57.89
}

with mtp and -ctk q8_0 -ctv q8_0

  code_python        pred= 192 draft=  93 acc=  87 rate=0.935 tok/s=32.6
  code_cpp           pred= 192 draft=  90 acc=  75 rate=0.833 tok/s=30.3
  explain_concept    pred= 192 draft=  93 acc=  83 rate=0.892 tok/s=32.1
  summarize          pred= 192 draft=  84 acc=  77 rate=0.917 tok/s=30.7
  qa_factual         pred= 156 draft=  75 acc=  74 rate=0.987 tok/s=34.9
  translation        pred= 192 draft=  86 acc=  79 rate=0.919 tok/s=31.4
  creative_short     pred= 192 draft=  77 acc=  63 rate=0.818 tok/s=27.8
  stepwise_math      pred= 192 draft=  86 acc=  85 rate=0.988 tok/s=32.9
  long_code_review   pred= 192 draft=  84 acc=  80 rate=0.952 tok/s=30.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1692,
  "total_draft": 768,
  "total_draft_accepted": 703,
  "aggregate_accept_rate": 0.9154,
  "wall_s_total": 59.55
}

and the reasoning:
image

Currently --spec-draft-n-max 3 will worse than 1, I'm working on support step3.5 3 layer mtp, but due to a conflict with Gemma 4, it will still take some time

@ServeurpersoCom

Copy link
Copy Markdown
Contributor

As ggerganov said, use --spec-draft-n-max 1 because more than 1 has not been implemented for this model.

I'm trying a combination of the two patches :

(root|~/llama.cpp.pascal) git diff
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index da7a92955..4cc4a4a16 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -567,7 +567,10 @@ void llm_graph_input_attn_kv_iswa::set_input(const llama_ubatch * ubatch) {
         mctx->get_base()->set_input_v_idxs(self_v_idxs, ubatch);
     }

-    mctx->get_base()->set_input_kq_mask(self_kq_mask, ubatch, cparams.causal_attn);
+    // the kq mask guards on its own buffer: shared cells leave idxs unbacked while the mask stays live
+    if (self_kq_mask && self_kq_mask->buffer) {
+        mctx->get_base()->set_input_kq_mask(self_kq_mask, ubatch, cparams.causal_attn);
+    }

     // swa tensors may not be allocated if there are no SWA attention layers
     if (self_k_idxs_swa && self_k_idxs_swa->buffer) {
@@ -575,7 +578,9 @@ void llm_graph_input_attn_kv_iswa::set_input(const llama_ubatch * ubatch) {
         mctx->get_swa()->set_input_v_idxs(self_v_idxs_swa, ubatch);
     }

-    mctx->get_swa()->set_input_kq_mask(self_kq_mask_swa, ubatch, cparams.causal_attn);
+    if (self_kq_mask_swa && self_kq_mask_swa->buffer) {
+        mctx->get_swa()->set_input_kq_mask(self_kq_mask_swa, ubatch, cparams.causal_attn);
+    }

     if (self_k_rot) {
         mctx->get_base()->set_input_k_rot(self_k_rot);
@@ -607,7 +612,9 @@ bool llm_graph_input_attn_kv_iswa::can_reuse(const llm_graph_params & params) {
       //res &= self_v_idxs->ne[0] == params.ubatch.n_tokens; // TODO: need to move this to the unified cache and check there
     }

-    res &= can_reuse_kq_mask(self_kq_mask, mctx->get_base(), params.ubatch, params.cparams);
+    if (self_kq_mask && self_kq_mask->buffer) {
+        res &= can_reuse_kq_mask(self_kq_mask, mctx->get_base(), params.ubatch, params.cparams);
+    }

     // swa tensors may not be allocated if there are no SWA attention layers
     if (self_k_idxs_swa && self_k_idxs_swa->buffer) {
@@ -615,7 +622,9 @@ bool llm_graph_input_attn_kv_iswa::can_reuse(const llm_graph_params & params) {
       //res &= self_v_idxs_swa->ne[0] == params.ubatch.n_tokens; // TODO: need to move this to the unified cache and check there
     }

-    res &= can_reuse_kq_mask(self_kq_mask_swa, mctx->get_swa(), params.ubatch, params.cparams);
+    if (self_kq_mask_swa && self_kq_mask_swa->buffer) {
+        res &= can_reuse_kq_mask(self_kq_mask_swa, mctx->get_swa(), params.ubatch, params.cparams);
+    }

     return res;
 }

@ServeurpersoCom

ServeurpersoCom commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

All working, the combined patch is the cleanest @ggerganov : keep your can_reuse guards, and use the mask's own buffer for set_input. Your base guard keyed off self_k_idxs_swa, which is allocated for a SWA-only draft head (StepFun's MTP head is SWA-only), so it still wrote the null base mask and crashed at load. Guarding each mask on its own buffer covers both cases, on all 4 sites. You can try this last @forforever73 it must work

@forforever73

Copy link
Copy Markdown
Contributor

@ServeurpersoCom yes, it can work as well.

@vbooka1

vbooka1 commented Jun 8, 2026

Copy link
Copy Markdown

I confirm that patch from #23398 (comment) fixes StepFun 3.7 MTP

@ServeurpersoCom

Copy link
Copy Markdown
Contributor

I confirm that patch from #23398 (comment) fixes StepFun 3.7 MTP

Thanks for testing. Just 2 more runners #24294 and it'll merge :)

TheTom pushed a commit to TheTom/llama-cpp-turboquant that referenced this pull request Jun 8, 2026
TheTom added a commit to TheTom/llama-cpp-turboquant that referenced this pull request Jun 8, 2026
Integration glue so the upstream MTP lineage (ggml-org#23198..ggml-org#23398) builds on
this fork without disturbing TurboQuant+ or the custom kernels:

- llama_kv_cache ctor: thread the new `hparams` param and `layer_share_cb`
  through all call sites (iswa, memory-hybrid, dsa, model.cpp); keep the
  fork's turbo auto-asymmetric K upgrade, n_layer_kv() sizing (+3 rotation
  tensors), and per-side LLAMA_ATTN_ROT_* policy (default OFF) — now nested
  under the new `if (other) { share } else { ... }` KV-sharing branch.
- hparams: carry n_layer_all/n_layer_nextn + n_layer()/n_layer_kv() from the
  refactor while keeping the fork's n_layer_kv_from_start; restore the
  swa_layers->is_swa_impl / recurrent_layer_arr->is_recr_impl /
  nextn_predict_layers->n_layer_nextn renames across fork models.
- add n_outputs_max to cparams / common_params / llama_context_params and
  wire it through; restore deepstack_mapping_arr.
- server: keep the ggml-org#23398 ctx_other (MTP draft KV-sharing) wiring; drop the
  ggml-org#23988 --fit VRAM pre-estimation block (depends on upstream helpers not on
  this fork; MTP does not need it).
- drop upstream-only models pulled in by the refactor (deepseek32, mellum,
  talkie); keep non-MTP fork models on their own source + mechanical refactor.

Builds clean on Metal; turbo quant unit test passes (turbo2/3/4 round-trip).
Kernels (ggml-cuda / ggml-metal) untouched.
turbo-tan added a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 10, 2026
Add support for gemma4-assistant models as MTP (Multi-Token Prediction)
draft heads for speculative decoding with gemma4 target models.

## Key Features

### Automatic Assistant Detection
- Detect gemma4-assistant models via 'gemma4.assistant.type = mtp' metadata
- Automatically route to gemma4-assistant implementation even when GGUF declares
  'general.architecture = gemma4'
- Read 'gemma4.assistant.backbone_hidden_size' to get target model's hidden size

### Architecture Alignment with Upstream
- Rename LLM_ARCH_GEMMA4_MTP to LLM_ARCH_GEMMA4_ASSISTANT
- Rename gemma4_mtp.cpp to gemma4-assistant.cpp
- Add ctx_other integration for shared memory between target and assistant
- Align layer counting with upstream (n_layer_all vs n_layer)

### Tensor Support
- Add LLM_TENSOR_ASSISTANT_PRE_PROJ and LLM_TENSOR_ASSISTANT_POST_PROJ
- Map 'assistant.pre_projection' and 'assistant.post_projection' tensor names
- Make rope_freqs optional (assistant GGUFs don't include this tensor)
- Fix layer_output_scale tensor name (remove 'weight' suffix)
- Add optional MTP projection tensors to gemma4.cpp

### Layer Counting Alignment
- Use n_layer_all for iterating all layers in assistant models
- Use n_layer() for regular layers (n_layer_all - n_layer_nextn)
- Assistant models have n_layer() = 0 (all layers are nextn layers)

### Stride and Dimension Fixes
- Use n_embd_out() for stride in output_reorder()
- Use target's n_embd_out for k==0 nextn fallback
- Add embeddings_pre_norm to allow_reuse() check

## Testing

Assistant model loads successfully:
- gemma-4-E2B-it-assistant-BF16.gguf: ✓ Loads (requires ctx_other for inference)
- Architecture detection: ✓ Automatically detects as gemma4-assistant
- Tensor loading: ✓ All 48 tensors found

Note: Full MTP speculative decoding requires a working target model. The
gemma4 target models in our test environment have separate tensor count
issues unrelated to this PR.

## Usage

```bash
./llama-server \
    -m target-gemma4.gguf \
    -md assistant-gemma4.gguf \
    --spec-type mtp \
    --draft-block-size 3 \
    --draft-max 8
```

## Files Changed

- 147 files modified
- +659 insertions, -455 deletions
- New: src/models/gemma4-assistant.cpp
- Deleted: src/models/gemma4_mtp.cpp

## References

- Upstream PR: ggml-org#23398
- Model card: https://huggingface.co/google/gemma-4-E2B-it-assistant
- GGUF repo: https://huggingface.co/AtomicChat/gemma-4-E2B-it-assistant-GGUF

Assisted-by: opencode
Blueforcer added a commit to aleph-garden/xllamacpp that referenced this pull request Jun 12, 2026
Pulls in ggml-org/llama.cpp#23398 (gemma4-assistant draft arch) and
ggml-org/llama.cpp#24282 (E2B/E4B assistants). Without this, loading any
mtp-gemma-4-*.gguf drafter fails with: unknown model architecture
'gemma4-assistant'.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
marksverdhei added a commit to heiervang-technologies/ht-llama.cpp that referenced this pull request Jun 12, 2026
The Backend & quantization table omitted two HT-specific speculative
decoding features that have shipped to ht:

- DFlash (LLM_ARCH_DFLASH, --spec-type dflash, custom CUDA kernels for
  partial-accept feature extraction) — landed via PR #62 (b0daec5),
  integrates the z-lab DFlash block-diffusion drafter against Gemma4
  31B targets.

- Gemma4 MTP (gemma4-assistant arch + --spec-type draft-mtp) — vendored
  via PR #93 (4c09765) ahead of upstream PR ggml-org#23398
  merge so the gemma-4-12b-qat-mtp preset can ship on titan. Marked
  with Tracked-upstream=ggml-org#23398 since it retires when that PR merges and
  flows through a normal master sync.

Found during a §7 documentation freshness sweep — the inventory exists
to be authoritative ("consult it before assuming a behaviour is
upstream stock" per AGENTS.md), so omissions defeat the purpose.

Docs-only, no code touched.

Co-authored-by: marksverdhei <mark.sverdhei@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples model Model specific python python script changes server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.