llama: avoid copying logits during prompt decode in MTP by am17an · Pull Request #23198 · ggml-org/llama.cpp

am17an · 2026-05-17T10:22:31Z

Overview

Avoid copying the logits for every token in the batch when doing prompt processing for MTP since it only requires the pre-norm. This reduces memory traffic quite a bit and in turn increases PP speed with MTP.

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, for debugging and reviewing

ggerganov

A quick bench on RTX 5090 with Qwen3.6 27B Q4_K

d-r-e · 2026-05-17T16:04:56Z

A quick bench on RTX 5090 with Qwen3.6 27B Q4_K

Are the legend colors swapped?

pwilkin · 2026-05-17T16:06:30Z

@d-r-e no, MTP does negatively impact prompt processing, but under this PR the negative impact is halved.

cb88 · 2026-05-17T17:03:01Z

2xMI50 qwen 27b Q4_1 does see some improvement with this PR
MI50 without MTP = 500t/s
with MTP = 250t/s
with MTP this PR = 300t/s

0cc4m · 2026-05-17T18:27:48Z

@d-r-e no, MTP does negatively impact prompt processing, but under this PR the negative impact is halved.

Why does it affect prompt processing?

Mithras · 2026-05-17T21:55:41Z

Made sure this PR is included and re-tested:
unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q5_K_XL.gguf

| model     |             test |             t/s |     peak t/s |          ttfr (ms) |       est_ppt (ms) |      e2e_ttft (ms) |
|:----------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:|
| qwen36-27 |  pp2048 @ d16384 | 1843.93 ± 14.82 |              |   9098.06 ± 124.65 |   9097.48 ± 124.65 |   9098.06 ± 124.65 |
| qwen36-27 |   tg128 @ d16384 |    74.46 ± 3.24 | 84.00 ± 0.82 |                    |                    |                    |
| qwen36-27 |  pp2048 @ d65536 |  1449.72 ± 9.78 |              |  42344.98 ± 292.27 |  42344.40 ± 292.27 |  42344.98 ± 292.27 |
| qwen36-27 |   tg128 @ d65536 |    61.97 ± 2.35 | 68.33 ± 4.78 |                    |                    |                    |
| qwen36-27 | pp2048 @ d131072 |  1075.30 ± 2.36 |              | 112238.48 ± 281.78 | 112237.90 ± 281.78 | 112238.48 ± 281.78 |
| qwen36-27 |  tg128 @ d131072 |    48.40 ± 2.53 | 55.00 ± 0.00 |                    |                    |                    |

pretty much the same as #22673 (comment) which probably had the PR already. Still almost 50% pp hit

pwilkin · 2026-05-17T21:59:25Z

Why does it affect prompt processing?

Due to the embeddings copy, most likely.

* llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm

jtjstock · 2026-05-21T02:54:48Z

This particular commit seems to regress the acceptance rate, I lose about 5% at n = 4 in coding tasks. Otherwise the prefill improvement is amazing

am17an · 2026-05-21T03:27:36Z

@jtjstock this commit just stops copying stuff we don't ever read, in a way it's a free optimization and it should not affect anything except the prefill

jtjstock · 2026-05-21T04:19:34Z

@am17an Just did a bunch of runs, one at latest(b9254): draft acceptance = 0.75134 ( 3913 accepted / 5208 generated)
and one with this and a dependent commit reverted: draft acceptance = 0.83896 ( 4376 accepted / 5216 generated)

Same prompt. Same Seed. Temp 0. Same hardware minutes apart. Same compiler. The only oddity seems to be the outputs from the latest(b9254), they are slightly different each time, but highly similar, where as with mainline commits (3e12fbd + dependent 49c21f9, included this last one for a clean revert and build) reverted build they are identical each time.

I'm running this on windows, the command used:
llama-server.exe ^
-m models/localweights/Qwen3.6-35B-A3B-MTP-IQ4_XS-Q8nextn-GGUF/Qwen3.6-35B-A3B-MTP-IQ4_XS-Q8nextn.gguf ^
--no-mmap --ctx-size 16384 --port 12345 ^
--flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 ^
--spec-type draft-mtp --spec-draft-n-max 4 ^
-np 1 --temp 0 --seed 0 ^
-sm layer -ngl 999 --tensor-split 20,18 ^
-t 4 -tb 8 --no-warmup --metrics

The prompt: write fizzbuzz in 16 different programming languages

* llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm

jtjstock · 2026-05-21T12:38:49Z

@am17an Also just tested the official build of b9254(llama-b9254-bin-win-cuda-13.1-x64), seeing the same variability in the output and similar degraded acceptance rate.

Maybe it's specific to my config? It's running on 2x 5060ti 16GB

Edit: did some more testing. Issue still present without quantized kv, but does disappear when I switch to -sm tensor, output from -sm tensor is internally consistent across runs.

* llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm

jtjstock · 2026-06-02T04:14:41Z

@am17an I think the issue I see, which still presents at head, is the MTP heads are being fed the raw pre-norm residual, but Qwen seems to do better with the post-norm. Before this PR, Qwen was effectively reading a post-norm because t_h_pre_norm aliased the buffer that the final norm was writing to. So it was better before because of a side effect.

It is a very small change to fix. Move the final build_norm before res->t_h_pre_norm = cur in the main graph, and move res->t_h_pre_norm = cur after the build_norm in the mtp one, in qwen35moe.cpp and qwen35.cpp. Two lines moved in each file.

I could be wrong, but it does result in a noticeable improvement to acceptance with the corresponding bump in token generation. I see acceptance move from ~75% to ~80% at n-max 4, and generation from ~203 t/s to ~225 t/s on tensor split. Prefill was of course unaffected.

I'm not looking at the sm layer issue I noted before, as I'm not using layer split anymore, but it's still present when I tested.

am17an · 2026-06-02T04:31:54Z

@jtjstock the model's MTP head was trained on the pre-norm hidden state. I don't understand how the post-norm residual would make it better.

jtjstock · 2026-06-02T04:37:25Z

Try it?

…

On Tue, Jun 2, 2026 at 12:32 AM Aman Gupta ***@***.***> wrote: *am17an* left a comment (ggml-org/llama.cpp#23198) <#23198 (comment)> @jtjstock <https://github.com/jtjstock> the model's MTP head was trained on the pre-norm residual. I don't understand how the post-norm residual would make it better. — Reply to this email directly, view it on GitHub <#23198?email_source=notifications&email_token=AEZEUVEXU7OY322LAT3LAED45ZKFJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJZHA3TGNBRGYY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4598734161>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEZEUVGP5WQDWNYH62W6W3T45ZKFJAVCNFSM6AAAAACZBIXW32VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKOJYG4ZTIMJWGE> . Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS <https://github.com/notifications/mobile/ios/AEZEUVGI5QZBVZFFJEDVV7D45ZKFJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJZHA3TGNBRGYY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJKTGN5XXIZLSL5UW64Y> and Android <https://github.com/notifications/mobile/android/AEZEUVHEOD6EYGRQTSJJZ4T45ZKFJA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJZHA3TGNBRGYY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLTGN5XXIZLSL5QW4ZDSN5UWI>. Download it today! You are receiving this because you were mentioned.Message ID: ***@***.***>

am17an · 2026-06-02T04:38:29Z

No thanks. Please stop posting irrelevant comments

jtjstock · 2026-06-02T13:11:14Z

Sorry I am bothering you, but I don't see any source that specifically says Qwen 3.5/3.6's MTP is trained on pre-norm hidden state, I do see a paper that says the training method is undisclosed(https://arxiv.org/html/2605.09992v1), and vLLM seems to use the post-norm hidden state in the branch qwen 3.5's mtp follows.

Again, I do apologise for bothering you, you've done a tremendous amount of work on this and it is not my intention to waste your time.

am17an · 2026-06-02T13:29:48Z

Okay I just checked vLLM and you're right, they do pass in the post-norm hidden state, my assumption was that is was a deepseek like MTP which passes the pre norm hidden state. Thanks for pointing this out

* llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm (cherry picked from commit 3e12fbd)

Integration glue so the upstream MTP lineage (ggml-org#23198..ggml-org#23398) builds on this fork without disturbing TurboQuant+ or the custom kernels: - llama_kv_cache ctor: thread the new `hparams` param and `layer_share_cb` through all call sites (iswa, memory-hybrid, dsa, model.cpp); keep the fork's turbo auto-asymmetric K upgrade, n_layer_kv() sizing (+3 rotation tensors), and per-side LLAMA_ATTN_ROT_* policy (default OFF) — now nested under the new `if (other) { share } else { ... }` KV-sharing branch. - hparams: carry n_layer_all/n_layer_nextn + n_layer()/n_layer_kv() from the refactor while keeping the fork's n_layer_kv_from_start; restore the swa_layers->is_swa_impl / recurrent_layer_arr->is_recr_impl / nextn_predict_layers->n_layer_nextn renames across fork models. - add n_outputs_max to cparams / common_params / llama_context_params and wire it through; restore deepstack_mapping_arr. - server: keep the ggml-org#23398 ctx_other (MTP draft KV-sharing) wiring; drop the ggml-org#23988 --fit VRAM pre-estimation block (depends on upstream helpers not on this fork; MTP does not need it). - drop upstream-only models pulled in by the refactor (deepseek32, mellum, talkie); keep non-MTP fork models on their own source + mechanical refactor. Builds clean on Metal; turbo quant unit test passes (turbo2/3/4 round-trip). Kernels (ggml-cuda / ggml-metal) untouched.

llama: avoid copying logits during prompt decode in MTP

0abcf8f

am17an requested review from a team, CISC and ggerganov as code owners May 17, 2026 10:22

ggerganov reviewed May 17, 2026

View reviewed changes

Comment thread src/llama-context.cpp Outdated

review: update comment

e964f98

ggerganov reviewed May 17, 2026

View reviewed changes

Comment thread src/models/qwen35moe.cpp

llama-graph: call set_output for t_h_pre_norm

70a7d0e

CISC approved these changes May 17, 2026

View reviewed changes

github-actions Bot added model Model specific examples server labels May 17, 2026

ggerganov approved these changes May 17, 2026

View reviewed changes

am17an merged commit 3e12fbd into ggml-org:master May 17, 2026
75 of 81 checks passed

am17an deleted the mtp-pp-fix branch May 17, 2026 15:30

tha80 mentioned this pull request May 17, 2026

llama + spec: MTP Support #22673

Merged

11 tasks

DrBearJew referenced this pull request in DrBearJew/RoxxY May 17, 2026

llama: avoid copying logits during prompt decode in MTP (#23198)

899097b

* llama: avoid copying logits during prompt decode in MTP * review: update comment * llama-graph: call set_output for t_h_pre_norm

github-actions Bot mentioned this pull request May 18, 2026

Reddit News Daily 2026-05-18 gitlawr/reddit-daily-news#248

Open

abetlen mentioned this pull request May 18, 2026

fix: initialize embeddings_pre_norm_masked=false in llama_context #23256

Merged

de-wim added a commit to de-wim/llama.cpp that referenced this pull request May 18, 2026

fix: update imatrix code to work after PR ggml-org#23198 changes

0f185b5

This comment has been minimized.

Sign in to view

xbezdick pushed a commit to xbezdick/llama.cpp that referenced this pull request May 18, 2026

fix: update imatrix code to work after PR ggml-org#23198 changes

0697342

DoradusResearch mentioned this pull request May 18, 2026

Eval bug: crash when evaluating images with MTP with Qwen3.6 #23233

Closed

eugenehp added a commit to eugenehp/llama-cpp-rs that referenced this pull request May 19, 2026

updated api based on ggml-org/llama.cpp#23198

6d064aa

xbezdick pushed a commit to xbezdick/llama.cpp that referenced this pull request May 19, 2026

fix: update imatrix code to work after PR ggml-org#23198 changes

aa1aa65

TheTom mentioned this pull request Jun 8, 2026

Gemma 4 MTP: bring in the upstream MTP lineage (qwen35 post-norm + gemma4) on TurboQuant+ TheTom/llama-cpp-turboquant#172

Merged

Conversation

am17an commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

d-r-e commented May 17, 2026

Uh oh!

pwilkin commented May 17, 2026

Uh oh!

cb88 commented May 17, 2026

Uh oh!

0cc4m commented May 17, 2026

Uh oh!

Mithras commented May 17, 2026

Uh oh!

pwilkin commented May 17, 2026

Uh oh!

This comment has been minimized.

jtjstock commented May 21, 2026

Uh oh!

am17an commented May 21, 2026

Uh oh!

jtjstock commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtjstock commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtjstock commented Jun 2, 2026

Uh oh!

am17an commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtjstock commented Jun 2, 2026 via email

Uh oh!

am17an commented Jun 2, 2026

Uh oh!

jtjstock commented Jun 2, 2026

Uh oh!

am17an commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

am17an commented May 17, 2026 •

edited

Loading

jtjstock commented May 21, 2026 •

edited

Loading

jtjstock commented May 21, 2026 •

edited

Loading

am17an commented Jun 2, 2026 •

edited

Loading