Skip to content

Remove padding and multiple D2D copies for MTP#24086

Merged
gaugarg-nv merged 3 commits into
ggml-org:masterfrom
gaugarg-nv:gdn-changes
Jun 10, 2026
Merged

Remove padding and multiple D2D copies for MTP#24086
gaugarg-nv merged 3 commits into
ggml-org:masterfrom
gaugarg-nv:gdn-changes

Conversation

@gaugarg-nv

@gaugarg-nv gaugarg-nv commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Based on @ggerganov's suggestion at #23940 (comment)

Make ggml_gated_delta_net take only the initial recurrent state (D, 1, n_seqs) and pass the snapshot count K as an op parameter instead of inferring it from state->ne[1].

Remove the padding hack and copy all emitted snapshots into the recurrent cache with a single strided ggml_cpy

This improves MTP performance by about 4% on DGX Spark.

Command: llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --spec-type draft-mtp

Master:

python3 mtp-bench.py
  code_python        pred= 192 draft= 165 acc= 135 rate=0.818 tok/s=98.1
  code_cpp           pred= 192 draft= 194 acc= 125 rate=0.644 tok/s=83.9
  explain_concept    pred= 192 draft= 199 acc= 124 rate=0.623 tok/s=83.0
  summarize          pred= 192 draft= 161 acc= 137 rate=0.851 tok/s=101.3
  qa_factual         pred= 192 draft= 177 acc= 131 rate=0.740 tok/s=92.3
  translation        pred= 192 draft= 202 acc= 122 rate=0.604 tok/s=81.3
  creative_short     pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=88.1
  stepwise_math      pred= 192 draft= 164 acc= 136 rate=0.829 tok/s=100.7
  long_code_review   pred= 192 draft= 201 acc= 123 rate=0.612 tok/s=80.8

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1652,
  "total_draft_accepted": 1160,
  "aggregate_accept_rate": 0.7022,
  "wall_s_total": 21.71
}

PR:

python3 mtp-bench.py
  code_python        pred= 192 draft= 165 acc= 135 rate=0.818 tok/s=102.0
  code_cpp           pred= 192 draft= 194 acc= 125 rate=0.644 tok/s=87.7
  explain_concept    pred= 192 draft= 199 acc= 124 rate=0.623 tok/s=86.4
  summarize          pred= 192 draft= 161 acc= 137 rate=0.851 tok/s=105.2
  qa_factual         pred= 192 draft= 177 acc= 131 rate=0.740 tok/s=95.8
  translation        pred= 192 draft= 202 acc= 122 rate=0.604 tok/s=84.3
  creative_short     pred= 192 draft= 189 acc= 127 rate=0.672 tok/s=91.5
  stepwise_math      pred= 192 draft= 164 acc= 136 rate=0.829 tok/s=104.9
  long_code_review   pred= 192 draft= 201 acc= 123 rate=0.612 tok/s=84.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1728,
  "total_draft": 1652,
  "total_draft_accepted": 1160,
  "aggregate_accept_rate": 0.7022,
  "wall_s_total": 20.91
}

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, paired with Codex. Code changes in all backends except CUDA were done by AI. I have reviewed all changes.

@github-actions github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jun 3, 2026
@gaugarg-nv

Copy link
Copy Markdown
Contributor Author

Marking this as a draft as it has changes only for the CUDA and Vulkan backends. Changes in the op implementation are fairly simple. I will work on changes for other backends once I get initial feedback.

@ggerganov ggerganov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's do the change. I think we the "determine K from the tensor shape" was never really needed.

Comment thread src/models/delta-net-base.cpp Outdated
// TODO: remove pad + simplify
ggml_tensor * s_3d = ggml_reshape_3d(ctx0, s, D, 1, n_seqs);
ggml_tensor * s_3d_pad = ggml_pad (ctx0, s_3d, 0, K - 1, 0, 0);
ggml_tensor * s_3d = ggml_reshape_3d(ctx0, s, D, 1, n_seqs);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to collapse the state anymore. Mainly for consistency reasons I think the state should remain a 4D tensor with shape: [S_v, S_v, H_v, n_seqs].

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Comment thread ggml/include/ggml.h Outdated
Comment on lines +2545 to +2558
// state holds the initial state s0 only, shape (S_v*S_v*H, 1, n_seqs). K is the number of
// output snapshot slots:
// K == 1: output carries the final state only.
// K > 1: output carries K snapshot slots; the kernel writes the last min(n_tokens, K)
// per-token snapshots into the trailing slots
// K > 1: output carries K snapshots, most-recent first (slot 0 = final state, slot s =
// state s tokens back); when n_tokens < K only slots 0..n_tokens-1 are written.
GGML_API struct ggml_tensor * ggml_gated_delta_net(
struct ggml_context * ctx,
struct ggml_tensor * q,
struct ggml_tensor * k,
struct ggml_tensor * v,
struct ggml_tensor * g,
struct ggml_tensor * beta,
struct ggml_tensor * state);
struct ggml_tensor * state,
int64_t K);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to document in a comment the shapes of the tensors of this op.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@github-actions github-actions Bot added SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) OpenCL Issues specific to the OpenCL backend Hexagon WebGPU labels Jun 9, 2026
@gaugarg-nv gaugarg-nv marked this pull request as ready for review June 9, 2026 18:52
@gaugarg-nv gaugarg-nv requested review from a team, CISC and JohannesGaessler as code owners June 9, 2026 18:52
@gaugarg-nv

Copy link
Copy Markdown
Contributor Author

Yes, let's do the change. I think we the "determine K from the tensor shape" was never really needed.

Changes are now made in all backends. I have thoroughly reviewed the change, but I have only been able to test CUDA, Vulkan, and CPU. I am hoping CI will be able to help capture any issues on other backends.

It will be great if maintainers can help test these changes on other backends as well.

Comment on lines +588 to +603
const int64_t n_written = std::min<int64_t>(n_seq_tokens, K);

ggml_build_forward_expand(gf, ggml_cpy(ctx0, src, dst));
}
// write the produced snapshots into the recurrent cache (snapshot slot i -> rollback group i)
ggml_tensor * src = ggml_view_3d(ctx0, gdn_out,
D, n_seqs, n_written,
ggml_row_size(gdn_out->type, D),
ggml_row_size(gdn_out->type, state_size_per_snap),
ggml_row_size(gdn_out->type, attn_score_elems));

ggml_tensor * dst = ggml_view_3d(ctx0, ssm_states_all,
D, n_seqs, n_written,
ssm_states_all->nb[1],
(size_t) mem_size * row_size,
(size_t) kv_head * row_size);

ggml_build_forward_expand(gf, ggml_cpy(ctx0, src, dst));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov could you please help review this part of the code? The earlier code was copying uninitialized snapshots if n_seq_tokens < K.

I have now limited the number of snapshot copies to min(n_seq_tokens, K).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks fine. It's unfortunate we can't deduce the n_written from the output shape (technically we can, but too complicated).

…, n_seqs) and passes the snapshot count K as an op parameter instead of inferring it from state->ne[1].

Remove the padding hack and copy all emitted snapshots into the recurrent cache with a single strided ggml_cpy
@arthw

arthw commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

@gaugarg-nv
In SYCL backend, the performance impact is smaller:
-5.4% -1.5% 0.5% 1.7%

It's no bad.
So, please go ahead!

Thank you!

@ggerganov ggerganov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing on Mac seems to work OK after the changes.

Comment on lines +588 to +603
const int64_t n_written = std::min<int64_t>(n_seq_tokens, K);

ggml_build_forward_expand(gf, ggml_cpy(ctx0, src, dst));
}
// write the produced snapshots into the recurrent cache (snapshot slot i -> rollback group i)
ggml_tensor * src = ggml_view_3d(ctx0, gdn_out,
D, n_seqs, n_written,
ggml_row_size(gdn_out->type, D),
ggml_row_size(gdn_out->type, state_size_per_snap),
ggml_row_size(gdn_out->type, attn_score_elems));

ggml_tensor * dst = ggml_view_3d(ctx0, ssm_states_all,
D, n_seqs, n_written,
ssm_states_all->nb[1],
(size_t) mem_size * row_size,
(size_t) kv_head * row_size);

ggml_build_forward_expand(gf, ggml_cpy(ctx0, src, dst));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, looks fine. It's unfortunate we can't deduce the n_written from the output shape (technically we can, but too complicated).

@ggerganov ggerganov added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 10, 2026
@gaugarg-nv

Copy link
Copy Markdown
Contributor Author

@ggml-org/maintainers, could I please get another approval?

@ggml-org/ggml-hexagon @ggml-org/ggml-webgpu @ggml-org/ggml-opencl Could you please take a look at this as well?

@gaugarg-nv gaugarg-nv merged commit e95dae1 into ggml-org:master Jun 10, 2026
51 checks passed
@gaugarg-nv gaugarg-nv deleted the gdn-changes branch June 10, 2026 17:51
Jcfunk added a commit to Jcfunk/llama.cpp that referenced this pull request Jun 11, 2026
* upstream/HEAD: (329 commits)
  vendor : update LibreSSL to 4.3.2 (ggml-org#24397)
  Remove padding and multiple D2D copies for MTP (ggml-org#24086)
  chat: fix LFM2/LFM2.5 ignoring json_schema (ggml-org#24377)
  CUDA: Fix ssm_scan_f32 data-races (ggml-org#24360)
  ci : bump komac version (ggml-org#24396)
  speculative : fix "ngram-map-k4v" name in logging (ggml-org#24253)
  webui: implement pinned conversations support (ggml-org#21387)
  graph: Fix granite speech model inference by applying embedding scale when deepstack is not used (ggml-org#24357)
  ci : fix windows release (ggml-org#24369)
  ui: add opt-in run_javascript frontend tool (ggml-org#24244)
  mtmd: build_vit batching (ggml-org#24352)
  vulkan: reduce iq1 shared memory usage for mul_mm (ggml-org#24287)
  vulkan: add `v_dot2_f32_f16` support in matrix-matrix multiplication and Flash Attention (ggml-org#24123)
  ui: Fix excessive style recalculation on hover (ggml-org#24243)
  mtmd: refactor video subproc handling (ggml-org#24316)
  server: log prompts to directory (ggml-org#22031)
  ui: fix mobile chat form overflow and bust stale bundle cache (ggml-org#24158)
  ggml : add GGML_OP_COL2IM_1D (ggml-org#24206)
  server : do not clear slots without unified KV cache (ggml-org#24190)
  models : fix plamo2 attention_key/value_length regression (ggml-org#24317)
  ...
wmeddie added a commit to XpressAI/ve_llama.cpp that referenced this pull request Jun 13, 2026
Sync 205 upstream commits. The VE backend is out-of-tree (ggml-ve/), so
the merge is conflict-free. The one upstream change that touches the
backend is ggml-org#24086 (e95dae1, the GATED_DELTA_NET op-contract change),
re-ported in the following commit.
wmeddie added a commit to XpressAI/ve_llama.cpp that referenced this pull request Jun 13, 2026
Upstream ggml-org#24086 (e95dae1) changed GATED_DELTA_NET's recurrent state
from the old 3D [S_v*S_v*H, K, n_seqs] layout to a 4D s0
[S_v, S_v, H, n_seqs], moved the snapshot slot count K to op_params[0],
and reversed the snapshot ordering (slot 0 = newest).

Without this the VE GDN supports() rejects the new 4D state, the op
falls to CPU, and Qwen3.5/3.6 garble. Make supports() accept BOTH the
pre- and post-ggml-org#24086 contracts, read K from op_params[0] (new) or
state->ne[1] (old), pass the correct per-seq state stride (nb[3] new /
nb[2] old), and thread a new_contract flag into the kernel so the K>1
snapshot slot uses (n_tokens-1-t) under the new order.

Validated on ve-backend-port after the 205-commit master sync:
Llama-3.2-3B stays coherent; Qwen3.5-4B-rys (GDN) produces coherent
output ("...Paris.") instead of garble.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
platima pushed a commit to platima/llama.cpp-SpacemiT-K3 that referenced this pull request Jun 14, 2026
Scanned ggml-org/llama.cpp 160 commits ahead of base 354ebac for any
MTP perf fixes that might obviate the planned patch 9 (MTP net-negative
on small models / Gemma E2B).

Findings:
- e95dae1 (Remove padding and multiple D2D copies for MTP, ggml-org#24086) is
  the only MTP-tagged perf commit, but it targets ggml_gated_delta_net,
  the Qwen3.5 DeltaNet recurrent path. Gemma4 is a plain transformer so
  this commit does not address the E2B net-negative case.
- a66d505 / 88a3927 / 260862b / 7acb4e8 noted as not applicable
  or pure refactor.

Conclusion: patch 9 would need original investigation work (IME2
coverage of the MTP block, n_max tuning, backend sampling
re-enablement). With E4B/12B already net-positive on the same binary
the patch may be deprioritisable.
platima pushed a commit to platima/llama.cpp-SpacemiT-K3 that referenced this pull request Jun 14, 2026
* Make ggml_gated_delta_net take only the initial recurrent state (D, 1, n_seqs) and passes the snapshot count K as an op parameter instead of inferring it from state->ne[1].

Remove the padding hack and copy all emitted snapshots into the recurrent cache with a single strided ggml_cpy

* Make GDN changes in all backends. Address review comments.

* Fix CI build errors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning Hexagon merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. model Model specific Nvidia GPU Issues specific to Nvidia GPUs OpenCL Issues specific to the OpenCL backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants