Conversation
|
I think it is too early to implement the dedicated delta net ops. There are still many things to optimize in the existing implementation (you can keep track of my progress in #19375). After that we have to consolidate the KDA version of the delta net (#18792). Btw the l2 norm should not be part of this OP - fixed in my branch. Also not sure how to handle the 2 variants of this operator (autoregressive and chunked). So I think we can experiment with a dedicated op in a branch, but merging this in |
|
@ggerganov I defer to your judgement, my thinking was that qwen3.5 is already a major model series, so even if the op is just for that model it makes sense. for KDA, AFAIK it the gate is a matrix, so it will just be another dot product instead of a scale. For chunk vs autoregressive, we have the vec FA path for CPU which now serves a reference kernel. I was thinking it would be the same here, the autoregressive kernel remains the simple kernel while chunking is the optimisation, both solve the same recurrence. |
|
Ok, let's prototype a branch that also has this op together with the CUDA implementation rebased on #19375. I will then add the Metal version of the kernel and from there we can consider a quicker merge if things are looking good. Also, want to see if having this op will allow the CUDA graphs to be more easily enabled. |
|
So this is basically what the Transformers implementations have as the "recurrent" implementation, right? No chunking, just iterating over tokens. |
|
@pwilkin yes, just calculating the recurrence token by token |
|
Btw, should also consider small batch sizes larger than 1 to be handled by this operator too. I'm not sure where the break-even point would be, but I imagine that processing a few tokens auto-regressively (i.e. more than 1 and less than ~16) would be more efficient compared to the chunking path. Also don't forget that dim 3 will handle separate sequences - though from a quick look, this implementation already accounts for that. |
Yes for small amount of tokens we can just run a loop even in CUDA. I have not looked into the chunked impl yet, but I will invest some time in finding the breakeven point
I think this should be fine, the work is split among dim1 * dim3 (heads * sequences) |
|
Great performance gain for inference. Looking forward to seeing your implementation done for the major backends. If you have plan to do the chunking version as well, it will be great if it is based on the block implementation done at fla. https://github.com/fla-org/flash-linear-attention/blob/main/fla/ops/kda/chunk_intra_token_parallel.py |
pwilkin
left a comment
There was a problem hiding this comment.
Looks clean to me. Are you planning on doing the chunking version here as well, or separate op / PR?
|
Converted to draft since I am not sure if my comment was clear: #19504 (comment). First we will be prototyping a new branch and after that we will consider adding the new op. |
|
Should we use this PR or will you create a dedicated branch? |
f655ba4 to
01eda69
Compare
|
@ggerganov I removed the norm, and also added the autoregressive cuda op in 01eda69, it passes |
01eda69 to
54ea122
Compare
|
Just a heads up, I will be rebasing the #19375 branch from time to time. Hope it's not a big issue. Just always put your commits on top. I'm hoping to merge in a day or two. |
|
I did a quick perf test this PR + #19375 + replacing the autoregressive for qwen3next with Detailsmaster
PR:
|
|
For reference, what do you get with CUDA graphs forced enabled: diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index f3d8317e1..605cb3ed4 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -2894,7 +2894,7 @@ static bool ggml_cuda_graph_check_compability(ggml_cgraph * cgraph) {
#endif
}
- if (node->op == GGML_OP_ADD &&
+ if (false && node->op == GGML_OP_ADD &&
node->src[1] && node->src[1]->ne[1] > 1 &&
(node->src[0] ? node->src[0]->name != gemma3n_per_layer_proj_src0_name : true) &&
(node->src[1] ? node->src[1]->name != gemma3n_per_layer_proj_src1_name : true) && |
|
With force enabled CUDA graphs Details
|
|
Very nice increase in TG speeds for CPU-only here! I didn't measure any increase for PP however (which may be expected). Finally this gaming rig CPU is Zen5 and has
👈 Detailsik_llama.cpp main@277fc1d2model=/mnt/astrodata/llm/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-ctk q8_0 -ctv q8_0 \
-c 69632 \
-ub 1024 -b 2048 \
--merge-qkv \
--threads 16 \
--warmup-batch \
-n 128
mainline llama.cpp master@e68f2fb8 + ug/port-sweep-benchmodel=/mnt/astrodata/llm/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-ctk q8_0 -ctv q8_0 \
-c 69632 \
-ub 1024 -b 2048 \
--threads 16 \
-n 128
mainline llama.cpp PR19504 gated_delta_net@e0fbfc01 + ug/port-sweep-benchmodel=/mnt/astrodata/llm/models/ubergarm/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-Q4_0.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-ctk q8_0 -ctv q8_0 \
-c 69632 \
-ub 1024 -b 2048 \
--threads 16 \
-n 128
I didn't compare hybrid CPU+GPU performance but expect it will see better throughput as well for TG. Some more details on how I compiled and similar benchmarks without this PR here. Thanks and great work! |
|
I see actually a huge difference in PP on CPU when just using the autoregressive kernel instead of the current one i.e. just use the fused op regardless of n_tokens. But I think I will optimize this later |
|
Hi @ProgenyAlpha, just wanted to check whether you still plan to submit a PR for the vulkan backend support. |
|
Huh, not sure exactly what's happening, but MUSA build is now throwing an ICE: Edit: This killed our Docker release as well: |
|
@CISC not sure who maintains the MUSA backend, but it seems like a compiler bug |
Add a fused Metal kernel for the gated delta net recurrence op (ggml-org#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@yeahdongcn PTAL at the MUSA issue above. @am17an In the meantime we can change |
No problem. I'll try a local build first and see if I should open an internal ticket. Thanks! |
I wasn't sure where the thread was going so I wanted to let you guys cook and see how things unfolded before I jump back in. I'll rebase and work on that this week if I have time. Thanks for pinging me! |
Add a fused Metal kernel for the gated delta net recurrence op (#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* ggml: add GATED_DELTA_NET op * remove the transpose * add KDA * add qwen35 dense * llama : check for fused gated delta net backend support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from 2068908 * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>
* llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (ggml-org#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (ggml-org#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (ggml-org#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from 2068908 * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>





Add CPU/CUDA impl for GATED_DELTA_NET used in qwen3next and a lot of upcoming recent attention models. This is a basic vector impl and not the chunking impl, although this should work for n_tokens > 1 as a reference implementation. I tested this vs
build_delta_net_autoregressiveand the results were good. I plan to add the chunked implementation for CPU and CUDA.master:
sched_reserve: graph nodes = 14990 (with bs=512), 6242 (with bs=1)ggml_op_gated_delta_netadded to the qwen3next graph (not added in the PR)sched_reserve: graph nodes = 14990 (with bs=512), 5342 (with bs=1)