llama : enable chunked fused GDN path#20340
Conversation
am17an
left a comment
There was a problem hiding this comment.
BTW I tried the chunked kernel, it just about equal in performance to master. It seems like cublas is hard to beat, even with the sequential loop for chunks
|
Hm, I actually disabled the chunked kernel in the CUDA backend only because I thought it is not implemented. If it is ready, then we should enable it - at the very least the ggml graph will become constant. Btw, on DGX Spark it seems to perform better compared to the unfused path: GGML_CUDA=ON ./scripts/compare-commits.sh master gg/llama-allow-gdn-ch llama-bench -m ~/models/qwen3-next-q4_0.gguf -m ~/models/Kimi-Linear-48B-A3B-Instruct-jp-imatrix.Q4_K_M.gguf -m ~/models/unsloth_Qwen3.5-27B-GGUF_Qwen3.5-27B-Q4_K_M.gguf -ngl 99 -fa 1 -t 1 -dio 1 -p 512,2048 -n 32 -ub 2048 -r 3
If you can confirm correctness of the implementation, I think it is fine to enable it. |
|
I think what you're enabling currently is just the autoregressive kernel right? |
|
Ok I see. It's still the autoregressive kernel iterating over all tokens. |
|
Strange that's it faster for Kimi-linear by so much. I think it makes sense to enable it when kda is true? I see it's faster on 5090 as well. cc: @ymcki |
|
Is the current branch much slower on 5090 with non-KDA models? |
|
Yes it's much slower for qwen3.5 |
444eeed to
39b6f5a
Compare
|
Ok, enabled it only for KDA for now. Also changed the broadcast pattern to interleaved since this is what is used for Qwen3.5 and helps to avoid explicit repeats of the Q and K tensors. Added TODOs to make the broadcast configurable which will allow to avoid the repeats for Qwen3 Next in a similar way. After a few tests and making sure this branch works correctly, we can merge. |
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Originally, pwilkin implemented backend agnostic chunking implementation for pp and recurrent implementation for inference. Recurrent is a special case of chunking in which chunk_size == tokens_len. Later pwilkin found that for inference, doing tokens one by one in autoregressive mode is faster than recurrent, so the recurrent form is replaced by the autoregressive form. Originally, cacaview implemented CPU and CUDA recurrent form: cacaview's recurrent CPU implAs you can see, aside from some hard coded numbers, it is a cleaner implementation than the reshape and solve_tri used in pwilkin and my backend agnostic chunking implementations. So if your implementation is along this line, then it is not surprising that it is much faster. But this is a recurrent impl, not sure if the chunking version of it can be faster or not. |
|
Yes, this is what the current recurrent version in master roughly looks like. We need to figure out the boundary between the chunked and autoregressive version, clearly it's not 1 and is also device dependent |
Change rq1 (tiled: head_id / rq1) to neq1 (interleaved: head_id % neq1) to match the broadcast semantics from PR ggml-org#20340. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change rq1 (tiled: head_id / rq1) to neq1 (interleaved: head_id % neq1) to match the broadcast semantics from PR ggml-org#20340. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Try to follow our policy please, it is there for a reason, it is extremely tiresome for everyone every time it is not. |
|
Is it just me or did the Github runners become very slow lately? I.e. it takes very long to pick up the jobs. |
I see the same. For me, it now takes about a day to finish all CI jobs. |
It is not just you, I've been regularly slaying queues once required tests have finished (or PR has merged) just to ensure everything doesn't completely grind to a halt... |
What exactly about my comment was tiresome? I shouldn't have to explain this but I have dyslexia. I use AI the same way someone might use spellcheck to format and catch errors. That is the extent of it. Spellcheck works for short discussions like this, not technical posts. It's 100% my words and my concepts and my posts and I have strict guardrails in place to keep it that way. Perhaps a CODEOFCONDUCT.md needs to be made as am17an's response was unreasonable and the time spent on this exchange and the sudden commentary on multiple PR's has wasted far more time than briefly reading my formatted technical post ever would have. If this is how outside contributors are treated for using accessibility tools, I have no choice but to stop contributing to this project once I've completed my PRs. This is what my post looks like without formatting tools
If you really think this would be better than what I originally wrote... then I owe you ALL an apology. |
I'll reshuffle the words; It is extremely tiresome for everyone every time the policy is not followed. |
It's about perception, one does not feel valued in a conversation if it seems artificially one-sided, nothing against your wording or your need for tools, TBH I find your original text just fine, just insert a few newlines, have more faith in your skills. :) |
|
@ProgenyAlpha the issue is that when one sees a clearly AI-written or at least AI-formatted comment, it's hard to tell how much effort a human put into it, and whether or not it's worth human time and attention reading and responding to it. LLMs can quickly and cheaply produce millions of long, detailed, and coherent sounding comments that may or may not be bullshit. With the amount of LLM-generated content getting produced these days, including PRs and comments on repos, it becomes tiresome to read and understand everything that LLMs produce. A policy of requiring all comments to be written by a human makes it easier to decide if it's worth another human spending their own time to read and respond to it. Your writing is understandable as is, just use some punctuation, newlines, and perhaps backticks for inline code snippets. |
|
@sultanqasim Thank you for your feedback, I have no issue with the policy if someone is blatantly violating it. I'm not violating it. The spam concern is valid in the abstract but doesn't apply here. I did not submit a long detailed LLM generated post and all of the things you say my original posted needed, is exactly what was done to the original post. I have open PRs, I'm a new responsive and friendly active contributor, and my comment was a direct technical proposal to two people about a problem they both named and it was personal and friendly and not AI. What was NOT personal and friendly was am17an response, which was far more egregious to what everyone seems to be trying to protect. @CISC Ironically, one does not feel valued for being unfairly nitpicked. |
Yes you are. Please read https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md
Seems like you already acknowledged this in #20334 (comment) You are also violating another policy
Sorry for being rude, I understand your intentions might be good. However these policies exist for maintainers to not get overloaded and ensure human-driven communication happens. |
Except I'm not. I wrote everything. Nothing was written by AI. Nowhere in CONTRIBUTING.md does it prohibit using AI to format a technical post before posting it.
Another missed detail. This comment on my PR was posted immediately following your rude comment here and I gave it a thumbs up out of respect for the POLITE way it was brought up and to keep things isolated to this PR.
This policy A) was written 3 days ago, after I started contributing weeks ago and B) the additional PRs were explicitly made after being directed to do so by @ggerganov and @0cc4m
I genuinely think your reaction, tone, and delivery has nothing to do with your desire to maintain the integrity of anything. Every communication from me has been human-driven and to act like I'm responsible for overloading the team after the drama you've created with your callous reply is a bit silly. I have dyslexia. Using accessibility tools is not the same as AI authorship, and conflating the two is something I'd hope a project of this caliber would understand. I hope we can move past this and focus on the work. |
Perhaps it is not clear to you or your AI, but formatting a technical post counts as being written by AI. At this point I think you're being disingenuous and I will not engage with you anymore. Good luck. |
Hard disagree, and your ignoring my other talking points and refusing to have dialogue just reinforces that your sole focus here had nothing to do with maintaining human to human communication. I disclose to you my disability and you return with a dismissive and dehumanizing "it's not clear to you and your AI" comment. Anyone else find it ironic the thread was locked by an AI bot for being too heated? Lol, too funny. |
* llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (ggml-org#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (ggml-org#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (ggml-org#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from 2068908 * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>
Adapt to the interleaved broadcast convention from ggml-org#20340: head_id / rq1 → head_id % neq1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adapt to the interleaved broadcast convention from ggml-org#20340: head_id / rq1 → head_id % neq1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...
* vulkan: add GATED_DELTA_NET op support Implements the fused gated delta net recurrence as a Vulkan compute shader with full support for scalar gate, KDA vector gate, GQA broadcast, multi-token sequences, and permuted (non-contiguous) q/k inputs. Specialization constants select head size (32/64/128) and KDA mode at pipeline creation time. Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: optimize GATED_DELTA_NET shader (Phase 1) - vec4 dot products on all inner loops (dp4 hardware intrinsic) - Cache exp(g) in shared memory for KDA path, eliminating ~32K redundant global reads and ~16K redundant exp() calls per token - vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops) - Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops KDA TG: +5.4% throughput. Non-KDA: no regressions. 13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: address review feedback for GATED_DELTA_NET Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros, scale in push constants, supports_op fix, dispatch restructuring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: add explicit FLOAT_TYPE casts for buffer loads Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts to ensure correct behavior across all Vulkan configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: fix Q/K broadcast for interleaved head layout Adapt to the interleaved broadcast convention from #20340: head_id / rq1 → head_id % neq1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The chunked fused Gated Delta Net detection in sched_reserve() calls graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs. This creates a dimension mismatch in build_pooling() for embedding models with mean/rank pooling: build_inp_mean() creates a tensor with shape [n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...] via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b). Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations. Regression introduced by ggml-org#20340 (d28961d). Same class of bug as ggml-org#12517, fixed by ggml-org#12545.
Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to cover the --pooling mean codepath, which was previously untested. These tests would have caught the regression introduced by ggml-org#20340 where build_pooling() crashes with a ggml_mul_mat assertion due to mismatched dimensions in the chunked GDN detection path.
…0468) * llama : fix pooling assertion crash in chunked GDN detection path The chunked fused Gated Delta Net detection in sched_reserve() calls graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs. This creates a dimension mismatch in build_pooling() for embedding models with mean/rank pooling: build_inp_mean() creates a tensor with shape [n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...] via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b). Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations. Regression introduced by #20340 (d28961d). Same class of bug as #12517, fixed by #12545. * server : add mean pooling tests to embedding test suite Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to cover the --pooling mean codepath, which was previously untested. These tests would have caught the regression introduced by #20340 where build_pooling() crashes with a ggml_mul_mat assertion due to mismatched dimensions in the chunked GDN detection path. --------- Co-authored-by: Domenico Crupi <domenico@zerovolt.it>







cont #19504
Backends can now implement the chunked version of the fused GDN operator.
Implementations: