vulkan: add GATED_DELTA_NET op support#20334
Conversation
|
Quickly did a couple of benchmarks on strix halo (8060s). PP basically unchanged (maybe a little slower), TG quite a bit faster:
PP performance still exhibits the same issue as described in #18725. |
|
Thanks for testing! Those TG numbers are great — 22% on Coder-Next, 21% on 35B-A3B, 18% on 122B-A10B. The PP issue from #18725 makes sense — this shader only affects the deltanet recurrence layers, and PP throughput is still bottlenecked by the autoregressive token loop in the current kernel. A chunked parallel kernel (Phase 2) would fix that, but it's a much bigger piece of work. Hope to work on it tomorrow. Updated the PR description with your benchmark numbers. Really helpful to have data from a discrete 8060S alongside the integrated 890M results. |
|
Same 8060S setup. In my testing, although the current PR shows performance improvement compared to Master PR, Qwen3.5-35B0A3B cannot reach that speed (50 token/s), quite strange :( command: |
0cc4m
left a comment
There was a problem hiding this comment.
Thank you for getting this out so quick! I have a few comments, but overall it looks good. I can confirm it runs correctly.
Let's focus on getting it running first, optimization can be a follow-up PR.
|
@IIIIIllllIIIIIlllll The difference in performance is mostly likely caused by the quantization: Additionally my Minisforum MS-S1 has a very generous power budget (I think the highest of all available boards with 160W short term and 130W long term). |
|
For what it's worth, I'm seeing some crazy numbers here in terms of performance on my R9700s:
These are anecdotal numbers from ad-hoc testing, but they're pretty consistent across prompts - just trying to give an indication of why my jaw dropped. The single-GPU boost is great, but I had to check the dual-GPU result with Qwen3-Coder-Next a bunch of times; it's genuinely usable in an agentic context now. With Cline at the helm, it never drops below 50t/s even as the context fills up past 180k, and both GPUs have gone from ~63W constant to 70W with spikes up to 180W (still nowhere near the 300W max, but that's on the tensor parallel PR when it's fixed for Vulkan...?). @ProgenyAlpha - I feel like some of us may owe you a beer or five. I might actually be getting some value out of these cards now! |
|
Thanks for the review — all 7 items addressed. Pipeline declarations moved to a |
|
Pushed the review fixes — all 7 items from @0cc4m addressed. Also ran a full model bench on 890M (integrated): Qwen3-Coder-Next REAM Q4_K_M (60B)
@digitalscream those dual-GPU numbers are wild — 42% improvement and usable agentic speeds at 180k context is exactly the kind of thing that makes this worth doing. I have a tip jar but I'm doing it for the community so not necessary. I love the feedback from different platforms so feel free to keep testing changes. @lemmi thanks again for the Strix Halo data. PP issue is upstream of this shader (SSM_CONV workgroup scaling, #18725) so it won't change here. @IIIIIllllIIIIIlllll lemmi nailed it — UD-Q8_K_XL uses importance-weighted mixed precision which generally outperforms uniform Q8_0 across backends. |
|
Great work @ProgenyAlpha |
|
@ProgenyAlpha thanks for this, 24% to 42% on R9700 is godsent ! if i may , you can have a look at the following for you next adventure |
|
Note that for the chunked version you'll need to have the changes from #20340 |
|
Pushed f16 mixed-precision state — stores the 128-element state array in float16_t, keeps all arithmetic in float32. No precision loss (16/16 tests), lower register pressure. 890M benchmarks (Qwen3-Coder-Next REAM Q4_K_M):
f16 pipeline auto-selects when the device supports shaderFloat16, falls back to f32 otherwise. @ggerganov saw your note about #20340 — will rebase onto that once it lands. The @zedbytes interesting find on the nocompute flag. That's a driver-level queue scheduling optimization — separate from shader work but worth investigating for AMD Vulkan in general. |
|
@digitalscream @lemmi if you get a chance, would be great to see updated numbers with the latest push. The f16 state change gave +8.3% PP on my 890M — curious if the improvement scales differently on discrete cards with more CUs. |
|
@ProgenyAlpha - OK, with the latest push:
So...+7% PP and -0.8% TG for single GPU, negligible changes for dual GPU. |
|
Please split out the chunked support into a follow-up instead of expanding the scope of this PR. Even float16 might be further than step 1 should be. |
|
Wasn't able to see any meaningful changes on 23fbfcb for PP on my end. |
Sorry to bother everyone. |
a57f2ac to
2007841
Compare
|
@0cc4m Done — stripped this PR back to autoregressive-only. Removed chunked shaders, f16 state, and all the infrastructure that came with them. This is now just the base GATED_DELTA_NET op support + review fixes. Split into two follow-up PRs:
13/13 backend-ops tests passing. Benchmarks unchanged from before since the autoregressive path is the same. |
|
@lemmi @IIIIIllllIIIIIlllll Thanks for retesting. The f16 state improvement looks hardware-dependent — +8.3% PP on my 890M, +7% on @digitalscream's R9700S, but flat on Strix Halo. Might be related to how the register file handles f16 on different RDNA3.5 configs. Either way, f16 is now split out into its own PR (#20376) so it won't hold up the base support. This PR is stripped back to autoregressive-only per @0cc4m's review — should be ready for another look. |
|
Mark the PR as ready when you are done, please. |
|
Rebased on master and fixed the Q/K broadcast to use the interleaved layout from #20340. 13/13 tests passing. |
|
No, that was not a correct rebase. |
Implements the fused gated delta net recurrence as a Vulkan compute shader with full support for scalar gate, KDA vector gate, GQA broadcast, multi-token sequences, and permuted (non-contiguous) q/k inputs. Specialization constants select head size (32/64/128) and KDA mode at pipeline creation time. Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- vec4 dot products on all inner loops (dp4 hardware intrinsic) - Cache exp(g) in shared memory for KDA path, eliminating ~32K redundant global reads and ~16K redundant exp() calls per token - vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops) - Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops KDA TG: +5.4% throughput. Non-KDA: no regressions. 13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros, scale in push constants, supports_op fix, dispatch restructuring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts to ensure correct behavior across all Vulkan configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adapt to the interleaved broadcast convention from ggml-org#20340: head_id / rq1 → head_id % neq1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d72268f to
d5300db
Compare
|
Saw it immediately after, rebase picked up duplicate commits from master, didn't catch it before pushing. Cleaned up with cherry-pick, should be 6 commits on master now. |
|
Thanks for correct rebase! EDIT: Checked. |
|
I'm gonna skip waiting for the CI to unblock other PRs. I checked locally that it works. |
Summary
First pass at a Vulkan compute shader for
GGML_OP_GATED_DELTA_NET, covering both the standard (scalar gate) and KDA (per-row vector gate) variants. This is the core recurrence op used by Qwen3.5 and Qwen3-Next models.What's here:
Benchmarks — AMD Radeon 8060S / Strix Halo (by @lemmi, vs master):
Benchmarks — AMD Radeon 890M / Strix Point (integrated):
Vulkan vs CPU on 9B: 1.7x PP, 7.5x TG.
Op-level perf (Phase 1 vs baseline scalar shader):
What's next:
13/13 test-backend-ops passing. All head sizes, GQA, multi-seq, permuted layouts, both KDA and non-KDA.
Open to collaboration — if anyone wants to work on the chunked kernel or test on different hardware, happy to coordinate.
cc @jhen0409 (re: #14909)