vulkan: chunked parallel kernel for GATED_DELTA_NET#20377
vulkan: chunked parallel kernel for GATED_DELTA_NET#20377ProgenyAlpha wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
Benchmarks, Strix Halo: master (e1a3999):
PR (795f15c):
That's 10-20% better PP performance, depending on the model. |
|
@lemmi Great numbers, thanks for testing. Updated the PR to actually enable the chunked Vulkan dispatch — it's now gated on shader core count (> 16 CUs) instead of being disabled. On my 890M (16 CUs) the 3-dispatch overhead makes chunked slower than autoregressive, so it stays off there. On your 8060S (32 CUs) it should activate automatically for n_tokens > 64 with d128 non-KDA configs. I can't validate the chunked dispatch path myself since I only have the integrated 890M. If you get a chance to test the latest push, that would tell us whether the chunked shaders actually help PP on discrete hardware or if they need more work (coopmat for the output kernel is the next step if so). |
|
Small clarification: the 8060s is the iGPU on Strix Halo (aka Ryzen AI MAX+ 395). The 8060s has 40CUs. Performance tanked with the latest patch:
After:
|
795f15c to
c0d0341
Compare
Three-dispatch chunked pipeline for prompt processing acceleration: intra-chunk WY decomposition, inter-chunk state propagation, output combination. Currently disabled (threshold=UINT32_MAX). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
c0d0341 to
dbbe2a9
Compare
|
@0cc4m Rebased on master. Chunked kernels work but the scalar output kernel is too slow without coopmat, so the threshold is disabled for now. I've got a coopmat output kernel already in the works, but do you want me to add it here or keep this as infrastructure and open a separate PR for the coopmat or stop here? |
|
Do I understand correctly that to see a gain you need to merge this PR with another? What exact command line are you using where you see a 30% gain? I only see GDN taking about 5% of the time running |
|
@jeffbolznv Hey! This is noted in the PR description but the 30% PP gain comes from #20340's chunked op path on the graph side feeding my GDN vulkan autoregressive shader #20334 more efficiently, not from the Vulkan chunked shaders. Both #20334 and #20340 are already merged into master, so that improvement is already live. The Vulkan chunked dispatch in this PR is actually disabled ( So with this PR as-is, you'd see near identical performance to master since the chunked path doesn't activate. I was waiting to find out how 0cc4m would like to handle this or anyone in a position to give feedback. I can close out this PR until I've done more thorough validation and testing and reopen then, if preferred. |
|
I mostly want to understand what kind of use case/benchmark you're accelerating on so I can see how much theoretical upside there is. |
The autoregressive kernel dispatches one workgroup per attention head. For Qwen3-Next (n_head_kv=2), that's 2 workgroups per GDN layer. On my 890M (16 CU), GDN is ~8% of PP-512 time. Everything else (MLP, matmuls, norms) saturates all 16 CUs. Chunked doesn't show any improvement here because there's nothing for the iGPU to give. On a 7900 XTX (96 CU, ~960 GB/s), the non-GDN ops scale with both CU count and bandwidth. Roughly 10x faster than my shared DDR5. The GDN op also gets faster from bandwidth (~3x), but it doesn't scale with CU count, still 2 workgroups, 94 CUs idle. Dirty math (Amdahl's law, all estimates)
GDN's share grows from ~8% to ~25% of the pipeline. Chunked dispatches 16 workgroups instead of 2 for PP-512. If chunking allowed something like a ~4× improvement on the GDN portion, the rough Amdahl math would put total PP around ~9.75 vs ~12 (~19%). Obviously that depends heavily on whether the kernel actually scales that way. These are rough numbers. The bandwidth scaling on the GDN op, the actual compute vs memory bound split, the dispatch overhead of 3 stages, all of that needs real profiling data to pin down. Based on this rough Amdahl model, GDN's relative share could grow on larger GPUs where the rest of the pipeline scales with CU count but the autoregressive kernel remains limited to a small number of workgroups. I can't prove the exact crossover point locally on 16 CUs, but the theoretical upside on larger GPUs makes it seem worth exploring. |
Follow-up to #20334. Adds the chunked parallel kernel infrastructure for Vulkan GATED_DELTA_NET, split out per @0cc4m's review feedback.
Depends on #20334 and #20340
Three new compute shaders implementing the chunked algorithm:
gated_delta_net_chunk_intra.comp— intra-chunk parallel computationgated_delta_net_chunk_inter.comp— inter-chunk state propagationgated_delta_net_chunk_output.comp— output reconstructionIncludes the
rq1→neq1broadcast fix to match #20340's interleaved Q/K layout (head_id % neq1instead ofhead_id / rq1).Chunked dispatch is currently disabled (
GDN_CHUNK_THRESHOLD = UINT32_MAX) — the autoregressive path handles all token counts. Enabling it will need cooperative matrix support for the output kernel to be competitive.16/16 backend-ops tests passing (includes chunked-specific test configs with n_seq_tokens=64/128).
890M benchmarks (Qwen3-Coder-Next REAM Q4_K_M):
The PP improvement comes from #20340's chunked op path feeding our autoregressive shader more efficiently. The Vulkan chunked dispatch itself isn't active yet — that's the next optimization pass.