CUDA: AR gated delta net improvements#20391
CUDA: AR gated delta net improvements#20391ggerganov merged 13 commits intoggml-org:gg/llama-allow-gdn-chfrom
Conversation
This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies).
|
How does the PP perf look if you enable the new kernel for larger batches: diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 8b9330d63..d709007d3 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -5001,7 +5001,7 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
#else
// KDA is faster using the AR kernel even when n_tokens >= 512
//TODO: Add chunked kernel
- return op->src[0]->ne[2] == 1 || op->src[3]->ne[0] == op->src[2]->ne[0];
+ return true;
#endif // GGML_USE_MUSA
case GGML_OP_FLASH_ATTN_EXT:
return ggml_cuda_flash_attn_ext_supported(dev_ctx->device, op);In the Metal backend, using this implementation, it is good both for TG and PP so I plan to enable it for both. |
Yes, my guess is that this will eliminate the need for using the shared mem for CDNA. |
FWIW, I'm seeing PP improvements both on DGX Spark and RTX 5090, using the new sharded AR kernel:
|
|
CDNA warps are 64 wide. We generally dont want to use WARP_SIZE (which is just hard coded to 32) in new code but rather |
am17an
left a comment
There was a problem hiding this comment.
We can remove the fast_div_64 stuff for now
|
@ggerganov can you try 8ea2990 on your devices? On a 5090 I get ~25% speedup using this kernel, but not as much as on a 4090. Note that it doesn't support KDA yet so just try the qwen models |
1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA
This should work now with 0211798. |
|
@am17an Here are the results between 8ea2990 and this PR with PP enabled:
For Q3.5 on RTX 5090 your version has an edge. |
|
Yeah my guess is it will scale with the |
ggerganov
left a comment
There was a problem hiding this comment.
IMO merging the sharded version + enabling it for all batch sizes is good for starts because it reduces dramatically the number of nodes in the ggml graph and makes it constant in terms of topology, regardless of the batch size. We can build the chunked kernels on top of it.
Just need to double check the correctness of the computation. PPL looks OK, but it wouldn't hurt to cross check the logprobs against vLLM again, just in case.
|
@am17an All tests here are with |
|
Agree, let's merge this. I feel the chunked version would be useful for large
|
|
@IMbackK might verify perf and correctness on HIP, especially compared to the current SMEM approach (note this PR targets |
We can merge this PR into |
|
I used the new I did like this: # go to baseline
git co 0cd4f4720b71dd7eb5fb3e3e86ffdd8ec5ac7c9f
make -j
# some random text for prompt
PROMPT="The above example is using an intermediate build b5046 of the library. This can be modified to use a different version by changing the URL and checksum."
# dump logits with bs=1 and bs>1 to exercise the 2 paths
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub1.gguf -p "$PROMPT" -ub 1
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub512.gguf -p "$PROMPT" -ub 512
# compare logits between the 2 to get a sense of the expected variance that we can expect
# i.e. evaluate the prompt using `ub=1` and compare to the result with `ub=512`
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub512.gguf -p "$PROMPT" -ub 1 --check
NMSE=1.641e-03
# checkout and build this branch
git-pr 20391
make -j
# compare the logits
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub1.gguf -p "$PROMPT" -ub 1 --check
NMSE=2.278e-03
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub512.gguf -p "$PROMPT" -ub 512 --check
NMSE=5.088e-04So I think it is safe to merge this. Let's wait for @IMbackK to confirm CDNA is good. |
|
Head sizes filling at least one cdna warp currently fail: |
| int64_t neqk1, int64_t rq3, | ||
| float scale, cudaStream_t stream) { | ||
| //TODO: Add chunked kernel for even faster pre-fill | ||
| constexpr uint32_t warp_size = ggml_cuda_get_physical_warp_size(); |
There was a problem hiding this comment.
ggml_cuda_get_physical_warp_size is not valid in host code you need to get the warp size from the device info struct
There was a problem hiding this comment.
Thanks for the patch! Please re-test when you have the time
warp_size is not known at compile time in hip host code.
7a3da50 to
226a1ac
Compare
Get warp size at runtime
IMbackK
left a comment
There was a problem hiding this comment.
Ok its correct now and, performance looks good too:
Master:
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen35moe 35B.A3B Q8_0 | 28.21 GiB | 34.66 B | ROCm | 99 | 1 | tg128 @ d32000 | 58.02 ± 0.11 |
Pr:
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)
| model | size | params | backend | ngl | fa | test | t/s |
|---|---|---|---|---|---|---|---|
| qwen35moe 35B.A3B Q8_0 | 28.21 GiB | 34.66 B | ROCm | 99 | 1 | tg128 @ d32000 | 61.03 ± 0.78 |
|
Thanks for checking perf also! |
|
@IMbackK Could you also post the PP performance between master and this PR? |
|
Its consistently faster or on par
|
|
2048 is actually a crossover point, but the slowdown is very mild, even at very large batch sizes. |
|
Ok thanks. @ORippler feel free to merge this into the |
d1b2301
into
ggml-org:gg/llama-allow-gdn-ch
* llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from 2068908 * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>
* llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (ggml-org#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (ggml-org#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (ggml-org#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from 2068908 * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>
I profiled the AR gated delta net, and improved perf by:
I see we since have added SMEM support to it for CDNA #20366, might be worth it to seed if sharding makes sense there also to get rid of spills. We could also process > 1 col per warp if we don't see perf gains on lower-tier GPUs with fewer SMs.