CUDA: AR gated delta net improvements by ORippler · Pull Request #20391 · ggml-org/llama.cpp

ORippler · 2026-03-11T10:40:45Z

I profiled the AR gated delta net, and improved perf by:

Adding fastdiv/fastrem for s64 int (do we even need this arithmetic to happen in 64-bit?)
Sharding a column across a full warp instead of using only a single thread. We don't fill SMs (at least on higher-tier GPUs) with existing launch-config (saw 16-32 CTAs with low thread-counts vs. 80+ SMs for e.g. 5080), so that was some free perf while reducing register-pressure in the case where S_v = 128 (saw some spill there)

GGML_CUDA=ON ./scripts/compare-commits.sh master osimons/gated_delta_net_improvements llama-bench -m /mnt/share/gguf/Qwen/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf -m /mnt/share/gguf/bartowski/Qwen_Qwen3.5-0.8B-GGUF/Qwen_Qwen3.5-0.8B-Q8_0.gguf -mmp 0 -dio 1 -fa 1

+ ./scripts/compare-llama-bench.py -b master -c osimons/gated_delta_net_improvements --tool llama-bench -i llama-bench.sqlite
| Model                    | Test   |   t/s master |   t/s osimons/gated_delta_net_improvements |   Speedup |
|:-------------------------|:-------|-------------:|-------------------------------------------:|----------:|
| qwen35 0.8B Q8_0         | tg128  |       460.86 |                                     465.81 |      1.01 |
| qwen3next 80B.A3B Q4_K_M | tg128  |       138.64 |                                     141.66 |      1.02 |

I see we since have added SMEM support to it for CDNA #20366, might be worth it to seed if sharding makes sense there also to get rid of spills. We could also process > 1 col per warp if we don't see perf gains on lower-tier GPUs with fewer SMs.

This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies).

ggerganov · 2026-03-11T10:47:20Z

How does the PP perf look if you enable the new kernel for larger batches:

diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 8b9330d63..d709007d3 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -5001,7 +5001,7 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
 #else
             // KDA is faster using the AR kernel even when n_tokens >= 512
             //TODO: Add chunked kernel
-            return op->src[0]->ne[2] == 1 || op->src[3]->ne[0] == op->src[2]->ne[0];
+            return true;
 #endif // GGML_USE_MUSA
         case GGML_OP_FLASH_ATTN_EXT:
             return ggml_cuda_flash_attn_ext_supported(dev_ctx->device, op);

In the Metal backend, using this implementation, it is good both for TG and PP so I plan to enable it for both.

ggerganov · 2026-03-11T10:49:53Z

I see we since have added SMEM support to it for CDNA #20366, might be worth it to seed if sharding makes sense there also to get rid of spills.

Yes, my guess is that this will eliminate the need for using the shared mem for CDNA.

ggerganov · 2026-03-11T11:25:46Z

How does the PP perf look if you enable the new kernel for larger batches:

FWIW, I'm seeing PP improvements both on DGX Spark and RTX 5090, using the new sharded AR kernel:

DGX Spark:

Model	Test	t/s gg/llama-allow-gdn-ch	t/s pr + patch	Speedup
kimi-linear 48B.A3B Q4_K_M	pp512	1533.28	1698.87	1.11
kimi-linear 48B.A3B Q4_K_M	pp2048	2100.06	2396.93	1.14
qwen35 27B Q4_K_M	pp512	663.65	786.76	1.19
qwen35 27B Q4_K_M	pp2048	601.05	724.44	1.21
qwen3next 80B.A3B Q4_0	pp512	1186.95	1332.82	1.12
qwen3next 80B.A3B Q4_0	pp2048	1373.22	1651.27	1.20

RTX 5090:

Model	Test	t/s gg/llama-allow-gdn-ch	t/s pr + patch	Speedup
kimi-linear 48B.A3B Q4_K_M	pp512	5224.63	6508.49	1.25
kimi-linear 48B.A3B Q4_K_M	pp2048	5626.62	9369.98	1.67
qwen35moe 35B.A3B Q4_K_M	pp512	5932.08	7009.35	1.18
qwen35moe 35B.A3B Q4_K_M	pp2048	8088.89	9109.19	1.13

IMbackK · 2026-03-11T11:31:12Z

CDNA warps are 64 wide. We generally dont want to use WARP_SIZE (which is just hard coded to 32) in new code but rather ggml_cuda_get_physical_warp_size

am17an

We can remove the fast_div_64 stuff for now

ggml/src/ggml-cuda/gated_delta_net.cu

am17an · 2026-03-11T13:49:28Z

@ggerganov can you try 8ea2990 on your devices? On a 5090 I get ~25% speedup using this kernel, but not as much as on a 4090. Note that it doesn't support KDA yet so just try the qwen models

1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA

ORippler · 2026-03-11T14:33:32Z

CDNA warps are 64 wide. We generally dont want to use WARP_SIZE (which is just hard coded to 32) in new code but rather ggml_cuda_get_physical_warp_size

This should work now with 0211798.

ggerganov · 2026-03-11T14:37:29Z

@am17an Here are the results between 8ea2990 and this PR with PP enabled:

DGX Spark

Model	Test	t/s `8ea2990`	t/s pr + patch	Speedup
kimi-linear 48B.A3B Q4_K_M	pp512	1534.09	1674.96	1.09
kimi-linear 48B.A3B Q4_K_M	pp2048	2105.79	2414.34	1.15
qwen35 27B Q4_K_M	pp512	747.65	777.76	1.04
qwen35 27B Q4_K_M	pp2048	692.72	716.31	1.03
qwen3next 80B.A3B Q4_0	pp512	1323.21	1335.70	1.01
qwen3next 80B.A3B Q4_0	pp2048	1624.30	1635.22	1.01

RTX 5090

Model	Test	t/s `8ea2990`	t/s pr + patch	Speedup
kimi-linear 48B.A3B Q4_K_M	pp512	5221.50	6507.67	1.25
kimi-linear 48B.A3B Q4_K_M	pp2048	5611.91	9368.58	1.67
qwen35moe 35B.A3B Q4_K_M	pp512	6968.99	7013.34	1.01
qwen35moe 35B.A3B Q4_K_M	pp2048	9890.69	9116.13	0.92

For Q3.5 on RTX 5090 your version has an edge.

am17an · 2026-03-11T14:43:21Z

Yeah my guess is it will scale with the -ub parameter? Since the AR kernel will loop over all tokens in the batch

ggerganov

IMO merging the sharded version + enabling it for all batch sizes is good for starts because it reduces dramatically the number of nodes in the ggml graph and makes it constant in terms of topology, regardless of the batch size. We can build the chunked kernels on top of it.

Just need to double check the correctness of the computation. PPL looks OK, but it wouldn't hurt to cross check the logprobs against vLLM again, just in case.

ggerganov · 2026-03-11T14:47:39Z

@am17an All tests here are with -ub 2048.

am17an · 2026-03-11T14:53:50Z

Agree, let's merge this. I feel the chunked version would be useful for large ubatch, mostly used for training and not for inference as well (since default ub is 512 and remains so regardless of context size). In my tests, perhaps not surprisingly the current chunked version maintains it's speed at larger ub sizes, the AR kernel will not but it should not matter for most users

Model	Microbatch size	Test	t/s `c96f608`	t/s gated_delta_net_chunk	Speedup
qwen35 9B Q8_0	512	pp32768	8813.84	10677.44	1.21
qwen35 9B Q8_0	1024	pp32768	9531.17	11521.71	1.21
qwen35 9B Q8_0	2048	pp32768	9228.76	11747.14	1.27
qwen35 9B Q8_0	4096	pp32768	9207.69	11718.04	1.27
qwen35 9B Q8_0	8192	pp32768	9194.76	11699.47	1.27
qwen35 9B Q8_0	16384	pp32768	9192.71	11695.46	1.27
qwen35 9B Q8_0	32768	pp32768	9184.83	11682.84	1.27

ggml/src/ggml-cuda/gated_delta_net.cu

ORippler · 2026-03-11T15:07:32Z

@IMbackK might verify perf and correctness on HIP, especially compared to the current SMEM approach (note this PR targets gg/llama-allow-gdn-ch and not master, so there would be another round of conflicts to resolve should we wish to push this into master directly)

ggerganov · 2026-03-11T15:21:36Z

there would be another round of conflicts to resolve should we wish to push this into master directly

We can merge this PR into gg/llama-allow-gdn-ch. And after that we can merge gg/llama-allow-gdn-ch into master.

ggerganov · 2026-03-11T15:57:21Z

I used the new llama-results tool that @JohannesGaessler implemented to compare the results of this branch and right before gg/llama-allow-gdn-ch and it looks like they match.

I did like this:

# go to baseline
git co 0cd4f4720b71dd7eb5fb3e3e86ffdd8ec5ac7c9f
make -j

# some random text for prompt
PROMPT="The above example is using an intermediate build b5046 of the library. This can be modified to use a different version by changing the URL and checksum."

# dump logits with bs=1 and bs>1 to exercise the 2 paths
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub1.gguf   -p "$PROMPT" -ub 1
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub512.gguf -p "$PROMPT" -ub 512

# compare logits between the 2 to get a sense of the expected variance that we can expect
# i.e. evaluate the prompt using `ub=1` and compare to the result with `ub=512`
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub512.gguf -p "$PROMPT" -ub 1   --check
NMSE=1.641e-03

# checkout and build this branch
git-pr 20391
make -j

# compare the logits
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub1.gguf   -p "$PROMPT" -ub 1   --check
NMSE=2.278e-03

./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub512.gguf -p "$PROMPT" -ub 512 --check
NMSE=5.088e-04

So I think it is safe to merge this.

Let's wait for @IMbackK to confirm CDNA is good.

IMbackK · 2026-03-11T17:28:20Z

Head sizes filling at least one cdna warp currently fail:

[GATED_DELTA_NET] ERR = 1.659305311 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=32,head_size=128,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): FAIL
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=1,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
[GATED_DELTA_NET] ERR = 1.409794362 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=16,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=0): FAIL
[GATED_DELTA_NET] ERR = 1.214608885 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0): FAIL
[GATED_DELTA_NET] ERR = 0.955333196 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0): FAIL
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=0): OK
[GATED_DELTA_NET] ERR = 1.128912112 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=0): FAIL
[GATED_DELTA_NET] ERR = 1.380463547 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=1,kda=0): FAIL
[GATED_DELTA_NET] ERR = 1.905914609 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): FAIL
[GATED_DELTA_NET] ERR = 1.479928983 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): FAIL
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
[GATED_DELTA_NET] ERR = 1.747710940 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1): FAIL
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1): OK
[GATED_DELTA_NET] ERR = 1.361838023 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): FAIL
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK

IMbackK · 2026-03-11T17:32:58Z

ggml/src/ggml-cuda/gated_delta_net.cu

        int64_t neqk1, int64_t rq3,
        float scale, cudaStream_t stream) {
+    //TODO: Add chunked kernel for even faster pre-fill
+    constexpr uint32_t warp_size = ggml_cuda_get_physical_warp_size();


ggml_cuda_get_physical_warp_size is not valid in host code you need to get the warp size from the device info struct

Thanks for the patch! Please re-test when you have the time

warp_size is not known at compile time in hip host code.

IMbackK · 2026-03-11T17:37:04Z

ORippler#1

Get warp size at runtime

IMbackK

Ok its correct now and, performance looks good too:

Master:
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q8_0	28.21 GiB	34.66 B	ROCm	99	1	tg128 @ d32000	58.02 ± 0.11

Pr:
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)

model	size	params	backend	ngl	fa	test	t/s
qwen35moe 35B.A3B Q8_0	28.21 GiB	34.66 B	ROCm	99	1	tg128 @ d32000	61.03 ± 0.78

ORippler · 2026-03-11T17:52:00Z

Thanks for checking perf also!

ggerganov · 2026-03-11T17:54:27Z

@IMbackK Could you also post the PP performance between master and this PR?

IMbackK · 2026-03-11T18:07:32Z

Its consistently faster or on par

Model	Microbatch size	Test	t/s master	t/s gated_delta_net_improvements	Speedup
qwen35moe 35B.A3B Q8_0	64	pp4096	338.57	346.96	1.02
qwen35moe 35B.A3B Q8_0	256	pp4096	655.80	660.60	1.01
qwen35moe 35B.A3B Q8_0	512	pp4096	953.99	963.31	1.01
qwen35moe 35B.A3B Q8_0	1024	pp4096	1253.56	1258.87	1.00
qwen35moe 35B.A3B Q8_0	2048	pp4096	1470.79	1477.46	1.00

IMbackK · 2026-03-11T18:10:30Z

2048 is actually a crossover point, but the slowdown is very mild, even at very large batch sizes.

ggerganov · 2026-03-11T18:12:04Z

Ok thanks. @ORippler feel free to merge this into the gg/llama-allow-gdn-ch branch.

* llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from 2068908 * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>

* llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (ggml-org#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (ggml-org#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (ggml-org#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from 2068908 * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>

ORippler added 5 commits March 11, 2026 10:33

Add FastDiv to gated_delta_net_cuda

55d2e3a

Shard columns across warps

2352caa

This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies).

Remove unneded include in gated_delta_net.cu

f62852a

Improve comments

da33f8e

Apply code-formating

e1cfcf8

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 11, 2026

am17an reviewed Mar 11, 2026

View reviewed changes

ggml/src/ggml-cuda/gated_delta_net.cu Outdated Show resolved Hide resolved

Make sharding HIP-compatible

0211798

1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA

ORippler requested a review from ggerganov as a code owner March 11, 2026 14:32

github-actions bot added the testing Everything test related label Mar 11, 2026

Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t

e26d75b

ggerganov reviewed Mar 11, 2026

View reviewed changes

ORippler added 2 commits March 11, 2026 15:49

Rename variables

1623bbc

Enable GDN also for prefill, move TODO for chunked_GDN

2068908

am17an approved these changes Mar 11, 2026

View reviewed changes

ggml/src/ggml-cuda/gated_delta_net.cu Show resolved Hide resolved

Actually remove the TODO from 2068908

226a1ac

ggerganov mentioned this pull request Mar 11, 2026

llama : enable chunked fused GDN path #20340

Merged

ORippler mentioned this pull request Mar 11, 2026

ggml : add NVFP4 quantization type support #19769

Merged

IMbackK suggested changes Mar 11, 2026

View reviewed changes

Get warp size at runtime

fd54823

warp_size is not known at compile time in hip host code.

ORippler force-pushed the osimons/gated_delta_net_improvements branch from 7a3da50 to 226a1ac Compare March 11, 2026 17:44

ORippler added 2 commits March 11, 2026 18:44

Merge pull request #1 from IMbackK/patch-1

2636f68

Get warp size at runtime

Don't expose ggml_cuda_get_physical_warp_size on host

c896fe7

IMbackK approved these changes Mar 11, 2026

View reviewed changes

ggerganov merged commit d1b2301 into ggml-org:gg/llama-allow-gdn-ch Mar 11, 2026
72 of 74 checks passed

ORippler deleted the osimons/gated_delta_net_improvements branch March 12, 2026 08:11

ggerganov mentioned this pull request Mar 12, 2026

vulkan: f16 mixed-precision state for GATED_DELTA_NET #20376

Draft

ORippler mentioned this pull request Mar 13, 2026

graph : remove redundant GDN state transposes #20443

Merged

1 task

Conversation

ORippler commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

IMbackK commented Mar 11, 2026

Uh oh!

am17an left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

am17an commented Mar 11, 2026

Uh oh!

ORippler commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

am17an commented Mar 11, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

am17an commented Mar 11, 2026

Uh oh!

Uh oh!

ORippler commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Mar 11, 2026

Uh oh!

IMbackK Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ORippler Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

IMbackK commented Mar 11, 2026

Uh oh!

IMbackK left a comment

Choose a reason for hiding this comment

Uh oh!

ORippler commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

IMbackK commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ORippler commented Mar 11, 2026 •

edited

Loading

ggerganov commented Mar 11, 2026 •

edited

Loading

IMbackK Mar 11, 2026 •

edited

Loading

IMbackK commented Mar 11, 2026 •

edited

Loading