Skip to content

CUDA: AR gated delta net improvements#20391

Merged
ggerganov merged 13 commits intoggml-org:gg/llama-allow-gdn-chfrom
ORippler:osimons/gated_delta_net_improvements
Mar 11, 2026
Merged

CUDA: AR gated delta net improvements#20391
ggerganov merged 13 commits intoggml-org:gg/llama-allow-gdn-chfrom
ORippler:osimons/gated_delta_net_improvements

Conversation

@ORippler
Copy link
Collaborator

@ORippler ORippler commented Mar 11, 2026

I profiled the AR gated delta net, and improved perf by:

  1. Adding fastdiv/fastrem for s64 int (do we even need this arithmetic to happen in 64-bit?)
  2. Sharding a column across a full warp instead of using only a single thread. We don't fill SMs (at least on higher-tier GPUs) with existing launch-config (saw 16-32 CTAs with low thread-counts vs. 80+ SMs for e.g. 5080), so that was some free perf while reducing register-pressure in the case where S_v = 128 (saw some spill there)
GGML_CUDA=ON ./scripts/compare-commits.sh master osimons/gated_delta_net_improvements llama-bench -m /mnt/share/gguf/Qwen/Qwen3-Coder-Next-GGUF/Qwen3-Coder-Next-Q4_K_M/Qwen3-Coder-Next-Q4_K_M-00001-of-00004.gguf -m /mnt/share/gguf/bartowski/Qwen_Qwen3.5-0.8B-GGUF/Qwen_Qwen3.5-0.8B-Q8_0.gguf -mmp 0 -dio 1 -fa 1

+ ./scripts/compare-llama-bench.py -b master -c osimons/gated_delta_net_improvements --tool llama-bench -i llama-bench.sqlite
| Model                    | Test   |   t/s master |   t/s osimons/gated_delta_net_improvements |   Speedup |
|:-------------------------|:-------|-------------:|-------------------------------------------:|----------:|
| qwen35 0.8B Q8_0         | tg128  |       460.86 |                                     465.81 |      1.01 |
| qwen3next 80B.A3B Q4_K_M | tg128  |       138.64 |                                     141.66 |      1.02 |

I see we since have added SMEM support to it for CDNA #20366, might be worth it to seed if sharding makes sense there also to get rid of spills. We could also process > 1 col per warp if we don't see perf gains on lower-tier GPUs with fewer SMs.

This reduces register pressure (avoids spill for S_v = 128) and gives
the warp-scheduler more CTAs to schedule (thus hiding data-access
latencies).
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 11, 2026
@ggerganov
Copy link
Member

How does the PP perf look if you enable the new kernel for larger batches:

diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 8b9330d63..d709007d3 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -5001,7 +5001,7 @@ static bool ggml_backend_cuda_device_supports_op(ggml_backend_dev_t dev, const g
 #else
             // KDA is faster using the AR kernel even when n_tokens >= 512
             //TODO: Add chunked kernel
-            return op->src[0]->ne[2] == 1 || op->src[3]->ne[0] == op->src[2]->ne[0];
+            return true;
 #endif // GGML_USE_MUSA
         case GGML_OP_FLASH_ATTN_EXT:
             return ggml_cuda_flash_attn_ext_supported(dev_ctx->device, op);

In the Metal backend, using this implementation, it is good both for TG and PP so I plan to enable it for both.

@ggerganov
Copy link
Member

I see we since have added SMEM support to it for CDNA #20366, might be worth it to seed if sharding makes sense there also to get rid of spills.

Yes, my guess is that this will eliminate the need for using the shared mem for CDNA.

@ggerganov
Copy link
Member

How does the PP perf look if you enable the new kernel for larger batches:

FWIW, I'm seeing PP improvements both on DGX Spark and RTX 5090, using the new sharded AR kernel:

  • DGX Spark:
Model Test t/s gg/llama-allow-gdn-ch t/s pr + patch Speedup
kimi-linear 48B.A3B Q4_K_M pp512 1533.28 1698.87 1.11
kimi-linear 48B.A3B Q4_K_M pp2048 2100.06 2396.93 1.14
qwen35 27B Q4_K_M pp512 663.65 786.76 1.19
qwen35 27B Q4_K_M pp2048 601.05 724.44 1.21
qwen3next 80B.A3B Q4_0 pp512 1186.95 1332.82 1.12
qwen3next 80B.A3B Q4_0 pp2048 1373.22 1651.27 1.20
  • RTX 5090:
Model Test t/s gg/llama-allow-gdn-ch t/s pr + patch Speedup
kimi-linear 48B.A3B Q4_K_M pp512 5224.63 6508.49 1.25
kimi-linear 48B.A3B Q4_K_M pp2048 5626.62 9369.98 1.67
qwen35moe 35B.A3B Q4_K_M pp512 5932.08 7009.35 1.18
qwen35moe 35B.A3B Q4_K_M pp2048 8088.89 9109.19 1.13

@IMbackK
Copy link
Collaborator

IMbackK commented Mar 11, 2026

CDNA warps are 64 wide. We generally dont want to use WARP_SIZE (which is just hard coded to 32) in new code but rather ggml_cuda_get_physical_warp_size

Copy link
Contributor

@am17an am17an left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the fast_div_64 stuff for now

@am17an
Copy link
Contributor

am17an commented Mar 11, 2026

@ggerganov can you try 8ea2990 on your devices? On a 5090 I get ~25% speedup using this kernel, but not as much as on a 4090. Note that it doesn't support KDA yet so just try the qwen models

1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly
2. Add test with partial warp to test sum reduction on CUDA
@ORippler ORippler requested a review from ggerganov as a code owner March 11, 2026 14:32
@github-actions github-actions bot added the testing Everything test related label Mar 11, 2026
@ORippler
Copy link
Collaborator Author

CDNA warps are 64 wide. We generally dont want to use WARP_SIZE (which is just hard coded to 32) in new code but rather ggml_cuda_get_physical_warp_size

This should work now with 0211798.

@ggerganov
Copy link
Member

@am17an Here are the results between 8ea2990 and this PR with PP enabled:

  • DGX Spark
Model Test t/s 8ea2990 t/s pr + patch Speedup
kimi-linear 48B.A3B Q4_K_M pp512 1534.09 1674.96 1.09
kimi-linear 48B.A3B Q4_K_M pp2048 2105.79 2414.34 1.15
qwen35 27B Q4_K_M pp512 747.65 777.76 1.04
qwen35 27B Q4_K_M pp2048 692.72 716.31 1.03
qwen3next 80B.A3B Q4_0 pp512 1323.21 1335.70 1.01
qwen3next 80B.A3B Q4_0 pp2048 1624.30 1635.22 1.01
  • RTX 5090
Model Test t/s 8ea2990 t/s pr + patch Speedup
kimi-linear 48B.A3B Q4_K_M pp512 5221.50 6507.67 1.25
kimi-linear 48B.A3B Q4_K_M pp2048 5611.91 9368.58 1.67
qwen35moe 35B.A3B Q4_K_M pp512 6968.99 7013.34 1.01
qwen35moe 35B.A3B Q4_K_M pp2048 9890.69 9116.13 0.92

For Q3.5 on RTX 5090 your version has an edge.

@am17an
Copy link
Contributor

am17an commented Mar 11, 2026

Yeah my guess is it will scale with the -ub parameter? Since the AR kernel will loop over all tokens in the batch

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO merging the sharded version + enabling it for all batch sizes is good for starts because it reduces dramatically the number of nodes in the ggml graph and makes it constant in terms of topology, regardless of the batch size. We can build the chunked kernels on top of it.

Just need to double check the correctness of the computation. PPL looks OK, but it wouldn't hurt to cross check the logprobs against vLLM again, just in case.

@ggerganov
Copy link
Member

@am17an All tests here are with -ub 2048.

@am17an
Copy link
Contributor

am17an commented Mar 11, 2026

Agree, let's merge this. I feel the chunked version would be useful for large ubatch, mostly used for training and not for inference as well (since default ub is 512 and remains so regardless of context size). In my tests, perhaps not surprisingly the current chunked version maintains it's speed at larger ub sizes, the AR kernel will not but it should not matter for most users

Model Microbatch size Test t/s c96f608 t/s gated_delta_net_chunk Speedup
qwen35 9B Q8_0 512 pp32768 8813.84 10677.44 1.21
qwen35 9B Q8_0 1024 pp32768 9531.17 11521.71 1.21
qwen35 9B Q8_0 2048 pp32768 9228.76 11747.14 1.27
qwen35 9B Q8_0 4096 pp32768 9207.69 11718.04 1.27
qwen35 9B Q8_0 8192 pp32768 9194.76 11699.47 1.27
qwen35 9B Q8_0 16384 pp32768 9192.71 11695.46 1.27
qwen35 9B Q8_0 32768 pp32768 9184.83 11682.84 1.27

@ORippler
Copy link
Collaborator Author

@IMbackK might verify perf and correctness on HIP, especially compared to the current SMEM approach (note this PR targets gg/llama-allow-gdn-ch and not master, so there would be another round of conflicts to resolve should we wish to push this into master directly)

@ggerganov
Copy link
Member

there would be another round of conflicts to resolve should we wish to push this into master directly

We can merge this PR into gg/llama-allow-gdn-ch. And after that we can merge gg/llama-allow-gdn-ch into master.

@ggerganov
Copy link
Member

ggerganov commented Mar 11, 2026

I used the new llama-results tool that @JohannesGaessler implemented to compare the results of this branch and right before gg/llama-allow-gdn-ch and it looks like they match.

I did like this:

# go to baseline
git co 0cd4f4720b71dd7eb5fb3e3e86ffdd8ec5ac7c9f
make -j

# some random text for prompt
PROMPT="The above example is using an intermediate build b5046 of the library. This can be modified to use a different version by changing the URL and checksum."

# dump logits with bs=1 and bs>1 to exercise the 2 paths
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub1.gguf   -p "$PROMPT" -ub 1
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub512.gguf -p "$PROMPT" -ub 512

# compare logits between the 2 to get a sense of the expected variance that we can expect
# i.e. evaluate the prompt using `ub=1` and compare to the result with `ub=512`
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub512.gguf -p "$PROMPT" -ub 1   --check
NMSE=1.641e-03

# checkout and build this branch
git-pr 20391
make -j

# compare the logits
./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub1.gguf   -p "$PROMPT" -ub 1   --check
NMSE=2.278e-03

./bin/llama-results -m ~/models/qwen3-next-q4_0.gguf --output logits-ub512.gguf -p "$PROMPT" -ub 512 --check
NMSE=5.088e-04

So I think it is safe to merge this.

Let's wait for @IMbackK to confirm CDNA is good.

@IMbackK
Copy link
Collaborator

IMbackK commented Mar 11, 2026

Head sizes filling at least one cdna warp currently fail:

[GATED_DELTA_NET] ERR = 1.659305311 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=32,head_size=128,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): FAIL
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=0): OK
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=1,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=32,head_size=16,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
[GATED_DELTA_NET] ERR = 1.409794362 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=16,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=0): FAIL
[GATED_DELTA_NET] ERR = 1.214608885 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=0): FAIL
[GATED_DELTA_NET] ERR = 0.955333196 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=0): FAIL
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=0): OK
[GATED_DELTA_NET] ERR = 1.128912112 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=0): FAIL
[GATED_DELTA_NET] ERR = 1.380463547 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=1,kda=0): FAIL
[GATED_DELTA_NET] ERR = 1.905914609 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=1,v_repeat=1,permuted=0,kda=1): FAIL
[GATED_DELTA_NET] ERR = 1.479928983 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): FAIL
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=1,n_seqs=2,v_repeat=1,permuted=0,kda=1): OK
  GATED_DELTA_NET(type=f32,head_count=4,head_size=32,n_seq_tokens=4,n_seqs=1,v_repeat=1,permuted=0,kda=1): OK
[GATED_DELTA_NET] ERR = 1.747710940 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=0,kda=1): FAIL
  GATED_DELTA_NET(type=f32,head_count=8,head_size=32,n_seq_tokens=4,n_seqs=2,v_repeat=2,permuted=0,kda=1): OK
[GATED_DELTA_NET] ERR = 1.361838023 > 0.000000100   GATED_DELTA_NET(type=f32,head_count=4,head_size=64,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): FAIL
  GATED_DELTA_NET(type=f32,head_count=4,head_size=16,n_seq_tokens=4,n_seqs=2,v_repeat=1,permuted=1,kda=1): OK

int64_t neqk1, int64_t rq3,
float scale, cudaStream_t stream) {
//TODO: Add chunked kernel for even faster pre-fill
constexpr uint32_t warp_size = ggml_cuda_get_physical_warp_size();
Copy link
Collaborator

@IMbackK IMbackK Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ggml_cuda_get_physical_warp_size is not valid in host code you need to get the warp size from the device info struct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! Please re-test when you have the time

warp_size is not known at compile time in hip host code.
@IMbackK
Copy link
Collaborator

IMbackK commented Mar 11, 2026

ORippler#1

@ORippler ORippler force-pushed the osimons/gated_delta_net_improvements branch from 7a3da50 to 226a1ac Compare March 11, 2026 17:44
Copy link
Collaborator

@IMbackK IMbackK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok its correct now and, performance looks good too:

Master:
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q8_0 28.21 GiB 34.66 B ROCm 99 1 tg128 @ d32000 58.02 ± 0.11

Pr:
ggml_cuda_init: found 1 ROCm devices (Total VRAM: 32752 MiB):
Device 0: AMD Instinct MI100, gfx908:sramecc+:xnack- (0x908), VMM: no, Wave Size: 64, VRAM: 32752 MiB (32724 MiB free)

model size params backend ngl fa test t/s
qwen35moe 35B.A3B Q8_0 28.21 GiB 34.66 B ROCm 99 1 tg128 @ d32000 61.03 ± 0.78

@ORippler
Copy link
Collaborator Author

Thanks for checking perf also!

@ggerganov
Copy link
Member

@IMbackK Could you also post the PP performance between master and this PR?

@IMbackK
Copy link
Collaborator

IMbackK commented Mar 11, 2026

Its consistently faster or on par

Model Microbatch size Test t/s master t/s gated_delta_net_improvements Speedup
qwen35moe 35B.A3B Q8_0 64 pp4096 338.57 346.96 1.02
qwen35moe 35B.A3B Q8_0 256 pp4096 655.80 660.60 1.01
qwen35moe 35B.A3B Q8_0 512 pp4096 953.99 963.31 1.01
qwen35moe 35B.A3B Q8_0 1024 pp4096 1253.56 1258.87 1.00
qwen35moe 35B.A3B Q8_0 2048 pp4096 1470.79 1477.46 1.00

@IMbackK
Copy link
Collaborator

IMbackK commented Mar 11, 2026

2048 is actually a crossover point, but the slowdown is very mild, even at very large batch sizes.

@ggerganov
Copy link
Member

Ok thanks. @ORippler feel free to merge this into the gg/llama-allow-gdn-ch branch.

@ggerganov ggerganov merged commit d1b2301 into ggml-org:gg/llama-allow-gdn-ch Mar 11, 2026
72 of 74 checks passed
ggerganov added a commit that referenced this pull request Mar 11, 2026
* llama : enable chunked fused GDN path

* models : avoid Q and K repeats when using fused GDA

* cont : fix comment

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cont : fix the fix

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cont : fix

* metal : add GDN kernel (#20361)

* metal : add Metal backend for GGML_OP_GATED_DELTA_NET

Add a fused Metal kernel for the gated delta net recurrence op
(#19504), enabling GPU-accelerated inference for DeltaNet-based
models (Qwen3.5, etc.) on Apple Silicon.

Supports both GDA (scalar gate) and KDA (per-row gate) modes
with head_size 64 and 128. Unsupported configurations (head_size
32, non-contiguous tensors) gracefully fall back to CPU.

Performance: Qwen3.5-0.8B Q4_K_M on M4 Max
  tg128: 170 -> 213 t/s (+25%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* metal : validate contiguity of all input tensors in supports_op

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* metal : add algorithm equivalence comment for GDA decay path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* cont : unslop + optimize

* cont : clean-up

---------

Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* CUDA: AR gated delta net improvements (#20391)

* Add FastDiv to gated_delta_net_cuda

* Shard columns across warps

This reduces register pressure (avoids spill for S_v = 128) and gives
the warp-scheduler more CTAs to schedule (thus hiding data-access
latencies).

* Remove unneded include in gated_delta_net.cu

* Improve comments

* Apply code-formating

* Make sharding HIP-compatible

1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly
2. Add test with partial warp to test sum reduction on CUDA

* Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t

* Rename variables

* Enable GDN also for prefill, move TODO for chunked_GDN

* Actually remove the TODO from 2068908

* Get warp size at runtime

warp_size is not known at compile time in hip host code.

* Don't expose ggml_cuda_get_physical_warp_size on host

---------

Co-authored-by: uvos <devnull@uvos.xyz>

* llama : refactor llm_build_delta_net_base API

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: uvos <devnull@uvos.xyz>
ProgenyAlpha pushed a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026
* llama : enable chunked fused GDN path

* models : avoid Q and K repeats when using fused GDA

* cont : fix comment

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cont : fix the fix

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cont : fix

* metal : add GDN kernel (ggml-org#20361)

* metal : add Metal backend for GGML_OP_GATED_DELTA_NET

Add a fused Metal kernel for the gated delta net recurrence op
(ggml-org#19504), enabling GPU-accelerated inference for DeltaNet-based
models (Qwen3.5, etc.) on Apple Silicon.

Supports both GDA (scalar gate) and KDA (per-row gate) modes
with head_size 64 and 128. Unsupported configurations (head_size
32, non-contiguous tensors) gracefully fall back to CPU.

Performance: Qwen3.5-0.8B Q4_K_M on M4 Max
  tg128: 170 -> 213 t/s (+25%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* metal : validate contiguity of all input tensors in supports_op

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* metal : add algorithm equivalence comment for GDA decay path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* cont : unslop + optimize

* cont : clean-up

---------

Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* CUDA: AR gated delta net improvements (ggml-org#20391)

* Add FastDiv to gated_delta_net_cuda

* Shard columns across warps

This reduces register pressure (avoids spill for S_v = 128) and gives
the warp-scheduler more CTAs to schedule (thus hiding data-access
latencies).

* Remove unneded include in gated_delta_net.cu

* Improve comments

* Apply code-formating

* Make sharding HIP-compatible

1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly
2. Add test with partial warp to test sum reduction on CUDA

* Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t

* Rename variables

* Enable GDN also for prefill, move TODO for chunked_GDN

* Actually remove the TODO from 2068908

* Get warp size at runtime

warp_size is not known at compile time in hip host code.

* Don't expose ggml_cuda_get_physical_warp_size on host

---------

Co-authored-by: uvos <devnull@uvos.xyz>

* llama : refactor llm_build_delta_net_base API

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: uvos <devnull@uvos.xyz>
@ORippler ORippler deleted the osimons/gated_delta_net_improvements branch March 12, 2026 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants