vulkan: Implement topk_moe fused shader, ported from CUDA by jeffbolznv · Pull Request #16641 · ggml-org/llama.cpp

jeffbolznv · 2025-10-17T20:14:12Z

This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |       260.12 ± 24.04 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |        328.18 ± 9.83 |

build: 66b0dbcb2 (6791)

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -fa 1 -n 128 -p 0 -r 10 --prio 1 -m c:\models\Qwen_Qwen3-30B-A3B-Q2_K.gguf -m c:\models\\deepseek-v2-lite-safetensors\deepseek-v2-lite-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q2_K - Medium |  10.15 GiB |    30.53 B | Vulkan     |  99 |  1 |           tg128 |        285.47 ± 7.80 |
| deepseek2 16B Q4_K - Medium    |   9.65 GiB |    15.71 B | Vulkan     |  99 |  1 |           tg128 |       339.16 ± 16.07 |

build: e0f7fa913 (6792)

This is similar to the CUDA shader from ggml-org#16130, but doesn't use shared memory and handles different subgroup sizes.

Uses the technique used in the vulkan PR ggml-org#16641. Neat trick!

Uses the technique used in the vulkan PR #16641. Neat trick!

0cc4m

LGTM

Uses the technique used in the vulkan PR ggml-org#16641. Neat trick!

…6641) This is similar to the CUDA shader from ggml-org#16130, but doesn't use shared memory and handles different subgroup sizes.

Uses the technique used in the vulkan PR ggml-org#16641. Neat trick!

…6641) This is similar to the CUDA shader from ggml-org#16130, but doesn't use shared memory and handles different subgroup sizes.

Uses the technique used in the vulkan PR #16641. Neat trick!

This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.

vulkan: Implement topk_moe fused shader, ported from CUDA

e0f7fa9

This is similar to the CUDA shader from ggml-org#16130, but doesn't use shared memory and handles different subgroup sizes.

jeffbolznv requested review from 0cc4m, ggerganov and slaren as code owners October 17, 2025 20:14

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Oct 17, 2025

jeffbolznv mentioned this pull request Oct 17, 2025

CUDA: add a fused top-K MoE kernel #16130

Merged

am17an added a commit to am17an/llama.cpp that referenced this pull request Oct 18, 2025

CUDA: use registers instead of smem in topk-moe

06cd6bd

Uses the technique used in the vulkan PR ggml-org#16641. Neat trick!

am17an mentioned this pull request Oct 18, 2025

CUDA: use registers instead of smem in topk-moe #16647

Merged

JohannesGaessler pushed a commit that referenced this pull request Oct 18, 2025

CUDA: use registers instead of smem in topk-moe (#16647)

38355c6

Uses the technique used in the vulkan PR #16641. Neat trick!

0cc4m approved these changes Oct 18, 2025

View reviewed changes

0cc4m merged commit e56abd2 into ggml-org:master Oct 18, 2025
69 of 70 checks passed

pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 23, 2025

CUDA: use registers instead of smem in topk-moe (ggml-org#16647)

ac040c3

Uses the technique used in the vulkan PR ggml-org#16641. Neat trick!

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

CUDA: use registers instead of smem in topk-moe (ggml-org#16647)

d0ade76

Uses the technique used in the vulkan PR ggml-org#16641. Neat trick!

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

CUDA: use registers instead of smem in topk-moe (#16647)

f2d0793

Uses the technique used in the vulkan PR #16641. Neat trick!

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

vulkan: Implement topk_moe fused shader, ported from CUDA (#16641)

a289949

This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Implement topk_moe fused shader, ported from CUDA#16641

vulkan: Implement topk_moe fused shader, ported from CUDA#16641
0cc4m merged 1 commit intoggml-org:masterfrom
jeffbolznv:topk

jeffbolznv commented Oct 17, 2025

Uh oh!

0cc4m left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jeffbolznv commented Oct 17, 2025

Uh oh!

0cc4m left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants