CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.)#19126

Merged

am17an merged 3 commits intoggml-org:masterfrom

am17an:topk-cuda-refactor

Jan 29, 2026

Contributor

am17an commented Jan 27, 2026

Refactor the topk-moe to enabling various combination of topk-moe. Hopefully this will cover most models. I removed some templates from the code and only kept the bias because it has a extra warp shuffle, the rest of the template code does not provide any significant speedup.

3090

Model	Test	t/s master	t/s topk-cuda-refactor	Speedup
deepseek2 ?B Q4_K_M	pp512	3096.25	3107.03	1.00
deepseek2 ?B Q4_K_M	tg128	124.49	132.89	1.07
gpt-oss 20B MXFP4 MoE	pp512	4773.87	4751.48	1.00
gpt-oss 20B MXFP4 MoE	tg128	211.17	210.60	1.00
qwen3moe 30B.A3B Q4_0	pp512	3707.88	3682.90	0.99
qwen3moe 30B.A3B Q4_0	tg128	190.88	190.27	1.00

4090

Model	Test	t/s master	t/s topk-cuda-refactor	Speedup
deepseek2 ?B Q4_K_M	pp512	6447.25	6503.07	1.01
deepseek2 ?B Q4_K_M	tg128	162.90	175.63	1.08
gpt-oss 20B MXFP4 MoE	pp512	9820.61	9815.71	1.00
gpt-oss 20B MXFP4 MoE	tg128	264.89	264.37	1.00
qwen3moe 30B.A3B Q4_0	pp512	7827.77	7821.77	1.00
qwen3moe 30B.A3B Q4_0	tg128	253.11	252.62	1.00

5090

Model	Test	t/s master	t/s topk-cuda-refactor	Speedup
deepseek2 ?B Q4_K_M	pp512	6512.76	6582.78	1.01
deepseek2 ?B Q4_K_M	tg128	170.66	187.88	1.10
gpt-oss 20B MXFP4 MoE	pp512	13815.72	13798.87	1.00
gpt-oss 20B MXFP4 MoE	tg128	384.08	383.34	1.00
qwen3moe 30B.A3B Q4_0	pp512	7990.44	8005.32	1.00
qwen3moe 30B.A3B Q4_0	tg128	285.53	284.53	1.00


          CUDA: refactor topk-moe to enable more models (GLM, Nemotron etc.)

bcbe257

github-actions bot added Nvidia GPU ggml labels


          template bias

1ae43b9

am17an force-pushed the topk-cuda-refactor branch from 3245ced to 1ae43b9 Compare

January 27, 2026 10:58

loci-dev mentioned this pull request

UPSTREAM PR #19126: CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) auroralabs-loci/llama.cpp#1049

Open

JohannesGaessler approved these changes

View reviewed changes

ggml/src/ggml-cuda/topk-moe.cu Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu

-                      ggml_tensor * get_rows = cgraph->nodes[node_idx + 4];
-                      ggml_tensor * argsort = cgraph->nodes[node_idx + 2];
-                      int n_expert = cgraph->nodes[node_idx]->src[0]->ne[0];
+                  node_idx++;

Contributor

JohannesGaessler Jan 27, 2026

Generally speaking I'm not a fan of changing function arguments in the body (though in this case I think it's still fine).

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Outdated

+                  if (args.sigmoid || args.softmax) {
+                      // SOFTMAX -> RESHAPE
+                      if (node_idx >= n_nodes || nodes[node_idx]->op != GGML_OP_RESHAPE ||
+                          nodes[node_idx]->src[0] != nodes[node_idx - 1]) {

Contributor

JohannesGaessler Jan 27, 2026

Suggested change

      
                        nodes[node_idx]->src[0] != nodes[node_idx - 1]) {
          
                            nodes[node_idx]->src[0] != nodes[node_idx - 1]) {

I think it makes sense to indent by 8 spaces rather than 4 to make the logic visually distinct from the code that would oftentimes follow on this line.

Contributor Author

am17an Jan 27, 2026

I ran this though clang-format and it does this, though I agree with you.

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/ggml-cuda.cu Outdated Show resolved Hide resolved


          review: formatting

eeb9b04

am17an force-pushed the topk-cuda-refactor branch 2 times, most recently from 3ad63db to eeb9b04 Compare

January 28, 2026 12:54

am17an merged commit 3bcc990 into ggml-org:master

79 of 85 checks passed

am17an deleted the topk-cuda-refactor branch

January 29, 2026 07:15

am17an mentioned this pull request

CUDA: use mmvq for mul-mat-id for small batch sizes #18958

Merged

4b1tQu4ntN3k0 pushed a commit to 4b1tQu4ntN3k0/llama.cpp that referenced this pull request


          CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (

26dfcd8

ggml-org#19126)

shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request


          CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (

6097e0d

ggml-org#19126)

pl752 mentioned this pull request

Eval bug: [CUDA, cuBLAS] Corrupted output on CUBLAS with moe models like Nemotron-3-nano and gpt-oss-120b with long context preprocessing #19659

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml Nvidia GPU