Skip to content

CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.)#19126

Merged
am17an merged 3 commits intoggml-org:masterfrom
am17an:topk-cuda-refactor
Jan 29, 2026
Merged

CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.)#19126
am17an merged 3 commits intoggml-org:masterfrom
am17an:topk-cuda-refactor

Conversation

@am17an
Copy link
Contributor

@am17an am17an commented Jan 27, 2026

Refactor the topk-moe to enabling various combination of topk-moe. Hopefully this will cover most models. I removed some templates from the code and only kept the bias because it has a extra warp shuffle, the rest of the template code does not provide any significant speedup.

3090

Model Test t/s master t/s topk-cuda-refactor Speedup
deepseek2 ?B Q4_K_M pp512 3096.25 3107.03 1.00
deepseek2 ?B Q4_K_M tg128 124.49 132.89 1.07
gpt-oss 20B MXFP4 MoE pp512 4773.87 4751.48 1.00
gpt-oss 20B MXFP4 MoE tg128 211.17 210.60 1.00
qwen3moe 30B.A3B Q4_0 pp512 3707.88 3682.90 0.99
qwen3moe 30B.A3B Q4_0 tg128 190.88 190.27 1.00

4090

Model Test t/s master t/s topk-cuda-refactor Speedup
deepseek2 ?B Q4_K_M pp512 6447.25 6503.07 1.01
deepseek2 ?B Q4_K_M tg128 162.90 175.63 1.08
gpt-oss 20B MXFP4 MoE pp512 9820.61 9815.71 1.00
gpt-oss 20B MXFP4 MoE tg128 264.89 264.37 1.00
qwen3moe 30B.A3B Q4_0 pp512 7827.77 7821.77 1.00
qwen3moe 30B.A3B Q4_0 tg128 253.11 252.62 1.00

5090

Model Test t/s master t/s topk-cuda-refactor Speedup
deepseek2 ?B Q4_K_M pp512 6512.76 6582.78 1.01
deepseek2 ?B Q4_K_M tg128 170.66 187.88 1.10
gpt-oss 20B MXFP4 MoE pp512 13815.72 13798.87 1.00
gpt-oss 20B MXFP4 MoE tg128 384.08 383.34 1.00
qwen3moe 30B.A3B Q4_0 pp512 7990.44 8005.32 1.00
qwen3moe 30B.A3B Q4_0 tg128 285.53 284.53 1.00

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jan 27, 2026
ggml_tensor * get_rows = cgraph->nodes[node_idx + 4];
ggml_tensor * argsort = cgraph->nodes[node_idx + 2];
int n_expert = cgraph->nodes[node_idx]->src[0]->ne[0];
node_idx++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking I'm not a fan of changing function arguments in the body (though in this case I think it's still fine).

if (args.sigmoid || args.softmax) {
// SOFTMAX -> RESHAPE
if (node_idx >= n_nodes || nodes[node_idx]->op != GGML_OP_RESHAPE ||
nodes[node_idx]->src[0] != nodes[node_idx - 1]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nodes[node_idx]->src[0] != nodes[node_idx - 1]) {
nodes[node_idx]->src[0] != nodes[node_idx - 1]) {

I think it makes sense to indent by 8 spaces rather than 4 to make the logic visually distinct from the code that would oftentimes follow on this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this though clang-format and it does this, though I agree with you.

@am17an am17an force-pushed the topk-cuda-refactor branch 2 times, most recently from 3ad63db to eeb9b04 Compare January 28, 2026 12:54
@am17an am17an merged commit 3bcc990 into ggml-org:master Jan 29, 2026
79 of 85 checks passed
@am17an am17an deleted the topk-cuda-refactor branch January 29, 2026 07:15
4b1tQu4ntN3k0 pushed a commit to 4b1tQu4ntN3k0/llama.cpp that referenced this pull request Feb 2, 2026
shaofeiqi pushed a commit to qualcomm/llama.cpp that referenced this pull request Feb 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants