CUDA: add a fused top-K MoE kernel#16130
Conversation
4d6c41a to
7345668
Compare
|
Some more performance numbers
|
324ecbb to
613b6c3
Compare
JohannesGaessler
left a comment
There was a problem hiding this comment.
I would recommend you cache the logits in registers from the start instead of reading the same data from VRAM twice.
613b6c3 to
17c9e7c
Compare
|
Did not see perf improvements after the changes, TG improves by 6-7% still |
ggml/src/ggml-cuda/ggml-cuda.cu
Outdated
| //special case for topk-moe | ||
| if (ops.size() == 5 && ops.begin()[0] == GGML_OP_SOFT_MAX && ops.begin()[1] == GGML_OP_RESHAPE && ops.begin()[2] == GGML_OP_ARGSORT | ||
| && ops.begin()[3] == GGML_OP_VIEW && ops.begin()[4] == GGML_OP_GET_ROWS) { | ||
|
|
||
| for (int i = 0; i < 5; i++) { | ||
| if (cgraph->nodes[node_idx + i]->op != ops.begin()[i]) return false; | ||
| } | ||
|
|
There was a problem hiding this comment.
Isn't all this redundant since it is performed in ggml_can_fuse?
There was a problem hiding this comment.
I think the logic of this function should be that you always apply ggml_can_fuse first and only then do special-cases.
Edit: got it, the RESHAPE and the VIEW are problematic in this case.
There was a problem hiding this comment.
Yeah, as I mentioned in #16102 (comment), if I have to remove the empty ops before passing to ggml_can_fuse then it still be a special case
17c9e7c to
7a258bf
Compare
|
Hmm, looks like this causes a illegal memory access when running |
|
Compile the CUDA code with |
|
Yeah I did that, however I don't understand it yet. It seems like there is some tensor not getting propagated downstream properly |
|
It looks like the dimensions are not right when using |
Yes, there is no problem with calling |
|
It does not crash when doing Changing this line to use n_expert_used makes everything work Line 545 in 8ba548d |
|
Interestingly, if I move |
|
I think the special case path added in this PR does not have bounds check for the number of nodes in the graph - could this be causing the illegal memory access? |
Added the bounds check and the problem is still there. Since it only happens on warmup and |
|
Does it still happen if you keep the build forward expand and remove the new fusing logic? |
No doesn't happen if I remove the fusing logic and keep the build forward expand |
7a258bf to
a275f10
Compare
|
Ok the bug was not handling ties properly in the kernel, after that it all works. I'm not exactly sure why though |
4b2d2b9 to
639e954
Compare
639e954 to
e772b28
Compare
941bc9e to
53acfe6
Compare
|
@JohannesGaessler will you merge once CI passes? |
|
As described in |
I don't have write access. Let me ping you once the CI passes to merge |
|
The CI failures don't seem related, so this should be good to merge. @JohannesGaessler |
* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
This is similar to the CUDA shader from ggml-org#16130, but doesn't use shared memory and handles different subgroup sizes.
This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.
* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
…6641) This is similar to the CUDA shader from ggml-org#16130, but doesn't use shared memory and handles different subgroup sizes.
* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
…6641) This is similar to the CUDA shader from ggml-org#16130, but doesn't use shared memory and handles different subgroup sizes.
* CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback
This is similar to the CUDA shader from #16130, but doesn't use shared memory and handles different subgroup sizes.
This kernel does the following:
It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
Should be more useful in TG than PP