[2/2] Use moe_sum_reduce cuda kernel#10654
Conversation
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on enhancing the efficiency of Mixture-of-Experts (MoE) operations by integrating a dedicated Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
4d390a4 to
64d7b40
Compare
There was a problem hiding this comment.
Code Review
This pull request replaces the conditional logic for MoE sum reduction, which previously switched between Triton and torch.compile implementations, with a single, more efficient custom CUDA kernel. This change simplifies the codebase in both fused_moe.py and triton.py and, according to the provided benchmarks, improves performance. My review includes suggestions to remove some redundant code for better clarity and style.
|
cc @ch-wan |
|
The CI error is because sgl-kernel has not yet been upgraded, the new kernel can not be recognized. Waiting for the new kernel release. |
f2f5751 to
cb30bf3
Compare
|
The CI failed because the tensor data type has np.float64, but the cuda kernel doesn't support float64 for the moment. |
fe2e3db to
7f55d35
Compare
7f55d35 to
3f26045
Compare
With the cuda kernel supporting float64, this issue has been fixed. |
|
The cuda kernel still has accuracy issue which makes some CI not happy. I'll follow up. |
The cuda kernel's accuracy is more precise than triton kernel. Move forward for this PR. |
Yes! According to my test, the cuda kernel's accuracy is more precise than triton kernel result |
Motivation
This PR is to use moe_sum_reduce cuda kernel implemented in #10321.
gsm8k result:
Tried TP4 for Qwen3 Moe, the acc and throughput looks more reasonable.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist