[Kimi K2] num_experts extends to 384#8007
[Kimi K2] num_experts extends to 384#8007panpan0000 wants to merge 2 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @panpan0000, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a necessary update to the topk_sigmoid_cpu operation, enabling it to correctly handle configurations with 384 experts. This is a targeted fix to prevent a specific runtime error and expand the operational range of the kernel, likely supporting larger or different model architectures.
Highlights
- Feature Extension: I've identified that this pull request extends the supported number of experts for the
topk_sigmoid_cpukernel. Specifically, it adds support for 384 experts. - Error Prevention: The change directly addresses and prevents the
"Unexpected num_experts: "runtime error that would occur if a model configuration with 384 experts was used previously.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request extends support for num_experts to 384, specifically within the topk_sigmoid_cpu function. The change is straightforward, adding a new case to a switch statement. My review points out that similar functions within the same file (topk_softmax_cpu and grouped_topk_cpu) have not been updated, which could lead to unexpected errors if they are used with 384 experts. I've recommended updating these as well for consistent behavior.
a6a4cf7 to
4e3bb15
Compare
|
It looks like |
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Try to create another PR #8013 , but no confidence at all.. T_T |
@yhyang201 although my best GPU in hand is Ampere.... cannot test But |
This operator is designed to reduce latency for small batches. If that isn’t a major concern, you may simply use torch.nn.Linear for now. |
Yes, @yhyang201 . I meant, so changing |
| static constexpr int MAX_VPT = 32; // maximum VPT we support, > params.VPT = num_expert / num_expert_group | ||
| // DeepSeek V3/R1: num_experts = 256, n_group = 8 | ||
| // Kimi K2: num_experts = 384, n_group =1 | ||
| static constexpr int MAX_VPT = 384; // maximum VPT we support, > params.VPT = num_expert / num_expert_group |
There was a problem hiding this comment.
@ispobock I feel that directly changing MAX_VPT to 384 in this place, without any register reuse strategy, will cause this kernel to experience register overflow. The modification here is extremely sensitive. I personally believe the closest implementation to what we want currently is this change: #6946. This uses a tiling approach for register reuse and also performs data prefetching between tiles. With minor modifications, it should support the Kimi V2 (384) case here. I'll push this forward as soon as possible.
There was a problem hiding this comment.
@ispobock on h20 sm90, compile error, the error cause by 384
/sgl-workspace/sglang/sgl-kernel/build/_deps/repo-cutlass-src/include/cutlass/array.h(2849): error: invalid alignment value specified by attribute class alignas(Alignment) AlignedArray: public Array<T, N> { ^ detected during: instantiation of class "cutlass::AlignedArray<T, N, Alignment> [with T=bfloat16_t, N=384, Alignment=768]" at line 86 of /sgl-workspace/sglang/sgl-kernel/csrc/moe/moe_fused_gate.cu instantiation of "void moe_fused_gate_impl<T,Params>(void *, void *, float *, int32_t *, int64_t, int64_t, int64_t, int64_t, double, Params) [with T=bfloat16_t, Params=KernelParams<384, 384, 1, 32, 192, 6>]" at line 299 of /sgl-workspace/sglang/sgl-kernel/csrc/moe/moe_fused_gate.cu instantiation of "void moe_fused_gate_kernel<T,VPT,NUM_EXPERTS,THREADS_PER_ROW,ROWS_PER_WARP,ROWS_PER_CTA,WARPS_PER_CTA>(void *, void *, float *, int32_t *, int64_t, int64_t, int64_t, int64_t, double) [with T=bfloat16_t, VPT=384, NUM_EXPERTS=384, THREADS_PER_ROW=1, ROWS_PER_WARP=32, ROWS_PER_CTA=192, WARPS_PER_CTA=6]" at line 426 of /sgl-workspace/sglang/sgl-kernel/csrc/moe/moe_fused_gate.cu
There was a problem hiding this comment.
I also encountered this compilation error
May I ask if you are interested in working with me on #6946,it's num_experts extends to 512 with tiling |
Kimi K2 has
"n_routed_experts": 384change the code , to avoid go to
"Unexpected num_experts: "Test Result for CUDA code on A800: