Add deepseek style fused moe group gate selection kernel#4445
Add deepseek style fused moe group gate selection kernel#4445qingquansong wants to merge 0 commit intosgl-project:mainfrom
Conversation
d6dc100 to
fa12f60
Compare
fa12f60 to
e5ba381
Compare
b647b14 to
ec3a1f3
Compare
206cf4b to
5127d70
Compare
|
@qingquansong Great job! Would you like to consider of usage AlignedArray, and native datatype? Of course I can create amd/ck/fused_moe_gate.cu counter part, but it will be great if I can reuse the codes. I think the fusion algorithm is great and we can do some deeper engineering work later. |
There was a problem hiding this comment.
Hi @qingquansong TRT_LLM has removed "tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.h" from v0.16.0 to
"tensorrt_llm/kernels/internal_cutlass_kernels/include/moe_kernels.h" in v0.17.0 and hide the implementation a static library.
Could you tell me moe_fused_gate_impl done by our team without referencing any implementation before (#3191) ? (copyright issue)
If that is not case, that is great! Do we have ncu profiling for sharing ?
There was a problem hiding this comment.
Hey @yiakwy-xpu-ml-framework-team Thanks! I seem to not able to find it from the file https://github.com/NVIDIA/TensorRT-LLM/blob/v0.16.0/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu but only our previous moeSoftmax kernel is adapted from there, maybe this is a new implementation or just referred to some similar code to implement? @BBuf do you happen to find it somewhere in the old trt version so we can put as reference? Thanks both!
5127d70 to
9e61c59
Compare
Definitely, pushed one version with a |
1ad3b5c to
e433ba5
Compare
50e18a9 to
fc5b464
Compare
b66591f to
f81a27f
Compare
|
The PR is picked up at #4530 cc @zhyncs @yiakwy-xpu-ml-framework-team @BBuf @hebiao064 @zcnrex @HandH1998 |
Motivation
PR adapted and improved from #3191
Rewrite Macro. Extended to support all power of 2
# expert&# expert group, also all# topk_group&# topkuse cases + dtype supportfp16/bf16/fp32.TODO:
NOTE:
# expertsis power of 2# experts / # expert group <= 32as we fix size of AlignedArray in expression (MAX_VPT=32later we can make this dynamic equal to theparams.VPTsize can improve the speed for smaller cases)Test
Unit tests (in pr)
Speed test for deepseek v3 case: 256 expert + 8 expert group + first select top4 expert group + then top8 final selected expert
Modifications
Checklist