Integrate DeepGemm contiguous group gemm into Fused MoE#4343
Integrate DeepGemm contiguous group gemm into Fused MoE#4343laixinn wants to merge 9 commits intosgl-project:mainfrom
Conversation
Co-authored-by: yinfan98 <1106310035@qq.com>
f2ac813 to
ce2677b
Compare
13b3cdc to
e5d3d3a
Compare
|
@laixinn Is there any plan to support masked gemm? It can be integrated with low_latency_dispatch seamlessly. |
|
@ch-wan I heard some EP features are developing with the masked gemm. |
|
This PR can pass unit test and support cuda graph, but the overhead for pre- and post-processing is currently unacceptable. Optimizing this overhead takes a while. |
|
It sounds reasonable. The current fused_moe kernel matches the data layout of inference without ep or inference with How about we focus on integrating DeepGeMM with DeepEP? Our current implementation also requires pre- and post-processing before computing GroupedGeMM (see #4643). DeepGeMM may achieve better performance for this case. Note that cuda graph has to be disabled if we focus on DeepEP. |
|
@ch-wan Exactly, I suppose the deepgemm kernels are designed for EP. |
Motivation
Integrate DeepGemm
m_grouped_gemm_fp8_fp8_bf16_nt_contiguousas the default group gemm kernel for Fused MoE, depending on #4165 .Modifications
Checklist