Skip to content

Integrate DeepGemm contiguous group gemm into Fused MoE#4343

Closed
laixinn wants to merge 9 commits intosgl-project:mainfrom
laixinn:deep-gemm-contiguous
Closed

Integrate DeepGemm contiguous group gemm into Fused MoE#4343
laixinn wants to merge 9 commits intosgl-project:mainfrom
laixinn:deep-gemm-contiguous

Conversation

@laixinn
Copy link
Copy Markdown
Contributor

@laixinn laixinn commented Mar 12, 2025

Motivation

Integrate DeepGemm m_grouped_gemm_fp8_fp8_bf16_nt_contiguous as the default group gemm kernel for Fused MoE, depending on #4165 .

Modifications

Checklist

Co-authored-by: yinfan98 <1106310035@qq.com>
@sleepcoo sleepcoo requested review from sleepcoo and zhyncs March 12, 2025 11:20
@laixinn laixinn force-pushed the deep-gemm-contiguous branch from f2ac813 to ce2677b Compare March 14, 2025 10:37
@laixinn laixinn force-pushed the deep-gemm-contiguous branch from 13b3cdc to e5d3d3a Compare March 14, 2025 10:48
@merrymercy merrymercy mentioned this pull request Feb 24, 2025
20 tasks
@ch-wan
Copy link
Copy Markdown
Collaborator

ch-wan commented Mar 26, 2025

@laixinn Is there any plan to support masked gemm? It can be integrated with low_latency_dispatch seamlessly.

@laixinn
Copy link
Copy Markdown
Contributor Author

laixinn commented Mar 26, 2025

@ch-wan I heard some EP features are developing with the masked gemm.

@ch-wan ch-wan mentioned this pull request Mar 26, 2025
18 tasks
@laixinn
Copy link
Copy Markdown
Contributor Author

laixinn commented Mar 26, 2025

This PR can pass unit test and support cuda graph, but the overhead for pre- and post-processing is currently unacceptable. Optimizing this overhead takes a while.

@ch-wan
Copy link
Copy Markdown
Collaborator

ch-wan commented Mar 26, 2025

It sounds reasonable. The current fused_moe kernel matches the data layout of inference without ep or inference with --enable-ep. Changing data layout to compute DeepGeMM may incur unnecessary overhead.

How about we focus on integrating DeepGeMM with DeepEP? Our current implementation also requires pre- and post-processing before computing GroupedGeMM (see #4643). DeepGeMM may achieve better performance for this case. Note that cuda graph has to be disabled if we focus on DeepEP.

@laixinn
Copy link
Copy Markdown
Contributor Author

laixinn commented Mar 26, 2025

@ch-wan Exactly, I suppose the deepgemm kernels are designed for EP.

@merrymercy merrymercy closed this Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants