Add deepseek_v3 fused gate by NovTi · Pull Request #3191 · sgl-project/sglang

NovTi · 2025-01-28T04:14:54Z

Add deepseek v3 fused gate module

BBuf · 2025-01-28T06:30:36Z

+    # Your module under test
+    output, indices_my = deepseekv3_fused_gate(tensor, bias, seq_length)
+
+    ###### Reference Implementation ######


Please refactor this code into a standalone function, which can be directly used from https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/layers/moe/topk.py#L111-L147.

Do you mean I separate the reference implementation into a standalone function?

Got it, I will do that

BBuf · 2025-01-28T06:33:14Z

+    output_ref = weights.type_as(scores)
+
+    # Assertions
+    output_check = torch.allclose(output_ref.sort()[0], output.sort()[0], rtol=1e-04, atol=1e-05)


Why not directly compare output and output_ref instead of sorting them?

This is weird, kernel sometimes will output exact same output but in a different order. I checked the following steps and the output order does not matter so I used this way to do the unit test, is this ok?

We need to determine at which specific step of the fused kernel this inconsistency in order occurs. Additionally, we need to clarify whether running the PyTorch implementation twice with the same input would result in inconsistent output orders. Finally, if you believe that the current order inconsistency does not affect the fused MoE accuracy, you need to provide an end-to-end result, such as running the GSM8K test with the DeepSeek V3 model.

I see, I will check the inconsistency inside the kernel. I cannot run e2e test on my server, Yineng will help me do the test

BBuf · 2025-01-28T06:34:21Z

+from sgl_kernel import deepseekv3_fused_gate
+
+
+@pytest.mark.parametrize("seq_length", range(1, 20000))


Can you add a benchmark script? Maybe refer to https://github.com/sgl-project/sglang/tree/main/sgl-kernel/benchmark

BBuf · 2025-01-28T06:36:33Z

    bmm_fp8,
    custom_dispose,
    custom_reduce,
+    deepseekv3_fused_gate,


It seems more appropriate to name it deepseekv3_fused_gate here, as models from the deepseek series can all go through this gate function.

This is not a generalized kernel, it only works for deepseek v3 671b model

I see, thanks.

I think it also works for DeepSeek V2 VL

BBuf · 2025-01-28T07:18:52Z

+        input.data_ptr(), bias.data_ptr(), output.data_ptr(), indices.data_ptr<int64_t>(), num_rows, k, route_scale
+    );
+
+    CHECK_CUDA_SUCCESS(cudaDeviceSynchronize());


Synchronization is not allowed in CUDA kernel's host code, as it will cause CUDA graphs to crash. Can you remove it?

Thanks, I will update these

BBuf · 2025-01-28T09:22:46Z

@@ -0,0 +1,219 @@
+#include <cfloat>


Please add Adapted from https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/mixtureOfExperts/moe_kernels.cu#L231

Note the codes has been removed from v0.16.0 into static lib (closed source) in v0.17.0

BBuf · 2025-01-29T11:55:05Z

In TensorRT-LLM, the fused MoE module, in addition to the fused_gate here, also includes the trt_moe_expand_and_permute, the CUTLASS grouped GEMM, and the trt_moe_unpermute_and_reduce processes. Compared to the MoE implemented in Triton, the advantage of TensorRT-LLM's approach is that it does not require padding, which saves some computational overhead, and the CUTLASS implementation may have greater performance potential, especially on Hopper architecture. I conducted an experiment at https://github.com/sgl-project/sglang/tree/bbuf_tmp, where I successfully connected trt_moe_expand_and_permute, trt_moe_unpermute_and_reduce, and FlashInfer's grouped GEMM in sgl-kernel to run correctness comparisons with the Triton fused MoE operator in the case of bfloat16 dtype. However, it seems that the current performance is still significantly worse than Triton's. This could be due to performance issues with FlashInfer's grouped GEMM on specific shapes. Additionally, FlashInfer's GEMM does not currently support scaled FP8 or INT8 GEMM. If anyone is interested, we can discuss whether to directly integrate TensorRT's fused MoE as a backend into sglang or to use FlashInfer's approach, which would require a customization of FlashInfer for grouped GEMM. cc @zhyncs

zhyncs · 2025-01-29T11:57:10Z

directly integrate TensorRT's fused MoE as a backend into sglang

sounds good @BBuf

BBuf · 2025-01-29T11:57:49Z

directly integrate TensorRT's fused MoE as a backend into sglang

sounds good @BBuf

Yeah, I can have a try.

yiakwy-xpu-ml-framework-team · 2025-03-16T14:10:17Z

@NovTi are you still working on the PR ? Have you reference to any other implementation open sourced ?

update main / deepseek fused gate

eb55e3a

NovTi requested review from BBuf, HandH1998, ispobock, merrymercy, yizhang2077 and zhyncs as code owners January 28, 2025 04:14

zhyncs assigned ispobock, BBuf and zhyncs Jan 28, 2025

BBuf reviewed Jan 28, 2025

View reviewed changes

BBuf changed the title ~~Add deepseek fused gate~~ Add deepseek_v3 fused gate Jan 28, 2025

BBuf reviewed Jan 28, 2025

View reviewed changes

NovTi and others added 2 commits February 3, 2025 14:11

Fix previous issues

ba11767

Merge branch 'main' into deepseek_fused_gate

e3ec286

NovTi mentioned this pull request Feb 18, 2025

[Feature] adapt fused sigmoid gate for MoE model #2739

Closed

2 tasks

qingquansong mentioned this pull request Mar 15, 2025

Add deepseek style fused moe group gate selection kernel #4445

Closed

6 tasks

qingquansong mentioned this pull request Mar 18, 2025

Add deepseek style fused moe group gate selection kernel #4530

Merged

6 tasks

This was referenced Mar 24, 2025

Feat: sigmoid gate cuda kernel integration bytedance-iaas/sglang#9

Closed

feat: sigmoid gate cuda kernel integration bytedance-iaas/sglang#10

Closed

Feat: Sigmoid Gate Cuda Kernel Integration bytedance-iaas/sglang#11

Merged

zhyncs closed this Mar 29, 2025

		from sgl_kernel import deepseekv3_fused_gate


		@pytest.mark.parametrize("seq_length", range(1, 20000))

Conversation

NovTi commented Jan 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BBuf Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BBuf Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BBuf Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BBuf commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyncs commented Jan 29, 2025

Uh oh!

BBuf commented Jan 29, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Mar 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

BBuf Jan 28, 2025 •

edited

Loading

BBuf Jan 28, 2025 •

edited

Loading

BBuf Jan 28, 2025 •

edited

Loading

BBuf commented Jan 29, 2025 •

edited

Loading