Optimize Permute Kernel in DeepEP by xutizhou · Pull Request #4643 · sgl-project/sglang

xutizhou · 2025-03-21T03:28:22Z

Motivation

The current performance of DeepEP is suboptimal due to the low efficiency of PyTorch's native permute function, which is used for formatting data before and after DeepEP communication. To address this limitation, we have implemented high-efficiency Triton kernels that significantly improve overall performance.

Co-authored-by: @zhou9402

Performance on H20

Single Node

Command

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote-code --tp 8 --dp 8 --host 0.0.0.0 --port 30000 --enable-dp-attention --enable-deepep-moe --max-running-requests 128 --disable-radix-cache --mem-fraction-static 0.9 --stream-output --disable-cuda-graph

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompt 512 --random-input 1000 --random-output 1000 --random-range-ratio 1 --host 127.0.0.1 --port 30000 --max-concurrency 128

Version	Concurrency	Input	Output	Num Requests	Input Throughput(tok/s)	Output Throughput (tok/s)	Total Throughput (tok/s)
DeepEP(original)	127.97	1000	1000	512	436.69	436.69	873.38
DeepEP(current)	127.97	1000	1000	512	581.94	581.94	1163.87

Multi Node

Command

# node 0
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote-code \
  --tp 16 --dp 16  --dist-init-addr 10.6.131.5:5000 --nnodes 2 --node-rank 0 \
  --enable-dp-attention --enable-deepep-moe \
  --disable-cuda-graph
# node 1
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote-code \
  --tp 16 --dp 16  --dist-init-addr 10.6.131.5:5000 --nnodes 2 --node-rank 1 \
  --enable-dp-attention --enable-deepep-moe \
  --disable-cuda-graph

Version	Concurrency	Input	Output	Num Requests	Input Throughput(tok/s)	Output Throughput (tok/s)	Total Throughput (tok/s)
DeepEP(current)	255.93	1000	1000	512	956.36	956.36	1912.71
DeepEP(current)	511.31	1000	1000	1024	1711.54	1711.54	3423.09
DeepEP(current)	1023.17	1000	1000	2048	2974.21	2974.21	5948.42
DeepEP(current)	2046.18	1000	1000	4096	3929.73	3929.73	7859.46
EPMoe	255.55	1000	1000	512	868.55	868.55	1737.10
EPMoe	511.85	1000	1000	1024	1694.59	1694.59	3389.18
EPMoe	1022.27	1000	1000	2048	2735.53	2735.53	5471.06
EPMoe	2045.90	1000	1000	4096	3489.57	3489.57	6979.15

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

ch-wan · 2025-03-21T05:29:11Z

+
+def deepep_run_moe_deep_preprocess(topk_ids: torch.Tensor, num_experts: int):
+    reorder_topk_ids, reorder_ids = torch.sort(topk_ids.view(-1), stable=True)
+    seg_indptr = torch.zeros(num_experts + 1, device=topk_ids.device, dtype=torch.int64)


It can be init using torch.empty

ch-wan · 2025-03-21T05:30:05Z

+    deepep_compute_src2dst_triton_kernel[grid](
+        reorder_ids, src2dst, topk_ids.numel(), num_minus_one, BLOCK_SIZE
+    )
+    # src2dst -= num_minus_one


debugging code?

ch-wan · 2025-03-21T05:42:41Z



+@triton.jit
+def compute_src2dst_triton_kernel(


compute_src2dst_triton_kernel and deepep_compute_src2dst_triton_kernel are defined twice.

ch-wan · 2025-03-21T05:45:28Z

+
+
+@triton.jit
+def deepep_compute_src2dst_triton_kernel(


Why developing a triton kernel is necessary? Is it faster?

ch-wan · 2025-03-21T05:47:29Z

+
+
+@triton.jit
+def deepep_permute_triton_kernel(


It is defined twice.

ch-wan · 2025-03-21T05:48:40Z

+
+
+@triton.jit
+def deepep_post_reorder_triton_kernel(


It is defined twice.

Edenzzzz · 2025-03-21T21:04:25Z

+                output = torch.zeros(
+                    (num_tokens, hidden_states.shape[1]),
+                    device=hidden_states.device,
+                    dtype=hidden_states.dtype,
+                )


Use torch.empty?

Edenzzzz · 2025-03-21T21:19:07Z

            )
        if self.tp_size > 1:
-            recv_hidden_states, topk_idx, topk_weights, tokens_per_expert = (
+            recv_hidden_states, reorder_topk_ids, seg_indptr = (


Should we add some short comments on the meaning/examples of reorder_topk_ids and seg_indptr for readability?

Huixxi · 2025-03-24T06:56:10Z

Will there be further optimization plans for this permute kernel？

xutizhou · 2025-03-24T09:17:48Z

Will there be further optimization plans for this permute kernel？

We will continue to optimize the permute kernel, but it is not our top priority at the moment.

Huixxi · 2025-03-26T01:54:54Z

node 0
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote-code
--tp 16 --dp 16 --dist-init-addr 10.6.131.5:5000 --nnodes 2 --node-rank 0
--enable-dp-attention --enable-deepep-moe
--disable-cuda-graph
node 1
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote-code
--tp 16 --dp 16 --dist-init-addr 10.6.131.5:5000 --nnodes 2 --node-rank 1
--enable-dp-attention --enable-deepep-moe
--disable-cuda-graph

But, it seems that I can't reproduce the performance of deepseek on 2 * H800 x 8 with roce rdma. I don't know why.

xutizhou · 2025-03-26T02:49:22Z

node 0
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote-code
--tp 16 --dp 16 --dist-init-addr 10.6.131.5:5000 --nnodes 2 --node-rank 0
--enable-dp-attention --enable-deepep-moe
--disable-cuda-graph
node 1
python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --trust-remote-code
--tp 16 --dp 16 --dist-init-addr 10.6.131.5:5000 --nnodes 2 --node-rank 1
--enable-dp-attention --enable-deepep-moe
--disable-cuda-graph

But, it seems that I can't reproduce the performance of deepseek on 2 * H800 x 8 with roce rdma. I don't know why.

The observed issue could potentially be attributed to ROCE network configuration. To verify this hypothesis, we recommend running the inter-node communication test from DeepEP's validation suite, specifically the internode connectivity check

xutizhou added 2 commits March 20, 2025 22:27

add reorder kernels

d7f1cfd

format code

011a50f

zhyncs assigned ch-wan Mar 21, 2025

xutizhou marked this pull request as ready for review March 21, 2025 05:01

xutizhou requested review from ByronHsu, HaiShaw, Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners March 21, 2025 05:01

Merge branch 'main' into optimize_permute_kernel

beae218

zhyncs added the high priority label Mar 21, 2025

ch-wan requested changes Mar 21, 2025

View reviewed changes

rm trivial code

a7f30d1

ch-wan mentioned this pull request Mar 21, 2025

[Feature] Integrate DeepEP into SGLang #4232

Merged

6 tasks

Edenzzzz reviewed Mar 21, 2025

View reviewed changes

ch-wan and others added 3 commits March 22, 2025 16:39

minor optim

6c25b79

Merge branch 'main' into optimize_permute_kernel

3f21f98

Merge branch 'main' into optimize_permute_kernel

4b39584

zhyncs merged commit c2bd094 into sgl-project:main Mar 22, 2025

xutizhou deleted the optimize_permute_kernel branch March 23, 2025 03:21

xutizhou restored the optimize_permute_kernel branch March 23, 2025 04:43

ch-wan mentioned this pull request Mar 24, 2025

[Roadmap] EP Enhancement #4734

Closed

18 tasks

ch-wan mentioned this pull request Mar 26, 2025

Integrate DeepGemm contiguous group gemm into Fused MoE #4343

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize Permute Kernel in DeepEP#4643

Optimize Permute Kernel in DeepEP#4643
zhyncs merged 7 commits intosgl-project:mainfrom
xutizhou:optimize_permute_kernel

xutizhou commented Mar 21, 2025 •

edited

Loading

Uh oh!

ch-wan Mar 21, 2025

Uh oh!

ch-wan Mar 21, 2025

Uh oh!

ch-wan Mar 21, 2025

Uh oh!

ch-wan Mar 21, 2025

Uh oh!

ch-wan Mar 21, 2025

Uh oh!

ch-wan Mar 21, 2025

Uh oh!

Edenzzzz Mar 21, 2025

Uh oh!

Edenzzzz Mar 21, 2025

Uh oh!

Huixxi commented Mar 24, 2025

Uh oh!

xutizhou commented Mar 24, 2025

Uh oh!

Huixxi commented Mar 26, 2025

Uh oh!

xutizhou commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants



		@triton.jit
		def compute_src2dst_triton_kernel(



		@triton.jit
		def deepep_compute_src2dst_triton_kernel(



		@triton.jit
		def deepep_permute_triton_kernel(



		@triton.jit
		def deepep_post_reorder_triton_kernel(

Conversation

xutizhou commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Performance on H20

Single Node

Multi Node

Modifications

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Huixxi commented Mar 24, 2025

Uh oh!

xutizhou commented Mar 24, 2025

Uh oh!

Huixxi commented Mar 26, 2025

Uh oh!

xutizhou commented Mar 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

xutizhou commented Mar 21, 2025 •

edited

Loading