[Step3p5] Optimize allreduce in MoE layers by yhyang201 · Pull Request #22773 · sgl-project/sglang

yhyang201 · 2026-04-14T08:14:30Z

Motivation

Modifications

Defer o_proj and share_expert all-reduce, combine with MoE output into a single all-reduce per layer (was 3 separate all-reduces)
Enable allreduce fusion and reduce-scatter for Step3p5
Add Step3p5ForCausalLM to flashinfer allreduce fusion whitelist

Performance

Good Perfermance Launch Command:
H200x8
63,213 TPS

  python3 -m sglang.launch_server \
      --model-path stepfun-ai/Step-3.5-Flash-FP8 \
      --tp 8 \
      --ep 4 \
      --trust-remote-code \
      --mem-fraction-static 0.75 \
      --chunked-prefill-size 16384 \
      --port 30000 \
      --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 64}'

Prefill throughput (TP=8, EP=4, input_len=8192, output_len=1, 200 prompts):

Before: ~46k tok/s
After: ~56k tok/s (+21%)

Accuracy Tests

GSM8K Full Test (1319 questions)

Server command:

python3 -m sglang.launch_server --model-path stepfun-ai/Step-3.5-Flash-FP8 --tp 8 --ep 4 --trust-remote-code --port 30000

Benchmark command:

python3 -m sglang.test.few_shot_gsm8k --num-q 1319 --port 30000

Branch	Accuracy	Invalid
main	0.875	0.001
PR (step3p5-optimize-allreduce)	0.879	0.001

Difference is 0.4% (~5 questions), within normal sampling variance. No accuracy regression.

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

yhyang201 · 2026-04-14T08:14:36Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request implements communication optimizations for the Step3p5 model, specifically adding support for all-reduce fusion and reduce-scatter to minimize Tensor Parallel overhead. It also optimizes layer sparsity checks and registers the model for server-side adjustments. Review feedback identifies a logic gap in the dense MLP path where internal all-reduces are not skipped during fusion, which could lead to redundant operations. A correction was also suggested for the debug tensor output to ensure the correct residual state is captured.

…reduction - Defer o_proj and share_expert all-reduce, combine with MoE output for one all-reduce per layer - Enable allreduce fusion and reduce-scatter support - Add Step3p5ForCausalLM to flashinfer allreduce fusion whitelist

Dense MLP (reduce_results=True) already performs an internal all-reduce. Without this fix, should_allreduce_fusion could still be True for dense layers during decode (batch_size <= 2048), causing the next layer to all-reduce again and multiplying values by world_size at each dense layer.

…ectness bug

github-actions Bot added the run-ci label Apr 14, 2026

gemini-code-assist Bot reviewed Apr 14, 2026

View reviewed changes

Comment thread python/sglang/srt/models/step3p5.py

Comment thread python/sglang/srt/models/step3p5.py Outdated

mickqian approved these changes Apr 14, 2026

View reviewed changes

yhyang201 requested review from JustinTong0323 and yuan-luo as code owners April 15, 2026 10:16

yhyang201 added 5 commits April 15, 2026 12:18

Fix post_attn_residual debug dump to use pre-norm value

15f46e9

Clean up redundant variables in Step3p5DecoderLayer

9ac60c3

Remove Step3p5 from flashinfer allreduce fusion whitelist to fix corr…

5d359e0

…ectness bug

yhyang201 force-pushed the step3p5-optimize-allreduce branch from 7ed2974 to 5d359e0 Compare April 15, 2026 12:22

Remove unused _dump_tensor debug logic

1b5039b

yhyang201 merged commit b8794ba into sgl-project:main Apr 16, 2026
363 of 460 checks passed

jmamou pushed a commit to jmamou/sglang that referenced this pull request Apr 20, 2026

[Step3p5] Optimize allreduce in MoE layers (sgl-project#22773)

6d08560

yhyang201 added a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[Step3p5] Optimize allreduce in MoE layers (sgl-project#22773)

be72361

zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026

[Step3p5] Optimize allreduce in MoE layers (sgl-project#22773)

351d301

kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026

[Step3p5] Optimize allreduce in MoE layers (sgl-project#22773)

bd97a47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Step3p5] Optimize allreduce in MoE layers #22773

[Step3p5] Optimize allreduce in MoE layers #22773
yhyang201 merged 6 commits intosgl-project:mainfrom
yhyang201:step3p5-optimize-allreduce

yhyang201 commented Apr 14, 2026 •

edited

Loading

Uh oh!

yhyang201 commented Apr 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yhyang201 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Performance

Accuracy Tests

GSM8K Full Test (1319 questions)

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

yhyang201 commented Apr 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yhyang201 commented Apr 14, 2026 •

edited

Loading