Skip to content

[Step3p5] Optimize allreduce in MoE layers #22773

Merged
yhyang201 merged 6 commits intosgl-project:mainfrom
yhyang201:step3p5-optimize-allreduce
Apr 16, 2026
Merged

[Step3p5] Optimize allreduce in MoE layers #22773
yhyang201 merged 6 commits intosgl-project:mainfrom
yhyang201:step3p5-optimize-allreduce

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

@yhyang201 yhyang201 commented Apr 14, 2026

Motivation

Modifications

  • Defer o_proj and share_expert all-reduce, combine with MoE output into a single all-reduce per layer (was 3 separate all-reduces)
  • Enable allreduce fusion and reduce-scatter for Step3p5
  • Add Step3p5ForCausalLM to flashinfer allreduce fusion whitelist

Performance

Good Perfermance Launch Command:
H200x8
63,213 TPS

  python3 -m sglang.launch_server \
      --model-path stepfun-ai/Step-3.5-Flash-FP8 \
      --tp 8 \
      --ep 4 \
      --trust-remote-code \
      --mem-fraction-static 0.75 \
      --chunked-prefill-size 16384 \
      --port 30000 \
      --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 64}'

Prefill throughput (TP=8, EP=4, input_len=8192, output_len=1, 200 prompts):

  • Before: ~46k tok/s
  • After: ~56k tok/s (+21%)

Accuracy Tests

GSM8K Full Test (1319 questions)

Server command:

python3 -m sglang.launch_server --model-path stepfun-ai/Step-3.5-Flash-FP8 --tp 8 --ep 4 --trust-remote-code --port 30000

Benchmark command:

python3 -m sglang.test.few_shot_gsm8k --num-q 1319 --port 30000
Branch Accuracy Invalid
main 0.875 0.001
PR (step3p5-optimize-allreduce) 0.879 0.001

Difference is 0.4% (~5 questions), within normal sampling variance. No accuracy regression.

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@yhyang201
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements communication optimizations for the Step3p5 model, specifically adding support for all-reduce fusion and reduce-scatter to minimize Tensor Parallel overhead. It also optimizes layer sparsity checks and registers the model for server-side adjustments. Review feedback identifies a logic gap in the dense MLP path where internal all-reduces are not skipped during fusion, which could lead to redundant operations. A correction was also suggested for the debug tensor output to ensure the correct residual state is captured.

Comment thread python/sglang/srt/models/step3p5.py
Comment thread python/sglang/srt/models/step3p5.py Outdated
…reduction

- Defer o_proj and share_expert all-reduce, combine with MoE output for one all-reduce per layer
- Enable allreduce fusion and reduce-scatter support
- Add Step3p5ForCausalLM to flashinfer allreduce fusion whitelist
Dense MLP (reduce_results=True) already performs an internal all-reduce.
Without this fix, should_allreduce_fusion could still be True for dense
layers during decode (batch_size <= 2048), causing the next layer to
all-reduce again and multiplying values by world_size at each dense layer.
@yhyang201 yhyang201 force-pushed the step3p5-optimize-allreduce branch from 7ed2974 to 5d359e0 Compare April 15, 2026 12:22
@yhyang201 yhyang201 merged commit b8794ba into sgl-project:main Apr 16, 2026
363 of 460 checks passed
jmamou pushed a commit to jmamou/sglang that referenced this pull request Apr 20, 2026
yhyang201 added a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
kyx1999 pushed a commit to KMSorSMS/sglang that referenced this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants