Replace all-reduce + dp_scatter with reduce_scatterv for DP attention#22642
Merged
Fridge003 merged 2 commits intosgl-project:mainfrom Apr 14, 2026
Merged
Replace all-reduce + dp_scatter with reduce_scatterv for DP attention#22642Fridge003 merged 2 commits intosgl-project:mainfrom
Fridge003 merged 2 commits intosgl-project:mainfrom
Conversation
… MoE For DP attention with EP, the default MoE combine path performs an all-reduce followed by dp_scatter, which is equivalent to two separate communication steps. This replaces them with a single reduce_scatterv call that combines reduce and scatter in one operation, improving throughput by ~7.7% (53k -> 57k tok/s on Qwen3.5-397B-A17B-FP8 DEP4). Only the post-kernel communication (combine phase) is changed; the dispatch phase and kernel inputs remain untouched. Made-with: Cursor
Contributor
There was a problem hiding this comment.
Code Review
This pull request implements a reduce_scatterv optimization for MoE layers when using Data Parallel attention with Expert Parallelism. The review feedback suggests refactoring the communication logic in LayerCommunicator to use pre-allocated buffers for better memory efficiency and compatibility with symmetric memory. Additionally, a potential synchronization issue was identified in the qwen2_moe model where the shared expert might perform an inconsistent all-reduce, leading to tensor mismatches.
Ensures symmetric memory compatibility by using the standard DP buffer allocation path, and avoids an extra torch.empty inside reduce_scatterv. Made-with: Cursor
Collaborator
Author
|
/tag-and-rerun-ci |
Fridge003
approved these changes
Apr 13, 2026
Collaborator
Author
|
/rerun-failed-ci |
1 similar comment
Collaborator
Author
|
/rerun-failed-ci |
pyc96
pushed a commit
to pyc96/sglang
that referenced
this pull request
Apr 14, 2026
yhyang201
pushed a commit
to yhyang201/sglang
that referenced
this pull request
Apr 22, 2026
2 tasks
This was referenced Apr 25, 2026
ByronHsu
pushed a commit
to ByronHsu/sglang
that referenced
this pull request
Apr 25, 2026
Follow-up to sgl-project#23731 (Qwen3 MoE) — PR sgl-project#22642 introduced should_use_dp_reduce_scatterv() to fuse the post-MoE all-reduce with dp_scatter into a single reduce_scatterv inside LayerCommunicator, but only patched qwen2_moe.py to skip the model-side tensor_model_parallel_all_reduce when the fast path is active. Every other MoE model that does the same post-experts all-reduce double- reduces under DP attention + EP, exactly as Qwen3 did. Reported in sgl-project#23431 with a real GSM8K nightly: 0.951 pre-sgl-project#22642 → 0.002–0.010 post → 0.980 with the guard. Mirror the guard onto the affected MoE models: - bailing_moe.py - bailing_moe_linear.py - deepseek_v2.py (forward_normal + dual-stream variant; forward_cpu intentionally untouched since CPU path doesn't trigger the fast path) - exaone_moe.py - glm4_moe.py (both forward_normal and dual-stream) - hunyuan_v3.py (uses moe_expert_parallel_all_reduce + moe_tensor_model_parallel_all_reduce like qwen3_moe; both branches must be skipped when the fast path is active) - llada2.py - llama4.py - mimo_v2_flash.py - minimax_m2.py - sarvam_moe.py (forward_normal + dual-stream) - sdar_moe.py - step3p5.py Each file gains the same one-line `and not should_use_dp_reduce_scatterv()` guard alongside the existing `should_use_flashinfer_cutlass_moe_fp4_allgather` guard (or its equivalent), matching the pattern used in qwen2_moe.py and qwen3_moe.py. Supersedes sgl-project#23431 (same diff for the 12 files there) and adds hunyuan_v3.py. Refs sgl-project#23729 sgl-project#23731 sgl-project#23431
ByronHsu
pushed a commit
to ByronHsu/sglang
that referenced
this pull request
Apr 25, 2026
PR sgl-project#22642 introduced should_use_dp_reduce_scatterv() to fuse the post-MoE all-reduce with the dp_scatter into a single reduce_scatterv inside LayerCommunicator. The qwen2_moe.py forward path was patched to skip the explicit tensor_model_parallel_all_reduce when this fast path is active, but qwen3_moe.py was missed. As a result, Qwen3 MoE models running with DP attention + EP=DP (e.g. --tp 2 --dp 2 --ep 2 --enable-dp-attention, no --moe-a2a-backend) double- reduce the MoE output: once explicitly via moe_expert_parallel_all_reduce in forward_normal, then again inside reduce_scatterv from the communicator. The output is silently corrupted; the model still produces fluent text but logprobs differ from a tp-only baseline by 0.5–2 nats. Repro: see sgl-project#23729 — same prompt, temperature 0, two servers tp=2 vs tp=2 dp=2 ep=2 dp_attention. Pre-fix the two configurations diverge at the second sampled token with max |Δlogprob|=2.03; post-fix they agree to 100/100 tokens with max |Δlogprob|=0.28 (within float drift). Mirror the qwen2_moe.py guard onto both reduce branches in Qwen3MoeSparseMoeBlock.forward_normal. Fixes sgl-project#23729 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ByronHsu
pushed a commit
to ByronHsu/sglang
that referenced
this pull request
Apr 25, 2026
Follow-up to sgl-project#23731 (Qwen3 MoE) — PR sgl-project#22642 introduced should_use_dp_reduce_scatterv() to fuse the post-MoE all-reduce with dp_scatter into a single reduce_scatterv inside LayerCommunicator, but only patched qwen2_moe.py to skip the model-side tensor_model_parallel_all_reduce when the fast path is active. Every other MoE model that does the same post-experts all-reduce double- reduces under DP attention + EP, exactly as Qwen3 did. Reported in sgl-project#23431 with a real GSM8K nightly: 0.951 pre-sgl-project#22642 → 0.002–0.010 post → 0.980 with the guard. Mirror the guard onto the affected MoE models: - bailing_moe.py - bailing_moe_linear.py - deepseek_v2.py (forward_normal + dual-stream variant; forward_cpu intentionally untouched since CPU path doesn't trigger the fast path) - exaone_moe.py - glm4_moe.py (both forward_normal and dual-stream) - hunyuan_v3.py (uses moe_expert_parallel_all_reduce + moe_tensor_model_parallel_all_reduce like qwen3_moe; both branches must be skipped when the fast path is active) - llada2.py - llama4.py - mimo_v2_flash.py - minimax_m2.py - sarvam_moe.py (forward_normal + dual-stream) - sdar_moe.py - step3p5.py Each file gains the same one-line `and not should_use_dp_reduce_scatterv()` guard alongside the existing `should_use_flashinfer_cutlass_moe_fp4_allgather` guard (or its equivalent), matching the pattern used in qwen2_moe.py and qwen3_moe.py. Supersedes sgl-project#23431 (same diff for the 12 files there) and adds hunyuan_v3.py. Refs sgl-project#23729 sgl-project#23731 sgl-project#23431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Merged
3 tasks
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
For DP attention with Expert Parallelism (EP), the default MoE communication path performs two separate operations after the MoE kernel:
tensor_model_parallel_all_reduce— reduces expert outputs across all DP workersdp_scatter— extracts each worker's local token slice from the global resultThis is functionally equivalent to a single
reduce_scatterv, which fuses the reduce and scatter into one NCCL collective, cutting the number of communication rounds in half.Modifications
4 files changed, ~35 lines added
python/sglang/srt/layers/moe/utils.py: Addedshould_use_dp_reduce_scatterv()— activates when DP attention + EP is enabled, no DeepEP or FP4 allgather path is active, andep_size == dp_size.python/sglang/srt/models/qwen2_moe.py: Skiptensor_model_parallel_all_reducewhenshould_use_dp_reduce_scatterv()is true (the reduction is deferred to the communicator).python/sglang/srt/layers/communicator.py: InCommunicateSummableTensorPairFn._scatter_hidden_states, usereduce_scatterv(viaget_tp_group().reduce_scatterv) instead ofdp_scatterwhen the flag is active. Output is allocated fromget_local_dp_buffer()for symmetric memory compatibility.python/sglang/srt/layers/moe/__init__.py: Export the new utility function.The dispatch phase and kernel inputs are completely untouched — only the post-kernel communication (combine/scatter) is changed.
Accuracy Tests
GSM8K 8-shot on Qwen3.5-397B-A17B-FP8, DP4 EP4, 1319 examples, max_tokens=16384:
Accuracy is identical — the optimization is mathematically equivalent (reduce + scatter = reduce_scatter).
Speed Tests and Profiling
Throughput
Max-throughput benchmark on Qwen3.5-397B-A17B-FP8, 1×GB200 node (4 GPUs), DP4 EP4 TP4, ISL=1000 OSL=1, concurrency=4096:
Profiling (100 decode steps, Torch Profiler)
NCCL communication summary (DP0, TP0):
Per-kernel latency (single decode step, steady state):
ncclDevKernel_Reduce_Sum_bf16(reduce_scatterv)ncclDevKernel_AllReduce_Sum_bf16(baseline)Why reduce_scatterv is faster:
AllReduce=ReduceScatter+AllGather: it reduces data across all ranks and broadcasts the full result back to every rank. In DP attention, each rank only needs its own token subset for the next attention layer — the AllGather half is wasted work.reduce_scattervperforms only the reduce-scatter phase, delivering each rank exactly the tokens it owns. This cuts the communication volume roughly in half (~37% per-kernel latency reduction, ~13.6% total NCCL time reduction), directly translating to the +7.7% end-to-end throughput gain.Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci