Replace all-reduce + dp_scatter with reduce_scatterv for DP attention by YAMY1234 · Pull Request #22642 · sgl-project/sglang

YAMY1234 · 2026-04-12T22:19:56Z

Motivation

For DP attention with Expert Parallelism (EP), the default MoE communication path performs two separate operations after the MoE kernel:

tensor_model_parallel_all_reduce — reduces expert outputs across all DP workers
dp_scatter — extracts each worker's local token slice from the global result

This is functionally equivalent to a single reduce_scatterv, which fuses the reduce and scatter into one NCCL collective, cutting the number of communication rounds in half.

Modifications

4 files changed, ~35 lines added

python/sglang/srt/layers/moe/utils.py: Added should_use_dp_reduce_scatterv() — activates when DP attention + EP is enabled, no DeepEP or FP4 allgather path is active, and ep_size == dp_size.
python/sglang/srt/models/qwen2_moe.py: Skip tensor_model_parallel_all_reduce when should_use_dp_reduce_scatterv() is true (the reduction is deferred to the communicator).
python/sglang/srt/layers/communicator.py: In CommunicateSummableTensorPairFn._scatter_hidden_states, use reduce_scatterv (via get_tp_group().reduce_scatterv) instead of dp_scatter when the flag is active. Output is allocated from get_local_dp_buffer() for symmetric memory compatibility.
python/sglang/srt/layers/moe/__init__.py: Export the new utility function.

The dispatch phase and kernel inputs are completely untouched — only the post-kernel communication (combine/scatter) is changed.

Accuracy Tests

GSM8K 8-shot on Qwen3.5-397B-A17B-FP8, DP4 EP4, 1319 examples, max_tokens=16384:

Configuration	GSM8K Score	std
reduce_scatterv (this PR)	97.86%	0.1446
baseline (all-reduce + dp_scatter)	97.86%	0.1446

Accuracy is identical — the optimization is mathematically equivalent (reduce + scatter = reduce_scatter).

Speed Tests and Profiling

Throughput

Max-throughput benchmark on Qwen3.5-397B-A17B-FP8, 1×GB200 node (4 GPUs), DP4 EP4 TP4, ISL=1000 OSL=1, concurrency=4096:

Configuration	Throughput (tok/s)	Change
baseline (all-reduce + dp_scatter)	53,006	—
reduce_scatterv (this PR)	57,115	+7.7%

Profiling (100 decode steps, Torch Profiler)

NCCL communication summary (DP0, TP0):

Metric	Baseline	reduce_scatterv	Change
Total NCCL time	8,326ms	7,193ms	−13.6%
AllReduce count	4,920 (~49/step)	2,460 (~25/step)	−50%
AllReduce total time	8,324ms	4,385ms	−47.3%
Reduce (scatterv) count	0	2,460 (~25/step)	new
Reduce (scatterv) total time	0ms	2,806ms	new
Comm / Compute ratio	105.5%	97.3%	comm no longer bottleneck

Per-kernel latency (single decode step, steady state):

Kernel	Duration	Note
`ncclDevKernel_Reduce_Sum_bf16` (reduce_scatterv)	~161µs	replaces AllReduce for MoE layers
`ncclDevKernel_AllReduce_Sum_bf16` (baseline)	~257µs	original MoE post-kernel comm

Why reduce_scatterv is faster:

AllReduce = ReduceScatter + AllGather: it reduces data across all ranks and broadcasts the full result back to every rank. In DP attention, each rank only needs its own token subset for the next attention layer — the AllGather half is wasted work.

reduce_scatterv performs only the reduce-scatter phase, delivering each rank exactly the tokens it owns. This cuts the communication volume roughly in half (~37% per-kernel latency reduction, ~13.6% total NCCL time reduction), directly translating to the +7.7% end-to-end throughput gain.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

… MoE For DP attention with EP, the default MoE combine path performs an all-reduce followed by dp_scatter, which is equivalent to two separate communication steps. This replaces them with a single reduce_scatterv call that combines reduce and scatter in one operation, improving throughput by ~7.7% (53k -> 57k tok/s on Qwen3.5-397B-A17B-FP8 DEP4). Only the post-kernel communication (combine phase) is changed; the dispatch phase and kernel inputs remain untouched. Made-with: Cursor

gemini-code-assist

Code Review

This pull request implements a reduce_scatterv optimization for MoE layers when using Data Parallel attention with Expert Parallelism. The review feedback suggests refactoring the communication logic in LayerCommunicator to use pre-allocated buffers for better memory efficiency and compatibility with symmetric memory. Additionally, a potential synchronization issue was identified in the qwen2_moe model where the shared expert might perform an inconsistent all-reduce, leading to tensor mismatches.

Ensures symmetric memory compatibility by using the standard DP buffer allocation path, and avoids an extra torch.empty inside reduce_scatterv. Made-with: Cursor

YAMY1234 · 2026-04-13T03:35:32Z

/tag-and-rerun-ci

YAMY1234 · 2026-04-13T06:31:06Z

/rerun-failed-ci

YAMY1234 · 2026-04-13T07:34:42Z

/rerun-failed-ci

…sgl-project#22642)

Follow-up to sgl-project#23731 (Qwen3 MoE) — PR sgl-project#22642 introduced should_use_dp_reduce_scatterv() to fuse the post-MoE all-reduce with dp_scatter into a single reduce_scatterv inside LayerCommunicator, but only patched qwen2_moe.py to skip the model-side tensor_model_parallel_all_reduce when the fast path is active. Every other MoE model that does the same post-experts all-reduce double- reduces under DP attention + EP, exactly as Qwen3 did. Reported in sgl-project#23431 with a real GSM8K nightly: 0.951 pre-sgl-project#22642 → 0.002–0.010 post → 0.980 with the guard. Mirror the guard onto the affected MoE models: - bailing_moe.py - bailing_moe_linear.py - deepseek_v2.py (forward_normal + dual-stream variant; forward_cpu intentionally untouched since CPU path doesn't trigger the fast path) - exaone_moe.py - glm4_moe.py (both forward_normal and dual-stream) - hunyuan_v3.py (uses moe_expert_parallel_all_reduce + moe_tensor_model_parallel_all_reduce like qwen3_moe; both branches must be skipped when the fast path is active) - llada2.py - llama4.py - mimo_v2_flash.py - minimax_m2.py - sarvam_moe.py (forward_normal + dual-stream) - sdar_moe.py - step3p5.py Each file gains the same one-line `and not should_use_dp_reduce_scatterv()` guard alongside the existing `should_use_flashinfer_cutlass_moe_fp4_allgather` guard (or its equivalent), matching the pattern used in qwen2_moe.py and qwen3_moe.py. Supersedes sgl-project#23431 (same diff for the 12 files there) and adds hunyuan_v3.py. Refs sgl-project#23729 sgl-project#23731 sgl-project#23431

PR sgl-project#22642 introduced should_use_dp_reduce_scatterv() to fuse the post-MoE all-reduce with the dp_scatter into a single reduce_scatterv inside LayerCommunicator. The qwen2_moe.py forward path was patched to skip the explicit tensor_model_parallel_all_reduce when this fast path is active, but qwen3_moe.py was missed. As a result, Qwen3 MoE models running with DP attention + EP=DP (e.g. --tp 2 --dp 2 --ep 2 --enable-dp-attention, no --moe-a2a-backend) double- reduce the MoE output: once explicitly via moe_expert_parallel_all_reduce in forward_normal, then again inside reduce_scatterv from the communicator. The output is silently corrupted; the model still produces fluent text but logprobs differ from a tp-only baseline by 0.5–2 nats. Repro: see sgl-project#23729 — same prompt, temperature 0, two servers tp=2 vs tp=2 dp=2 ep=2 dp_attention. Pre-fix the two configurations diverge at the second sampled token with max |Δlogprob|=2.03; post-fix they agree to 100/100 tokens with max |Δlogprob|=0.28 (within float drift). Mirror the qwen2_moe.py guard onto both reduce branches in Qwen3MoeSparseMoeBlock.forward_normal. Fixes sgl-project#23729 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follow-up to sgl-project#23731 (Qwen3 MoE) — PR sgl-project#22642 introduced should_use_dp_reduce_scatterv() to fuse the post-MoE all-reduce with dp_scatter into a single reduce_scatterv inside LayerCommunicator, but only patched qwen2_moe.py to skip the model-side tensor_model_parallel_all_reduce when the fast path is active. Every other MoE model that does the same post-experts all-reduce double- reduces under DP attention + EP, exactly as Qwen3 did. Reported in sgl-project#23431 with a real GSM8K nightly: 0.951 pre-sgl-project#22642 → 0.002–0.010 post → 0.980 with the guard. Mirror the guard onto the affected MoE models: - bailing_moe.py - bailing_moe_linear.py - deepseek_v2.py (forward_normal + dual-stream variant; forward_cpu intentionally untouched since CPU path doesn't trigger the fast path) - exaone_moe.py - glm4_moe.py (both forward_normal and dual-stream) - hunyuan_v3.py (uses moe_expert_parallel_all_reduce + moe_tensor_model_parallel_all_reduce like qwen3_moe; both branches must be skipped when the fast path is active) - llada2.py - llama4.py - mimo_v2_flash.py - minimax_m2.py - sarvam_moe.py (forward_normal + dual-stream) - sdar_moe.py - step3p5.py Each file gains the same one-line `and not should_use_dp_reduce_scatterv()` guard alongside the existing `should_use_flashinfer_cutlass_moe_fp4_allgather` guard (or its equivalent), matching the pattern used in qwen2_moe.py and qwen3_moe.py. Supersedes sgl-project#23431 (same diff for the 12 files there) and adds hunyuan_v3.py. Refs sgl-project#23729 sgl-project#23731 sgl-project#23431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

YAMY1234 requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners April 12, 2026 22:19

YAMY1234 marked this pull request as draft April 12, 2026 22:20

gemini-code-assist Bot reviewed Apr 12, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/communicator.py Outdated

Comment thread python/sglang/srt/models/qwen2_moe.py

Use get_local_dp_buffer() for reduce_scatterv output allocation

4aceb20

Ensures symmetric memory compatibility by using the standard DP buffer allocation path, and avoids an extra torch.empty inside reduce_scatterv. Made-with: Cursor

YAMY1234 marked this pull request as ready for review April 13, 2026 03:35

github-actions Bot added the run-ci label Apr 13, 2026

Fridge003 approved these changes Apr 13, 2026

View reviewed changes

Fridge003 merged commit 657945c into sgl-project:main Apr 14, 2026
417 of 470 checks passed

pyc96 pushed a commit to pyc96/sglang that referenced this pull request Apr 14, 2026

Replace all-reduce + dp_scatter with reduce_scatterv for DP attention (…

d97b66e

…sgl-project#22642)

nvpohanh mentioned this pull request Apr 17, 2026

[Tracking] Qwen3.5-397B (G)B200 Functional Support and Optimizations #20024

Open

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

Replace all-reduce + dp_scatter with reduce_scatterv for DP attention (…

30e8c1f

…sgl-project#22642)

Kangyan-Zhou mentioned this pull request Apr 22, 2026

Fix DP-Attention reduce_scatterv missing guard in MiniMax/Bailing MoE #23431

Closed

2 tasks

ByronHsu mentioned this pull request Apr 26, 2026

Fix Qwen3 MoE: also guard EP all-reduce with not use_reduce_scatter (follow-up to #23731) #23734

Merged

3 tasks

ByronHsu mentioned this pull request Apr 26, 2026

refactor(moe): centralize post-experts all-reduce skip predicate #23748

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace all-reduce + dp_scatter with reduce_scatterv for DP attention#22642

Replace all-reduce + dp_scatter with reduce_scatterv for DP attention#22642
Fridge003 merged 2 commits intosgl-project:mainfrom
YAMY1234:feat/dp-gatherv-scatterv

YAMY1234 commented Apr 12, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

YAMY1234 commented Apr 13, 2026

Uh oh!

YAMY1234 commented Apr 13, 2026

Uh oh!

YAMY1234 commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YAMY1234 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Throughput

Profiling (100 decode steps, Torch Profiler)

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

YAMY1234 commented Apr 13, 2026

Uh oh!

YAMY1234 commented Apr 13, 2026

Uh oh!

YAMY1234 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YAMY1234 commented Apr 12, 2026 •

edited

Loading

YAMY1234 commented Apr 13, 2026 •

edited

Loading