Fix Qwen3 MoE double-reduce when DP attention + EP + reduce_scatterv (#23729) by ByronHsu · Pull Request #23731 · sgl-project/sglang

ByronHsu · 2026-04-25T22:22:59Z

Motivation

PR #22642 introduced should_use_dp_reduce_scatterv() which fuses the post-MoE all-reduce with the dp_scatter into a single reduce_scatterv inside LayerCommunicator. The qwen2_moe.py forward path was patched to skip the explicit tensor_model_parallel_all_reduce when the fast path is active — but qwen3_moe.py was missed.

As a result, Qwen3 MoE models running with DP attention + EP (e.g. --tp 2 --dp 2 --ep 2 --enable-dp-attention, no --moe-a2a-backend) double-reduce the MoE output: once explicitly via moe_expert_parallel_all_reduce (and moe_tensor_model_parallel_all_reduce) in forward_normal, then again inside reduce_scatterv from the communicator. The output is silently corrupted; the model still produces fluent text but logprobs differ from a tp-only baseline by 0.5–2 nats.

Reported as #23729.

Reproduction (no DP attention vs DP+EP, same model, same prompt, temperature 0)

# Server A: tp-only baseline
python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --tp-size 2 --trust-remote-code --mem-fraction-static 0.8 --port 30000

# Server D: tp + dp + ep + dp_attention (the broken combination)
python -m sglang.launch_server --model-path Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --tp-size 2 --dp-size 2 --ep-size 2 --enable-dp-attention \
  --trust-remote-code --mem-fraction-static 0.8 \
  --chunked-prefill-size 16384 --cuda-graph-max-bs 128 --port 30000

curl -sS -X POST http://127.0.0.1:30000/generate \
  -H 'Content-Type: application/json' \
  -d '{"text":"Write a science fiction for me please.","sampling_params":{"temperature":0,"max_new_tokens":100,"top_p":1.0},"return_logprob":true,"logprob_start_len":0}'

variant	common-prefix vs A	mean \|Δlp\|	max \|Δlp\|	A and D text
before fix	2 / 100	1.291	2.032	diverge
after fix	100 / 100	0.027	0.279	identical

v0.5.9 is also clean (76/100 prefix, max |Δlp|=0.147 — pure float drift), confirming the regression was introduced after that release.

A and D agreement is now within normal float drift, matching the level of agreement between any of the (TP-only, DP-attn-only, EP-only) configurations on this model.

Modifications

In Qwen3MoeSparseMoeBlock.forward_normal, mirror the qwen2_moe.py guard onto both the EP and TP all-reduce branches: skip them when should_use_dp_reduce_scatterv() returns True (the communicator now performs the reduce inside reduce_scatterv).

-        if self.ep_size > 1 and not should_allreduce_fusion:
+        if (
+            self.ep_size > 1
+            and not should_allreduce_fusion
+            and not should_use_dp_reduce_scatterv()
+        ):
             final_hidden_states = moe_expert_parallel_all_reduce(final_hidden_states)

         if (
             self.tp_size > 1
             and not should_allreduce_fusion
             and not use_reduce_scatter
             and not should_use_flashinfer_cutlass_moe_fp4_allgather()
+            and not should_use_dp_reduce_scatterv()
         ):
             final_hidden_states = moe_tensor_model_parallel_all_reduce(...)

Accuracy

The fix only deletes redundant collectives along the affected branch; numerics on every other path are identical. Validated above on Qwen3-30B-A3B-Instruct-2507 in the broken configuration; configurations that didn't trigger should_use_dp_reduce_scatterv() (TP-only, DP-attn alone, EP alone) are unaffected.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy.
Follow the SGLang code style guidance.

cc @YAMY1234 (PR #22642 author)

Fixes #23729

PR sgl-project#22642 introduced should_use_dp_reduce_scatterv() to fuse the post-MoE all-reduce with the dp_scatter into a single reduce_scatterv inside LayerCommunicator. The qwen2_moe.py forward path was patched to skip the explicit tensor_model_parallel_all_reduce when this fast path is active, but qwen3_moe.py was missed. As a result, Qwen3 MoE models running with DP attention + EP=DP (e.g. --tp 2 --dp 2 --ep 2 --enable-dp-attention, no --moe-a2a-backend) double- reduce the MoE output: once explicitly via moe_expert_parallel_all_reduce in forward_normal, then again inside reduce_scatterv from the communicator. The output is silently corrupted; the model still produces fluent text but logprobs differ from a tp-only baseline by 0.5–2 nats. Repro: see sgl-project#23729 — same prompt, temperature 0, two servers tp=2 vs tp=2 dp=2 ep=2 dp_attention. Pre-fix the two configurations diverge at the second sampled token with max |Δlogprob|=2.03; post-fix they agree to 100/100 tokens with max |Δlogprob|=0.28 (within float drift). Mirror the qwen2_moe.py guard onto both reduce branches in Qwen3MoeSparseMoeBlock.forward_normal. Fixes sgl-project#23729

gemini-code-assist

Code Review

This pull request integrates the should_use_dp_reduce_scatterv check into the qwen3_moe model's forward pass to conditionally skip all-reduce operations. The review feedback suggests also incorporating the use_reduce_scatter flag check within the expert parallel block to ensure consistency with the tensor parallel logic and prevent redundant reductions.

gemini-code-assist · 2026-04-25T22:25:18Z

+        if (
+            self.ep_size > 1
+            and not should_allreduce_fusion
+            and not should_use_dp_reduce_scatterv()
+        ):


The expert parallel (EP) all-reduce block should also check the use_reduce_scatter flag, mirroring the logic used in the tensor parallel (TP) block at line 345. This ensures that redundant EP all-reduces are skipped when the communicator is configured to perform a reduce-scatter operation instead of an all-reduce.

Suggested change

if (

self.ep_size > 1

and not should_allreduce_fusion

and not should_use_dp_reduce_scatterv()

):

if (

self.ep_size > 1

and not should_allreduce_fusion

and not use_reduce_scatter

and not should_use_dp_reduce_scatterv()

):

Follow-up to sgl-project#23731 (Qwen3 MoE) — PR sgl-project#22642 introduced should_use_dp_reduce_scatterv() to fuse the post-MoE all-reduce with dp_scatter into a single reduce_scatterv inside LayerCommunicator, but only patched qwen2_moe.py to skip the model-side tensor_model_parallel_all_reduce when the fast path is active. Every other MoE model that does the same post-experts all-reduce double- reduces under DP attention + EP, exactly as Qwen3 did. Reported in sgl-project#23431 with a real GSM8K nightly: 0.951 pre-sgl-project#22642 → 0.002–0.010 post → 0.980 with the guard. Mirror the guard onto the affected MoE models: - bailing_moe.py - bailing_moe_linear.py - deepseek_v2.py (forward_normal + dual-stream variant; forward_cpu intentionally untouched since CPU path doesn't trigger the fast path) - exaone_moe.py - glm4_moe.py (both forward_normal and dual-stream) - hunyuan_v3.py (uses moe_expert_parallel_all_reduce + moe_tensor_model_parallel_all_reduce like qwen3_moe; both branches must be skipped when the fast path is active) - llada2.py - llama4.py - mimo_v2_flash.py - minimax_m2.py - sarvam_moe.py (forward_normal + dual-stream) - sdar_moe.py - step3p5.py Each file gains the same one-line `and not should_use_dp_reduce_scatterv()` guard alongside the existing `should_use_flashinfer_cutlass_moe_fp4_allgather` guard (or its equivalent), matching the pattern used in qwen2_moe.py and qwen3_moe.py. Supersedes sgl-project#23431 (same diff for the 12 files there) and adds hunyuan_v3.py. Refs sgl-project#23729 sgl-project#23731 sgl-project#23431

Follow-up to sgl-project#23731 (Qwen3 MoE) — PR sgl-project#22642 introduced should_use_dp_reduce_scatterv() to fuse the post-MoE all-reduce with dp_scatter into a single reduce_scatterv inside LayerCommunicator, but only patched qwen2_moe.py to skip the model-side tensor_model_parallel_all_reduce when the fast path is active. Every other MoE model that does the same post-experts all-reduce double- reduces under DP attention + EP, exactly as Qwen3 did. Reported in sgl-project#23431 with a real GSM8K nightly: 0.951 pre-sgl-project#22642 → 0.002–0.010 post → 0.980 with the guard. Mirror the guard onto the affected MoE models: - bailing_moe.py - bailing_moe_linear.py - deepseek_v2.py (forward_normal + dual-stream variant; forward_cpu intentionally untouched since CPU path doesn't trigger the fast path) - exaone_moe.py - glm4_moe.py (both forward_normal and dual-stream) - hunyuan_v3.py (uses moe_expert_parallel_all_reduce + moe_tensor_model_parallel_all_reduce like qwen3_moe; both branches must be skipped when the fast path is active) - llada2.py - llama4.py - mimo_v2_flash.py - minimax_m2.py - sarvam_moe.py (forward_normal + dual-stream) - sdar_moe.py - step3p5.py Each file gains the same one-line `and not should_use_dp_reduce_scatterv()` guard alongside the existing `should_use_flashinfer_cutlass_moe_fp4_allgather` guard (or its equivalent), matching the pattern used in qwen2_moe.py and qwen3_moe.py. Supersedes sgl-project#23431 (same diff for the 12 files there) and adds hunyuan_v3.py. Refs sgl-project#23729 sgl-project#23731 sgl-project#23431 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…follow-up to #23731) (#23734) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…llow-up to #23731) (#23732) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

…follow-up to #23731) (#23734) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gl-project#23729) (sgl-project#23731) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com>

…follow-up to sgl-project#23731) (sgl-project#23734) Co-authored-by: Byron Hsu <byron@periodiclabs.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…llow-up to sgl-project#23731) (sgl-project#23732) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

ByronHsu mentioned this pull request Apr 25, 2026

[Bug] dp_attention + ep>1 produces wrong logits on bf16 MoE (no DeepEP path) #23729

Closed

5 tasks

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

ByronHsu merged commit 99b59b2 into sgl-project:main Apr 25, 2026
57 of 65 checks passed

ByronHsu mentioned this pull request Apr 25, 2026

Apply should_use_dp_reduce_scatterv guard to remaining MoE models (follow-up to #23731) #23732

Merged

5 tasks

ByronHsu mentioned this pull request Apr 25, 2026

Fix DP-Attention reduce_scatterv missing guard in MiniMax/Bailing MoE #23431

Closed

2 tasks

ByronHsu mentioned this pull request Apr 26, 2026

Fix Qwen3 MoE: also guard EP all-reduce with not use_reduce_scatter (follow-up to #23731) #23734

Merged

3 tasks

ByronHsu mentioned this pull request Apr 26, 2026

refactor(moe): centralize post-experts all-reduce skip predicate #23748

Merged

3 tasks

hnyls2002 mentioned this pull request Apr 29, 2026

Deepseek V4 #23882

Merged

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

Fix Qwen3 MoE double-reduce when DP attention + EP + reduce_scatterv (s…

d8f3645

…gl-project#23729) (sgl-project#23731) Co-authored-by: Byron Hsu <byronhsu@noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Qwen3 MoE double-reduce when DP attention + EP + reduce_scatterv (#23729)#23731

Fix Qwen3 MoE double-reduce when DP attention + EP + reduce_scatterv (#23729)#23731
ByronHsu merged 1 commit intosgl-project:mainfrom
ByronHsu:fix/qwen3-moe-dp-reduce-scatterv-guard

ByronHsu commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ByronHsu commented Apr 25, 2026

Motivation

Reproduction (no DP attention vs DP+EP, same model, same prompt, temperature 0)

Modifications

Accuracy

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant