Support Qwen3 MoE context parallel#18233
Merged
Fridge003 merged 13 commits intosgl-project:mainfrom Mar 22, 2026
Merged
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
6 tasks
50c7181 to
84572ca
Compare
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Contributor
Author
|
/tag-and-rerun-ci |
5 tasks
added 2 commits
March 18, 2026 00:00
cbbe5e9 to
a1d9bfc
Compare
Contributor
Author
|
/tag-and-rerun-ci |
Contributor
Author
|
/tag-and-rerun-ci |
Contributor
Author
|
/rerun-failed-ci |
Fridge003
approved these changes
Mar 21, 2026
Collaborator
|
/rerun-ut registered/spec/eagle/test_eagle_dp_attention.py |
Contributor
|
✅ Triggered |
Contributor
OrangeRedeng
pushed a commit
to OrangeRedeng/sglang
that referenced
this pull request
Mar 22, 2026
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
Collaborator
|
This PR introduced test_qwen3_30b.py which made CI broken. @Fridge003 @alisonshao @Kangyan-Zhou Impacted CI: |
0-693
pushed a commit
to 0-693/sglang
that referenced
this pull request
Mar 25, 2026
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
dutsc
pushed a commit
to dutsc/sglang
that referenced
this pull request
Mar 30, 2026
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
15 tasks
Kangyan-Zhou
added a commit
to Kangyan-Zhou/sglang
that referenced
this pull request
Apr 2, 2026
PR sgl-project#18233 (bb737d7) switched MoE allreduce from _TP group to the dedicated _MOE_TP group but did not add _MOE_TP to graph_capture(). During CUDA graph replay the custom-allreduce kernel dereferences unregistered IPC handles, causing illegal-memory-access crashes on every MoE model launched with 1 < ep < tp (e.g. Qwen3-235B-FP8 --tp=8 --ep=2). Nightly CI confirms the breakpoint: • Mar 20 (before sgl-project#18233): model loads, different test-framework error • Mar 23 (after sgl-project#18233): exit code -9 (OOM-killed / segfault) Validated on 8xH200 with Qwen3-235B-A22B-Instruct-2507-FP8 --tp=8 --ep=2 (gsm8k accuracy 96.4%, 3124 tok/s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 task
Fridge003
added a commit
that referenced
this pull request
Apr 2, 2026
When ep_size > 1 and ep_size < tp_size, the _MOE_TP group is distinct from _TP. PR #18233 switched MoE allreduce to use _MOE_TP but forgot to register it in graph_capture(). This causes illegal memory access during CUDA graph replay because custom allreduce IPC handles from _MOE_TP are never registered. Use ExitStack to register both _MOE_EP and _MOE_TP groups (when they differ from _TP) during graph capture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fridge003
added a commit
that referenced
this pull request
Apr 2, 2026
When ep_size > 1 and ep_size < tp_size, the _MOE_TP group is distinct from _TP. PR #18233 switched MoE allreduce to use _MOE_TP but forgot to register it in graph_capture(). This causes illegal memory access during CUDA graph replay because custom allreduce IPC handles from _MOE_TP are never registered. Use ExitStack to register both _MOE_EP and _MOE_TP groups (when they differ from _TP) during graph capture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2 tasks
JustinTong0323
pushed a commit
to JustinTong0323/sglang
that referenced
this pull request
Apr 7, 2026
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
5 tasks
yhyang201
pushed a commit
to yhyang201/sglang
that referenced
this pull request
Apr 22, 2026
Co-authored-by: Shunkang <182541032+Shunkangz@users.noreply.github.co> Co-authored-by: Jiying Dong <87510204+dongjiyingdjy@users.noreply.github.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Context parallelism is essential in long context LLM inference. It splits a long input sequence across multiple GPUs so attention can be computed in parallel, drastically reducing latency, which enables practical million-token context windows.
In this PR, we add support for the context-parallel form of Qwen3-MoE. With this update, context parallelism can now be enabled during the prefill phase under various parallel configurations.
As for attention layer, users can use CP, TP and CP + TP.
As for moe layer, users can use CP, TP, CP + TP, CP + EP.
Modifications
In this implementation, we allocate a full-sequence KV cache on each CP rank. This approach simplifies both KV cache management and reuse by replicating the KV cache across all CP ranks. Before performing the attention computation, we use an allgather operation to collect the KV cache from all ranks, and then apply the FlashAttention backend for the calculation.
Accuracy Tests
Command
sglang serve --model-path /home/scratch.trt_llm_data/llm-models/Qwen3/Qwen3-30B-A3B-FP8/ --trust-remote-code --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 64}' --tp=4 --moe-dp-size=2 --ep-size=2 --attn-cp-size=2 --enable-prefill-context-parallel --cuda-graph-max-bs=32 --max-running-requests=32Results
Accuracy: 0.785 Invalid: 0.000 Latency: 43.704 s Output throughput: 1027.630 token/sH200 Qwen3-235B

Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci