[NPU] ascend backend support qwen3 moe attention cp by AndyLi429 · Pull Request #21685 · sgl-project/sglang

AndyLi429 · 2026-03-30T12:17:58Z

Motivation

Qwen3 MoE models on Ascend NPU already support Prefill Context Parallel (PCP) for the MLA attention path, but the standard (non-MLA) attention path lacked CP support. This PR completes CP support for standard attention in the Ascend backend, enabling Qwen3-30B-A3B to run correctly under co-located deployment (TP=4 / MOE_DP=2 / ATTN_CP=2), reducing peak HBM usage during long-sequence prefill and improving TTFT.

Modifications

ascend_backend.py (core implementation)
- Added _cp_allgather_and_save_kv_npu(): concatenates K and V along the feature dimension and performs a single all-gather instead of two separate ones, halving communication overhead. GQA is handled correctly even when tp_k_head_num != tp_v_head_num.
- Added do_cp_attn_fia(): implements CP-aware attention using npu_fused_infer_attention_score (FIA). Splits Q into prev/next halves following the zigzag pattern, computes attention separately for each half, then concatenates the outputs.
- Extended forward_extend() with a CP branch: when is_context_parallel_extend is detected, the flow goes through all-gather KV → CP attention. Non-FIA paths raise a clear NotImplementedError.
- Reads attn_cp_size from ModelRunner and stores it on the backend instance.
qwen2_moe.py (minor fix)
- Replaced torch.cuda.current_stream() with get_current_device_stream_fast() to ensure device-agnostic stream handling on Ascend NPU.
test_npu_qwen3_30b_attn_cp.py (new test)
- Registers a nightly NPU CI test on 4 NPUs (nightly-4-npu-a3).
- Validates GSM8K accuracy for Qwen3-30B-A3B under co-located TP=4 / MOE_DP=2 / ATTN_CP=2 deployment, requiring accuracy ≥ 0.92 (measured: 0.96).

Accuracy Tests

======================start test gsm8k =================
parallel= 16
Accuracy: 0.960
Invalid: 0.000
Latency: 36.943 s
Output throughput: 154.996 token/s
metrics={'accuracy': np.float64(0.96), 'invalid': np.float64(0.0), 'latency': 36.942977759987116, 'output_throughput': 154.99562696870154}
parallel=16
metrics['accuracy']=np.float64(0.96)
======================start test gsm8k =================
parallel= 32
Accuracy: 0.960
Invalid: 0.000
Latency: 28.165 s
Output throughput: 193.860 token/s
metrics={'accuracy': np.float64(0.96), 'invalid': np.float64(0.0), 'latency': 28.164654500083998, 'output_throughput': 193.8600027912189}
parallel=32
metrics['accuracy']=np.float64(0.96)

Speed Tests and Profiling

attn_cp=2


============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    1.0       
Max request concurrency:                 16        
Successful requests:                     1         
Benchmark duration (s):                  75.20     
Total input tokens:                      256000    
Total input text tokens:                 256000    
Total generated tokens:                  128       
Total generated tokens (retokenized):    129       
Request throughput (req/s):              0.01      
Input token throughput (tok/s):          3404.31   
Output token throughput (tok/s):         1.70      
Peak output token throughput (tok/s):    6.00      
Peak concurrent requests:                1         
Total token throughput (tok/s):          3406.01   
Concurrency:                             1.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   75193.51  
Median E2E Latency (ms):                 75193.51  
P90 E2E Latency (ms):                    75193.51  
P99 E2E Latency (ms):                    75193.51  
---------------Time to First Token----------------
Mean TTFT (ms):                          49533.88  
Median TTFT (ms):                        49533.88  
P99 TTFT (ms):                           49533.88  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          202.04    
Median TPOT (ms):                        202.04    
P99 TPOT (ms):                           202.04    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           202.04    
Median ITL (ms):                         175.49    
P95 ITL (ms):                            188.89    
P99 ITL (ms):                            200.62    
Max ITL (ms):                            3533.68   
==================================================

Metric	attn_cp=1	attn_cp=2	Delta
Benchmark Duration (s)	84.68	75.20	-9.48 (-11%)
Input Token Throughput (tok/s)	3023.25	3404.31	+381 (+13%)
Output Token Throughput(tok/s)	1.51	1.70	+0.19 (+13%)
Total Token Throughput (tok/s)	3024.76	3406.01	+381 (+13%)
Mean E2E Latency (ms)	84672.02	75193.51	-9478 (-11%)
Mean TTFT (ms)	58039.34	49533.88	-8505 (-15%)
Mean TPOT (ms)	209.71	202.04	-7.67 (-4%)
Mean ITL (ms)	209.71	202.04	-7.67 (-4%)
Median ITL (ms)	181.25	175.49	-5.76 (-3%)
P99 ITL (ms)	208.63	200.62	-8.01 (-4%)
Max ITL (ms)	3834.20	3533.68	-300 (-8%)

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request implements context parallel (CP) support for the Ascend NPU backend, introducing a merged K/V all-gather optimization to reduce communication latency and CP-aware attention using the FIA path. It also adds a new end-to-end GSM8K accuracy test for Qwen3-30B-A3B with PD disaggregation. Review feedback identifies a potential correctness bug where a CUDA stream is used instead of an NPU stream, suggests refactoring duplicated attention scoring logic into a helper method, and points out a redundant argument in the test configuration.

AndyLi429 · 2026-03-31T01:21:23Z

@claude review
@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ecb778867a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

AndyLi429 · 2026-03-31T13:01:58Z

/tag-run-ci-label

AndyLi429 · 2026-04-01T03:16:19Z

/run-ci npu

AndyLi429 · 2026-04-01T03:18:11Z

/tag-and-rerun-ci

AndyLi429 · 2026-04-03T04:01:02Z

/rerun-failed-ci

AndyLi429 · 2026-04-03T04:10:43Z

@iforgetmyname please review the code and start CI

sglang-npu-bot · 2026-04-03T04:18:44Z

/tag-and-rerun-ci

AndyLi429 · 2026-04-03T07:31:26Z

/rerun-failed-ci

AndyLi429 · 2026-04-08T07:50:33Z

/rerun-failed-ci

AndyLi429 · 2026-04-15T02:43:08Z

/rerun-failed-ci

AndyLi429 · 2026-04-21T11:19:04Z

/rerun-failed-ci

AndyLi429 · 2026-04-25T01:10:58Z

/rerun-failed-ci

AndyLi429 · 2026-04-25T08:24:43Z

/rerun-failed-ci

AndyLi429 · 2026-04-26T13:10:01Z

/rerun-failed-ci

AndyLi429 · 2026-04-26T14:15:56Z

/rerun-failed-ci

AndyLi429 · 2026-04-27T01:17:42Z

/rerun-failed-ci

iforgetmyname · 2026-04-28T01:17:09Z

/tag-and-rerun-ci

AndyLi429 requested review from iforgetmyname and ping1jing2 as code owners March 30, 2026 12:17

github-actions Bot added the npu label Mar 30, 2026

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

Comment thread python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py Outdated

Comment thread python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py

Comment thread test/registered/ascend/llm_models/test_npu_qwen3_30b_pd_gsm8k.py Outdated

chatgpt-codex-connector Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread test/registered/ascend/llm_models/test_npu_qwen3_30b_pd_gsm8k.py Outdated

github-actions Bot added run-ci documentation Improvements or additions to documentation labels Apr 3, 2026

AndyLi429 force-pushed the qwen3_pcp_0330 branch 3 times, most recently from c6799ce to e8c837e Compare April 7, 2026 07:20

AndyLi429 changed the title ~~ascend backend support qwen3 moe attention cp~~ [NPU] ascend backend support qwen3 moe attention cp Apr 15, 2026

AndyLi429 force-pushed the qwen3_pcp_0330 branch 3 times, most recently from e2c3cbb to 4e306c7 Compare April 18, 2026 13:43

AndyLi429 added 2 commits April 21, 2026 14:29

ascend support qwen3 attention cp

aee9bbf

fix bugs

e19654a

AndyLi429 force-pushed the qwen3_pcp_0330 branch from 4e306c7 to e19654a Compare April 21, 2026 06:29

AndyLi429 requested a review from wisclmy0611 as a code owner April 21, 2026 06:29

iforgetmyname self-assigned this Apr 28, 2026

iforgetmyname approved these changes Apr 28, 2026

View reviewed changes

Merge branch 'main' into qwen3_pcp_0330

29666fd

iforgetmyname merged commit 4c1eefc into sgl-project:main Apr 29, 2026
218 of 259 checks passed

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026

[NPU] ascend backend support qwen3 moe attention cp (sgl-project#21685)

ef3f722

Conversation

AndyLi429 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndyLi429 commented Mar 31, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

AndyLi429 commented Mar 31, 2026

Uh oh!

AndyLi429 commented Apr 1, 2026

Uh oh!

AndyLi429 commented Apr 1, 2026

Uh oh!

AndyLi429 commented Apr 3, 2026

Uh oh!

AndyLi429 commented Apr 3, 2026

Uh oh!

sglang-npu-bot commented Apr 3, 2026

Uh oh!

AndyLi429 commented Apr 3, 2026

Uh oh!

AndyLi429 commented Apr 8, 2026

Uh oh!

AndyLi429 commented Apr 15, 2026

Uh oh!

AndyLi429 commented Apr 21, 2026

Uh oh!

AndyLi429 commented Apr 25, 2026

Uh oh!

AndyLi429 commented Apr 25, 2026

Uh oh!

AndyLi429 commented Apr 26, 2026

Uh oh!

AndyLi429 commented Apr 26, 2026

Uh oh!

AndyLi429 commented Apr 27, 2026

Uh oh!

iforgetmyname commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AndyLi429 commented Mar 30, 2026 •

edited

Loading