Skip to content

[NPU] ascend backend support qwen3 moe attention cp#21685

Merged
iforgetmyname merged 3 commits intosgl-project:mainfrom
AndyLi429:qwen3_pcp_0330
Apr 29, 2026
Merged

[NPU] ascend backend support qwen3 moe attention cp#21685
iforgetmyname merged 3 commits intosgl-project:mainfrom
AndyLi429:qwen3_pcp_0330

Conversation

@AndyLi429
Copy link
Copy Markdown
Contributor

@AndyLi429 AndyLi429 commented Mar 30, 2026

Motivation

Qwen3 MoE models on Ascend NPU already support Prefill Context Parallel (PCP) for the MLA attention path, but the standard (non-MLA) attention path lacked CP support. This PR completes CP support for standard attention in the Ascend backend, enabling Qwen3-30B-A3B to run correctly under co-located deployment (TP=4 / MOE_DP=2 / ATTN_CP=2), reducing peak HBM usage during long-sequence prefill and improving TTFT.

Modifications

  1. ascend_backend.py (core implementation)

    • Added _cp_allgather_and_save_kv_npu(): concatenates K and V along the feature dimension and performs a single all-gather instead of two separate ones, halving communication overhead. GQA is handled correctly even when tp_k_head_num != tp_v_head_num.
    • Added do_cp_attn_fia(): implements CP-aware attention using npu_fused_infer_attention_score (FIA). Splits Q into prev/next halves following the zigzag pattern, computes attention separately for each half, then concatenates the outputs.
    • Extended forward_extend() with a CP branch: when is_context_parallel_extend is detected, the flow goes through all-gather KV → CP attention. Non-FIA paths raise a clear NotImplementedError.
    • Reads attn_cp_size from ModelRunner and stores it on the backend instance.
  2. qwen2_moe.py (minor fix)

    • Replaced torch.cuda.current_stream() with get_current_device_stream_fast() to ensure device-agnostic stream handling on Ascend NPU.
  3. test_npu_qwen3_30b_attn_cp.py (new test)

    • Registers a nightly NPU CI test on 4 NPUs (nightly-4-npu-a3).
    • Validates GSM8K accuracy for Qwen3-30B-A3B under co-located TP=4 / MOE_DP=2 / ATTN_CP=2 deployment, requiring accuracy ≥ 0.92 (measured: 0.96).

Accuracy Tests

======================start test gsm8k =================
parallel= 16
Accuracy: 0.960
Invalid: 0.000
Latency: 36.943 s
Output throughput: 154.996 token/s
metrics={'accuracy': np.float64(0.96), 'invalid': np.float64(0.0), 'latency': 36.942977759987116, 'output_throughput': 154.99562696870154}
parallel=16
metrics['accuracy']=np.float64(0.96)
======================start test gsm8k =================
parallel= 32
Accuracy: 0.960
Invalid: 0.000
Latency: 28.165 s
Output throughput: 193.860 token/s
metrics={'accuracy': np.float64(0.96), 'invalid': np.float64(0.0), 'latency': 28.164654500083998, 'output_throughput': 193.8600027912189}
parallel=32
metrics['accuracy']=np.float64(0.96)

Speed Tests and Profiling

attn_cp=2


============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    1.0       
Max request concurrency:                 16        
Successful requests:                     1         
Benchmark duration (s):                  75.20     
Total input tokens:                      256000    
Total input text tokens:                 256000    
Total generated tokens:                  128       
Total generated tokens (retokenized):    129       
Request throughput (req/s):              0.01      
Input token throughput (tok/s):          3404.31   
Output token throughput (tok/s):         1.70      
Peak output token throughput (tok/s):    6.00      
Peak concurrent requests:                1         
Total token throughput (tok/s):          3406.01   
Concurrency:                             1.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   75193.51  
Median E2E Latency (ms):                 75193.51  
P90 E2E Latency (ms):                    75193.51  
P99 E2E Latency (ms):                    75193.51  
---------------Time to First Token----------------
Mean TTFT (ms):                          49533.88  
Median TTFT (ms):                        49533.88  
P99 TTFT (ms):                           49533.88  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          202.04    
Median TPOT (ms):                        202.04    
P99 TPOT (ms):                           202.04    
---------------Inter-Token Latency----------------
Mean ITL (ms):                           202.04    
Median ITL (ms):                         175.49    
P95 ITL (ms):                            188.89    
P99 ITL (ms):                            200.62    
Max ITL (ms):                            3533.68   
==================================================

Metric attn_cp=1 attn_cp=2 Delta
Benchmark Duration (s) 84.68 75.20 -9.48 (-11%)
Input Token Throughput (tok/s) 3023.25 3404.31 +381 (+13%)
Output Token Throughput(tok/s) 1.51 1.70 +0.19 (+13%)
Total Token Throughput (tok/s) 3024.76 3406.01 +381 (+13%)
Mean E2E Latency (ms) 84672.02 75193.51 -9478 (-11%)
Mean TTFT (ms) 58039.34 49533.88 -8505 (-15%)
Mean TPOT (ms) 209.71 202.04 -7.67 (-4%)
Mean ITL (ms) 209.71 202.04 -7.67 (-4%)
Median ITL (ms) 181.25 175.49 -5.76 (-3%)
P99 ITL (ms) 208.63 200.62 -8.01 (-4%)
Max ITL (ms) 3834.20 3533.68 -300 (-8%)

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements context parallel (CP) support for the Ascend NPU backend, introducing a merged K/V all-gather optimization to reduce communication latency and CP-aware attention using the FIA path. It also adds a new end-to-end GSM8K accuracy test for Qwen3-30B-A3B with PD disaggregation. Review feedback identifies a potential correctness bug where a CUDA stream is used instead of an NPU stream, suggests refactoring duplicated attention scoring logic into a helper method, and points out a redundant argument in the test configuration.

Comment thread python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py Outdated
Comment thread python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py
Comment thread test/registered/ascend/llm_models/test_npu_qwen3_30b_pd_gsm8k.py Outdated
@AndyLi429
Copy link
Copy Markdown
Contributor Author

@claude review
@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ecb778867a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread test/registered/ascend/llm_models/test_npu_qwen3_30b_pd_gsm8k.py Outdated
@AndyLi429
Copy link
Copy Markdown
Contributor Author

/tag-run-ci-label

@AndyLi429
Copy link
Copy Markdown
Contributor Author

/run-ci npu

@AndyLi429
Copy link
Copy Markdown
Contributor Author

/tag-and-rerun-ci

@AndyLi429
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@AndyLi429
Copy link
Copy Markdown
Contributor Author

@iforgetmyname please review the code and start CI

@sglang-npu-bot
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@github-actions github-actions Bot added run-ci documentation Improvements or additions to documentation labels Apr 3, 2026
@AndyLi429
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@AndyLi429 AndyLi429 force-pushed the qwen3_pcp_0330 branch 3 times, most recently from c6799ce to e8c837e Compare April 7, 2026 07:20
@AndyLi429
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

1 similar comment
@AndyLi429
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@AndyLi429 AndyLi429 changed the title ascend backend support qwen3 moe attention cp [NPU] ascend backend support qwen3 moe attention cp Apr 15, 2026
@AndyLi429 AndyLi429 force-pushed the qwen3_pcp_0330 branch 3 times, most recently from e2c3cbb to 4e306c7 Compare April 18, 2026 13:43
@AndyLi429
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

1 similar comment
@AndyLi429
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@AndyLi429
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

3 similar comments
@AndyLi429
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@AndyLi429
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@AndyLi429
Copy link
Copy Markdown
Contributor Author

/rerun-failed-ci

@iforgetmyname iforgetmyname self-assigned this Apr 28, 2026
@iforgetmyname
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@iforgetmyname iforgetmyname merged commit 4c1eefc into sgl-project:main Apr 29, 2026
218 of 259 checks passed
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation npu run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants