[NPU] ascend backend support qwen3 moe attention cp#21685
[NPU] ascend backend support qwen3 moe attention cp#21685iforgetmyname merged 3 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements context parallel (CP) support for the Ascend NPU backend, introducing a merged K/V all-gather optimization to reduce communication latency and CP-aware attention using the FIA path. It also adds a new end-to-end GSM8K accuracy test for Qwen3-30B-A3B with PD disaggregation. Review feedback identifies a potential correctness bug where a CUDA stream is used instead of an NPU stream, suggests refactoring duplicated attention scoring logic into a helper method, and points out a redundant argument in the test configuration.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ecb778867a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
/tag-run-ci-label |
|
/run-ci npu |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
@iforgetmyname please review the code and start CI |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
c6799ce to
e8c837e
Compare
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
e2c3cbb to
4e306c7
Compare
4e306c7 to
e19654a
Compare
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/tag-and-rerun-ci |
Motivation
Qwen3 MoE models on Ascend NPU already support Prefill Context Parallel (PCP) for the MLA attention path, but the standard (non-MLA) attention path lacked CP support. This PR completes CP support for standard attention in the Ascend backend, enabling Qwen3-30B-A3B to run correctly under co-located deployment (TP=4 / MOE_DP=2 / ATTN_CP=2), reducing peak HBM usage during long-sequence prefill and improving TTFT.
Modifications
ascend_backend.py(core implementation)_cp_allgather_and_save_kv_npu(): concatenates K and V along the feature dimension and performs a single all-gather instead of two separate ones, halving communication overhead. GQA is handled correctly even whentp_k_head_num != tp_v_head_num.do_cp_attn_fia(): implements CP-aware attention usingnpu_fused_infer_attention_score(FIA). Splits Q into prev/next halves following the zigzag pattern, computes attention separately for each half, then concatenates the outputs.forward_extend()with a CP branch: whenis_context_parallel_extendis detected, the flow goes through all-gather KV → CP attention. Non-FIA paths raise a clearNotImplementedError.attn_cp_sizefromModelRunnerand stores it on the backend instance.qwen2_moe.py(minor fix)torch.cuda.current_stream()withget_current_device_stream_fast()to ensure device-agnostic stream handling on Ascend NPU.test_npu_qwen3_30b_attn_cp.py(new test)nightly-4-npu-a3).Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci