[npu] support features of qwen3_next; fixup accuracy bugs in qwen3_next by zhuyijie88 · Pull Request #14391 · sgl-project/sglang

zhuyijie88 · 2025-12-04T03:53:50Z

Motivation

NPU supports features of qwen3_next(pd disaggregation, mtp, npugraph, fused operator, w8a8 quantizaton);
fixup accuracy bugs in qwen3_next on NPU
This PR needs to run togather with the modification of [fix] fixup bug in conv1d_update_fn sgl-kernel-npu#259. And one should install latest CANN/PTA packages which include our new torch_npu kernels.

Modifications

This pull request significantly advances the NPU (Ascend) compatibility and performance for the Qwen3_next model within the SGLang framework. It integrates crucial NPU-specific optimizations and features, such as disaggregation and fused operators, while also rectifying known accuracy problems. The changes span across attention mechanisms, quantization processes, and general device management, aiming to provide a more robust and efficient experience on Ascend hardware.

Highlights

NPU Feature Support: Introduced comprehensive NPU (Ascend) support for Qwen3_next model features, including disaggregation, multi-token prefill (MTP), NPU graph integration, fused operators, and W8A8 quantization.
Accuracy Bug Fixes: Addressed and resolved accuracy issues identified in the Qwen3_next model when running on NPU hardware.
Attention Backend Enhancements: Implemented a new AscendGDNAttnBackend for Mamba kernel attention on NPU, refining the handling of sequence lengths and cache locations for improved performance and correctness.
Quantization Improvements: Updated the W8A8 quantization logic to better support dynamic quantization on NPU and introduced a new utility script (convert_model_qwen3_next.py) for INT8 quantization of Qwen3_next models.
Device Agnostic Operations: Refactored cache management and memory pool initialization to be device-agnostic, moving away from hardcoded 'cuda' references to torch.get_device_module().
New AscendC Fused Operator Integration: Added integration of the AscendC fused operator torch_npu.npu_recurrent_gated_delta_rule, which can be enabled by setting the ENABLE_ASCENDC_FUSION_GDN environment variable.

Accuracy Tests

export HCCL_OP_EXPANSION_MODE=AIV
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export ASCEND_USE_FIA=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export ENABLE_ASCENDC_FUSION_GDN=1
export MODEL_PATH=Qwen3-Next-80B-A3B-Instruct-A8W8

python -m sglang.launch_server --model-path ${MODEL_PATH} --trust-remote-code \
    --host 0.0.0.0 --port 8000 --nnodes 1 --node-rank 0 \
    --attention-backend ascend --device npu --quantization modelslim \
    --max-running-requests 16 \
    --context-length 102400 --chunked-prefill-size 71680 --max-prefill-tokens 102400 \
    --tp-size 8 \
    --mem-fraction-static 0.7 --max-total-tokens 1126400 \
    --moe-a2a-backend deepep --deepep-mode auto \
    --disable-overlap-schedule  --disable-radix-cache \
    --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2

python few_shot_gsm8k.py --data-path test.jsonl.txt --parallel 16 --num-question 200 --num-shots 5 --port 8000 --temperature 0
# Accuracy: 0.945
# Invaild: 0.000
# Latency: 389.107s
# Output throughput: 83.327 token/s

Benchmarking and Profiling

After enabling the aforementioned AscendC fused operator, the performance metrics are as follows:
By configuring startup command:

python -m sglang.launch_server --model-path ${MODEL_PATH} --trust-remote-code \
    --host 0.0.0.0 --port 8000 --nnodes 1 --node-rank 0 \
    --attention-backend ascend --device npu \
    --max-running-requests 96 --context-length 8192  --disable-radix-cache \
    --chunked-prefill-size 327680 --max-prefill-tokens 4000 \
    --tp-size 16 --enable-dp-attention --dp-size 4 --enable-dp-lm-head --mem-fraction-static 0.7 \
    --moe-a2a-backend deepep --deepep-mode auto \
    --cuda-graph-bs 24 \
    --max-total-tokens 229504

python -m sglang.bench_serving --base-url http://7.150.8.78:8000 --dataset-path /data/h00910141/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name=random --random-range-ratio 1 --random-input 3500 --random-output 1500 --max-concurrency 24 --num-prompts 24

baseline tpot: 53.05ms, ttft: 6500ms

pr tpot: 44.30ms, ttft: 6363ms

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-04T03:53:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

zhuyijie88 · 2025-12-05T09:50:30Z

@gemini-code-assist review

gemini-code-assist

Code Review

This pull request adds support for qwen3_next features on Ascend NPU and fixes some accuracy bugs. The changes are extensive, touching attention backends, memory management, model definitions, and quantization logic. Overall, the changes seem to be moving in the right direction for NPU enablement. I've identified a few areas for improvement regarding code duplication, performance, potential correctness issues, and hardcoded values. Addressing these points will enhance the maintainability, performance, and robustness of the implementation.

gemini-code-assist · 2025-12-05T09:52:35Z

+        if conv_state_update is not None:
+            for i in range(seq_len):
+                end = i - seq_len + 1
+                start = end - state_len
+                slice_range = slice(start, end if end != 0 else None)
+                conv_state_update[:, i] = hidden_states_new[:, :, slice_range]


The for loop over seq_len here can be a performance bottleneck, especially when this function is called from forward_extend during the verification phase of speculative decoding, where seq_len can be greater than 1. Consider vectorizing this operation or using a custom NPU kernel to improve performance.

zhuyijie88 · 2025-12-05T09:54:24Z

@gemini-code-assist summary

gemini-code-assist · 2025-12-05T09:54:56Z

Summary of Changes

This pull request significantly advances the NPU (Ascend) compatibility and performance for the Qwen3_next model within the SGLang framework. It integrates crucial NPU-specific optimizations and features, such as disaggregation and fused operators, while also rectifying known accuracy problems. The changes span across attention mechanisms, quantization processes, and general device management, aiming to provide a more robust and efficient experience on Ascend hardware.

Highlights

NPU Feature Support: Introduced comprehensive NPU (Ascend) support for Qwen3_next model features, including disaggregation, multi-token prefill (MTP), NPU graph integration, fused operators, and W8A8 quantization.
Accuracy Bug Fixes: Addressed and resolved accuracy issues identified in the Qwen3_next model when running on NPU hardware.
Attention Backend Enhancements: Implemented a new AscendGDNAttnBackend for Mamba kernel attention on NPU, refining the handling of sequence lengths and cache locations for improved performance and correctness.
Quantization Improvements: Updated the W8A8 quantization logic to better support dynamic quantization on NPU and introduced a new utility script (convert_model_qwen3_next.py) for INT8 quantization of Qwen3_next models.
Device Agnostic Operations: Refactored cache management and memory pool initialization to be device-agnostic, moving away from hardcoded 'cuda' references to torch.get_device_module().

Changelog

python/sglang/srt/disaggregation/ascend/conn.py
- Removed the register_buffer_to_engine method.
python/sglang/srt/layers/attention/ascend_backend.py
- Added NPU-specific imports for FLA and Mamba kernels.
- Extended ForwardMetadata with actual_seq_lengths and actual_seq_lengths_kv.
- Refined init_forward_metadata to populate new sequence length metadata for verify and draft extend modes.
- Ensured forward_batch.out_cache_loc is cast to torch.int32 in forward_mtp.
- Refactored forward_mtp to conditionally use MLA or general NPU fused attention score, incorporating new sequence length parameters.
- Modified forward_decode_graph to include self.use_fia check for attention tensor parallelism.
- Introduced AscendGDNAttnBackend for Mamba kernel attention, including NPU-specific causal_conv1d_update and fused_recurrent_gated_delta_rule_update methods.
python/sglang/srt/layers/attention/attention_registry.py
- Imported and conditionally registered AscendGDNAttnBackend for hybrid GDN models on NPU.
python/sglang/srt/layers/attention/fla/fused_gdn_gating.py
- Added fused_gdn_gating_kernel_v3 and fused_gdn_gating_v3 for NPU-optimized GDN gating.
python/sglang/srt/layers/attention/fla/layernorm_gated.py
- Renamed rms_norm_ref to rms_norm and updated its return signature.
- Refactored _layer_norm_fwd to directly utilize the rms_norm function.
python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py
- Added logging import and conditionally imported NPU-specific kernels.
- Aliased fused_gdn_gating to fused_gdn_gating_v3 for NPU.
- Adjusted _forward_metadata to use is_extend(True).
- Added placeholder methods for get_verify_buffers_to_fill_after_draft and update_verify_buffers_to_fill_after_draft.
python/sglang/srt/layers/layernorm.py
- Modified forward_npu to use npu_add_rms_norm for residual connections and npu_gemma_rms_norm otherwise, with a calculated gamma.
python/sglang/srt/layers/linear.py
- Added an assertion to ensure int8 dtype consistency during weight loading.
python/sglang/srt/layers/quantization/compressed_tensors/utils.py
- Included fused_mapping['model'] in should_ignore_layer logic.
- Modified check_equal_or_regex_match to pass check_contains=True for regex matching.
python/sglang/srt/layers/quantization/w8a8_int8.py
- Added an early return for dynamic quantization within the NPU initialization block.
- Updated _get_quant_method_npu to check for 'compressed-tensors' to determine dynamic quantization.
- Improved is_layer_skipped to use .get() for more robust quant_description lookups.
python/sglang/srt/mem_cache/memory_pool.py
- Changed hardcoded device='cuda' to device=device for temporal_state and conv_state initialization.
- Adjusted logging format for memory sizes to include more precision.
python/sglang/srt/model_executor/model_runner.py
- Updated init_memory_pool to conditionally initialize token_to_kv_pool for Ascend only when mambaish_config is None.
python/sglang/srt/model_executor/npu_graph_runner.py
- Imported get_bool_env_var and introduced self.use_fia based on an environment variable.
- Modified _get_update_attr_name and _get_update_attr_type to incorporate self.use_fia for conditional logic.
python/sglang/srt/models/qwen3_next.py
- Passed quant_config to Linear layer initialization for ba_proj.
- Extended DP-Attn padding condition to include NPU.
- Replaced torch.cuda.empty_cache() and torch.cuda.synchronize() with device-agnostic torch.get_device_module() calls.
python/sglang/srt/speculative/eagle_worker_v2.py
- Added logic to update Mamba state after MTP verification for hybrid GDN models, including calculation of max_relative_indices_per_req.
python/sglang/srt/utils/convert_model_qwen3_next.py
- Added a new utility script for quantizing Qwen3_next models to INT8 (A8W8) for NPU, including functions for weight quantization, ignore list generation, and configuration management.

Activity

Comment 1 from @gemini-code-assist[bot]: A warning about reaching the daily quota limit.
Comment 2 from @zhuyijie88: Requested a review from @gemini-code-assist.
Comment 3 from @zhuyijie88: Requested a summary from @gemini-code-assist.
Review Comment (2592066247) from @gemini-code-assist[bot]: High priority - The for loop in torch_causal_conv1d_update_npu can be a performance bottleneck, suggesting vectorization or a custom NPU kernel.
Review Comment (2592066252) from @gemini-code-assist[bot]: High priority - Potential bug in AscendGDNAttnBackend.forward_extend regarding the last_recurrent_state transpose in the else block, suggesting it might be needed for correctness.
Review Comment (2592066257) from @gemini-code-assist[bot]: Medium priority - Code duplication in AscendAttnBackend.forward_mtp for calculating actual_seq_lengths_kv and actual_seq_lengths, recommending refactoring to use pre-computed values.
Review Comment (2592066260) from @gemini-code-assist[bot]: Medium priority - Hardcoded num_cores in fused_gdn_gating_v3 can lead to suboptimal performance or errors, suggesting dynamic querying of device properties.

StepyHan · 2025-12-09T09:22:02Z

Add pr description:

New AscendC Fused Operator Integration: Added integration of the AscendC fused operator torch_npu.npu_recurrent_gated_delta_rule, which can be enabled by export ENABLE_ASCENDC_FUSION_GDN="true".

Performance Comparison:
After enabling the aforementioned AscendC fused operator, the performance metrics are as follows:
By configuring startup command:

python -m sglang.launch_server --model-path ${MODEL_PATH} --trust-remote-code \
    --host 141.61.39.231 --port 8000 --nnodes 1 --node-rank 0 \
    --attention-backend ascend --device npu \
    --max-running-requests 96 --context-length 8192  --disable-radix-cache \
    --chunked-prefill-size 327680 --max-prefill-tokens 4000 \
    --tp-size 16 --enable-dp-attention --dp-size 4 --enable-dp-lm-head --mem-fraction-static 0.7 \
    --moe-a2a-backend deepep --deepep-mode auto \
    --cuda-graph-bs 24 \
    --max-total-tokens 229504

python -m sglang.bench_serving --base-url http://7.150.8.78:8000 --dataset-path ${DATASET_PATH}/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name=random --random-range-ratio 1 --random-input 3500 --random-output 1500 --max-concurrency 24 --num-prompts 24

baseline tpot: 53.05ms, ttft: 6500ms

pr tpot: 44.30ms, ttft: 6363ms

zhuyijie88 · 2025-12-11T03:50:15Z

@gemini-code-assist review

gemini-code-assist · 2025-12-11T03:50:19Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

plusls · 2026-01-16T00:54:50Z

any progress？

gbdjxgp · 2026-01-19T05:21:27Z

any progress? @zhuyijie88 @ping1jing2

zhuyijie88 · 2026-03-09T02:31:57Z

any progress? @zhuyijie88 @ping1jing2
This project will continues in recent.

…, fused operator, w8a8 quantizaton); fixup accuracy bugs in qwen3_next Co-authored-by: zhuyijie88 <762412795@qq.com> Co-authored-by: ZhengdQin <zhengdqin@gmail.com> Co-authored-by: StepyHan <936072483@qq.com>

zhuyijie88 requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, hnyls2002, iforgetmyname, ispobock, merrymercy and xiezhq-hermann as code owners December 4, 2025 03:53

github-actions Bot added the npu label Dec 4, 2025

ping1jing2 self-assigned this Dec 4, 2025

ping1jing2 mentioned this pull request Dec 4, 2025

[Bug] [Ascend] Infinite repeated tokens generated from Qwen3-Next-80B-A3B-Instruct! #12873

Closed

5 tasks

gemini-code-assist Bot reviewed Dec 5, 2025

View reviewed changes

zhuyijie88 force-pushed the main_20251204_qwen3_next branch from ba18646 to 52f3068 Compare December 8, 2025 03:54

zhuyijie88 requested review from hanming-lu, hebiao064, ping1jing2 and yizhang2077 as code owners December 8, 2025 03:54

zhuyijie88 force-pushed the main_20251204_qwen3_next branch from 8dff114 to 0a45244 Compare December 9, 2025 08:56

zhuyijie88 force-pushed the main_20251204_qwen3_next branch from aa9dfc2 to 61012fd Compare December 11, 2025 03:15

zhuyijie88 force-pushed the main_20251204_qwen3_next branch from 4db529d to 441e237 Compare December 15, 2025 11:02

RuixuanZhang06 reviewed Dec 18, 2025

View reviewed changes

Comment thread python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py Outdated

iforgetmyname mentioned this pull request Jan 23, 2026

[Roadmap] Ascend NPU Development (2026 Q1) #13664

Open

28 tasks

zhuyijie88 force-pushed the main_20251204_qwen3_next branch from fb6de01 to 52f3068 Compare March 9, 2026 01:38

zhuyijie88 requested a review from Qiaolin-Yu as a code owner March 9, 2026 01:38

zhuyijie88 force-pushed the main_20251204_qwen3_next branch 2 times, most recently from 8bee0a9 to 69b14f0 Compare March 9, 2026 02:25

zhuyijie88 force-pushed the main_20251204_qwen3_next branch from 69b14f0 to 2dc9568 Compare March 9, 2026 02:33

[npu] support features of qwen3_next(pd disaggregation, mtp, npugraph…

de06d64

…, fused operator, w8a8 quantizaton); fixup accuracy bugs in qwen3_next Co-authored-by: zhuyijie88 <762412795@qq.com> Co-authored-by: ZhengdQin <zhengdqin@gmail.com> Co-authored-by: StepyHan <936072483@qq.com>

zhuyijie88 force-pushed the main_20251204_qwen3_next branch from 2dc9568 to de06d64 Compare March 10, 2026 02:12

Conversation

zhuyijie88 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Highlights

Accuracy Tests

Benchmarking and Profiling

baseline tpot: 53.05ms, ttft: 6500ms

pr tpot: 44.30ms, ttft: 6363ms

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 4, 2025

Uh oh!

zhuyijie88 commented Dec 5, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuyijie88 commented Dec 5, 2025

Uh oh!

gemini-code-assist Bot commented Dec 5, 2025

Summary of Changes

Highlights

Uh oh!

StepyHan commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuyijie88 commented Dec 11, 2025

Uh oh!

gemini-code-assist Bot commented Dec 11, 2025

Uh oh!

Uh oh!

plusls commented Jan 16, 2026

Uh oh!

gbdjxgp commented Jan 19, 2026

Uh oh!

zhuyijie88 commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhuyijie88 commented Dec 4, 2025 •

edited

Loading

StepyHan commented Dec 9, 2025 •

edited

Loading