Skip to content

[npu] support features of qwen3_next; fixup accuracy bugs in qwen3_next#14391

Open
zhuyijie88 wants to merge 1 commit intosgl-project:mainfrom
zhuyijie88:main_20251204_qwen3_next
Open

[npu] support features of qwen3_next; fixup accuracy bugs in qwen3_next#14391
zhuyijie88 wants to merge 1 commit intosgl-project:mainfrom
zhuyijie88:main_20251204_qwen3_next

Conversation

@zhuyijie88
Copy link
Copy Markdown
Contributor

@zhuyijie88 zhuyijie88 commented Dec 4, 2025

Motivation

  • NPU supports features of qwen3_next(pd disaggregation, mtp, npugraph, fused operator, w8a8 quantizaton);
  • fixup accuracy bugs in qwen3_next on NPU
    This PR needs to run togather with the modification of [fix] fixup bug in conv1d_update_fn sgl-kernel-npu#259. And one should install latest CANN/PTA packages which include our new torch_npu kernels.

Modifications

This pull request significantly advances the NPU (Ascend) compatibility and performance for the Qwen3_next model within the SGLang framework. It integrates crucial NPU-specific optimizations and features, such as disaggregation and fused operators, while also rectifying known accuracy problems. The changes span across attention mechanisms, quantization processes, and general device management, aiming to provide a more robust and efficient experience on Ascend hardware.

Highlights

  • NPU Feature Support: Introduced comprehensive NPU (Ascend) support for Qwen3_next model features, including disaggregation, multi-token prefill (MTP), NPU graph integration, fused operators, and W8A8 quantization.
  • Accuracy Bug Fixes: Addressed and resolved accuracy issues identified in the Qwen3_next model when running on NPU hardware.
  • Attention Backend Enhancements: Implemented a new AscendGDNAttnBackend for Mamba kernel attention on NPU, refining the handling of sequence lengths and cache locations for improved performance and correctness.
  • Quantization Improvements: Updated the W8A8 quantization logic to better support dynamic quantization on NPU and introduced a new utility script (convert_model_qwen3_next.py) for INT8 quantization of Qwen3_next models.
  • Device Agnostic Operations: Refactored cache management and memory pool initialization to be device-agnostic, moving away from hardcoded 'cuda' references to torch.get_device_module().
  • New AscendC Fused Operator Integration: Added integration of the AscendC fused operator torch_npu.npu_recurrent_gated_delta_rule, which can be enabled by setting the ENABLE_ASCENDC_FUSION_GDN environment variable.

Accuracy Tests

export HCCL_OP_EXPANSION_MODE=AIV
export DEEP_NORMAL_MODE_USE_INT8_QUANT=1
export ASCEND_USE_FIA=1
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
export ENABLE_ASCENDC_FUSION_GDN=1
export MODEL_PATH=Qwen3-Next-80B-A3B-Instruct-A8W8

python -m sglang.launch_server --model-path ${MODEL_PATH} --trust-remote-code \
    --host 0.0.0.0 --port 8000 --nnodes 1 --node-rank 0 \
    --attention-backend ascend --device npu --quantization modelslim \
    --max-running-requests 16 \
    --context-length 102400 --chunked-prefill-size 71680 --max-prefill-tokens 102400 \
    --tp-size 8 \
    --mem-fraction-static 0.7 --max-total-tokens 1126400 \
    --moe-a2a-backend deepep --deepep-mode auto \
    --disable-overlap-schedule  --disable-radix-cache \
    --speculative-algorithm NEXTN --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2
python few_shot_gsm8k.py --data-path test.jsonl.txt --parallel 16 --num-question 200 --num-shots 5 --port 8000 --temperature 0
# Accuracy: 0.945
# Invaild: 0.000
# Latency: 389.107s
# Output throughput: 83.327 token/s

Benchmarking and Profiling

After enabling the aforementioned AscendC fused operator, the performance metrics are as follows:
By configuring startup command:

python -m sglang.launch_server --model-path ${MODEL_PATH} --trust-remote-code \
    --host 0.0.0.0 --port 8000 --nnodes 1 --node-rank 0 \
    --attention-backend ascend --device npu \
    --max-running-requests 96 --context-length 8192  --disable-radix-cache \
    --chunked-prefill-size 327680 --max-prefill-tokens 4000 \
    --tp-size 16 --enable-dp-attention --dp-size 4 --enable-dp-lm-head --mem-fraction-static 0.7 \
    --moe-a2a-backend deepep --deepep-mode auto \
    --cuda-graph-bs 24 \
    --max-total-tokens 229504
python -m sglang.bench_serving --base-url http://7.150.8.78:8000 --dataset-path /data/h00910141/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name=random --random-range-ratio 1 --random-input 3500 --random-output 1500 --max-concurrency 24 --num-prompts 24

baseline tpot: 53.05ms, ttft: 6500ms

image

pr tpot: 44.30ms, ttft: 6363ms

image

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@zhuyijie88
Copy link
Copy Markdown
Contributor Author

@gemini-code-assist review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for qwen3_next features on Ascend NPU and fixes some accuracy bugs. The changes are extensive, touching attention backends, memory management, model definitions, and quantization logic. Overall, the changes seem to be moving in the right direction for NPU enablement. I've identified a few areas for improvement regarding code duplication, performance, potential correctness issues, and hardcoded values. Addressing these points will enhance the maintainability, performance, and robustness of the implementation.

Comment on lines +1391 to +1396
if conv_state_update is not None:
for i in range(seq_len):
end = i - seq_len + 1
start = end - state_len
slice_range = slice(start, end if end != 0 else None)
conv_state_update[:, i] = hidden_states_new[:, :, slice_range]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The for loop over seq_len here can be a performance bottleneck, especially when this function is called from forward_extend during the verification phase of speculative decoding, where seq_len can be greater than 1. Consider vectorizing this operation or using a custom NPU kernel to improve performance.

Comment thread python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py
Comment thread python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py
Comment thread python/sglang/srt/layers/attention/fla/fused_gdn_gating.py Outdated
@zhuyijie88
Copy link
Copy Markdown
Contributor Author

@gemini-code-assist summary

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

This pull request significantly advances the NPU (Ascend) compatibility and performance for the Qwen3_next model within the SGLang framework. It integrates crucial NPU-specific optimizations and features, such as disaggregation and fused operators, while also rectifying known accuracy problems. The changes span across attention mechanisms, quantization processes, and general device management, aiming to provide a more robust and efficient experience on Ascend hardware.

Highlights

  • NPU Feature Support: Introduced comprehensive NPU (Ascend) support for Qwen3_next model features, including disaggregation, multi-token prefill (MTP), NPU graph integration, fused operators, and W8A8 quantization.
  • Accuracy Bug Fixes: Addressed and resolved accuracy issues identified in the Qwen3_next model when running on NPU hardware.
  • Attention Backend Enhancements: Implemented a new AscendGDNAttnBackend for Mamba kernel attention on NPU, refining the handling of sequence lengths and cache locations for improved performance and correctness.
  • Quantization Improvements: Updated the W8A8 quantization logic to better support dynamic quantization on NPU and introduced a new utility script (convert_model_qwen3_next.py) for INT8 quantization of Qwen3_next models.
  • Device Agnostic Operations: Refactored cache management and memory pool initialization to be device-agnostic, moving away from hardcoded 'cuda' references to torch.get_device_module().
Changelog
  • python/sglang/srt/disaggregation/ascend/conn.py
    • Removed the register_buffer_to_engine method.
  • python/sglang/srt/layers/attention/ascend_backend.py
    • Added NPU-specific imports for FLA and Mamba kernels.
    • Extended ForwardMetadata with actual_seq_lengths and actual_seq_lengths_kv.
    • Refined init_forward_metadata to populate new sequence length metadata for verify and draft extend modes.
    • Ensured forward_batch.out_cache_loc is cast to torch.int32 in forward_mtp.
    • Refactored forward_mtp to conditionally use MLA or general NPU fused attention score, incorporating new sequence length parameters.
    • Modified forward_decode_graph to include self.use_fia check for attention tensor parallelism.
    • Introduced AscendGDNAttnBackend for Mamba kernel attention, including NPU-specific causal_conv1d_update and fused_recurrent_gated_delta_rule_update methods.
  • python/sglang/srt/layers/attention/attention_registry.py
    • Imported and conditionally registered AscendGDNAttnBackend for hybrid GDN models on NPU.
  • python/sglang/srt/layers/attention/fla/fused_gdn_gating.py
    • Added fused_gdn_gating_kernel_v3 and fused_gdn_gating_v3 for NPU-optimized GDN gating.
  • python/sglang/srt/layers/attention/fla/layernorm_gated.py
    • Renamed rms_norm_ref to rms_norm and updated its return signature.
    • Refactored _layer_norm_fwd to directly utilize the rms_norm function.
  • python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py
    • Added logging import and conditionally imported NPU-specific kernels.
    • Aliased fused_gdn_gating to fused_gdn_gating_v3 for NPU.
    • Adjusted _forward_metadata to use is_extend(True).
    • Added placeholder methods for get_verify_buffers_to_fill_after_draft and update_verify_buffers_to_fill_after_draft.
  • python/sglang/srt/layers/layernorm.py
    • Modified forward_npu to use npu_add_rms_norm for residual connections and npu_gemma_rms_norm otherwise, with a calculated gamma.
  • python/sglang/srt/layers/linear.py
    • Added an assertion to ensure int8 dtype consistency during weight loading.
  • python/sglang/srt/layers/quantization/compressed_tensors/utils.py
    • Included fused_mapping['model'] in should_ignore_layer logic.
    • Modified check_equal_or_regex_match to pass check_contains=True for regex matching.
  • python/sglang/srt/layers/quantization/w8a8_int8.py
    • Added an early return for dynamic quantization within the NPU initialization block.
    • Updated _get_quant_method_npu to check for 'compressed-tensors' to determine dynamic quantization.
    • Improved is_layer_skipped to use .get() for more robust quant_description lookups.
  • python/sglang/srt/mem_cache/memory_pool.py
    • Changed hardcoded device='cuda' to device=device for temporal_state and conv_state initialization.
    • Adjusted logging format for memory sizes to include more precision.
  • python/sglang/srt/model_executor/model_runner.py
    • Updated init_memory_pool to conditionally initialize token_to_kv_pool for Ascend only when mambaish_config is None.
  • python/sglang/srt/model_executor/npu_graph_runner.py
    • Imported get_bool_env_var and introduced self.use_fia based on an environment variable.
    • Modified _get_update_attr_name and _get_update_attr_type to incorporate self.use_fia for conditional logic.
  • python/sglang/srt/models/qwen3_next.py
    • Passed quant_config to Linear layer initialization for ba_proj.
    • Extended DP-Attn padding condition to include NPU.
    • Replaced torch.cuda.empty_cache() and torch.cuda.synchronize() with device-agnostic torch.get_device_module() calls.
  • python/sglang/srt/speculative/eagle_worker_v2.py
    • Added logic to update Mamba state after MTP verification for hybrid GDN models, including calculation of max_relative_indices_per_req.
  • python/sglang/srt/utils/convert_model_qwen3_next.py
    • Added a new utility script for quantizing Qwen3_next models to INT8 (A8W8) for NPU, including functions for weight quantization, ignore list generation, and configuration management.
Activity
  • Comment 1 from @gemini-code-assist[bot]: A warning about reaching the daily quota limit.
  • Comment 2 from @zhuyijie88: Requested a review from @gemini-code-assist.
  • Comment 3 from @zhuyijie88: Requested a summary from @gemini-code-assist.
  • Review Comment (2592066247) from @gemini-code-assist[bot]: High priority - The for loop in torch_causal_conv1d_update_npu can be a performance bottleneck, suggesting vectorization or a custom NPU kernel.
  • Review Comment (2592066252) from @gemini-code-assist[bot]: High priority - Potential bug in AscendGDNAttnBackend.forward_extend regarding the last_recurrent_state transpose in the else block, suggesting it might be needed for correctness.
  • Review Comment (2592066257) from @gemini-code-assist[bot]: Medium priority - Code duplication in AscendAttnBackend.forward_mtp for calculating actual_seq_lengths_kv and actual_seq_lengths, recommending refactoring to use pre-computed values.
  • Review Comment (2592066260) from @gemini-code-assist[bot]: Medium priority - Hardcoded num_cores in fused_gdn_gating_v3 can lead to suboptimal performance or errors, suggesting dynamic querying of device properties.

@StepyHan
Copy link
Copy Markdown

StepyHan commented Dec 9, 2025

Add pr description:

  1. New AscendC Fused Operator Integration: Added integration of the AscendC fused operator torch_npu.npu_recurrent_gated_delta_rule, which can be enabled by export ENABLE_ASCENDC_FUSION_GDN="true".

Performance Comparison:
After enabling the aforementioned AscendC fused operator, the performance metrics are as follows:
By configuring startup command:

python -m sglang.launch_server --model-path ${MODEL_PATH} --trust-remote-code \
    --host 141.61.39.231 --port 8000 --nnodes 1 --node-rank 0 \
    --attention-backend ascend --device npu \
    --max-running-requests 96 --context-length 8192  --disable-radix-cache \
    --chunked-prefill-size 327680 --max-prefill-tokens 4000 \
    --tp-size 16 --enable-dp-attention --dp-size 4 --enable-dp-lm-head --mem-fraction-static 0.7 \
    --moe-a2a-backend deepep --deepep-mode auto \
    --cuda-graph-bs 24 \
    --max-total-tokens 229504
python -m sglang.bench_serving --base-url http://7.150.8.78:8000 --dataset-path ${DATASET_PATH}/ShareGPT_V3_unfiltered_cleaned_split.json --dataset-name=random --random-range-ratio 1 --random-input 3500 --random-output 1500 --max-concurrency 24 --num-prompts 24

baseline tpot: 53.05ms, ttft: 6500ms
base

pr tpot: 44.30ms, ttft: 6363ms
pr_perf

@zhuyijie88 zhuyijie88 force-pushed the main_20251204_qwen3_next branch from aa9dfc2 to 61012fd Compare December 11, 2025 03:15
@zhuyijie88
Copy link
Copy Markdown
Contributor Author

@gemini-code-assist review

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@zhuyijie88 zhuyijie88 force-pushed the main_20251204_qwen3_next branch from 4db529d to 441e237 Compare December 15, 2025 11:02
Comment thread python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py Outdated
@plusls
Copy link
Copy Markdown

plusls commented Jan 16, 2026

any progress?

@gbdjxgp
Copy link
Copy Markdown

gbdjxgp commented Jan 19, 2026

any progress? @zhuyijie88 @ping1jing2

@zhuyijie88 zhuyijie88 force-pushed the main_20251204_qwen3_next branch from fb6de01 to 52f3068 Compare March 9, 2026 01:38
@zhuyijie88 zhuyijie88 requested a review from Qiaolin-Yu as a code owner March 9, 2026 01:38
@zhuyijie88 zhuyijie88 force-pushed the main_20251204_qwen3_next branch 2 times, most recently from 8bee0a9 to 69b14f0 Compare March 9, 2026 02:25
@zhuyijie88
Copy link
Copy Markdown
Contributor Author

any progress? @zhuyijie88 @ping1jing2
This project will continues in recent.

@zhuyijie88 zhuyijie88 force-pushed the main_20251204_qwen3_next branch from 69b14f0 to 2dc9568 Compare March 9, 2026 02:33
…, fused operator, w8a8 quantizaton); fixup accuracy bugs in qwen3_next

Co-authored-by: zhuyijie88 <762412795@qq.com>
Co-authored-by: ZhengdQin <zhengdqin@gmail.com>
Co-authored-by: StepyHan <936072483@qq.com>
@zhuyijie88 zhuyijie88 force-pushed the main_20251204_qwen3_next branch from 2dc9568 to de06d64 Compare March 10, 2026 02:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants