[npu] support features of qwen3_next; fixup accuracy bugs in qwen3_next#14391
[npu] support features of qwen3_next; fixup accuracy bugs in qwen3_next#14391zhuyijie88 wants to merge 1 commit intosgl-project:mainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
@gemini-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request adds support for qwen3_next features on Ascend NPU and fixes some accuracy bugs. The changes are extensive, touching attention backends, memory management, model definitions, and quantization logic. Overall, the changes seem to be moving in the right direction for NPU enablement. I've identified a few areas for improvement regarding code duplication, performance, potential correctness issues, and hardcoded values. Addressing these points will enhance the maintainability, performance, and robustness of the implementation.
| if conv_state_update is not None: | ||
| for i in range(seq_len): | ||
| end = i - seq_len + 1 | ||
| start = end - state_len | ||
| slice_range = slice(start, end if end != 0 else None) | ||
| conv_state_update[:, i] = hidden_states_new[:, :, slice_range] |
There was a problem hiding this comment.
The for loop over seq_len here can be a performance bottleneck, especially when this function is called from forward_extend during the verification phase of speculative decoding, where seq_len can be greater than 1. Consider vectorizing this operation or using a custom NPU kernel to improve performance.
|
@gemini-code-assist summary |
Summary of ChangesThis pull request significantly advances the NPU (Ascend) compatibility and performance for the Qwen3_next model within the SGLang framework. It integrates crucial NPU-specific optimizations and features, such as disaggregation and fused operators, while also rectifying known accuracy problems. The changes span across attention mechanisms, quantization processes, and general device management, aiming to provide a more robust and efficient experience on Ascend hardware. Highlights
Changelog
Activity
|
ba18646 to
52f3068
Compare
8dff114 to
0a45244
Compare
|
Add pr description:
Performance Comparison: |
aa9dfc2 to
61012fd
Compare
|
@gemini-code-assist review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
4db529d to
441e237
Compare
|
any progress? |
|
any progress? @zhuyijie88 @ping1jing2 |
fb6de01 to
52f3068
Compare
8bee0a9 to
69b14f0
Compare
|
69b14f0 to
2dc9568
Compare
…, fused operator, w8a8 quantizaton); fixup accuracy bugs in qwen3_next Co-authored-by: zhuyijie88 <762412795@qq.com> Co-authored-by: ZhengdQin <zhengdqin@gmail.com> Co-authored-by: StepyHan <936072483@qq.com>
2dc9568 to
de06d64
Compare


Motivation
This PR needs to run togather with the modification of [fix] fixup bug in conv1d_update_fn sgl-kernel-npu#259. And one should install latest CANN/PTA packages which include our new torch_npu kernels.
Modifications
This pull request significantly advances the NPU (Ascend) compatibility and performance for the Qwen3_next model within the SGLang framework. It integrates crucial NPU-specific optimizations and features, such as disaggregation and fused operators, while also rectifying known accuracy problems. The changes span across attention mechanisms, quantization processes, and general device management, aiming to provide a more robust and efficient experience on Ascend hardware.
Highlights
Accuracy Tests
Benchmarking and Profiling
After enabling the aforementioned AscendC fused operator, the performance metrics are as follows:
By configuring startup command:
baseline tpot: 53.05ms, ttft: 6500ms
pr tpot: 44.30ms, ttft: 6363ms
Checklist