[NPU] Support MTP for Qwen3.5#20918
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the SGLang framework by enabling Multi-Token Prediction (MTP) speculative decoding for the Qwen3.5 model on Ascend NPU. It introduces a specialized GDN attention backend, refines attention routing, and integrates NPU-specific memory and unquantization logic to ensure efficient and stable model operation on Ascend hardware. These changes are crucial for leveraging the performance capabilities of NPUs for advanced language model inference. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for MTP (Multi-Token Prediction) speculative decoding for Qwen3.5 models on Ascend NPUs. The changes include a new NPU-specific GDN attention backend, updates to memory management and metadata handling, and model-level modifications to enable this functionality. My review identified two critical thread-safety issues in the model forward passes related to the modification of global environment variables, which could lead to race conditions in a server environment. Additionally, a minor performance issue due to a redundant computation was found in the new attention backend.
| if is_npu() and self.quant_config is None: | ||
| # ascend mtp unquant | ||
| os.environ["SGLANG_DEEPEP_BF16_DISPATCH"] = "1" | ||
| os.environ["DEEP_NORMAL_MODE_USE_INT8_QUANT"] = "0" |
There was a problem hiding this comment.
Modifying environment variables (os.environ) within the forward method is not thread-safe and can lead to race conditions in a concurrent server environment. When multiple requests are processed in parallel, one request might change the environment variables while another is in the middle of its execution, leading to incorrect and unpredictable behavior. This is a critical issue for a production server.
Configuration should ideally be passed through function arguments or other thread-safe mechanisms. If the underlying DEEPEP library only supports configuration via environment variables, this section of code should be protected by a lock, though this would have a significant performance impact. The best approach would be to investigate if the library can be configured in a thread-safe manner.
| if is_npu() and self.quant_config is None: | ||
| # ascend mtp unquant | ||
| os.environ["SGLANG_DEEPEP_BF16_DISPATCH"] = "0" | ||
| os.environ["DEEP_NORMAL_MODE_USE_INT8_QUANT"] = "1" |
There was a problem hiding this comment.
Similar to the modification at the beginning of the forward method, resetting environment variables here is not thread-safe and can cause race conditions. A request could reset the variables while another concurrent request requires them to be set, leading to incorrect behavior. This pattern of setting and unsetting environment variables per-request is unsafe in a multithreaded context.
| if is_npu() and self.quant_config is None: | ||
| # ascend mtp unquant | ||
| os.environ["SGLANG_DEEPEP_BF16_DISPATCH"] = "1" | ||
| os.environ["DEEP_NORMAL_MODE_USE_INT8_QUANT"] = "0" |
There was a problem hiding this comment.
Modifying environment variables (os.environ) within the forward method is not thread-safe. In a concurrent server environment, this can lead to race conditions where one request's configuration leaks into another, causing unpredictable behavior. This is a critical issue.
Please consider a thread-safe way to pass this configuration. If the underlying library absolutely requires environment variables, access to this part of the code might need to be serialized (e.g., with a lock), which would be a major performance bottleneck. The preferred solution is to avoid environment variables for per-request configuration.
| if is_npu() and self.quant_config is None: | ||
| # ascend mtp unquant | ||
| os.environ["SGLANG_DEEPEP_BF16_DISPATCH"] = "0" | ||
| os.environ["DEEP_NORMAL_MODE_USE_INT8_QUANT"] = "1" |
There was a problem hiding this comment.
|
|
||
|
|
||
| @triton.jit | ||
| def fused_gdn_gating_kernel_without_sigmoid_kernel( |
There was a problem hiding this comment.
Is this op Ascend only ? If it is only used in Ascend, please move it to sgl-kernel-npu repo
There was a problem hiding this comment.
i have moved it to sgl-kernel-npu repo(sgl-project/sgl-kernel-npu#429)
| req_pool_indices[bs - num_padding :] = 0 | ||
| mamba_indices = self.req_to_token_pool.get_mamba_indices(req_pool_indices) | ||
| mamba_indices[bs - num_padding :] = -1 | ||
| mamba_indices[bs - num_padding :] = 0 |
There was a problem hiding this comment.
This change needs to be reviewed to determine whether it affects the GPU implementation !
There was a problem hiding this comment.
This change does not affect the GPU implementation.
…st 27&35(gsm8k), accuracy ok
fix extra-buffer check
| and get_global_server_args().quantization is not None | ||
| ): | ||
| # ascend mtp unquant | ||
| os.environ["SGLANG_DEEPEP_BF16_DISPATCH"] = "1" |
|
Sorry to bother you, but I wonder what‘s your I test this pr on (SGLangEngine pid=3061557) [2026-04-22 09:32:22 TP2] Scheduler hit an exception: Traceback (most recent call last):
(SGLangEngine pid=3061557) File "/root/anaconda3/envs/_slime_/lib/python3.11/site-packages/triton/compiler/compiler.py", line 310, in compile
(SGLangEngine pid=3061557) next_module = compile_ir(module, metadata)
(SGLangEngine pid=3061557) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(SGLangEngine pid=3061557) File "/root/anaconda3/envs/_slime_/lib/python3.11/site-packages/triton/backends/ascend/compiler.py", line 778, in <lambda>
(SGLangEngine pid=3061557) lambda src, metadata: linalg_to_bin_enable_npu_compile_A2_A3(
(SGLangEngine pid=3061557) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(SGLangEngine pid=3061557) File "/root/anaconda3/envs/_slime_/lib/python3.11/site-packages/triton/backends/ascend/compiler.py", line 549, in linalg_to_bin_enable_npu_compile_A2_A3
(SGLangEngine pid=3061557) ret = subprocess.run(cmd_list, capture_output=True, check=True)
(SGLangEngine pid=3061557) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(SGLangEngine pid=3061557) File "/root/anaconda3/envs/_slime_/lib/python3.11/subprocess.py", line 571, in run
(SGLangEngine pid=3061557) raise CalledProcessError(retcode, process.args,
(SGLangEngine pid=3061557) subprocess.CalledProcessError: Command '['/home//851b080/cann-8.5.1/bin/bishengir-compile', '/tmp/tmpp8bki1m1/kernel.ttadapter.mlir', '--target=Ascend910_9392', '--enable-auto-multi-buffer=True', '--enable-auto-bind-sub-block=True', '--enable-hfusion-compile=true', '--enable-hivm-compile=true', '--enable-triton-kernel-compile=true', '-o', '/tmp/tmpp8bki1m1/kernel']' returned non-zero exit status 1.
(SGLangEngine pid=3061557)
(SGLangEngine pid=3061557) During handling of the above exception, another exception occurred:
(SGLangEngine pid=3061557)
(SGLangEngine pid=3061557) Traceback (most recent call last):
(SGLangEngine pid=3061557) File "/home//slime-proj/sglang/python/sglang/srt/managers/scheduler.py", line 3623, in run_scheduler_process
(SGLangEngine pid=3061557) scheduler.run_event_loop()
(SGLangEngine pid=3061557) File "/home//slime-proj/sglang/python/sglang/srt/managers/scheduler.py", line 1307, in run_event_loop
(SGLangEngine pid=3061557) dispatch_event_loop(self)
(SGLangEngine pid=3061557) File "/home//slime-proj/sglang/python/sglang/srt/managers/scheduler.py", line 3506, in dispatch_event_loop
(SGLangEngine pid=3061557) scheduler.event_loop_normal()
(SGLangEngine pid=3061557) File "/root/anaconda3/envs/_slime_/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(SGLangEngine pid=3061557) return func(*args, **kwargs)
(SGLangEngine pid=3061557) ^^^^^^^^^^^^^^^^^^^^^
(SGLangEngine pid=3061557) File "/home//slime-proj/sglang/python/sglang/srt/managers/scheduler.py", line 1326, in event_loop_normal
(SGLangEngine pid=3061557) result = self.run_batch(batch)
(SGLangEngine pid=3061557) ^^^^^^^^^^^^^^^^^^^^^
(SGLangEngine pid=3061557) File "/home//slime-proj/sglang/python/sglang/srt/managers/scheduler.py", line 2731, in run_batch
(SGLangEngine pid=3061557) batch_result = self.model_worker.forward_batch_generation(
(SGLangEngine pid=3061557) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(SGLangEngine pid=3061557) File "/home//slime-proj/sglang/python/sglang/srt/speculative/eagle_worker.py", line 320, in forward_batch_generation
(SGLangEngine pid=3061557) self.verify(batch, spec_info)
(SGLangEngine pid=3061557) File "/home//slime-proj/sglang/python/sglang/srt/speculative/eagle_worker.py", line 782, in verify
(SGLangEngine pid=3061557) self._mamba_verify_update(
(SGLangEngine pid=3061557) File "/home//slime-proj/sglang/python/sglang/srt/speculative/eagle_worker.py", line 874, in _mamba_verify_update
(SGLangEngine pid=3061557) self.target_worker.model_runner.attn_backend.update_mamba_state_after_mtp_verify(
(SGLangEngine pid=3061557) File "/home//slime-proj/sglang/python/sglang/srt/hardware_backend/npu/attention/ascend_hybrid_linear_attn_backend.py", line 259, in update_mamba_state_after_mtp_verify
(SGLangEngine pid=3061557) move_intermediate_cache(
(SGLangEngine pid=3061557) File "/root/anaconda3/envs/_slime_/lib/python3.11/site-packages/sgl_kernel_npu/mamba/mamba_state_update_triton.py", line 126, in move_intermediate_cache
(SGLangEngine pid=3061557) move_cache_dynamic_last_kernel_h_block[grid](
(SGLangEngine pid=3061557) File "/root/anaconda3/envs/_slime_/lib/python3.11/site-packages/triton/runtime/jit.py", line 353, in <lambda>
(SGLangEngine pid=3061557) return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(SGLangEngine pid=3061557) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(SGLangEngine pid=3061557) File "/root/anaconda3/envs/_slime_/lib/python3.11/site-packages/triton/runtime/jit.py", line 660, in run
(SGLangEngine pid=3061557) kernel = self.compile(
(SGLangEngine pid=3061557) ^^^^^^^^^^^^^
(SGLangEngine pid=3061557) File "/root/anaconda3/envs/_slime_/lib/python3.11/site-packages/triton/compiler/compiler.py", line 320, in compile
(SGLangEngine pid=3061557) raise MLIRCompilationError(stage_name, error_detail)
(SGLangEngine pid=3061557) triton.compiler.errors.MLIRCompilationError:
(SGLangEngine pid=3061557) ///------------------[ERROR][Triton][BEG]------------------
(SGLangEngine pid=3061557) [ConvertLinalgRToBinary] encounters error:
(SGLangEngine pid=3061557) loc("/tmp/tmpp8bki1m1/kernel.ttadapter.mlir":1:1): error: Failed to run BiShengHIR pipeline
(SGLangEngine pid=3061557)
(SGLangEngine pid=3061557) loc("/tmp/tmpp8bki1m1/kernel.ttadapter.mlir":2:3): error: ub overflow, requires 2097152 bits while 1572864 bits available! (possible reason: tiling basic block is too large or block number is more than what user expect due to multi-buffer feature is enabled and some ops need extra local buffer.)
(SGLangEngine pid=3061557) loc("/tmp/tmpp8bki1m1/kernel.ttadapter.mlir":1:1): error: Failed to run BiShengHIR pipeline
(SGLangEngine pid=3061557)
(SGLangEngine pid=3061557) loc("/tmp/tmpp8bki1m1/kernel.ttadapter.mlir":2:3): error: ub overflow, requires 2097152 bits while 1572864 bits available! (possible reason: tiling basic block is too large or block number is more than what user expect due to multi-buffer feature is enabled and some ops need extra local buffer.)
(SGLangEngine pid=3061557) loc("/tmp/tmpp8bki1m1/kernel.ttadapter.mlir":1:1): error: Failed to run BiShengHIR pipeline
(SGLangEngine pid=3061557)
(SGLangEngine pid=3061557) loc("/tmp/tmpp8bki1m1/kernel.ttadapter.mlir":2:3): error: ub overflow, requires 2097152 bits while 1572864 bits available! (possible reason: tiling basic block is too large or block number is more than what user expect due to multi-buffer feature is enabled and some ops need extra local buffer.)
(SGLangEngine pid=3061557) loc("/tmp/tmpp8bki1m1/kernel.ttadapter.mlir":1:1): error: Failed to run BiShengHIR pipeline
(SGLangEngine pid=3061557)
(SGLangEngine pid=3061557) loc("/tmp/tmpp8bki1m1/kernel.ttadapter.mlir":2:3): error: ub overflow, requires 2097152 bits while 1572864 bits available! (possible reason: tiling basic block is too large or block number is more than what user expect due to multi-buffer feature is enabled and some ops need extra local buffer.)
(SGLangEngine pid=3061557) loc("/tmp/tmpp8bki1m1/kernel.ttadapter.mlir":1:1): error: Failed to run BiShengHIR pipeline
(SGLangEngine pid=3061557)
(SGLangEngine pid=3061557) loc("/tmp/tmpp8bki1m1/kernel.ttadapter.mlir":2:3): error: ub overflow, requires 2097152 bits while 1572864 bits available! (possible reason: tiling basic block is too large or block number is more than what user expect due to multi-buffer feature is enabled and some ops need extra local buffer.)
(SGLangEngine pid=3061557) [ERROR] Failed to run BiShengIR pipeline
(SGLangEngine pid=3061557)
(SGLangEngine pid=3061557)
(SGLangEngine pid=3061557) [INFO]: The compiled kernel cache is in /root/.triton/cache/xaNvcZrMuMy_f7dS6Iq88V2w0VfI4uQln5MWDIdXUQM
(SGLangEngine pid=3061557)
(SGLangEngine pid=3061557) ///------------------[ERROR][Triton][END]------------------do you have any ideas? |
|
/tag-run-ci-label |
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
@iridiumine Hi! Which version of sgl-kernel-npu you're using? It seems i have problems with our default 2026.03.10.rc1 version, i think we need update it in https://github.com/sgl-project/sglang/blob/main/scripts/ci/npu/npu_ci_install_dependency.sh |
|
@OrangeRedeng Hi, I’m using the following sgl-kernel-npu version:https://github.com/sgl-project/sgl-kernel-npu/releases/tag/2026.04.15.rc4 |

Motivation
Adapt the MTP (Multi-Token Prediction) speculative decoding feature for the Qwen3.5 model on the Ascend NPU platform, fix inference errors, and ensure stable and efficient model operation.
Modifications
Accuracy Tests
Script
Result
Benchmarking and Profiling
No performance impact.
Checklist