[NPU][eagle3] support qwen eagle3 on NPU by Liwansi · Pull Request #14820 · sgl-project/sglang

Liwansi · 2025-12-10T12:34:35Z

Motivation

Enable sglang eagle3 on NPU platform

Tested models:
Qwen3-32B-Int8

Modifications

1、Add a MHA attn op in forward_mtp to support non-MLA model.
2、Modify eagle_draft_npu_graph_runner.py to support speculative-num-steps > 1 in eagle3 scenario
3、A parameter '--speculative-draft-model-quantization' has been added to handle cases where the target and draft models use different quantization method.

Rules for speculative_draft_model_quantization:

If speculative_draft_model_quantization is not specified, inherits the quantization config from target model
If speculative_draft_model_quantization is unquant, meaning that target model is quanted but draft model isn't, reset speculative_draft_model_quantization to None
Otherwise, use the quant method specified
speculative_draft_model_quantization will then passed into draft models' model_config to initialize quantization entry

Accuracy Tests

Qwen3-32B-Int8:

Qwen3-32B-Int8 with eagle:

Benchmarking and Profiling

Qwen3-32B-Int8:

Qwen3-32B-Int8 with eagle step 1:

Qwen3-32B-Int8 with eagle step 2:

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (142 commits) [diffusion] performance: refactor diffusion fuse qkv and apply to qwen-image (sgl-project#14793) Revert "fix: checking if tokenizer is in cache before downloading from HF" (sgl-project#14808) [diffusion] CI: Add LoRA support to diffusion server configuration and test cases (sgl-project#14697) [GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V (sgl-project#14720) [diffusion] profile: early exit when enough steps are captured to reduce the size of the trace file (sgl-project#14803) fix b200 fa4 ci (sgl-project#14788) fix: add missing logic for SGLANG_USE_MODELSCOPE variable (sgl-project#14794) [RL] support weight reload for low-bit rollout (sgl-project#9650) fix b200 ci (sgl-project#14786) handling incomplete rope_scaling config ci after transformers upgrade (sgl-project#14784) fix server args bug (sgl-project#14725) Fix VLM accuracy thresholds for nightly tests (sgl-project#14777) fix: race condition between validation and download locks (sgl-project#14761) [Auto Sync] Update data_parallel_controller.py, detokenizer... (20251209) (sgl-project#14759) [fix] Fix issues for in-flight weight updates (sgl-project#14064) Disable 8-gpu-b200 runner in PR tests (sgl-project#14768) fix: adding rate limit warning at verify token permission stage (sgl-project#14756) Re-add the API serving timing metrics. (sgl-project#14744) [ci]: Enable the new hf API (sgl-project#14687) [model-gateway] release gateway 0.2.4 (sgl-project#14763) ...

gemini-code-assist · 2025-12-10T12:35:00Z

Summary of Changes

Hello @Liwansi, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant enhancements to SGLang by enabling the Qwen Eagle3 model to run efficiently on NPU hardware. The changes involve adapting the attention mechanism to support various model architectures, refining the speculative decoding process for multi-step operations, and optimizing NPU-specific data handling. These modifications aim to improve performance and compatibility for large language models on NPU platforms, as demonstrated by the provided accuracy and benchmarking results for Qwen3-32B-Int8.

Highlights

NPU Support for Qwen Eagle3: This pull request enables the SGLang Eagle3 model to run on the NPU (Neural Processing Unit) platform, specifically tested with Qwen3-32B-Int8.
Multi-Head Attention (MHA) Operation: A new MHA attention operation has been added to the forward_mtp function to support non-MLA (Multi-Layer Attention) models, enhancing compatibility.
Speculative Decoding Steps: The eagle_draft_npu_graph_runner.py has been modified to properly support speculative-num-steps greater than 1 in the Eagle3 scenario, improving efficiency for speculative decoding.
Dynamic Attribute Handling for Attention Architectures: The NPU graph runners now dynamically determine which attributes to update (e.g., actual_seq_lengths_kv or context_lens) based on the attention architecture (MLA or MHA) and forward mode, allowing for more flexible graph updates.
Cache Location Data Type Adjustment for NPU: The data type for out_cache_loc has been adjusted to torch.int32 specifically for NPU operations, optimizing memory and performance for NPU-specific cache management.
Eagle3 Quantization Control: An environment variable DISABLE_EAGLE3_QUANT has been introduced to control quantization for the small model in the Eagle3 scenario, allowing it to be unquantized if needed.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for Qwen Eagle3 models on NPU. The main changes include adding a Multi-Head Attention (MHA) path for non-MLA models, modifying the NPU graph runners to handle different attention architectures dynamically, and adjusting data types for NPU compatibility. My review focuses on improving code maintainability by reducing duplication and cleaning up redundant code.

gemini-code-assist · 2025-12-10T12:37:22Z

+        if not self.use_mla:
+            k_cache = forward_batch.token_to_kv_pool.get_key_buffer(
+                layer.layer_id).view(-1, self.page_size, layer.tp_k_head_num * layer.qk_head_dim)
+            v_cache = forward_batch.token_to_kv_pool.get_value_buffer(
+                layer.layer_id).view(-1, self.page_size, layer.tp_v_head_num * layer.v_head_dim)
+            query = q.reshape(-1, layer.tp_q_head_num, layer.qk_head_dim).contiguous()
+            if not self.graph_mode:
+                num_token_padding = query.shape[0]
+                query = query[: forward_batch.num_token_non_padded_cpu]
+            if self.forward_metadata.seq_lens_cpu_int is None:
+                actual_seq_lengths_kv = self.forward_metadata.seq_lens_cpu_list
+            else:
+                actual_seq_lengths_kv = (
+                    self.forward_metadata.seq_lens_cpu_int.cpu().int().tolist()
+                )
+            if forward_batch.forward_mode.is_draft_extend():
+                actual_seq_lengths = (
+                    np.array(forward_batch.extend_seq_lens_cpu).cumsum().tolist()
+                )
+            else:
+                actual_seq_lengths = np.arange(
+                    self.speculative_num_draft_tokens,
+                    self.speculative_num_draft_tokens + query.shape[0],
+                    self.speculative_num_draft_tokens,
+                )

-        q_nope = q.view(-1, layer.tp_q_head_num, self.kv_lora_rank).contiguous()
-        q_rope = q_rope.view(-1, layer.tp_q_head_num, self.qk_rope_head_dim)
-        if not self.graph_mode:
-            num_token_padding = q.shape[0]
-            q_nope = q_nope[: forward_batch.num_token_non_padded_cpu]
-            q_rope = q_rope[: forward_batch.num_token_non_padded_cpu]
-        if self.forward_metadata.seq_lens_cpu_int is None:
-            actual_seq_lengths_kv = self.forward_metadata.seq_lens_cpu_list
-        else:
-            actual_seq_lengths_kv = (
-                self.forward_metadata.seq_lens_cpu_int.cpu().int().tolist()
-            )
-        if forward_batch.forward_mode.is_draft_extend():
-            actual_seq_lengths = (
-                np.array(forward_batch.extend_seq_lens_cpu).cumsum().tolist()
+            attn_output, _ = torch.ops.npu.npu_fused_infer_attention_score(
+                query,
+                k_cache,
+                v_cache,
+                block_table=self.forward_metadata.block_tables,
+                block_size=self.page_size,
+                num_heads=layer.tp_q_head_num,
+                num_key_value_heads=layer.tp_k_head_num,
+                input_layout="TND",
+                atten_mask=self.mtp_mask,
+                scale=layer.scaling,
+                actual_seq_lengths=actual_seq_lengths,
+                actual_seq_lengths_kv=actual_seq_lengths_kv,
+                sparse_mode=3,
            )
+            attn_output = attn_output.view(-1, layer.tp_q_head_num * layer.v_head_dim)
+            if (
+                not self.graph_mode
+                and forward_batch.num_token_non_padded_cpu != num_token_padding
+            ):
+                attn_output = torch.cat(
+                    [
+                        attn_output,
+                        attn_output.new_zeros(
+                            num_token_padding - forward_batch.num_token_non_padded_cpu, *attn_output.shape[1:]
+                        ),
+                    ],
+                    dim=0,
+                )
+            return attn_output
        else:
-            actual_seq_lengths = np.arange(
-                self.speculative_num_draft_tokens,
-                self.speculative_num_draft_tokens + q_nope.shape[0],
-                self.speculative_num_draft_tokens,
+            c_kv, k_rope = forward_batch.token_to_kv_pool.get_kv_buffer(layer.layer_id)
+            k_rope_cache = k_rope.view(
+                -1, layer.tp_k_head_num, self.page_size, self.qk_rope_head_dim
+            )
+            c_kv_cache = c_kv.view(
+                -1, layer.tp_v_head_num, self.page_size, self.kv_lora_rank
            )

-        workspace = torch_npu._npu_fused_infer_attention_score_get_max_workspace(
-            q_nope,
-            c_kv_cache,
-            c_kv_cache,
-            query_rope=q_rope,
-            key_rope=k_rope_cache,
-            num_heads=layer.tp_q_head_num,
-            num_key_value_heads=layer.tp_k_head_num,
-            input_layout="TND",
-            scale=layer.scaling,
-            antiquant_mode=0,
-            antiquant_scale=None,
-            block_table=self.forward_metadata.block_tables,
-            block_size=self.page_size,
-            sparse_mode=3,
-            atten_mask=self.mtp_mask,
-            actual_seq_lengths=actual_seq_lengths,
-            actual_seq_lengths_kv=actual_seq_lengths_kv,
-        )
-        attn_output = torch.empty_like(q_nope, dtype=q.dtype, device=q.device)
-        softmax_lse = torch.empty(1, dtype=q.dtype, device=q.device)
-        torch_npu.npu_fused_infer_attention_score.out(
-            q_nope,
-            c_kv_cache,
-            c_kv_cache,
-            query_rope=q_rope,
-            key_rope=k_rope_cache,
-            num_heads=layer.tp_q_head_num,
-            num_key_value_heads=layer.tp_k_head_num,
-            input_layout="TND",
-            scale=layer.scaling,
-            antiquant_mode=0,
-            antiquant_scale=None,
-            block_table=self.forward_metadata.block_tables,
-            block_size=self.page_size,
-            sparse_mode=3,
-            atten_mask=self.mtp_mask,
-            actual_seq_lengths=actual_seq_lengths,
-            actual_seq_lengths_kv=actual_seq_lengths_kv,
-            workspace=workspace,
-            out=[attn_output, softmax_lse],
-        )
-        attn_output = attn_output.view(-1, layer.tp_q_head_num * layer.v_head_dim)
-        if (
-            not self.graph_mode
-            and forward_batch.num_token_non_padded_cpu != num_token_padding
-        ):
-            attn_output = torch.cat(
-                [
-                    attn_output,
-                    attn_output.new_zeros(
-                        num_token_padding - attn_output.shape[0], *attn_output.shape[1:]
-                    ),
-                ],
-                dim=0,
+            q_nope = q.view(-1, layer.tp_q_head_num, self.kv_lora_rank).contiguous()
+            q_rope = q_rope.view(-1, layer.tp_q_head_num, self.qk_rope_head_dim)
+            if not self.graph_mode:
+                num_token_padding = q.shape[0]
+                q_nope = q_nope[: forward_batch.num_token_non_padded_cpu]
+                q_rope = q_rope[: forward_batch.num_token_non_padded_cpu]
+            if self.forward_metadata.seq_lens_cpu_int is None:
+                actual_seq_lengths_kv = self.forward_metadata.seq_lens_cpu_list
+            else:
+                actual_seq_lengths_kv = (
+                    self.forward_metadata.seq_lens_cpu_int.cpu().int().tolist()
+                )
+            if forward_batch.forward_mode.is_draft_extend():
+                actual_seq_lengths = (
+                    np.array(forward_batch.extend_seq_lens_cpu).cumsum().tolist()
+                )
+            else:
+                actual_seq_lengths = np.arange(
+                    self.speculative_num_draft_tokens,
+                    self.speculative_num_draft_tokens + q_nope.shape[0],
+                    self.speculative_num_draft_tokens,
+                )
+
+            workspace = torch_npu._npu_fused_infer_attention_score_get_max_workspace(
+                q_nope,
+                c_kv_cache,
+                c_kv_cache,
+                query_rope=q_rope,
+                key_rope=k_rope_cache,
+                num_heads=layer.tp_q_head_num,
+                num_key_value_heads=layer.tp_k_head_num,
+                input_layout="TND",
+                scale=layer.scaling,
+                antiquant_mode=0,
+                antiquant_scale=None,
+                block_table=self.forward_metadata.block_tables,
+                block_size=self.page_size,
+                sparse_mode=3,
+                atten_mask=self.mtp_mask,
+                actual_seq_lengths=actual_seq_lengths,
+                actual_seq_lengths_kv=actual_seq_lengths_kv,
            )
-        return attn_output
+            attn_output = torch.empty_like(q_nope, dtype=q.dtype, device=q.device)
+            softmax_lse = torch.empty(1, dtype=q.dtype, device=q.device)
+            torch_npu.npu_fused_infer_attention_score.out(
+                q_nope,
+                c_kv_cache,
+                c_kv_cache,
+                query_rope=q_rope,
+                key_rope=k_rope_cache,
+                num_heads=layer.tp_q_head_num,
+                num_key_value_heads=layer.tp_k_head_num,
+                input_layout="TND",
+                scale=layer.scaling,
+                antiquant_mode=0,
+                antiquant_scale=None,
+                block_table=self.forward_metadata.block_tables,
+                block_size=self.page_size,
+                sparse_mode=3,
+                atten_mask=self.mtp_mask,
+                actual_seq_lengths=actual_seq_lengths,
+                actual_seq_lengths_kv=actual_seq_lengths_kv,
+                workspace=workspace,
+                out=[attn_output, softmax_lse],
+            )
+            attn_output = attn_output.view(-1, layer.tp_q_head_num * layer.v_head_dim)
+            if (
+                not self.graph_mode
+                and forward_batch.num_token_non_padded_cpu != num_token_padding
+            ):
+                attn_output = torch.cat(
+                    [
+                        attn_output,
+                        attn_output.new_zeros(
+                            num_token_padding - attn_output.shape[0], *attn_output.shape[1:]
+                        ),
+                    ],
+                    dim=0,
+                )
+            return attn_output


There is significant code duplication between the if not self.use_mla: block and the else: block. The logic for calculating actual_seq_lengths_kv and actual_seq_lengths, handling graph mode to determine num_token_padding, and the final padding logic for attn_output are all nearly identical.

To improve code maintainability and reduce redundancy, this common logic should be refactored and moved outside the if/else blocks. You can compute these values once, and then have the conditional blocks focus only on the parts that differ (i.e., the specific attention computation for MHA vs. MLA). This will make the code cleaner and easier to maintain.

gemini-code-assist · 2025-12-10T12:37:22Z

            out_cache_loc,
        )
-        out_cache_loc = out_cache_loc.to(dtype=torch.int64)
+        out_cache_loc = out_cache_loc.to(dtype=torch.int32)


This line is redundant. The out_cache_loc tensor is already created with dtype=torch.int32 on line 533. This unnecessary type conversion can be removed for cleaner code.

iforgetmyname · 2025-12-11T02:54:39Z

/tag-and-rerun-ci

iforgetmyname · 2025-12-11T03:08:07Z

+            if not _is_npu:
+                batch.out_cache_loc = torch.empty(
+                    (bs * topk * num_steps,),
+                    dtype=torch.int64,
+                    device=batch.input_ids.device,
+                )
+            else:
+                batch.out_cache_loc = torch.empty(
+                    (bs * topk * num_steps,),
+                    dtype=torch.int32,
+                    device=batch.input_ids.device,
+                )


Suggested change

if not _is_npu:

batch.out_cache_loc = torch.empty(

(bs * topk * num_steps,),

dtype=torch.int64,

device=batch.input_ids.device,

)

else:

batch.out_cache_loc = torch.empty(

(bs * topk * num_steps,),

dtype=torch.int32,

device=batch.input_ids.device,

)

batch.out_cache_loc = torch.empty(

(bs * topk * num_steps,),

dtype=cuda_graph_runner.get_cache_loc_dtype(),

device=batch.input_ids.device,

)

iforgetmyname · 2025-12-11T03:11:19Z

+        # In the Eagle3 scenario, the small model is unquantized.
+        if _disable_eagle3_quant:
+            self.server_args.quantization = None
+


should rework this modification here
can we read quant config from eagle model config files?

iforgetmyname · 2025-12-11T03:15:19Z

remove this import as well, theoretically import once during startup should work

xueliangyang-oeuler · 2025-12-11T13:08:35Z

@Liwansi Hi, could you write your start_up parameters and hardware infos to one README？ I'm working on this pr to test some performance. Thank you.

Liwansi · 2025-12-12T01:20:58Z

@Liwansi Hi, could you write your start_up parameters and hardware infos to one README？ I'm working on this pr to test some performance. Thank you.
In eagle3 scenario, you need to additionally add:
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx/Qwen3-32B-Eagle3
--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2

xueliangyang-oeuler · 2025-12-12T08:56:17Z

export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx/Qwen3-32B-Eagle3 --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2

Thank you, but i already met one error when start up with these parameters, pls give some guidance. Thank you !

START_UP PARAMETERS:
export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1
python3 -m sglang.launch_server --model /data/qwen3-32B --device npu --attention-backend ascend --tp-size 2 --mem-fraction-static 0.8 --host 0.0.0.0 --port 8080 --context-length 32768 --enable-metrics --disable-radix-cache --trust-remote-code --speculative-algorithm EAGLE3 --speculative-draft-model-path /data/qwen3-8b-eagle3/ --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2

ERRORS:
ValueError: ClassRegistryMixin.register_decorator cannot register a class <class 'transformers_modules.eagle3.Eagle3SpeculatorConfig'> with the name eagle3 because it is already registered.

@Liwansi

Liwansi · 2025-12-12T11:39:55Z

export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
export SGLANG_ENABLE_SPEC_V2=1

--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx/Qwen3-32B-Eagle3 --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2

Thank you, but i already met one error when start up with these parameters, pls give some guidance. Thank you !

START_UP PARAMETERS: export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 export SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server --model /data/qwen3-32B --device npu --attention-backend ascend --tp-size 2 --mem-fraction-static 0.8 --host 0.0.0.0 --port 8080 --context-length 32768 --enable-metrics --disable-radix-cache --trust-remote-code --speculative-algorithm EAGLE3 --speculative-draft-model-path /data/qwen3-8b-eagle3/ --speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2

ERRORS: ValueError: ClassRegistryMixin.register_decorator cannot register a class <class 'transformers_modules.eagle3.Eagle3SpeculatorConfig'> with the name eagle3 because it is already registered.

@Liwansi

Please ensure that you are using the latest version of the code and no residual processes remain in the environment.If you encounter further issues, please open an issue for tracking. Thank you!

…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (121 commits) Super tiny add gsp-fast-prepare (sgl-project#14992) Super tiny fix confusing slash_command_handler hint (sgl-project#14976) Super tiny remove unused argument (sgl-project#14966) [registry] Add a strict mode to model registration (sgl-project#14933) Feature/Fix multi lora scheduler blocking issue and evict LoRA None lastly (sgl-project#14795) Tune triton fused moe for the case of glm-4.6-fp8 b200 tp4 (sgl-project#15020) [model-gateway] refactor: unify worker management into modular workflow structure (sgl-project#15010) Update ci permission (sgl-project#15014) Refactor of http and engine entrypoints to allow custom override (sgl-project#14869) Add KV4-capable backend flashmla and update server args (sgl-project#14989) Revert several PRs (sgl-project#14958) Super tiny extract route_typed_request_once (sgl-project#14951) Fix CI by reverting incorrect metric check logic (sgl-project#15004) [model-gateway] refactor: workflow engine cleanup and minor optimization (sgl-project#15001) [model-gateway] fix: handle workflow deadlock and optimize cycle detection (sgl-project#15000) [model-gateway] feat: add DAG parallel execution support and workflow optimization (sgl-project#14999) [model-gateway] refactor: extract workflow engine to src/workflow module (sgl-project#14996) Update CODEOWNERS for multimodal_gen (sgl-project#14995) [diffusion] docker: Tiny fix Docker Hub link in installation documentation (sgl-project#14987) [PD] Add decode PP event loop for PD disaggregation (sgl-project#14945) ... # Conflicts: # python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py

…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (25 commits) [NPU] perf update with kvcache nz & w4a8 quant (sgl-project#14423) [PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks (sgl-project#15027) Fix GLM-4.6 tool calls don't support streaming output for arguments i… (sgl-project#13989) feature: adding nightly wheel workflow and indexer (sgl-project#14924) [diffusion] feat: Improve LoRA compatibility by adding unified format detection and diffusers-based normalization (sgl-project#14659) [Fix] Disable trtllm moe backend for draft model for a qucik fix (sgl-project#15002) [diffusion] fix: use NDRotaryEmbedding in flux_2 (sgl-project#15034) Mistral Large 3 NVFP4 support (sgl-project#14485) call check_quantized_moe_compatibility after initialize (sgl-project#13876) Add sgl_router_attempt_http_responses_total for single attempt information (sgl-project#15037) Add error code in prometheus metrics and add X-SMG-Error-Code header (sgl-project#15036) Provide more fine grained error reason for reqwest error (sgl-project#15032) Tiny change http router response format to unify (sgl-project#15031) Tiny unify grpc existing error responses into new format (sgl-project#15030) Add `code` field and unify error responses for router (sgl-project#15028) Super tiny remove unused log_request (sgl-project#15035) Fix decode OOM caused by retraction (sgl-project#14939) [CI]Add gb200 runner back (sgl-project#15024) Add a special label for b200 CI runner that can run kernel tests (sgl-project#15033) Fix regression caused by fa3 block_table (sgl-project#15009) ... # Conflicts: # python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py

iforgetmyname · 2025-12-15T00:54:40Z

+export SGLANG_SET_CPU_AFFINITY=1
+export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
+export STREAMS_PER_DEVICE=32
+export HCCL_BUFFSIZE=1536


we dont need that much of buffsize for dense model

iforgetmyname · 2025-12-15T00:55:12Z

+export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1
+export SGLANG_ENABLE_SPEC_V2=1
+
+ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server \


no visible_device but should use --base-gpu-id instead

iforgetmyname · 2025-12-15T00:59:18Z

+    # In speculative scenario:
+    # - If `speculative_draft_model_quantization` is specified, the draft model uses this quantization method.
+    # - Otherwise, the draft model defaults to the same quantization as the target model.
+    if model_config.is_draft_model:
+        draft_quant = model_config.speculative_draft_model_quantization
+        quantization = (
+            None
+            if draft_quant == "unquant"
+            else draft_quant or model_config.quantization
+        )
+    else:
+        quantization = model_config.quantization
+


this part should be handled inside server_args.py

if self.speculative_draft_model_quantization is None: self.speculative_draft_model_quantization = self.quantization elif unquant: self.speculative_draft_model_quantization = None

iforgetmyname · 2025-12-15T01:17:41Z

+    # In speculative scenario:
+    # - If `speculative_draft_model_quantization` is specified, the draft model uses this quantization method.
+    # - Otherwise, the draft model defaults to the same quantization as the target model.
+    if model_config.is_draft_model:
+        draft_quant = model_config.speculative_draft_model_quantization
+        quantization = (
+            None
+            if draft_quant == "unquant"
+            else draft_quant or model_config.quantization
+        )
+    else:
+        quantization = model_config.quantization
+


if self.speculative_draft_model_quantization is None: self.speculative_draft_model_quantization = self.quantization elif unquant: self.speculative_draft_model_quantization = None

iforgetmyname · 2025-12-15T01:21:21Z

            sampling_defaults=server_args.sampling_defaults,
            quantize_and_serve=server_args.quantize_and_serve,
            override_config_file=server_args.decrypted_config_file,
+            speculative_draft_model_quantization=server_args.speculative_draft_model_quantization,


quantization = server_args.speculative_draft_model_quantization if is_draft_model else server_args.quantization

iforgetmyname · 2025-12-15T02:04:26Z

/tag-and-rerun-ci

…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (89 commits) [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (sgl-project#15160) [diffusion] fix: fix video model sp when resolution is not specified (sgl-project#15047) [diffusion] fix: fix pytorch non-writable array warning (sgl-project#15017) [diffusion] fix: cache dit with parallel (sgl-project#15163) chore: change npu pr-test a2 runner (sgl-project#15152) [Feature] Fuse mrope all in 1 kernel (sgl-project#14906) Fix num running requests (load) wrong cleared for ongoing requests (sgl-project#15116) Fused two elementwise kernels for k_nope and k_pe concat (sgl-project#14862) fix: adding date and fixing release name issue (sgl-project#15174) [CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut (sgl-project#9324) feature: PR wheel (sgl-project#15170) [diffusion] model: support mutli-image input and qwen-image-edit-2509 (sgl-project#15005) fix CompressedTensorsW8A8Int8 min_capability (sgl-project#13914) Tiny improve summary text in `bench_one_batch_server.py` (sgl-project#15158) [model-gateway] add mcp and discovery metrics (sgl-project#15156) fix: move ci-bot (sgl-project#15154) Fix import warnings (sgl-project#15144) ci: adding errors to Github summary (sgl-project#14778) [model-gateway] Add streaming metrics for harmony gRPC router (sgl-project#15147) [model-gateway] upgrade axum and axum server (sgl-project#15146) ... # Conflicts: # python/sglang/srt/server_args.py

ZhongsJie · 2025-12-19T06:57:49Z

@Liwansi Thanks for supporting this feature. I built the image from the main branch using npu.Dockerfile with DEVICE_TYPE set to 910b, and I ran into a Not allowed to synchronize captured-stream error during startup when using this feature.

I followed the startup command in docs/platforms/ascend_npu_qwen3_examples.md. Could you advise on how to debug this issue? Also, how can we tell whether this error is related to the hardware (for example, the Ascend A2 series)?

Any guidance would be greatly appreciated. Thanks!

Liwansi · 2025-12-19T08:49:35Z

@Liwansi Thanks for supporting this feature. I built the image from the main branch using npu.Dockerfile with DEVICE_TYPE set to 910b, and I ran into a Not allowed to synchronize captured-stream error during startup when using this feature.

I followed the startup command in docs/platforms/ascend_npu_qwen3_examples.md. Could you advise on how to debug this issue? Also, how can we tell whether this error is related to the hardware (for example, the Ascend A2 series)?

Any guidance would be greatly appreciated. Thanks!

Could you please provide me with the complete error log and the script?

ZhongsJie · 2025-12-19T09:36:15Z

@Liwansi Thanks for supporting this feature. I built the image from the main branch using npu.Dockerfile with DEVICE_TYPE set to 910b, and I ran into a Not allowed to synchronize captured-stream error during startup when using this feature.
I followed the startup command in docs/platforms/ascend_npu_qwen3_examples.md. Could you advise on how to debug this issue? Also, how can we tell whether this error is related to the hardware (for example, the Ascend A2 series)?
Any guidance would be greatly appreciated. Thanks!

Could you please provide me with the complete error log and the script?

@Liwansi First, I’d like to confirm whether building the 910B image in this way is correct. It might be best to rule out any image build–related issues first.

Liwansi · 2025-12-20T06:08:40Z

@Liwansi Thanks for supporting this feature. I built the image from the main branch using npu.Dockerfile with DEVICE_TYPE set to 910b, and I ran into a Not allowed to synchronize captured-stream error during startup when using this feature.
I followed the startup command in docs/platforms/ascend_npu_qwen3_examples.md. Could you advise on how to debug this issue? Also, how can we tell whether this error is related to the hardware (for example, the Ascend A2 series)?
Any guidance would be greatly appreciated. Thanks!

Could you please provide me with the complete error log and the script?

@Liwansi First, I’d like to confirm whether building the 910B image in this way is correct. It might be best to rule out any image build–related issues first.

OK. I pulled the latest image to reproduce your issue, but everything worked fine. Perhaps you could try using this command 'docker image pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:main-cann8.3.rc2-910b' to pull a new image and run it again.

Liwansi added 3 commits December 10, 2025 17:03

eagle3 optimization

94296b3

support step > 1

8ec78e2

Liwansi requested review from Ying1123, hnyls2002, iforgetmyname, merrymercy and ping1jing2 as code owners December 10, 2025 12:34

github-actions Bot added the npu label Dec 10, 2025

gemini-code-assist Bot reviewed Dec 10, 2025

View reviewed changes

ping1jing2 changed the title ~~[Ascend][eagle3] support qwen eagle3 on NPU~~ [NPU][eagle3] support qwen eagle3 on NPU Dec 10, 2025

ping1jing2 self-assigned this Dec 10, 2025

github-actions Bot added the run-ci label Dec 11, 2025

iforgetmyname reviewed Dec 11, 2025

View reviewed changes

ping1jing2 reviewed Dec 12, 2025

View reviewed changes

Comment thread python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py

Liwansi force-pushed the qwen_eagle3_npu branch from 7e5932b to 0d03bf7 Compare December 12, 2025 11:25

Liwansi requested review from Fridge003, hebiao064 and ispobock as code owners December 12, 2025 11:25

cleancode

dcb041a

Liwansi force-pushed the qwen_eagle3_npu branch from 0d03bf7 to dcb041a Compare December 12, 2025 11:40

github-actions Bot added the documentation Improvements or additions to documentation label Dec 12, 2025

cleancode

4e77111

Liwansi force-pushed the qwen_eagle3_npu branch from e1e66cf to 4e77111 Compare December 13, 2025 07:20

Liwansi force-pushed the qwen_eagle3_npu branch from c9d4586 to ad67b57 Compare December 13, 2025 10:23

iforgetmyname approved these changes Dec 15, 2025

View reviewed changes

Liwansi force-pushed the qwen_eagle3_npu branch 2 times, most recently from 1cdc949 to f6e0965 Compare December 15, 2025 01:57

cleancode

033fd51

Liwansi force-pushed the qwen_eagle3_npu branch from f6e0965 to 033fd51 Compare December 15, 2025 02:00

Liwansi added 2 commits December 15, 2025 22:15

cleancode

50fc61a

Liwansi force-pushed the qwen_eagle3_npu branch from fc43430 to 50fc61a Compare December 15, 2025 15:00

Merge branch 'main' into qwen_eagle3_npu

2a5b9d7

iforgetmyname approved these changes Dec 15, 2025

View reviewed changes

iforgetmyname merged commit 30da2f0 into sgl-project:main Dec 15, 2025
235 of 253 checks passed

iforgetmyname deleted the qwen_eagle3_npu branch December 15, 2025 18:25

xueliangyang-oeuler mentioned this pull request Dec 16, 2025

[Bug] EAGLE3 on NPU, sglang serve crash when running sglang benchmark #15255

Closed

5 tasks

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 17, 2025

[NPU][eagle3] support qwen eagle3 on NPU (sgl-project#14820)

51de031

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

[NPU][eagle3] support qwen eagle3 on NPU (sgl-project#14820)

3fb0191

iforgetmyname mentioned this pull request Jan 23, 2026

[Roadmap] Ascend NPU Development (2026 Q1) #13664

Open

28 tasks

Conversation

Liwansi commented Dec 10, 2025 • edited by iforgetmyname Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 10, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

iforgetmyname commented Dec 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xueliangyang-oeuler commented Dec 11, 2025

Uh oh!

Liwansi commented Dec 12, 2025

Uh oh!

Uh oh!

xueliangyang-oeuler commented Dec 12, 2025

Uh oh!

Liwansi commented Dec 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iforgetmyname commented Dec 15, 2025

Uh oh!

Uh oh!

ZhongsJie commented Dec 19, 2025

Uh oh!

Liwansi commented Dec 19, 2025

Uh oh!

ZhongsJie commented Dec 19, 2025

Uh oh!

Liwansi commented Dec 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Liwansi commented Dec 10, 2025 •

edited by iforgetmyname

Loading