[NPU][eagle3] support qwen eagle3 on NPU#14820
[NPU][eagle3] support qwen eagle3 on NPU#14820iforgetmyname merged 11 commits intosgl-project:mainfrom
Conversation
…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (142 commits) [diffusion] performance: refactor diffusion fuse qkv and apply to qwen-image (sgl-project#14793) Revert "fix: checking if tokenizer is in cache before downloading from HF" (sgl-project#14808) [diffusion] CI: Add LoRA support to diffusion server configuration and test cases (sgl-project#14697) [GLM-4.6V] Support Pipeline Parallelism for GLM-4.6V & GLM-4.1V (sgl-project#14720) [diffusion] profile: early exit when enough steps are captured to reduce the size of the trace file (sgl-project#14803) fix b200 fa4 ci (sgl-project#14788) fix: add missing logic for SGLANG_USE_MODELSCOPE variable (sgl-project#14794) [RL] support weight reload for low-bit rollout (sgl-project#9650) fix b200 ci (sgl-project#14786) handling incomplete rope_scaling config ci after transformers upgrade (sgl-project#14784) fix server args bug (sgl-project#14725) Fix VLM accuracy thresholds for nightly tests (sgl-project#14777) fix: race condition between validation and download locks (sgl-project#14761) [Auto Sync] Update data_parallel_controller.py, detokenizer... (20251209) (sgl-project#14759) [fix] Fix issues for in-flight weight updates (sgl-project#14064) Disable 8-gpu-b200 runner in PR tests (sgl-project#14768) fix: adding rate limit warning at verify token permission stage (sgl-project#14756) Re-add the API serving timing metrics. (sgl-project#14744) [ci]: Enable the new hf API (sgl-project#14687) [model-gateway] release gateway 0.2.4 (sgl-project#14763) ...
Summary of ChangesHello @Liwansi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces significant enhancements to SGLang by enabling the Qwen Eagle3 model to run efficiently on NPU hardware. The changes involve adapting the attention mechanism to support various model architectures, refining the speculative decoding process for multi-step operations, and optimizing NPU-specific data handling. These modifications aim to improve performance and compatibility for large language models on NPU platforms, as demonstrated by the provided accuracy and benchmarking results for Qwen3-32B-Int8. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for Qwen Eagle3 models on NPU. The main changes include adding a Multi-Head Attention (MHA) path for non-MLA models, modifying the NPU graph runners to handle different attention architectures dynamically, and adjusting data types for NPU compatibility. My review focuses on improving code maintainability by reducing duplication and cleaning up redundant code.
| if not self.use_mla: | ||
| k_cache = forward_batch.token_to_kv_pool.get_key_buffer( | ||
| layer.layer_id).view(-1, self.page_size, layer.tp_k_head_num * layer.qk_head_dim) | ||
| v_cache = forward_batch.token_to_kv_pool.get_value_buffer( | ||
| layer.layer_id).view(-1, self.page_size, layer.tp_v_head_num * layer.v_head_dim) | ||
| query = q.reshape(-1, layer.tp_q_head_num, layer.qk_head_dim).contiguous() | ||
| if not self.graph_mode: | ||
| num_token_padding = query.shape[0] | ||
| query = query[: forward_batch.num_token_non_padded_cpu] | ||
| if self.forward_metadata.seq_lens_cpu_int is None: | ||
| actual_seq_lengths_kv = self.forward_metadata.seq_lens_cpu_list | ||
| else: | ||
| actual_seq_lengths_kv = ( | ||
| self.forward_metadata.seq_lens_cpu_int.cpu().int().tolist() | ||
| ) | ||
| if forward_batch.forward_mode.is_draft_extend(): | ||
| actual_seq_lengths = ( | ||
| np.array(forward_batch.extend_seq_lens_cpu).cumsum().tolist() | ||
| ) | ||
| else: | ||
| actual_seq_lengths = np.arange( | ||
| self.speculative_num_draft_tokens, | ||
| self.speculative_num_draft_tokens + query.shape[0], | ||
| self.speculative_num_draft_tokens, | ||
| ) | ||
|
|
||
| q_nope = q.view(-1, layer.tp_q_head_num, self.kv_lora_rank).contiguous() | ||
| q_rope = q_rope.view(-1, layer.tp_q_head_num, self.qk_rope_head_dim) | ||
| if not self.graph_mode: | ||
| num_token_padding = q.shape[0] | ||
| q_nope = q_nope[: forward_batch.num_token_non_padded_cpu] | ||
| q_rope = q_rope[: forward_batch.num_token_non_padded_cpu] | ||
| if self.forward_metadata.seq_lens_cpu_int is None: | ||
| actual_seq_lengths_kv = self.forward_metadata.seq_lens_cpu_list | ||
| else: | ||
| actual_seq_lengths_kv = ( | ||
| self.forward_metadata.seq_lens_cpu_int.cpu().int().tolist() | ||
| ) | ||
| if forward_batch.forward_mode.is_draft_extend(): | ||
| actual_seq_lengths = ( | ||
| np.array(forward_batch.extend_seq_lens_cpu).cumsum().tolist() | ||
| attn_output, _ = torch.ops.npu.npu_fused_infer_attention_score( | ||
| query, | ||
| k_cache, | ||
| v_cache, | ||
| block_table=self.forward_metadata.block_tables, | ||
| block_size=self.page_size, | ||
| num_heads=layer.tp_q_head_num, | ||
| num_key_value_heads=layer.tp_k_head_num, | ||
| input_layout="TND", | ||
| atten_mask=self.mtp_mask, | ||
| scale=layer.scaling, | ||
| actual_seq_lengths=actual_seq_lengths, | ||
| actual_seq_lengths_kv=actual_seq_lengths_kv, | ||
| sparse_mode=3, | ||
| ) | ||
| attn_output = attn_output.view(-1, layer.tp_q_head_num * layer.v_head_dim) | ||
| if ( | ||
| not self.graph_mode | ||
| and forward_batch.num_token_non_padded_cpu != num_token_padding | ||
| ): | ||
| attn_output = torch.cat( | ||
| [ | ||
| attn_output, | ||
| attn_output.new_zeros( | ||
| num_token_padding - forward_batch.num_token_non_padded_cpu, *attn_output.shape[1:] | ||
| ), | ||
| ], | ||
| dim=0, | ||
| ) | ||
| return attn_output | ||
| else: | ||
| actual_seq_lengths = np.arange( | ||
| self.speculative_num_draft_tokens, | ||
| self.speculative_num_draft_tokens + q_nope.shape[0], | ||
| self.speculative_num_draft_tokens, | ||
| c_kv, k_rope = forward_batch.token_to_kv_pool.get_kv_buffer(layer.layer_id) | ||
| k_rope_cache = k_rope.view( | ||
| -1, layer.tp_k_head_num, self.page_size, self.qk_rope_head_dim | ||
| ) | ||
| c_kv_cache = c_kv.view( | ||
| -1, layer.tp_v_head_num, self.page_size, self.kv_lora_rank | ||
| ) | ||
|
|
||
| workspace = torch_npu._npu_fused_infer_attention_score_get_max_workspace( | ||
| q_nope, | ||
| c_kv_cache, | ||
| c_kv_cache, | ||
| query_rope=q_rope, | ||
| key_rope=k_rope_cache, | ||
| num_heads=layer.tp_q_head_num, | ||
| num_key_value_heads=layer.tp_k_head_num, | ||
| input_layout="TND", | ||
| scale=layer.scaling, | ||
| antiquant_mode=0, | ||
| antiquant_scale=None, | ||
| block_table=self.forward_metadata.block_tables, | ||
| block_size=self.page_size, | ||
| sparse_mode=3, | ||
| atten_mask=self.mtp_mask, | ||
| actual_seq_lengths=actual_seq_lengths, | ||
| actual_seq_lengths_kv=actual_seq_lengths_kv, | ||
| ) | ||
| attn_output = torch.empty_like(q_nope, dtype=q.dtype, device=q.device) | ||
| softmax_lse = torch.empty(1, dtype=q.dtype, device=q.device) | ||
| torch_npu.npu_fused_infer_attention_score.out( | ||
| q_nope, | ||
| c_kv_cache, | ||
| c_kv_cache, | ||
| query_rope=q_rope, | ||
| key_rope=k_rope_cache, | ||
| num_heads=layer.tp_q_head_num, | ||
| num_key_value_heads=layer.tp_k_head_num, | ||
| input_layout="TND", | ||
| scale=layer.scaling, | ||
| antiquant_mode=0, | ||
| antiquant_scale=None, | ||
| block_table=self.forward_metadata.block_tables, | ||
| block_size=self.page_size, | ||
| sparse_mode=3, | ||
| atten_mask=self.mtp_mask, | ||
| actual_seq_lengths=actual_seq_lengths, | ||
| actual_seq_lengths_kv=actual_seq_lengths_kv, | ||
| workspace=workspace, | ||
| out=[attn_output, softmax_lse], | ||
| ) | ||
| attn_output = attn_output.view(-1, layer.tp_q_head_num * layer.v_head_dim) | ||
| if ( | ||
| not self.graph_mode | ||
| and forward_batch.num_token_non_padded_cpu != num_token_padding | ||
| ): | ||
| attn_output = torch.cat( | ||
| [ | ||
| attn_output, | ||
| attn_output.new_zeros( | ||
| num_token_padding - attn_output.shape[0], *attn_output.shape[1:] | ||
| ), | ||
| ], | ||
| dim=0, | ||
| q_nope = q.view(-1, layer.tp_q_head_num, self.kv_lora_rank).contiguous() | ||
| q_rope = q_rope.view(-1, layer.tp_q_head_num, self.qk_rope_head_dim) | ||
| if not self.graph_mode: | ||
| num_token_padding = q.shape[0] | ||
| q_nope = q_nope[: forward_batch.num_token_non_padded_cpu] | ||
| q_rope = q_rope[: forward_batch.num_token_non_padded_cpu] | ||
| if self.forward_metadata.seq_lens_cpu_int is None: | ||
| actual_seq_lengths_kv = self.forward_metadata.seq_lens_cpu_list | ||
| else: | ||
| actual_seq_lengths_kv = ( | ||
| self.forward_metadata.seq_lens_cpu_int.cpu().int().tolist() | ||
| ) | ||
| if forward_batch.forward_mode.is_draft_extend(): | ||
| actual_seq_lengths = ( | ||
| np.array(forward_batch.extend_seq_lens_cpu).cumsum().tolist() | ||
| ) | ||
| else: | ||
| actual_seq_lengths = np.arange( | ||
| self.speculative_num_draft_tokens, | ||
| self.speculative_num_draft_tokens + q_nope.shape[0], | ||
| self.speculative_num_draft_tokens, | ||
| ) | ||
|
|
||
| workspace = torch_npu._npu_fused_infer_attention_score_get_max_workspace( | ||
| q_nope, | ||
| c_kv_cache, | ||
| c_kv_cache, | ||
| query_rope=q_rope, | ||
| key_rope=k_rope_cache, | ||
| num_heads=layer.tp_q_head_num, | ||
| num_key_value_heads=layer.tp_k_head_num, | ||
| input_layout="TND", | ||
| scale=layer.scaling, | ||
| antiquant_mode=0, | ||
| antiquant_scale=None, | ||
| block_table=self.forward_metadata.block_tables, | ||
| block_size=self.page_size, | ||
| sparse_mode=3, | ||
| atten_mask=self.mtp_mask, | ||
| actual_seq_lengths=actual_seq_lengths, | ||
| actual_seq_lengths_kv=actual_seq_lengths_kv, | ||
| ) | ||
| return attn_output | ||
| attn_output = torch.empty_like(q_nope, dtype=q.dtype, device=q.device) | ||
| softmax_lse = torch.empty(1, dtype=q.dtype, device=q.device) | ||
| torch_npu.npu_fused_infer_attention_score.out( | ||
| q_nope, | ||
| c_kv_cache, | ||
| c_kv_cache, | ||
| query_rope=q_rope, | ||
| key_rope=k_rope_cache, | ||
| num_heads=layer.tp_q_head_num, | ||
| num_key_value_heads=layer.tp_k_head_num, | ||
| input_layout="TND", | ||
| scale=layer.scaling, | ||
| antiquant_mode=0, | ||
| antiquant_scale=None, | ||
| block_table=self.forward_metadata.block_tables, | ||
| block_size=self.page_size, | ||
| sparse_mode=3, | ||
| atten_mask=self.mtp_mask, | ||
| actual_seq_lengths=actual_seq_lengths, | ||
| actual_seq_lengths_kv=actual_seq_lengths_kv, | ||
| workspace=workspace, | ||
| out=[attn_output, softmax_lse], | ||
| ) | ||
| attn_output = attn_output.view(-1, layer.tp_q_head_num * layer.v_head_dim) | ||
| if ( | ||
| not self.graph_mode | ||
| and forward_batch.num_token_non_padded_cpu != num_token_padding | ||
| ): | ||
| attn_output = torch.cat( | ||
| [ | ||
| attn_output, | ||
| attn_output.new_zeros( | ||
| num_token_padding - attn_output.shape[0], *attn_output.shape[1:] | ||
| ), | ||
| ], | ||
| dim=0, | ||
| ) | ||
| return attn_output |
There was a problem hiding this comment.
There is significant code duplication between the if not self.use_mla: block and the else: block. The logic for calculating actual_seq_lengths_kv and actual_seq_lengths, handling graph mode to determine num_token_padding, and the final padding logic for attn_output are all nearly identical.
To improve code maintainability and reduce redundancy, this common logic should be refactored and moved outside the if/else blocks. You can compute these values once, and then have the conditional blocks focus only on the parts that differ (i.e., the specific attention computation for MHA vs. MLA). This will make the code cleaner and easier to maintain.
| out_cache_loc, | ||
| ) | ||
| out_cache_loc = out_cache_loc.to(dtype=torch.int64) | ||
| out_cache_loc = out_cache_loc.to(dtype=torch.int32) |
|
/tag-and-rerun-ci |
| if not _is_npu: | ||
| batch.out_cache_loc = torch.empty( | ||
| (bs * topk * num_steps,), | ||
| dtype=torch.int64, | ||
| device=batch.input_ids.device, | ||
| ) | ||
| else: | ||
| batch.out_cache_loc = torch.empty( | ||
| (bs * topk * num_steps,), | ||
| dtype=torch.int32, | ||
| device=batch.input_ids.device, | ||
| ) |
There was a problem hiding this comment.
| if not _is_npu: | |
| batch.out_cache_loc = torch.empty( | |
| (bs * topk * num_steps,), | |
| dtype=torch.int64, | |
| device=batch.input_ids.device, | |
| ) | |
| else: | |
| batch.out_cache_loc = torch.empty( | |
| (bs * topk * num_steps,), | |
| dtype=torch.int32, | |
| device=batch.input_ids.device, | |
| ) | |
| batch.out_cache_loc = torch.empty( | |
| (bs * topk * num_steps,), | |
| dtype=cuda_graph_runner.get_cache_loc_dtype(), | |
| device=batch.input_ids.device, | |
| ) |
| # In the Eagle3 scenario, the small model is unquantized. | ||
| if _disable_eagle3_quant: | ||
| self.server_args.quantization = None | ||
|
|
There was a problem hiding this comment.
should rework this modification here
can we read quant config from eagle model config files?
There was a problem hiding this comment.
remove this import as well, theoretically import once during startup should work
|
@Liwansi Hi, could you write your start_up parameters and hardware infos to one README? I'm working on this pr to test some performance. Thank you. |
--speculative-algorithm EAGLE3 --speculative-draft-model-path xxx/Qwen3-32B-Eagle3 |
Thank you, but i already met one error when start up with these parameters, pls give some guidance. Thank you ! START_UP PARAMETERS: ERRORS: |
7e5932b to
0d03bf7
Compare
Please ensure that you are using the latest version of the code and no residual processes remain in the environment.If you encounter further issues, please open an issue for tracking. Thank you! |
0d03bf7 to
dcb041a
Compare
…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (121 commits) Super tiny add gsp-fast-prepare (sgl-project#14992) Super tiny fix confusing slash_command_handler hint (sgl-project#14976) Super tiny remove unused argument (sgl-project#14966) [registry] Add a strict mode to model registration (sgl-project#14933) Feature/Fix multi lora scheduler blocking issue and evict LoRA None lastly (sgl-project#14795) Tune triton fused moe for the case of glm-4.6-fp8 b200 tp4 (sgl-project#15020) [model-gateway] refactor: unify worker management into modular workflow structure (sgl-project#15010) Update ci permission (sgl-project#15014) Refactor of http and engine entrypoints to allow custom override (sgl-project#14869) Add KV4-capable backend flashmla and update server args (sgl-project#14989) Revert several PRs (sgl-project#14958) Super tiny extract route_typed_request_once (sgl-project#14951) Fix CI by reverting incorrect metric check logic (sgl-project#15004) [model-gateway] refactor: workflow engine cleanup and minor optimization (sgl-project#15001) [model-gateway] fix: handle workflow deadlock and optimize cycle detection (sgl-project#15000) [model-gateway] feat: add DAG parallel execution support and workflow optimization (sgl-project#14999) [model-gateway] refactor: extract workflow engine to src/workflow module (sgl-project#14996) Update CODEOWNERS for multimodal_gen (sgl-project#14995) [diffusion] docker: Tiny fix Docker Hub link in installation documentation (sgl-project#14987) [PD] Add decode PP event loop for PD disaggregation (sgl-project#14945) ... # Conflicts: # python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
e1e66cf to
4e77111
Compare
…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (25 commits) [NPU] perf update with kvcache nz & w4a8 quant (sgl-project#14423) [PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks (sgl-project#15027) Fix GLM-4.6 tool calls don't support streaming output for arguments i… (sgl-project#13989) feature: adding nightly wheel workflow and indexer (sgl-project#14924) [diffusion] feat: Improve LoRA compatibility by adding unified format detection and diffusers-based normalization (sgl-project#14659) [Fix] Disable trtllm moe backend for draft model for a qucik fix (sgl-project#15002) [diffusion] fix: use NDRotaryEmbedding in flux_2 (sgl-project#15034) Mistral Large 3 NVFP4 support (sgl-project#14485) call check_quantized_moe_compatibility after initialize (sgl-project#13876) Add sgl_router_attempt_http_responses_total for single attempt information (sgl-project#15037) Add error code in prometheus metrics and add X-SMG-Error-Code header (sgl-project#15036) Provide more fine grained error reason for reqwest error (sgl-project#15032) Tiny change http router response format to unify (sgl-project#15031) Tiny unify grpc existing error responses into new format (sgl-project#15030) Add `code` field and unify error responses for router (sgl-project#15028) Super tiny remove unused log_request (sgl-project#15035) Fix decode OOM caused by retraction (sgl-project#14939) [CI]Add gb200 runner back (sgl-project#15024) Add a special label for b200 CI runner that can run kernel tests (sgl-project#15033) Fix regression caused by fa3 block_table (sgl-project#15009) ... # Conflicts: # python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py
c9d4586 to
ad67b57
Compare
| export SGLANG_SET_CPU_AFFINITY=1 | ||
| export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True | ||
| export STREAMS_PER_DEVICE=32 | ||
| export HCCL_BUFFSIZE=1536 |
There was a problem hiding this comment.
we dont need that much of buffsize for dense model
| export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=1 | ||
| export SGLANG_ENABLE_SPEC_V2=1 | ||
|
|
||
| ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 python -m sglang.launch_server \ |
There was a problem hiding this comment.
no visible_device but should use --base-gpu-id instead
| # In speculative scenario: | ||
| # - If `speculative_draft_model_quantization` is specified, the draft model uses this quantization method. | ||
| # - Otherwise, the draft model defaults to the same quantization as the target model. | ||
| if model_config.is_draft_model: | ||
| draft_quant = model_config.speculative_draft_model_quantization | ||
| quantization = ( | ||
| None | ||
| if draft_quant == "unquant" | ||
| else draft_quant or model_config.quantization | ||
| ) | ||
| else: | ||
| quantization = model_config.quantization | ||
|
|
There was a problem hiding this comment.
this part should be handled inside server_args.py
There was a problem hiding this comment.
if self.speculative_draft_model_quantization is None:
self.speculative_draft_model_quantization = self.quantization
elif unquant:
self.speculative_draft_model_quantization = None
| # In speculative scenario: | ||
| # - If `speculative_draft_model_quantization` is specified, the draft model uses this quantization method. | ||
| # - Otherwise, the draft model defaults to the same quantization as the target model. | ||
| if model_config.is_draft_model: | ||
| draft_quant = model_config.speculative_draft_model_quantization | ||
| quantization = ( | ||
| None | ||
| if draft_quant == "unquant" | ||
| else draft_quant or model_config.quantization | ||
| ) | ||
| else: | ||
| quantization = model_config.quantization | ||
|
|
There was a problem hiding this comment.
if self.speculative_draft_model_quantization is None:
self.speculative_draft_model_quantization = self.quantization
elif unquant:
self.speculative_draft_model_quantization = None
| sampling_defaults=server_args.sampling_defaults, | ||
| quantize_and_serve=server_args.quantize_and_serve, | ||
| override_config_file=server_args.decrypted_config_file, | ||
| speculative_draft_model_quantization=server_args.speculative_draft_model_quantization, |
There was a problem hiding this comment.
quantization = server_args.speculative_draft_model_quantization if is_draft_model else server_args.quantization
1cdc949 to
f6e0965
Compare
f6e0965 to
033fd51
Compare
|
/tag-and-rerun-ci |
…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (89 commits) [model-gateway] Remove legacy RouterMetrics and Rename SmgMetrics to Metrics and smg_labels to metrics_labels (sgl-project#15160) [diffusion] fix: fix video model sp when resolution is not specified (sgl-project#15047) [diffusion] fix: fix pytorch non-writable array warning (sgl-project#15017) [diffusion] fix: cache dit with parallel (sgl-project#15163) chore: change npu pr-test a2 runner (sgl-project#15152) [Feature] Fuse mrope all in 1 kernel (sgl-project#14906) Fix num running requests (load) wrong cleared for ongoing requests (sgl-project#15116) Fused two elementwise kernels for k_nope and k_pe concat (sgl-project#14862) fix: adding date and fixing release name issue (sgl-project#15174) [CPU] Add Gemma3RMSNorm kernel in sgl-kernel and add ut (sgl-project#9324) feature: PR wheel (sgl-project#15170) [diffusion] model: support mutli-image input and qwen-image-edit-2509 (sgl-project#15005) fix CompressedTensorsW8A8Int8 min_capability (sgl-project#13914) Tiny improve summary text in `bench_one_batch_server.py` (sgl-project#15158) [model-gateway] add mcp and discovery metrics (sgl-project#15156) fix: move ci-bot (sgl-project#15154) Fix import warnings (sgl-project#15144) ci: adding errors to Github summary (sgl-project#14778) [model-gateway] Add streaming metrics for harmony gRPC router (sgl-project#15147) [model-gateway] upgrade axum and axum server (sgl-project#15146) ... # Conflicts: # python/sglang/srt/server_args.py
fc43430 to
50fc61a
Compare
|
@Liwansi Thanks for supporting this feature. I built the image from the main branch using I followed the startup command in Any guidance would be greatly appreciated. Thanks! |
Could you please provide me with the complete error log and the script? |
@Liwansi First, I’d like to confirm whether building the 910B image in this way is correct. It might be best to rule out any image build–related issues first. |
OK. I pulled the latest image to reproduce your issue, but everything worked fine. Perhaps you could try using this command 'docker image pull swr.cn-southwest-2.myhuaweicloud.com/base_image/dockerhub/lmsysorg/sglang:main-cann8.3.rc2-910b' to pull a new image and run it again. |
Motivation
Enable sglang eagle3 on NPU platform
Tested models:
Qwen3-32B-Int8
Modifications
1、Add a MHA attn op in forward_mtp to support non-MLA model.
2、Modify eagle_draft_npu_graph_runner.py to support speculative-num-steps > 1 in eagle3 scenario
3、A parameter '--speculative-draft-model-quantization' has been added to handle cases where the target and draft models use different quantization method.
Rules for
speculative_draft_model_quantization:speculative_draft_model_quantizationis not specified, inherits the quantization config from target modelspeculative_draft_model_quantizationisunquant, meaning that target model is quanted but draft model isn't, resetspeculative_draft_model_quantizationtoNonespeculative_draft_model_quantizationwill then passed into draft models' model_config to initializequantizationentryAccuracy Tests
Qwen3-32B-Int8:

Qwen3-32B-Int8 with eagle:

Benchmarking and Profiling
Qwen3-32B-Int8:

Qwen3-32B-Int8 with eagle step 1:

Qwen3-32B-Int8 with eagle step 2:

Checklist