Skip to content

[Feature] Support DeepSeek MTP on NPU#11897

Merged
hnyls2002 merged 18 commits intosgl-project:mainfrom
iforgetmyname:feature/mtp
Oct 30, 2025
Merged

[Feature] Support DeepSeek MTP on NPU#11897
hnyls2002 merged 18 commits intosgl-project:mainfrom
iforgetmyname:feature/mtp

Conversation

@iforgetmyname
Copy link
Copy Markdown
Collaborator

@iforgetmyname iforgetmyname commented Oct 21, 2025

Motivation

This pr primarily aims to support deepseek's mtp on ascend npus.

Modifications

  • Introduces NPU support for newest eagle framework
  • Includes ascend specific ops for draft tree build/verify

Accuracy Tests

pr

Benchmarking and Profiling

Checklist

@iforgetmyname iforgetmyname changed the title [Feature] Support MTP on NPU [Feature] Support DeepSeek MTP on NPU Oct 22, 2025
@Alcanderian
Copy link
Copy Markdown
Collaborator

Can we add an PR Test for NPU MTP?

@Yellowhappy
Copy link
Copy Markdown

Hi, which DeepSeek model is this, and is it running on one machine or two?

@iforgetmyname iforgetmyname marked this pull request as draft October 25, 2025 02:14
@iforgetmyname iforgetmyname marked this pull request as ready for review October 25, 2025 03:04
@iforgetmyname
Copy link
Copy Markdown
Collaborator Author

Hi, which DeepSeek model is this, and is it running on one machine or two?

Hi, this supports both V3 and V3.2, and it could run on one machine if hbm capacity allows

@iforgetmyname
Copy link
Copy Markdown
Collaborator Author

Can we add an PR Test for NPU MTP?

for sure, we have test_ascend_deepseek_mtp.py for pr-test now

Comment thread python/sglang/srt/speculative/eagle_info.py Outdated
Comment thread python/sglang/srt/speculative/eagle_info.py Outdated
Comment thread python/sglang/srt/speculative/eagle_info.py Outdated
Comment thread python/sglang/srt/speculative/eagle_info.py Outdated
Comment thread python/sglang/srt/speculative/eagle_info_v2.py Outdated
Comment thread python/sglang/srt/speculative/eagle_info_v2.py Outdated
Comment thread python/sglang/srt/speculative/eagle_utils.py Outdated
Comment thread python/sglang/srt/speculative/eagle_utils.py
Comment thread python/sglang/srt/speculative/eagle_worker.py Outdated
Comment thread python/sglang/srt/speculative/spec_utils.py Outdated
@hnyls2002 hnyls2002 merged commit ce6b17c into sgl-project:main Oct 30, 2025
55 of 71 checks passed
export PATH="/usr/local/Ascend/8.3.RC1/compiler/bishengir/bin:${PATH}"
cd test/srt
python3 run_suite.py --suite per-commit-16-ascend-a3 --timeout-per-file 3600
python3 run_suite.py --suite per-commit-16-ascend-a3 --timeout-per-file 3600 --auto-partition-id ${{ matrix.part }} --auto-partition-size 2
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- If a single test file run longer than 500 seconds, split it into multiple smaller files (e.g., `test_eagle_infer_a.py`, `test_eagle_infer_b.py`).

if not _is_npu:
device: str = "cuda"
else:
device: str = "npu"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. you should set this value to npu when you create it instead of adding if/else here
  2. or do this
    device: str = "cuda" if not is_npu else "npu"

self.lm_head.weight = head
torch.cuda.empty_cache()
torch.cuda.synchronize()
if not _is_npu:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove if/else

if not _is_npu:
device = "cuda"
else:
device = "npu"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read from the global variable?

)

if is_all_greedy or not TREE_SPEC_KERNEL_AVAILABLE:
if is_all_greedy or not TREE_SPEC_KERNEL_AVAILABLE or _is_npu:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: use more general filed to replace is_npu


# Sample tokens
if sampling_info.is_all_greedy:
if sampling_info.is_all_greedy or _is_npu:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not use is_npu

bs,
)
else:
sgl_build_tree_kernel_efficient(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the GPU code should be in the first branch of if/else

if _is_cuda or _is_hip:
from sgl_kernel import verify_tree_greedy

verify_tree_greedy(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can try to add more arguments to sgl kernel and remove these

@iforgetmyname iforgetmyname deleted the feature/mtp branch November 3, 2025 01:03
@wangtiance
Copy link
Copy Markdown
Contributor

Hello, is deepseek the only model supporting speculative decoding on NPU? Will qwen3 etc. be supported?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants