[ROCm] Enable MTP (NextN) on AMD GPU by alexsun07 · Pull Request #3670 · sgl-project/sglang

alexsun07 · 2025-02-18T10:35:16Z

Motivation

To support MTP (NextN) on AMD GPU

Modifications

MTP(NextN) relies on build_tree_kernel and build_tree_kernel_efficient. Add them into torch_extension_rocm.cc

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

HaiShaw

LG

zhyncs

Please enable this for AMD

sglang/test/srt/test_mla.py

Lines 99 to 118 in 424848d

    
           if torch.cuda.is_available() and torch.version.cuda: 
        
               other_args.extend( 
        
                   [ 
        
                       "--cuda-graph-max-bs", 
        
                       "2", 
        
                       "--disable-radix", 
        
                       "--enable-torch-compile", 
        
                       "--torch-compile-max-bs", 
        
                       "1", 
        
                       "--speculative-algorithm", 
        
                       "NEXTN", 
        
                       "--speculative-draft", 
        
                       "sgl-project/sglang-ci-dsv3-test-NextN", 
        
                       "--speculative-num-steps", 
        
                       "2", 
        
                       "--speculative-eagle-topk", 
        
                       "4", 
        
                       "--speculative-num-draft-tokens", 
        
                       "4", 
        
                   ]

tot0 · 2025-02-19T04:36:31Z

First time I tested this commit (on 8*MI300X node) I got two Decode Batch logs and then sglang hung:

alexsun07 · 2025-02-19T04:43:29Z

Tested on 8*MI308X GPUs.

# server
python3 -m sglang.launch_server --host 0.0.0.0 --trust-remote-code --tp 8 --model-path deepseek-ai/DeepSeek-V3  --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4  --mem-fraction-static 0.5

# bench
python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 256 --output-len 256

alexsun07 · 2025-02-19T04:44:25Z

First time I tested this commit (on 8*MI300X node) I got two Decode Batch logs and then sglang hung:

Can you share more details? For example your launch_server and bench command

tot0 · 2025-02-19T05:16:25Z

First time I tested this commit (on 8*MI300X node) I got two Decode Batch logs and then sglang hung:

Can you share more details? For example your launch_server and bench command

yea sorry.

python3 -m sglang.launch_server --model-path /mnt/models/DeepSeek-R1/DeepSeek-R1/ --port 5000 --host 0.0.0.0 --served-model-name deepseek-r1 --trust-remote-code --tp-size 8 --enable-metrics --mem-fraction-static 0.7 --speculative-algo NEXTN --speculative-draft /mnt/models/DeepSeek-R1/DeepSeek-R1//MTP-NextN-Weights/ --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2
I also tried --speculative-eagle-topk 2 --speculative-num-draft-tokens 4, it booted and logged a Prefill batch then hung again.

I generated the nextn weights using scripts/export_deepseek_nextn.py.

Haven't had time to dive as deep as using something like py-spy to attach to the python procs and inspect current stack.

alexsun07 · 2025-02-19T07:26:16Z

First time I tested this commit (on 8*MI300X node) I got two Decode Batch logs and then sglang hung:

Can you share more details? For example your launch_server and bench command

yea sorry.

python3 -m sglang.launch_server --model-path /mnt/models/DeepSeek-R1/DeepSeek-R1/ --port 5000 --host 0.0.0.0 --served-model-name deepseek-r1 --trust-remote-code --tp-size 8 --enable-metrics --mem-fraction-static 0.7 --speculative-algo NEXTN --speculative-draft /mnt/models/DeepSeek-R1/DeepSeek-R1//MTP-NextN-Weights/ --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 I also tried --speculative-eagle-topk 2 --speculative-num-draft-tokens 4, it booted and logged a Prefill batch then hung again.

I generated the nextn weights using scripts/export_deepseek_nextn.py.

Haven't had time to dive as deep as using something like py-spy to attach to the python procs and inspect current stack.

I cannot reproduce and I don't have MI300X access. Can you try different machines or environment? Since you boot successfully, the kernels this commit added should be working.

zhaochenyang20 · 2025-02-19T18:58:07Z

@HaiShaw hai, could you take a look? Thanks!

andyluo7 · 2025-02-20T05:56:36Z

First time I tested this commit (on 8*MI300X node) I got two Decode Batch logs and then sglang hung:

Can you share more details? For example your launch_server and bench command

i can run it on mi300x. however, it got lower perf than without it.

I used 0.4.3post2-rocm630.

Serve cmdline -
python3 -m sglang.launch_server --host 0.0.0.0 --trust-remote-code --tp 8 --model-path /models/DeepSeek-R1 --speculative-algo NEXTN --speculative-draft /models/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --mem-fraction-static 0.5

Client -
python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 3200 --output-len 800

Perf:
batch size: 16
latency: 8.79 s
output throughput: 29.14 token/s
(input + output) throughput: 1894.00 token/s
batch size: 1
latency: 50.04 s
output throughput: 15.99 token/s
(input + output) throughput: 79.93 token/s

Perf without it:
batch size: 16
latency: 6.94 s
output throughput: 36.87 token/s
(input + output) throughput: 2396.78 token/s
batch size: 1
latency: 26.61 s
output throughput: 30.06 token/s
(input + output) throughput: 150.30 token/s

Do you get better perf with it on AMD GPU?

alexsun07 · 2025-02-20T16:04:41Z

Do you get better perf with it on AMD GPU?

@andyluo7 Yes. With --batch-size 1 --input-len 256 --output-len 256 on MI308X

zhaochenyang20 · 2025-02-20T18:57:16Z

@yiakwy-xpu-ml-framework-team @HaiShaw Could you take a look at this? Thanks!

yiakwy-xpu-ml-framework-team · 2025-02-21T08:56:47Z

@yiakwy-xpu-ml-framework-team @HaiShaw Could you take a look at this? Thanks!

Thank you @zhaochenyang20 for including me in. Yes I am watching it , really great job @alexsun07.

I am having a look at the algorithm design if it is helpful to facilitat merging.

vLLM is also working on this import feature in a different approach.

HaiShaw

Pls change/update sgl-project/sglang-ci-dsv3-test-NextN, sgl-project/sglang-ci-dsv3-test in test script.

zhaochenyang20 · 2025-02-23T06:32:19Z

@yiakwy-xpu-ml-framework-team, what shall we do next?

HaiShaw · 2025-02-23T08:33:06Z

@zhaochenyang20 I am looking into it.

HaiShaw · 2025-02-23T09:30:46Z

Hit out-of-bounds error at running gsm8k tests.

python3 -m sglang.launch_server --host 0.0.0.0 --trust-remote-code --tp 8 --model-path /data/deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --mem-fraction-static 0.7 --disable-radix

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000

[2025-02-23 09:27:42 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/dockerx/0222/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/dockerx/0222/sglang/python/sglang/srt/managers/scheduler.py", line 479, in event_loop_normal
    self.process_batch_result(batch, result)
  File "/dockerx/0222/sglang/python/sglang/srt/managers/scheduler.py", line 1120, in process_batch_result
    self.process_batch_result_prefill(batch, result)
  File "/dockerx/0222/sglang/python/sglang/srt/managers/scheduler.py", line 1181, in process_batch_result_prefill
    self.tree_cache.cache_unfinished_req(req)
  File "/dockerx/0222/sglang/python/sglang/srt/mem_cache/chunk_cache.py", line 62, in cache_unfinished_req
    kv_indices = self.req_to_token_pool.req_to_token[
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index 2280 is out of bounds for dimension 0 with size 2280

zhaochenyang20 · 2025-02-23T23:50:53Z

sure, thansk1!

alexsun07 · 2025-03-18T11:23:57Z

Hit out-of-bounds error at running gsm8k tests.

This issue has been fixed with merging upstream.
The root cause is that for MTP case, req_to_token_pool fail to reset its req_pool_idx and the req_pool_idx will go beyond the bound and hit the IndexError you met. Now the community have completely changed the original logic and this issue is fixed when I merge upstream into my repo.

HaiShaw

Glad that issue was addressed.

HaiShaw · 2025-03-19T08:15:35Z

@saienduri can we have test/srt/test_mla_deepseek_v3.py covered?

alexsun07 · 2025-03-19T09:18:39Z

Please enable this for AMD

Added in test. @zhyncs Would you please review again?

alexsun07 · 2025-03-20T14:44:47Z

Didn't noticed that I shouldn't use main branch. Close this one, track with new PR: #4631

alexsun07 requested review from BBuf, ByronHsu, HandH1998, Ying1123, hnyls2002, ispobock, merrymercy, yizhang2077 and zhyncs as code owners February 18, 2025 10:35

HaiShaw self-requested a review February 18, 2025 17:05

HaiShaw approved these changes Feb 18, 2025

View reviewed changes

zhyncs suggested changes Feb 18, 2025

View reviewed changes

zhyncs changed the title ~~[ROCm] Support MTP (NextN) on AMD GPU~~ [ROCm] Enable MTP (NextN) on AMD GPU Feb 18, 2025

tot0 mentioned this pull request Feb 18, 2025

Support NextN (MTP) speculative decoding for DeepSeek-V3/R1 #3582

Merged

tot0 pushed a commit to tot0/sglang that referenced this pull request Feb 18, 2025

[ROCm] Enable MTP (NextN) on AMD GPU sgl-project#3670

dc99c5d

alexsun07 requested review from HaiShaw and zhyncs February 19, 2025 08:55

zhyncs requested a review from zhaochenyang20 February 19, 2025 12:55

HaiShaw requested changes Feb 23, 2025

View reviewed changes

merrymercy requested review from kssteven418 and rkooo567 as code owners March 8, 2025 06:12

update upstream and resolve conflict

39a386f

alexsun07 force-pushed the main branch from 70deb0b to 39a386f Compare March 18, 2025 11:11

zhaochenyang20 and others added 2 commits March 18, 2025 11:08

Merge branch 'main' into main

46faf39

add amd mtp test

dcef633

alexsun07 requested a review from HaiShaw March 19, 2025 08:04

HaiShaw approved these changes Mar 19, 2025

View reviewed changes

HaiShaw reviewed Mar 19, 2025

View reviewed changes

Comment thread python/sglang/srt/speculative/eagle_utils.py Outdated

alexsun07 and others added 2 commits March 19, 2025 08:50

use is_hip instead of is_rocm

9e4a93c

Merge branch 'main' into main

292c369

zhaochenyang20 and others added 2 commits March 19, 2025 08:50

Merge branch 'main' into main

fc15ac1

Merge branch 'main' into main

43b4abe

alexsun07 closed this Mar 20, 2025

alexsun07 mentioned this pull request Mar 20, 2025

[ROCm] Enable MTP (NextN) on AMD GPU #4631

Merged

6 tasks

	if torch.cuda.is_available() and torch.version.cuda:
	other_args.extend(
	[
	"--cuda-graph-max-bs",
	"2",
	"--disable-radix",
	"--enable-torch-compile",
	"--torch-compile-max-bs",
	"1",
	"--speculative-algorithm",
	"NEXTN",
	"--speculative-draft",
	"sgl-project/sglang-ci-dsv3-test-NextN",
	"--speculative-num-steps",
	"2",
	"--speculative-eagle-topk",
	"4",
	"--speculative-num-draft-tokens",
	"4",
	]

Conversation

alexsun07 commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

zhyncs left a comment

Choose a reason for hiding this comment

Uh oh!

tot0 commented Feb 19, 2025

Uh oh!

alexsun07 commented Feb 19, 2025

Uh oh!

alexsun07 commented Feb 19, 2025

Uh oh!

tot0 commented Feb 19, 2025

Uh oh!

alexsun07 commented Feb 19, 2025

Uh oh!

zhaochenyang20 commented Feb 19, 2025

Uh oh!

andyluo7 commented Feb 20, 2025

Uh oh!

alexsun07 commented Feb 20, 2025

Uh oh!

zhaochenyang20 commented Feb 20, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HaiShaw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhaochenyang20 commented Feb 23, 2025

Uh oh!

HaiShaw commented Feb 23, 2025

Uh oh!

HaiShaw commented Feb 23, 2025

Uh oh!

zhaochenyang20 commented Feb 23, 2025

Uh oh!

alexsun07 commented Mar 18, 2025

Uh oh!

HaiShaw left a comment

Choose a reason for hiding this comment

Uh oh!

HaiShaw commented Mar 19, 2025

Uh oh!

Uh oh!

alexsun07 commented Mar 19, 2025

Uh oh!

alexsun07 commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

alexsun07 commented Feb 18, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team commented Feb 21, 2025 •

edited

Loading

HaiShaw left a comment •

edited

Loading

alexsun07 commented Mar 20, 2025 •

edited

Loading