Skip to content

[ROCm] Enable MTP (NextN) on AMD GPU#3670

Closed
alexsun07 wants to merge 7 commits intosgl-project:mainfrom
alexsun07:main
Closed

[ROCm] Enable MTP (NextN) on AMD GPU#3670
alexsun07 wants to merge 7 commits intosgl-project:mainfrom
alexsun07:main

Conversation

@alexsun07
Copy link
Copy Markdown
Contributor

@alexsun07 alexsun07 commented Feb 18, 2025

Motivation

To support MTP (NextN) on AMD GPU

Modifications

MTP(NextN) relies on build_tree_kernel and build_tree_kernel_efficient. Add them into torch_extension_rocm.cc

Checklist

Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

Copy link
Copy Markdown
Collaborator

@zhyncs zhyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please enable this for AMD

if torch.cuda.is_available() and torch.version.cuda:
other_args.extend(
[
"--cuda-graph-max-bs",
"2",
"--disable-radix",
"--enable-torch-compile",
"--torch-compile-max-bs",
"1",
"--speculative-algorithm",
"NEXTN",
"--speculative-draft",
"sgl-project/sglang-ci-dsv3-test-NextN",
"--speculative-num-steps",
"2",
"--speculative-eagle-topk",
"4",
"--speculative-num-draft-tokens",
"4",
]

@zhyncs zhyncs changed the title [ROCm] Support MTP (NextN) on AMD GPU [ROCm] Enable MTP (NextN) on AMD GPU Feb 18, 2025
tot0 pushed a commit to tot0/sglang that referenced this pull request Feb 18, 2025
@tot0
Copy link
Copy Markdown

tot0 commented Feb 19, 2025

First time I tested this commit (on 8*MI300X node) I got two Decode Batch logs and then sglang hung:
image

@alexsun07
Copy link
Copy Markdown
Contributor Author

Tested on 8*MI308X GPUs.

# server
python3 -m sglang.launch_server --host 0.0.0.0 --trust-remote-code --tp 8 --model-path deepseek-ai/DeepSeek-V3  --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4  --mem-fraction-static 0.5

# bench
python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 256 --output-len 256

@alexsun07
Copy link
Copy Markdown
Contributor Author

First time I tested this commit (on 8*MI300X node) I got two Decode Batch logs and then sglang hung: image

Can you share more details? For example your launch_server and bench command

@tot0
Copy link
Copy Markdown

tot0 commented Feb 19, 2025

First time I tested this commit (on 8*MI300X node) I got two Decode Batch logs and then sglang hung: image

Can you share more details? For example your launch_server and bench command

yea sorry.

python3 -m sglang.launch_server --model-path /mnt/models/DeepSeek-R1/DeepSeek-R1/ --port 5000 --host 0.0.0.0 --served-model-name deepseek-r1 --trust-remote-code --tp-size 8 --enable-metrics --mem-fraction-static 0.7 --speculative-algo NEXTN --speculative-draft /mnt/models/DeepSeek-R1/DeepSeek-R1//MTP-NextN-Weights/ --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2
I also tried --speculative-eagle-topk 2 --speculative-num-draft-tokens 4, it booted and logged a Prefill batch then hung again.

I generated the nextn weights using scripts/export_deepseek_nextn.py.

Haven't had time to dive as deep as using something like py-spy to attach to the python procs and inspect current stack.

@alexsun07
Copy link
Copy Markdown
Contributor Author

First time I tested this commit (on 8*MI300X node) I got two Decode Batch logs and then sglang hung: image

Can you share more details? For example your launch_server and bench command

yea sorry.

python3 -m sglang.launch_server --model-path /mnt/models/DeepSeek-R1/DeepSeek-R1/ --port 5000 --host 0.0.0.0 --served-model-name deepseek-r1 --trust-remote-code --tp-size 8 --enable-metrics --mem-fraction-static 0.7 --speculative-algo NEXTN --speculative-draft /mnt/models/DeepSeek-R1/DeepSeek-R1//MTP-NextN-Weights/ --speculative-num-steps 2 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2 I also tried --speculative-eagle-topk 2 --speculative-num-draft-tokens 4, it booted and logged a Prefill batch then hung again.

I generated the nextn weights using scripts/export_deepseek_nextn.py.

Haven't had time to dive as deep as using something like py-spy to attach to the python procs and inspect current stack.

I cannot reproduce and I don't have MI300X access. Can you try different machines or environment? Since you boot successfully, the kernels this commit added should be working.

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@HaiShaw hai, could you take a look? Thanks!

@andyluo7
Copy link
Copy Markdown
Contributor

First time I tested this commit (on 8*MI300X node) I got two Decode Batch logs and then sglang hung: image

Can you share more details? For example your launch_server and bench command

i can run it on mi300x. however, it got lower perf than without it.

I used 0.4.3post2-rocm630.

Serve cmdline -
python3 -m sglang.launch_server --host 0.0.0.0 --trust-remote-code --tp 8 --model-path /models/DeepSeek-R1 --speculative-algo NEXTN --speculative-draft /models/DeepSeek-R1-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --mem-fraction-static 0.5

Client -
python3 -m sglang.bench_one_batch_server --model None --base-url http://127.0.0.1:30000 --batch-size 1 --input-len 3200 --output-len 800

Perf:
batch size: 16
latency: 8.79 s
output throughput: 29.14 token/s
(input + output) throughput: 1894.00 token/s
batch size: 1
latency: 50.04 s
output throughput: 15.99 token/s
(input + output) throughput: 79.93 token/s

Perf without it:
batch size: 16
latency: 6.94 s
output throughput: 36.87 token/s
(input + output) throughput: 2396.78 token/s
batch size: 1
latency: 26.61 s
output throughput: 30.06 token/s
(input + output) throughput: 150.30 token/s

Do you get better perf with it on AMD GPU?

@alexsun07
Copy link
Copy Markdown
Contributor Author

Do you get better perf with it on AMD GPU?

@andyluo7 Yes. With --batch-size 1 --input-len 256 --output-len 256 on MI308X

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@yiakwy-xpu-ml-framework-team @HaiShaw Could you take a look at this? Thanks!

@yiakwy-xpu-ml-framework-team
Copy link
Copy Markdown
Contributor

yiakwy-xpu-ml-framework-team commented Feb 21, 2025

@yiakwy-xpu-ml-framework-team @HaiShaw Could you take a look at this? Thanks!

Thank you @zhaochenyang20 for including me in. Yes I am watching it , really great job @alexsun07.

I am having a look at the algorithm design if it is helpful to facilitat merging.

vLLM is also working on this import feature in a different approach.

Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls change/update sgl-project/sglang-ci-dsv3-test-NextN, sgl-project/sglang-ci-dsv3-test in test script.

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

@yiakwy-xpu-ml-framework-team, what shall we do next?

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Feb 23, 2025

@zhaochenyang20 I am looking into it.

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Feb 23, 2025

Hit out-of-bounds error at running gsm8k tests.

python3 -m sglang.launch_server --host 0.0.0.0 --trust-remote-code --tp 8 --model-path /data/deepseek-ai/DeepSeek-V3 --speculative-algo NEXTN --speculative-draft SGLang/DeepSeek-V3-NextN --speculative-num-steps 2 --speculative-eagle-topk 4 --speculative-num-draft-tokens 4 --mem-fraction-static 0.7 --disable-radix

python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000

[2025-02-23 09:27:42 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/dockerx/0222/sglang/python/sglang/srt/managers/scheduler.py", line 1827, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/dockerx/0222/sglang/python/sglang/srt/managers/scheduler.py", line 479, in event_loop_normal
    self.process_batch_result(batch, result)
  File "/dockerx/0222/sglang/python/sglang/srt/managers/scheduler.py", line 1120, in process_batch_result
    self.process_batch_result_prefill(batch, result)
  File "/dockerx/0222/sglang/python/sglang/srt/managers/scheduler.py", line 1181, in process_batch_result_prefill
    self.tree_cache.cache_unfinished_req(req)
  File "/dockerx/0222/sglang/python/sglang/srt/mem_cache/chunk_cache.py", line 62, in cache_unfinished_req
    kv_indices = self.req_to_token_pool.req_to_token[
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index 2280 is out of bounds for dimension 0 with size 2280

@zhaochenyang20
Copy link
Copy Markdown
Collaborator

sure, thansk1!

@alexsun07
Copy link
Copy Markdown
Contributor Author

Hit out-of-bounds error at running gsm8k tests.

This issue has been fixed with merging upstream.
The root cause is that for MTP case, req_to_token_pool fail to reset its req_pool_idx and the req_pool_idx will go beyond the bound and hit the IndexError you met. Now the community have completely changed the original logic and this issue is fixed when I merge upstream into my repo.

@alexsun07 alexsun07 requested a review from HaiShaw March 19, 2025 08:04
Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad that issue was addressed.

@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Mar 19, 2025

@saienduri can we have test/srt/test_mla_deepseek_v3.py covered?

Comment thread python/sglang/srt/speculative/eagle_utils.py Outdated
@alexsun07
Copy link
Copy Markdown
Contributor Author

Please enable this for AMD

Added in test. @zhyncs Would you please review again?

@alexsun07
Copy link
Copy Markdown
Contributor Author

alexsun07 commented Mar 20, 2025

Didn't noticed that I shouldn't use main branch. Close this one, track with new PR: #4631

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants