Skip to content

[Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm#1420

Merged
merrymercy merged 8 commits intosgl-project:mainfrom
HaiShaw:main
Sep 17, 2024
Merged

[Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm#1420
merrymercy merged 8 commits intosgl-project:mainfrom
HaiShaw:main

Conversation

@HaiShaw
Copy link
Copy Markdown
Collaborator

@HaiShaw HaiShaw commented Sep 14, 2024

Motivation

  • Enable SGLang on AMD GPUs

Modifications

  • Bypass FlashInfer backend untill it is available on AMD/ROCm

  • Add proper fix for AMD FP8 e4m3fnuz to support Fused_MoE

  • Dependency over vLLM>=0.5.5, I modified pyproject.toml just to confirm that it works up to 0.6.0 as well.

  • Misc.

  • TODO: follow-up to address one error (below) when cuda-graph is enabled.

File "/sglang/python/sglang/srt/layers/sampler.py", line 164, in top_k_top_p_min_p_sampling_from_probs_torch
    min_p_thresholds = probs_sort[:, 0] * min_ps
    TypeError: unsupported operand type(s) for *: 'Tensor' and 'NoneType' (where min_ps is None)

How to run

  • An example on one MI3xx (not performance benchmark)
root@x:/sglang# VLLM_MOE_PADDING=0 python -m sglang.bench_latency --batch-size 32 --input 1024 --output 8 --model dummy_half_grok1/ --tokenizer-path Xenova/grok-1-tokenizer --load-format dummy --tp 8 --quant fp8  --attention-backend triton --sampling-backend  pytorch --disable-cuda-graph
Warmup ...
Prefill. latency: 25.79838 s, throughput:   1270.16 token/s
Decode.  latency: 0.53607 s, throughput:     59.69 token/s
Decode.  latency: 0.31325 s, throughput:    102.15 token/s
Decode.  latency: 0.04105 s, throughput:    779.47 token/s
Decode.  latency: 0.04075 s, throughput:    785.31 token/s
Decode.  median latency: 0.17715 s, median throughput:    180.64 token/s
Total. latency: 26.730 s, throughput:   1230.70 token/s
Benchmark ...
Prefill. latency: 1.11868 s, throughput:  29291.72 token/s
Decode.  latency: 0.02593 s, throughput:   1234.10 token/s
Decode.  latency: 0.02575 s, throughput:   1242.87 token/s
Decode.  latency: 0.02559 s, throughput:   1250.46 token/s
Decode.  latency: 0.02646 s, throughput:   1209.23 token/s
Decode.  latency: 0.02574 s, throughput:   1243.42 token/s
Decode.  median latency: 0.02574 s, median throughput:   1243.15 token/s
Total. latency:  1.325 s, throughput:  24925.60 token/s
root@x:/sglang#

Checklist

  • [+] Format your code according to the Contributor Guide.
  • [+] Add unit tests as outlined in the Contributor Guide.
  • [+] Update documentation as needed, including docstrings or example tutorials.

@HaiShaw HaiShaw changed the title Enable SGLang on AMD GPUs via PyTorch for ROCm (#1419) Enable SGLang on AMD GPUs via PyTorch for ROCm Sep 14, 2024
@Ying1123 Ying1123 mentioned this pull request Sep 14, 2024
29 tasks
Copy link
Copy Markdown
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We deprecated --disable-flashinfer --disable-flashinfer-sampling. Please use --attention-backend triton --sampling-backend pytorch instead.
  2. Alternatively, you can set these backends automatically here https://github.com/HaiShaw/sglang/blob/8715deff22727382ce6f74768213e3e72f413f71/python/sglang/srt/server_args.py#L155
if is_hip():
    self.attention_backend = "triton"
    self.sampling_backend = "pytorch"
  1. The sampler + cuda graph issue has been fixed by #1392. You can rebase and try it again.
    python -m sglang.bench_latency --model-path TinyLlama/TinyLlama-1.1B-Chat-v0.4 --attention-backend triton --sampling-backend pytorch This runs correctly.

Comment thread python/pyproject.toml Outdated
Comment thread python/sglang/srt/layers/activation.py Outdated
Comment thread python/sglang/srt/layers/activation.py Outdated
Comment thread python/sglang/srt/layers/layernorm.py
Comment thread scripts/deprecated/test_flashinfer.py Outdated
Comment thread python/sglang/srt/layers/attention_backend.py Outdated
Comment thread python/sglang/srt/server.py Outdated
Comment thread python/sglang/srt/layers/layernorm.py
@HaiShaw HaiShaw requested a review from merrymercy September 16, 2024 05:21
@Ying1123 Ying1123 changed the title Enable SGLang on AMD GPUs via PyTorch for ROCm [Feature, Hardware] Enable SGLang on AMD GPUs via PyTorch for ROCm Sep 16, 2024
Comment thread python/sglang/srt/layers/activation.py Outdated
@merrymercy merrymercy enabled auto-merge (squash) September 17, 2024 07:31
@merrymercy merrymercy merged commit 3a6e041 into sgl-project:main Sep 17, 2024
@linqingxu
Copy link
Copy Markdown

rocm 6.1.2
Memory access fault by GPU node-1 (Agent handle: 0x5629173ccc50) on address 0x7fcbccfeb000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
root@7acc50a881cf:/usr/local/lib/python3.10/dist-packages# /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

@merrymercy
Copy link
Copy Markdown
Contributor

@linqingxu try --disable-cuda-graph for now? We are working on more fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants