Checklist
Describe the bug
after applying fix aiter failure at gfx90a to docker "lmsysorg/sglang:v0.4.7-rocm630", single GPU inference of sglang works. However, when using --tp-size option the inference result is incorrect.
Tested using llama3 8b, 70b, llama2 7b at mi250 single node(8 GPU).
This does not reproduce at mi300.
Reproduction
Reproduction
- docker pull lmsysorg/sglang:v0.4.7-rocm630
- fix fp8.py code as suggested in this PRfix aiter failure at gfx90a in docker
- reinstall hipblaslt since the docker has gfx942 version only (apt remove hipblaslt; apt install hipblaslt)
- reinstall any packages removed along with hipblaslt
- (SERVER) python3 -m sglang.launch_server --attention-backend triton --sampling-backend pytorch --model-path /model/llama3_8b --host 0.0.0.0 --port 30000 --tp-size 8
- (CLIENT test code)
import requests
from sglang.utils import print_highlight
port=30000
response = requests.post(
f"http://localhost:{port}/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 32,
},
},
)
print_highlight(response.json())
SAMPLE RESULT
# python3 -m sglang.launch_server --attention-backend triton --sampling-backend pytorch --model-path /model/llama3_8b --tp-size 8 --host 0.0.0.0 --port 30000
# python3 -m test_req.py
{'text': 'zemควควควemouthemouthemouthemouthemouthemouthemouthemouthemouthemouth442442442442ets759unganungan(___(___羊laceongyangongyangongyangongyang drill drill', 'meta_info': {'id': '548ae1102ed44f0a89a5dfb915ed4f40', 'finish_reason': {'type': 'length', 'length': 32}, 'prompt_tokens': 6, 'completion_tokens': 32, 'cached_tokens': 0, 'e2e_latency': 0.6615102291107178}}
Environment
root@mi250:/sgl-workspace# python3 -m sglang.check_env
Python: 3.12.8 (main, Dec 4 2024, 08:54:12) [GCC 11.4.0]
ROCM available: True
GPU 0,1,2,3,4,5,6,7: AMD Instinct MI250X/MI250
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.3.42131-fa1d09cbd
ROCM Driver Version: 6.8.5
PyTorch: 2.6.0a0+git8d4926e
sglang: 0.4.7
sgl_kernel: 0.1.7
flashinfer_python: Module Not Found
triton: 3.2.0+gitcddf0fc3
transformers: 4.52.3
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.32.4
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
python-multipart: 0.0.20
pyzmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.7.dev2+g113274a0.rocm630
xgrammar: 0.1.19
openai: 1.85.0
tiktoken: 0.7.0
anthropic: 0.53.0
litellm: 1.72.2
decord: 0.6.0
AMD Topology:
============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU1 XGMI 0 XGMI XGMI XGMI XGMI XGMI XGMI
GPU2 XGMI XGMI 0 XGMI XGMI XGMI XGMI XGMI
GPU3 XGMI XGMI XGMI 0 XGMI XGMI XGMI XGMI
GPU4 XGMI XGMI XGMI XGMI 0 XGMI XGMI XGMI
GPU5 XGMI XGMI XGMI XGMI XGMI 0 XGMI XGMI
GPU6 XGMI XGMI XGMI XGMI XGMI XGMI 0 XGMI
GPU7 XGMI XGMI XGMI XGMI XGMI XGMI XGMI 0
================================== End of ROCm SMI Log ===================================
ulimit soft: 1048576
Checklist
Describe the bug
after applying fix aiter failure at gfx90a to docker "lmsysorg/sglang:v0.4.7-rocm630", single GPU inference of sglang works. However, when using --tp-size option the inference result is incorrect.
Tested using llama3 8b, 70b, llama2 7b at mi250 single node(8 GPU).
This does not reproduce at mi300.
Reproduction
Reproduction
SAMPLE RESULT
Environment
root@mi250:/sgl-workspace# python3 -m sglang.check_env
Python: 3.12.8 (main, Dec 4 2024, 08:54:12) [GCC 11.4.0]
ROCM available: True
GPU 0,1,2,3,4,5,6,7: AMD Instinct MI250X/MI250
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.3.42131-fa1d09cbd
ROCM Driver Version: 6.8.5
PyTorch: 2.6.0a0+git8d4926e
sglang: 0.4.7
sgl_kernel: 0.1.7
flashinfer_python: Module Not Found
triton: 3.2.0+gitcddf0fc3
transformers: 4.52.3
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.32.4
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
python-multipart: 0.0.20
pyzmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.7.dev2+g113274a0.rocm630
xgrammar: 0.1.19
openai: 1.85.0
tiktoken: 0.7.0
anthropic: 0.53.0
litellm: 1.72.2
decord: 0.6.0
AMD Topology:
============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU1 XGMI 0 XGMI XGMI XGMI XGMI XGMI XGMI
GPU2 XGMI XGMI 0 XGMI XGMI XGMI XGMI XGMI
GPU3 XGMI XGMI XGMI 0 XGMI XGMI XGMI XGMI
GPU4 XGMI XGMI XGMI XGMI 0 XGMI XGMI XGMI
GPU5 XGMI XGMI XGMI XGMI XGMI 0 XGMI XGMI
GPU6 XGMI XGMI XGMI XGMI XGMI XGMI 0 XGMI
GPU7 XGMI XGMI XGMI XGMI XGMI XGMI XGMI 0
================================== End of ROCm SMI Log ===================================
ulimit soft: 1048576