[Bug] incorrect inference result when using tensor parallel at mi250

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

after applying [fix aiter failure at gfx90a](https://github.com/sgl-project/sglang/pull/7187) to docker "lmsysorg/sglang:v0.4.7-rocm630", single GPU inference of sglang works. However, when using --tp-size option the inference result is incorrect.

Tested using llama3 8b, 70b, llama2 7b at mi250 single node(8 GPU).

This does not reproduce at mi300.

### Reproduction

Reproduction

- docker pull lmsysorg/sglang:v0.4.7-rocm630
- fix fp8.py code as suggested in this PR[fix aiter failure at gfx90a](https://github.com/sgl-project/sglang/pull/7187) in docker
- reinstall hipblaslt since the docker has gfx942 version only (apt remove hipblaslt; apt install hipblaslt)
- reinstall any packages removed along with hipblaslt
- (SERVER) python3 -m sglang.launch_server --attention-backend triton --sampling-backend pytorch --model-path /model/llama3_8b --host 0.0.0.0 --port 30000 --tp-size 8
- (CLIENT test code)
```python
import requests
from sglang.utils import print_highlight
port=30000
response = requests.post(
            f"http://localhost:{port}/generate",
                json={
                    "text": "The capital of France is",
                    "sampling_params": {
                        "temperature": 0,
                        "max_new_tokens": 32,
                        },
                    },
                )
print_highlight(response.json())
```  

SAMPLE RESULT
```
# python3 -m sglang.launch_server --attention-backend triton --sampling-backend pytorch --model-path /model/llama3_8b --tp-size 8 --host 0.0.0.0 --port 30000

# python3 -m test_req.py
{'text': 'zemควควควemouthemouthemouthemouthemouthemouthemouthemouthemouthemouth442442442442ets759unganungan(___(___羊laceongyangongyangongyangongyang drill drill', 'meta_info': {'id': '548ae1102ed44f0a89a5dfb915ed4f40', 'finish_reason': {'type': 'length', 'length': 32}, 'prompt_tokens': 6, 'completion_tokens': 32, 'cached_tokens': 0, 'e2e_latency': 0.6615102291107178}}
``` 

### Environment

root@mi250:/sgl-workspace# python3 -m sglang.check_env
Python: 3.12.8 (main, Dec  4 2024, 08:54:12) [GCC 11.4.0]
ROCM available: True
GPU 0,1,2,3,4,5,6,7: AMD Instinct MI250X/MI250
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.3.42131-fa1d09cbd
ROCM Driver Version: 6.8.5
PyTorch: 2.6.0a0+git8d4926e
sglang: 0.4.7
sgl_kernel: 0.1.7
flashinfer_python: Module Not Found
triton: 3.2.0+gitcddf0fc3
transformers: 4.52.3
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.11.11
fastapi: 0.115.6
hf_transfer: 0.1.9
huggingface_hub: 0.32.4
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 24.2
psutil: 6.1.1
pydantic: 2.10.5
python-multipart: 0.0.20
pyzmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.6.7.dev2+g113274a0.rocm630
xgrammar: 0.1.19
openai: 1.85.0
tiktoken: 0.7.0
anthropic: 0.53.0
litellm: 1.72.2
decord: 0.6.0
AMD Topology:


============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0
================================== End of ROCm SMI Log ===================================

ulimit soft: 1048576

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] incorrect inference result when using tensor parallel at mi250 #7641

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] incorrect inference result when using tensor parallel at mi250 #7641

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions