[Bug]:  GLM5 FP8: AMD current gen MI355  slower than last gen H200

### Checklist

- [x] I searched related issues but found no solution.
- [x] The bug persists in the latest version.
- [x] Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
- [x] If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
- [x] Please use English. Otherwise, it will be closed.

### Describe the bug

GLM5 FP8: MI355 slower than H200 across all workloads 

<img width="1121" height="651" alt="Image" src="https://github.com/user-attachments/assets/33326fec-63bb-4a62-b8e0-a9a990027027" />

<img width="1064" height="661" alt="Image" src="https://github.com/user-attachments/assets/2abdd46a-4847-4cfa-9934-4252d8471ea3" />

### Reproduction

https://github.com/SemiAnalysisAI/InferenceX/blob/main/benchmarks/single_node/glm5_fp8_mi355x.sh

logs for mi355
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22792161490/job/66170603760

```
# GLM-5 requires transformers with glm_moe_dsa model type support.
# However, the Image rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219 doesn't provide this support.
python3 -m pip install -U --no-cache-dir \
  "git+https://github.com/huggingface/transformers.git@6ed9ee36f608fd145168377345bfc4a5de12e1e2"

export SGLANG_ROCM_FUSED_DECODE_MLA=0
export ROCM_QUICK_REDUCE_QUANTIZATION=INT4
export SAFETENSORS_FAST_GPU=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

python3 -m sglang.launch_server \
    --model-path $MODEL \
    --host=0.0.0.0 \
    --port $PORT \
    --tensor-parallel-size $TP \
    --trust-remote-code \
    --tool-call-parser glm47 \
    --reasoning-parser glm45 \
    --mem-fraction-static 0.85 \
    --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
    --nsa-prefill-backend tilelang \
    --nsa-decode-backend tilelang > $SERVER_LOG 2>&1 &
```


### Environment

`rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: GLM5 FP8: AMD current gen MI355 slower than last gen H200 #21071

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: GLM5 FP8: AMD current gen MI355 slower than last gen H200 #21071

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions