Checklist
Describe the bug
GLM5 FP8: MI355 slower than H200 across all workloads
Reproduction
https://github.com/SemiAnalysisAI/InferenceX/blob/main/benchmarks/single_node/glm5_fp8_mi355x.sh
logs for mi355
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22792161490/job/66170603760
# GLM-5 requires transformers with glm_moe_dsa model type support.
# However, the Image rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219 doesn't provide this support.
python3 -m pip install -U --no-cache-dir \
"git+https://github.com/huggingface/transformers.git@6ed9ee36f608fd145168377345bfc4a5de12e1e2"
export SGLANG_ROCM_FUSED_DECODE_MLA=0
export ROCM_QUICK_REDUCE_QUANTIZATION=INT4
export SAFETENSORS_FAST_GPU=1
SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor
python3 -m sglang.launch_server \
--model-path $MODEL \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--trust-remote-code \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--mem-fraction-static 0.85 \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang > $SERVER_LOG 2>&1 &
Environment
rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219
Checklist
Describe the bug
GLM5 FP8: MI355 slower than H200 across all workloads
Reproduction
https://github.com/SemiAnalysisAI/InferenceX/blob/main/benchmarks/single_node/glm5_fp8_mi355x.sh
logs for mi355
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22792161490/job/66170603760
Environment
rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219