Checklist
Describe the bug
When I run Qwen3-235B-A22B model, and enable DP+TP+EP, it will memory access fault.

I found this issue is caused by triton kernel pre_reorder_triton_kernel. In this kernel the grid is set by hindden_states.shape[0], here is 92029. The unit test is also wrong for batch size > 65536.


Reproduction
server:
python3 -m sglang.launch_server \
--model-path /data/models/Qwen3-235B-A22B/ \
--port 30000 \
--host 0.0.0.0 \
--served-model-name Qwen3-235B-A22B \
--trust-remote-code \
--chunked-prefill-size 130172 \
--max-running-requests 128 \
--mem-fraction-static 0.85 \
--enable-torch-compile \
--dp-size 8 \
--enable-ep-moe \
--tp-size 8 \
--enable-dp-attention
bench_serving:
python3 -m sglang.bench_serving \
--dataset-name random \
--dataset-path /data/ShareGPT_V3_unfiltered_cleaned_split.json \
--model /data/models/Qwen3-235B-A22B \
--random-input-len 6144 \
--random-output-len 1024 \
--num-prompt 64 \
--random-range-ratio 1.0 \
--max-concurrency 16 \
--host 0.0.0.0 \
--port 30000 \
--seed $(date +%s)
what's more, here are unit test code and cmd.
ut parameters as show below for sgl-kernel/tests/test_ep_moe_pre_reorder_kernel.py
@pytest.mark.parametrize(
"batch_size,hidden_size,topk",
list(itertools.product([92029], [4096], [8])),
)
@pytest.mark.parametrize("dtype", [torch.bfloat16])
@pytest.mark.parametrize("use_per_token_if_dynamic", [True])
def test_ep_moe_pre_reorder_vs_triton(
batch_size: int,
hidden_size: int,
topk: int,
dtype: torch.dtype,
use_per_token_if_dynamic: bool,
):
ut cmd:
pytest sgl-kernel/tests/test_ep_moe_pre_reorder_kernel.py
Environment
Python: 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0]
ROCM available: True
GPU 0,1,2,3,4,5,6,7: AMD Instinct MI308X
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.4
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.4.43483-e0d58c107
ROCM Driver Version: 6.14.0
PyTorch: 2.6.0+rocm6.4.1.lw.git9d0a4a1a
sglang: 0.4.7
sgl_kernel: 0.1.7
flashinfer_python: Module Not Found
triton: 3.2.0+gitcddf0fc3
transformers: 4.52.3
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.12.12
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.32.4
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: 0.6.7.dev2+g113274a0e.rocm641
xgrammar: 0.1.19
openai: 1.85.0
tiktoken: 0.9.0
anthropic: 0.53.0
litellm: 1.72.2
decord: 0.6.0
AMD Topology:
============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU1 XGMI 0 XGMI XGMI XGMI XGMI XGMI XGMI
GPU2 XGMI XGMI 0 XGMI XGMI XGMI XGMI XGMI
GPU3 XGMI XGMI XGMI 0 XGMI XGMI XGMI XGMI
GPU4 XGMI XGMI XGMI XGMI 0 XGMI XGMI XGMI
GPU5 XGMI XGMI XGMI XGMI XGMI 0 XGMI XGMI
GPU6 XGMI XGMI XGMI XGMI XGMI XGMI 0 XGMI
GPU7 XGMI XGMI XGMI XGMI XGMI XGMI XGMI 0
================================== End of ROCm SMI Log ===================================
ulimit soft: 1048576
Checklist
Describe the bug
When I run Qwen3-235B-A22B model, and enable DP+TP+EP, it will memory access fault.
I found this issue is caused by triton kernel pre_reorder_triton_kernel. In this kernel the grid is set by hindden_states.shape[0], here is 92029. The unit test is also wrong for batch size > 65536.
Reproduction
server:
bench_serving:
what's more, here are unit test code and cmd.
ut parameters as show below for sgl-kernel/tests/test_ep_moe_pre_reorder_kernel.py
ut cmd:
pytest sgl-kernel/tests/test_ep_moe_pre_reorder_kernel.pyEnvironment
Python: 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0]
ROCM available: True
GPU 0,1,2,3,4,5,6,7: AMD Instinct MI308X
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.4
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.4.43483-e0d58c107
ROCM Driver Version: 6.14.0
PyTorch: 2.6.0+rocm6.4.1.lw.git9d0a4a1a
sglang: 0.4.7
sgl_kernel: 0.1.7
flashinfer_python: Module Not Found
triton: 3.2.0+gitcddf0fc3
transformers: 4.52.3
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.12.12
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.32.4
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: 0.6.7.dev2+g113274a0e.rocm641
xgrammar: 0.1.19
openai: 1.85.0
tiktoken: 0.9.0
anthropic: 0.53.0
litellm: 1.72.2
decord: 0.6.0
AMD Topology:
============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU1 XGMI 0 XGMI XGMI XGMI XGMI XGMI XGMI
GPU2 XGMI XGMI 0 XGMI XGMI XGMI XGMI XGMI
GPU3 XGMI XGMI XGMI 0 XGMI XGMI XGMI XGMI
GPU4 XGMI XGMI XGMI XGMI 0 XGMI XGMI XGMI
GPU5 XGMI XGMI XGMI XGMI XGMI 0 XGMI XGMI
GPU6 XGMI XGMI XGMI XGMI XGMI XGMI 0 XGMI
GPU7 XGMI XGMI XGMI XGMI XGMI XGMI XGMI 0
================================== End of ROCm SMI Log ===================================
ulimit soft: 1048576