Skip to content

[Bug] pre_reorder_triton_kernel memory violation with large batch size (batch size > 65536) #7545

@rujiacai

Description

@rujiacai

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

When I run Qwen3-235B-A22B model, and enable DP+TP+EP, it will memory access fault.

Image

I found this issue is caused by triton kernel pre_reorder_triton_kernel. In this kernel the grid is set by hindden_states.shape[0], here is 92029. The unit test is also wrong for batch size > 65536.

Image

Image

Reproduction

server:

python3 -m sglang.launch_server \
--model-path /data/models/Qwen3-235B-A22B/ \
--port 30000 \
--host 0.0.0.0 \
--served-model-name Qwen3-235B-A22B \
--trust-remote-code \
--chunked-prefill-size 130172 \
--max-running-requests 128 \
--mem-fraction-static 0.85 \
--enable-torch-compile \
--dp-size 8 \
--enable-ep-moe \
--tp-size 8 \
--enable-dp-attention

bench_serving:

python3 -m sglang.bench_serving \
        --dataset-name random \
        --dataset-path /data/ShareGPT_V3_unfiltered_cleaned_split.json \
        --model /data/models/Qwen3-235B-A22B \
        --random-input-len 6144 \
        --random-output-len 1024 \
        --num-prompt 64 \
        --random-range-ratio 1.0 \
        --max-concurrency 16 \
        --host 0.0.0.0 \
        --port 30000 \
        --seed $(date +%s)

what's more, here are unit test code and cmd.
ut parameters as show below for sgl-kernel/tests/test_ep_moe_pre_reorder_kernel.py

@pytest.mark.parametrize(
    "batch_size,hidden_size,topk",
    list(itertools.product([92029], [4096], [8])),
)
@pytest.mark.parametrize("dtype", [torch.bfloat16])
@pytest.mark.parametrize("use_per_token_if_dynamic", [True])
def test_ep_moe_pre_reorder_vs_triton(
    batch_size: int,
    hidden_size: int,
    topk: int,
    dtype: torch.dtype,
    use_per_token_if_dynamic: bool,
): 

ut cmd:
pytest sgl-kernel/tests/test_ep_moe_pre_reorder_kernel.py

Environment

Python: 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0]
ROCM available: True
GPU 0,1,2,3,4,5,6,7: AMD Instinct MI308X
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.4
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.4.43483-e0d58c107
ROCM Driver Version: 6.14.0
PyTorch: 2.6.0+rocm6.4.1.lw.git9d0a4a1a
sglang: 0.4.7
sgl_kernel: 0.1.7
flashinfer_python: Module Not Found
triton: 3.2.0+gitcddf0fc3
transformers: 4.52.3
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.12.12
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.32.4
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: 0.6.7.dev2+g113274a0e.rocm641
xgrammar: 0.1.19
openai: 1.85.0
tiktoken: 0.9.0
anthropic: 0.53.0
litellm: 1.72.2
decord: 0.6.0
AMD Topology:

============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 0 XGMI XGMI XGMI XGMI XGMI XGMI XGMI
GPU1 XGMI 0 XGMI XGMI XGMI XGMI XGMI XGMI
GPU2 XGMI XGMI 0 XGMI XGMI XGMI XGMI XGMI
GPU3 XGMI XGMI XGMI 0 XGMI XGMI XGMI XGMI
GPU4 XGMI XGMI XGMI XGMI 0 XGMI XGMI XGMI
GPU5 XGMI XGMI XGMI XGMI XGMI 0 XGMI XGMI
GPU6 XGMI XGMI XGMI XGMI XGMI XGMI 0 XGMI
GPU7 XGMI XGMI XGMI XGMI XGMI XGMI XGMI 0
================================== End of ROCm SMI Log ===================================

ulimit soft: 1048576

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions