[Bug] pre_reorder_triton_kernel memory violation with large batch size (batch size > 65536)

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

When I run Qwen3-235B-A22B model, and enable DP+TP+EP,  it will memory access fault.

![Image](https://github.com/user-attachments/assets/6f1ec0a7-858d-485e-ac9e-91fa36c6695f)

I found this issue is caused by triton kernel **pre_reorder_triton_kernel**. In this kernel the grid is set by hindden_states.shape[0], here is 92029. The unit test is also wrong for batch size > 65536.

![Image](https://github.com/user-attachments/assets/04d058f0-dd90-4c44-a653-14c1f4426c8d)

![Image](https://github.com/user-attachments/assets/33e4d148-e001-4698-a6d3-4b84932bc763)

### Reproduction

server:
```
python3 -m sglang.launch_server \
--model-path /data/models/Qwen3-235B-A22B/ \
--port 30000 \
--host 0.0.0.0 \
--served-model-name Qwen3-235B-A22B \
--trust-remote-code \
--chunked-prefill-size 130172 \
--max-running-requests 128 \
--mem-fraction-static 0.85 \
--enable-torch-compile \
--dp-size 8 \
--enable-ep-moe \
--tp-size 8 \
--enable-dp-attention
```
bench_serving:
```
python3 -m sglang.bench_serving \
        --dataset-name random \
        --dataset-path /data/ShareGPT_V3_unfiltered_cleaned_split.json \
        --model /data/models/Qwen3-235B-A22B \
        --random-input-len 6144 \
        --random-output-len 1024 \
        --num-prompt 64 \
        --random-range-ratio 1.0 \
        --max-concurrency 16 \
        --host 0.0.0.0 \
        --port 30000 \
        --seed $(date +%s)
```
what's more, here are unit test code and cmd.
ut parameters as show below for sgl-kernel/tests/test_ep_moe_pre_reorder_kernel.py
```python
@pytest.mark.parametrize(
    "batch_size,hidden_size,topk",
    list(itertools.product([92029], [4096], [8])),
)
@pytest.mark.parametrize("dtype", [torch.bfloat16])
@pytest.mark.parametrize("use_per_token_if_dynamic", [True])
def test_ep_moe_pre_reorder_vs_triton(
    batch_size: int,
    hidden_size: int,
    topk: int,
    dtype: torch.dtype,
    use_per_token_if_dynamic: bool,
): 
``` 
ut cmd:
```pytest sgl-kernel/tests/test_ep_moe_pre_reorder_kernel.py```


### Environment

Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
ROCM available: True
GPU 0,1,2,3,4,5,6,7: AMD Instinct MI308X
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.4
ROCM_HOME: /opt/rocm
HIPCC: HIP version: 6.4.43483-e0d58c107
ROCM Driver Version: 6.14.0
PyTorch: 2.6.0+rocm6.4.1.lw.git9d0a4a1a
sglang: 0.4.7
sgl_kernel: 0.1.7
flashinfer_python: Module Not Found
triton: 3.2.0+gitcddf0fc3
transformers: 4.52.3
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.12.12
fastapi: 0.115.12
hf_transfer: 0.1.9
huggingface_hub: 0.32.4
interegular: 0.3.3
modelscope: 1.26.0
orjson: 3.10.18
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.5
python-multipart: 0.0.20
pyzmq: 26.4.0
uvicorn: 0.34.3
uvloop: 0.21.0
vllm: 0.6.7.dev2+g113274a0e.rocm641
xgrammar: 0.1.19
openai: 1.85.0
tiktoken: 0.9.0
anthropic: 0.53.0
litellm: 1.72.2
decord: 0.6.0
AMD Topology:


============================ ROCm System Management Interface ============================
=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0
================================== End of ROCm SMI Log ===================================

ulimit soft: 1048576

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] pre_reorder_triton_kernel memory violation with large batch size (batch size > 65536) #7545

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] pre_reorder_triton_kernel memory violation with large batch size (batch size > 65536) #7545

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions