Checklist
Describe the bug
ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [8,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Reproduction
docker run --entrypoint bash -it --gpus all
--network host --shm-size 32g
-v /data/:/data/
--privileged=true
--ipc=host
-e HF_HUB_OFFLINE=1 -e GLOO_SOCKET_IFNAME=eth0 -e NCCL_SOCKET_IFNAME=eth0 -e ENABLE_JIT_DEEPGEMM=1 -e NCCL_CUMEM_ENABLE=1 docker.1ms.run/lmsysorg/sglang:v0.4.9.post2-cu126
python3 /sgl-workspace/sglang/python/sglang/launch_server.py --model-path
/data/models/DeepSeek-V3-0324
--served-model-name deepseek
--host 0.0.0.0 --port 8000 --trust-remote-code --tp 16 --max-running-requests 100 --dist-init-addr 172.31.16.3:20000 --nnodes 2 --node-rank 0 --mem-fraction-static 0.8
--context-length 32768 --enable-mixed-chunk --chunked-prefill-size 2048 --attention-backend fa3
--enable-eplb --enable-hierarchical-cache
--enable-dp-attention --dp 16 --enable-dp-lm-head --cuda-graph-max-bs 8 --ep-size 8
python3 /sgl-workspace/sglang/python/sglang/launch_server.py --model-path
/data/models/DeepSeek-V3-0324
--served-model-name deepseek
--host 0.0.0.0 --port 8000 --trust-remote-code --tp 16 --max-running-requests 100 --dist-init-addr 172.31.16.3:20000 --nnodes 2 --node-rank 1 --mem-fraction-static 0.8
--context-length 32768 --enable-mixed-chunk --chunked-prefill-size 2048 --attention-backend fa3
--enable-eplb --enable-hierarchical-cache
--enable-dp-attention --dp 16 --enable-dp-lm-head --cuda-graph-max-bs 8 --ep-size 8
Environment
2* 8 *h20
Checklist
Describe the bug
ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [8,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"failed.Reproduction
docker run --entrypoint bash -it --gpus all
--network host --shm-size 32g
-v /data/:/data/
--privileged=true
--ipc=host
-e HF_HUB_OFFLINE=1 -e GLOO_SOCKET_IFNAME=eth0 -e NCCL_SOCKET_IFNAME=eth0 -e ENABLE_JIT_DEEPGEMM=1 -e NCCL_CUMEM_ENABLE=1 docker.1ms.run/lmsysorg/sglang:v0.4.9.post2-cu126
python3 /sgl-workspace/sglang/python/sglang/launch_server.py --model-path
/data/models/DeepSeek-V3-0324
--served-model-name deepseek
--host 0.0.0.0 --port 8000 --trust-remote-code --tp 16 --max-running-requests 100 --dist-init-addr 172.31.16.3:20000 --nnodes 2 --node-rank 0 --mem-fraction-static 0.8
--context-length 32768 --enable-mixed-chunk --chunked-prefill-size 2048 --attention-backend fa3
--enable-eplb --enable-hierarchical-cache
--enable-dp-attention --dp 16 --enable-dp-lm-head --cuda-graph-max-bs 8 --ep-size 8
python3 /sgl-workspace/sglang/python/sglang/launch_server.py --model-path
/data/models/DeepSeek-V3-0324
--served-model-name deepseek
--host 0.0.0.0 --port 8000 --trust-remote-code --tp 16 --max-running-requests 100 --dist-init-addr 172.31.16.3:20000 --nnodes 2 --node-rank 1 --mem-fraction-static 0.8
--context-length 32768 --enable-mixed-chunk --chunked-prefill-size 2048 --attention-backend fa3
--enable-eplb --enable-hierarchical-cache
--enable-dp-attention --dp 16 --enable-dp-lm-head --cuda-graph-max-bs 8 --ep-size 8
Environment
2* 8 *h20