[2025-08-22 02:45:57] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2568, in run_scheduler_process
scheduler.event_loop_normal()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 762, in event_loop_normal
result = self.run_batch(batch)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1732, in run_batch
) = self.draft_worker.forward_batch_speculative_generation(batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 346, in forward_batch_speculative_generation
self.verify(batch, spec_info)
File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 686, in verify
self.target_worker.forward_batch_generation(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 238, in forward_batch_generation
logits_output, can_run_cuda_graph = self.model_runner.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1720, in forward
output = self._forward_raw(
^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1747, in _forward_raw
ret = self.cuda_graph_runner.replay(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 786, in replay
self.replay_prepare(forward_batch, pp_proxy_tensors)
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 763, in replay_prepare
self.model_runner.attn_backend.init_forward_metadata_replay_cuda_graph(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 435, in init_forward_metadata_replay_cuda_graph
self.indices_updater_prefill.update(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 879, in update_single_wrapper
self.call_begin_forward(
File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1051, in call_begin_forward
wrapper_paged.begin_forward(
File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 1654, in plan
qo_indptr_host = qo_indptr.to("cpu")
^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
export CUDA_LAUNCH_BLOCKING=1
python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --enable-memory-saver \
--trust-remote-code --skip-server-warmup --speculative-algorithm EAGLE --speculative-num-steps 20 \
--speculative-eagle-topk 10 --speculative-num-draft-tokens 100 --mem-fraction 0.7
Run data set.
cd benchmark/gsm8k
python3 bench_sglang.py --num-questions 2000 --parallel 17
An H200 GPU.
Partial pip list is shown as below.
You can also use docker zhuzilin/slime:latest
Package Version Editable project location
---------------------------------- ------------- ----------------------------
cryptography 3.4.8
cuda-bindings 13.0.0
cuda-pathfinder 1.1.0
cuda-python 13.0.0
cumem_allocator 0.0.1
flash_attn 2.7.4.post1
flashinfer-python 0.2.10
nixl 0.5.0
numpy 1.26.4
nvidia-cublas-cu12 12.6.4.1
nvidia-cuda-cupti-cu12 12.6.80
nvidia-cuda-nvrtc-cu12 12.6.77
nvidia-cuda-runtime-cu12 12.6.77
nvidia-cudnn-cu12 9.10.2.21
nvidia-cudnn-frontend 1.13.0
nvidia-cufft-cu12 11.3.0.4
nvidia-cufile-cu12 1.11.1.6
nvidia-curand-cu12 10.3.7.77
nvidia-cusolver-cu12 11.7.1.2
nvidia-cusparse-cu12 12.5.4.2
nvidia-cusparselt-cu12 0.7.1
nvidia-ml-py 12.575.51
nvidia-nccl-cu12 2.27.3
nvidia-nvjitlink-cu12 12.6.85
nvidia-nvshmem-cu12 3.3.20
nvidia-nvtx-cu12 12.6.77
openai 1.99.1
ray 2.48.0
scikit_build_core 0.11.5
sgl-kernel 0.3.3
sglang 0.5.0rc0 /sgl-workspace/sglang/python
sglang-router 0.1.9
slime 0.0.1 /root/slime
tabulate 0.9.0
tiktoken 0.11.0
timm 1.0.16
tokenizers 0.21.4
torch 2.8.0+cu126
torch_memory_saver 0.0.8
torchao 0.9.0+cu126
torchaudio 2.8.0+cu126
torchvision 0.23.0+cu126
transformer_engine 2.5.0+af9467c
transformers 4.55.0
triton 3.4.0
Checklist
Describe the bug
Notice: This bug has been mentioned in #6309 before. And this issue provide a more easily reproducible version of this bug.
sglang=0.5.0rc0, sgl-kernel=0.3.3, flash_attn=2.7.4.post1, flashinfer-python=0.2.10
Reproduction
Launch SGLang server
Run data set.
Notice:
--disable-cuda-graphmakes this bug disappear.--disable-cuda-graph-paddingmay make this bug disappear. (need more tests)fa3andtriton, also have this issue.parallelparameter — almost like a magic number. I’ve tried 30 and 32, which rarely trigger it, but 27 and 17 have a much higher chance. Running just one or two rounds of the GSM8K benchmark is usually enough to see it.--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4), there is very low possibility to trigger this bug.Environment
An H200 GPU.
Partial pip list is shown as below.
You can also use docker
zhuzilin/slime:latest