[Bug] Easier reproduction of CUDA access illegal memory bug under speculative decoding

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

Notice: This bug has been mentioned in #6309 before. And this issue provide a more easily reproducible version of this bug.
sglang=0.5.0rc0, sgl-kernel=0.3.3, flash_attn=2.7.4.post1, flashinfer-python=0.2.10

```
[2025-08-22 02:45:57] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2568, in run_scheduler_process
    scheduler.event_loop_normal()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 762, in event_loop_normal
    result = self.run_batch(batch)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1732, in run_batch
    ) = self.draft_worker.forward_batch_speculative_generation(batch)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 346, in forward_batch_speculative_generation
    self.verify(batch, spec_info)
  File "/sgl-workspace/sglang/python/sglang/srt/speculative/eagle_worker.py", line 686, in verify
    self.target_worker.forward_batch_generation(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 238, in forward_batch_generation
    logits_output, can_run_cuda_graph = self.model_runner.forward(
                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1720, in forward
    output = self._forward_raw(
             ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1747, in _forward_raw
    ret = self.cuda_graph_runner.replay(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 786, in replay
    self.replay_prepare(forward_batch, pp_proxy_tensors)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 763, in replay_prepare
    self.model_runner.attn_backend.init_forward_metadata_replay_cuda_graph(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 435, in init_forward_metadata_replay_cuda_graph
    self.indices_updater_prefill.update(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 879, in update_single_wrapper
    self.call_begin_forward(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 1051, in call_begin_forward
    wrapper_paged.begin_forward(
  File "/usr/local/lib/python3.12/dist-packages/flashinfer/prefill.py", line 1654, in plan
    qo_indptr_host = qo_indptr.to("cpu")
                     ^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

### Reproduction

Launch SGLang server
```
export CUDA_LAUNCH_BLOCKING=1
python3 -m sglang.launch_server --model-path XiaomiMiMo/MiMo-7B-RL --enable-memory-saver \
--trust-remote-code --skip-server-warmup --speculative-algorithm EAGLE --speculative-num-steps 20  \
--speculative-eagle-topk 10 --speculative-num-draft-tokens 100  --mem-fraction 0.7
```

Run data set.
```
cd benchmark/gsm8k
python3 bench_sglang.py --num-questions 2000 --parallel 17
```
Notice:
- Set such extreme speculative parameters in order to reproduce this bug. 
- Setting `--disable-cuda-graph` makes this bug disappear.
- Setting `--disable-cuda-graph-padding` may make this bug disappear. (need more tests)
- Other attention backends, like `fa3` and `triton`, also have this issue.
- With a single SGLang engine, this bug shows up more often under extreme speculative decoding parameter settings. 
- It also seems to depend heavily on the `parallel` parameter — almost like a magic number. I’ve tried 30 and 32, which rarely trigger it, but 27 and 17 have a much higher chance. Running just one or two rounds of the GSM8K benchmark is usually enough to see it.
- Under normal parameter (like `--speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4`), there is very low possibility to trigger this bug.
- However, when running multiple SGLang engines and sending requests through the SGLang router, the bug happens **much more often, even with normal parameter settings**. It’s a pretty serious issue — it can crash inference in RL framework with SGL router in less than an hour.
- I personally suspect it’s related to batch size not being handled properly during speculative decoding. https://github.com/sgl-project/sglang/compare/main...Qing-zy:sglang:main may solve this bug, but will raise another bug related to batch size. And other solutions mentioned in #6309 did not work well for me.
### Environment

An H200 GPU.
Partial pip list is shown as below.
You can also use docker `zhuzilin/slime:latest`
```
Package                            Version       Editable project location
---------------------------------- ------------- ----------------------------
cryptography                       3.4.8
cuda-bindings                      13.0.0
cuda-pathfinder                    1.1.0
cuda-python                        13.0.0
cumem_allocator                    0.0.1
flash_attn                         2.7.4.post1
flashinfer-python                  0.2.10
nixl                               0.5.0
numpy                              1.26.4
nvidia-cublas-cu12                 12.6.4.1
nvidia-cuda-cupti-cu12             12.6.80
nvidia-cuda-nvrtc-cu12             12.6.77
nvidia-cuda-runtime-cu12           12.6.77
nvidia-cudnn-cu12                  9.10.2.21
nvidia-cudnn-frontend              1.13.0
nvidia-cufft-cu12                  11.3.0.4
nvidia-cufile-cu12                 1.11.1.6
nvidia-curand-cu12                 10.3.7.77
nvidia-cusolver-cu12               11.7.1.2
nvidia-cusparse-cu12               12.5.4.2
nvidia-cusparselt-cu12             0.7.1
nvidia-ml-py                       12.575.51
nvidia-nccl-cu12                   2.27.3
nvidia-nvjitlink-cu12              12.6.85
nvidia-nvshmem-cu12                3.3.20
nvidia-nvtx-cu12                   12.6.77
openai                             1.99.1
ray                                2.48.0
scikit_build_core                  0.11.5
sgl-kernel                         0.3.3
sglang                             0.5.0rc0      /sgl-workspace/sglang/python
sglang-router                      0.1.9
slime                              0.0.1         /root/slime
tabulate                           0.9.0
tiktoken                           0.11.0
timm                               1.0.16
tokenizers                         0.21.4
torch                              2.8.0+cu126
torch_memory_saver                 0.0.8
torchao                            0.9.0+cu126
torchaudio                         2.8.0+cu126
torchvision                        0.23.0+cu126
transformer_engine                 2.5.0+af9467c
transformers                       4.55.0
triton                             3.4.0
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Easier reproduction of CUDA access illegal memory bug under speculative decoding #9481

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Easier reproduction of CUDA access illegal memory bug under speculative decoding #9481

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions