Checklist
Describe the bug
error:
[2026-02-04 19:36:48 DP7 PP0 TP7] Decode batch, #running-req: 52, #token: 382208, token usage: 0.94, cuda graph: False, gen throughput (token/s): 112.22, #queue-req: 215,
[2026-02-04 19:36:48 DP5 PP0 TP5] Decode batch, #running-req: 56, #token: 378880, token usage: 0.93, cuda graph: False, gen throughput (token/s): 120.51, #queue-req: 226,
[2026-02-04 19:36:48 DP3 PP0 TP3] Prefill batch, #new-seq: 1, #new-token: 3264, #cached-token: 640, token usage: 0.92, #running-req: 50, #queue-req: 203,
[2026-02-04 19:36:48 DP6 PP0 TP6] Decode batch, #running-req: 56, #token: 379200, token usage: 0.93, cuda graph: False, gen throughput (token/s): 122.81, #queue-req: 254,
[2026-02-04 19:36:48 DP4 PP0 TP4] Decode batch, #running-req: 56, #token: 383232, token usage: 0.94, cuda graph: False, gen throughput (token/s): 118.84, #queue-req: 200,
[2026-02-04 19:36:48 DP1 PP0 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler.py", line 2974, in run_scheduler_process
scheduler.event_loop_pp()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_pp_mixin.py", line 90, in event_loop_pp
self.mbs[mb_id] = self.get_next_batch_to_run()
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler.py", line 1880, in get_next_batch_to_run
ret = self.maybe_prepare_mlp_sync_batch_and_log_stats(
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 253, in maybe_prepare_mlp_sync_batch_and_log_stats
batch = self.prepare_mlp_sync_batch(batch)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 225, in prepare_mlp_sync_batch
return prepare_mlp_sync_batch_raw(
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 197, in prepare_mlp_sync_batch_raw
mlp_sync_info.all_gather(device=device, group=group)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 73, in all_gather
local_info_tensor = self._get_local_tensor(device=device)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 45, in _get_local_tensor
return torch.tensor(
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
Reproduction
python3 -m sglang.launch_server
--model-path /local_ssd/DeepSeek-V3.2
--nccl-init-addr "$MASTER_IP_ADDRESS:20000"
--nnodes 4
--node-rank "$RANK"
--trust-remote-code
--host 0.0.0.0
--schedule-policy fcfs
--port "$PORT"
--decode-log-interval 1
--context-length 128000
--tokenizer-worker-num 4
$ARGS
ARGS = --enable-hierarchical-cache --hicache-ratio 3 --cuda-graph-bs 1 2 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 72 80 88 96 104 112 120 128 --tp-size 8 --pp-size 4 --pp-async-batch-depth 1 --dp-size 8 --enable-dp-attention --max-running-requests 5120 --pp-max-micro-batch-size 1024 --chunked-prefill-size 32768 --schedule-conservativeness 3.333 --tokenizer-worker-num 1 --tool-call-parser deepseekv32 --mem-fraction-static 0.82 --reasoning-parser deepseek-v3 --disable-custom-all-reduce
Environment
sglang-0.5.8 / H800 / 32cards
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Checklist
Describe the bug
error:
[2026-02-04 19:36:48 DP7 PP0 TP7] Decode batch, #running-req: 52, #token: 382208, token usage: 0.94, cuda graph: False, gen throughput (token/s): 112.22, #queue-req: 215,
[2026-02-04 19:36:48 DP5 PP0 TP5] Decode batch, #running-req: 56, #token: 378880, token usage: 0.93, cuda graph: False, gen throughput (token/s): 120.51, #queue-req: 226,
[2026-02-04 19:36:48 DP3 PP0 TP3] Prefill batch, #new-seq: 1, #new-token: 3264, #cached-token: 640, token usage: 0.92, #running-req: 50, #queue-req: 203,
[2026-02-04 19:36:48 DP6 PP0 TP6] Decode batch, #running-req: 56, #token: 379200, token usage: 0.93, cuda graph: False, gen throughput (token/s): 122.81, #queue-req: 254,
[2026-02-04 19:36:48 DP4 PP0 TP4] Decode batch, #running-req: 56, #token: 383232, token usage: 0.94, cuda graph: False, gen throughput (token/s): 118.84, #queue-req: 200,
[2026-02-04 19:36:48 DP1 PP0 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler.py", line 2974, in run_scheduler_process
scheduler.event_loop_pp()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_pp_mixin.py", line 90, in event_loop_pp
self.mbs[mb_id] = self.get_next_batch_to_run()
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler.py", line 1880, in get_next_batch_to_run
ret = self.maybe_prepare_mlp_sync_batch_and_log_stats(
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 253, in maybe_prepare_mlp_sync_batch_and_log_stats
batch = self.prepare_mlp_sync_batch(batch)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 225, in prepare_mlp_sync_batch
return prepare_mlp_sync_batch_raw(
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 197, in prepare_mlp_sync_batch_raw
mlp_sync_info.all_gather(device=device, group=group)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 73, in all_gather
local_info_tensor = self._get_local_tensor(device=device)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 45, in _get_local_tensor
return torch.tensor(
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Search for
cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile withTORCH_USE_CUDA_DSA` to enable device-side assertions.Reproduction
python3 -m sglang.launch_server
--model-path /local_ssd/DeepSeek-V3.2
--nccl-init-addr "$MASTER_IP_ADDRESS:20000"
--nnodes 4
--node-rank "$RANK"
--trust-remote-code
--host 0.0.0.0
--schedule-policy fcfs
--port "$PORT"
--decode-log-interval 1
--context-length 128000
--tokenizer-worker-num 4
$ARGS
ARGS = --enable-hierarchical-cache --hicache-ratio 3 --cuda-graph-bs 1 2 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 72 80 88 96 104 112 120 128 --tp-size 8 --pp-size 4 --pp-async-batch-depth 1 --dp-size 8 --enable-dp-attention --max-running-requests 5120 --pp-max-micro-batch-size 1024 --chunked-prefill-size 32768 --schedule-conservativeness 3.333 --tokenizer-worker-num 1 --tool-call-parser deepseekv32 --mem-fraction-static 0.82 --reasoning-parser deepseek-v3 --disable-custom-all-reduce
Environment
sglang-0.5.8 / H800 / 32cards
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True