Skip to content

[Bug] HiCache CUDA illegal memory #18166

@dongyibo

Description

@dongyibo

Checklist

  • I searched related issues but found no solution.
  • The bug persists in the latest version.
  • Issues without environment info and a minimal reproducible demo are hard to resolve and may receive no feedback.
  • If this is not a bug report but a general question, please start a discussion at https://github.com/sgl-project/sglang/discussions. Otherwise, it will be closed.
  • Please use English. Otherwise, it will be closed.

Describe the bug

error:
[2026-02-04 19:36:48 DP7 PP0 TP7] Decode batch, #running-req: 52, #token: 382208, token usage: 0.94, cuda graph: False, gen throughput (token/s): 112.22, #queue-req: 215,
[2026-02-04 19:36:48 DP5 PP0 TP5] Decode batch, #running-req: 56, #token: 378880, token usage: 0.93, cuda graph: False, gen throughput (token/s): 120.51, #queue-req: 226,
[2026-02-04 19:36:48 DP3 PP0 TP3] Prefill batch, #new-seq: 1, #new-token: 3264, #cached-token: 640, token usage: 0.92, #running-req: 50, #queue-req: 203,
[2026-02-04 19:36:48 DP6 PP0 TP6] Decode batch, #running-req: 56, #token: 379200, token usage: 0.93, cuda graph: False, gen throughput (token/s): 122.81, #queue-req: 254,
[2026-02-04 19:36:48 DP4 PP0 TP4] Decode batch, #running-req: 56, #token: 383232, token usage: 0.94, cuda graph: False, gen throughput (token/s): 118.84, #queue-req: 200,
[2026-02-04 19:36:48 DP1 PP0 TP1] Scheduler hit an exception: Traceback (most recent call last):
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler.py", line 2974, in run_scheduler_process
scheduler.event_loop_pp()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_pp_mixin.py", line 90, in event_loop_pp
self.mbs[mb_id] = self.get_next_batch_to_run()
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler.py", line 1880, in get_next_batch_to_run
ret = self.maybe_prepare_mlp_sync_batch_and_log_stats(
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 253, in maybe_prepare_mlp_sync_batch_and_log_stats
batch = self.prepare_mlp_sync_batch(batch)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 225, in prepare_mlp_sync_batch
return prepare_mlp_sync_batch_raw(
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 197, in prepare_mlp_sync_batch_raw
mlp_sync_info.all_gather(device=device, group=group)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 73, in all_gather
local_info_tensor = self._get_local_tensor(device=device)
File "/local-ssd/pv0/python/sglang/srt/managers/scheduler_dp_attn_mixin.py", line 45, in _get_local_tensor
return torch.tensor(
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Reproduction

python3 -m sglang.launch_server
--model-path /local_ssd/DeepSeek-V3.2
--nccl-init-addr "$MASTER_IP_ADDRESS:20000"
--nnodes 4
--node-rank "$RANK"
--trust-remote-code
--host 0.0.0.0
--schedule-policy fcfs
--port "$PORT"
--decode-log-interval 1
--context-length 128000
--tokenizer-worker-num 4
$ARGS

ARGS = --enable-hierarchical-cache --hicache-ratio 3 --cuda-graph-bs 1 2 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 72 80 88 96 104 112 120 128 --tp-size 8 --pp-size 4 --pp-async-batch-depth 1 --dp-size 8 --enable-dp-attention --max-running-requests 5120 --pp-max-micro-batch-size 1024 --chunked-prefill-size 32768 --schedule-conservativeness 3.333 --tokenizer-worker-num 1 --tool-call-parser deepseekv32 --mem-fraction-static 0.82 --reasoning-parser deepseek-v3 --disable-custom-all-reduce

Environment

sglang-0.5.8 / H800 / 32cards
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions