Skip to content

[Bug] offload_kv_cache error #7819

@zhangxiaolei123456

Description

@zhangxiaolei123456

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

`usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 96.87, #queue-req: 0
[2025-07-04 16:59:49 DP4 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2817, in run_scheduler_process
scheduler.event_loop_normal_disagg_decode()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/disaggregation/decode.py", line 645, in event_loop_normal_disagg_decode
batch = self.get_next_disagg_decode_batch_to_run()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/disaggregation/decode.py", line 794, in get_next_disagg_decode_batch_to_run
self.running_batch = self.update_running_batch(self.running_batch)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1736, in update_running_batch
retracted_reqs, new_token_ratio = batch.retract_decode(self.server_args)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/schedule_batch.py", line 1449, in retract_decode
req.offload_kv_cache(
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/schedule_batch.py", line 776, in offload_kv_cache
self.kv_cache_cpu = token_to_kv_pool_allocator.get_cpu_copy(token_indices)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/mem_cache/allocator.py", line 81, in get_cpu_copy
raise NotImplementedError()
NotImplementedError

[2025-07-04 16:59:55] Child process unexpectedly failed with exitcode=131. pid=19676`

Reproduction

my command below
`#prefill
GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1
NCCL_MIN_NCHANNELS=24
NCCL_IB_QPS_PER_CONNECTION=8
SGL_ENABLE_JIT_DEEPGEMM=1
python3 -m sglang.launch_server
--attention-backend flashmla --disaggregation-ib-device "mlx5_1,mlx5_2,mlx5_3,mlx5_4"
--model-path /data/models/DeepSeek-R1/
--tp 8 --disaggregation-mode prefill
--host --port 30300 --trust-remote-code --enable-deepep-moe --deepep-mode normal --disable-radix-cache --max-running-requests 8 --chunked-prefill-size 0
--trust-remote-code --watchdog-timeout 1000000
--mem-fraction-static 0.8
--show-time-cost --kv-cache-dtype fp8_e4m3 --page-size 64

##decode
GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1
NCCL_MIN_NCHANNELS=24
NCCL_IB_QPS_PER_CONNECTION=8
SGL_ENABLE_JIT_DEEPGEMM=1
python3 -m sglang.launch_server
--attention-backend flashmla
--model-path /data/models/DeepSeek-R1/
--tp 8 --disaggregation-mode decode --disaggregation-ib-device "mlx5_1,mlx5_2,mlx5_3,mlx5_4"
--host --port 30300 --trust-remote-code --dist-init-addr --enable-deepep-moe --deepep-mode low_latency --disable-radix-cache --mem-fraction-static 0.7 --max-running-requests 256 --moe-dense-tp-size 1 --cuda-graph-bs 1 2 4 8 10 12 14 16 18 20 22 24 26 28 30 32 --watchdog-timeout 1000000
--enable-dp-attention --dp-size 8
--context-length 40000
--trust-remote-code --page-size 64
--show-time-cost --enable-dp-lm-head
--speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 2 --speculative-num-draft-tokens 4
--kv-cache-dtype fp8_e4m3

python3 -m sglang.srt.disaggregation.mini_lb --prefill --decode --port 8000 --neat-room`

python3 benchmark_serving.py --backend vllm --model /data/models/DeepSeek-R1/ --base-url http://127.0.0.1:8000/ --endpoint /v1/completions --num-prompts 128 --request-rate 1 --goodput ttft:5000 tpot:50 --max-concurrency 32 --random-input-len 25000 --random-output-len 5000 --dataset-name random --ignore-eos --seed 5

Environment

Prefill: H20-141G 1Node
Decoder: H20-141G 1Node

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions