[Bug] offload_kv_cache error

### Checklist

- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.

### Describe the bug

`usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 96.87, #queue-req: 0
[2025-07-04 16:59:49 DP4 TP4] Scheduler hit an exception: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2817, in run_scheduler_process
scheduler.event_loop_normal_disagg_decode()
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/disaggregation/decode.py", line 645, in event_loop_normal_disagg_decode
batch = self.get_next_disagg_decode_batch_to_run()
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/disaggregation/decode.py", line 794, in get_next_disagg_decode_batch_to_run
self.running_batch = self.update_running_batch(self.running_batch)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 1736, in update_running_batch
retracted_reqs, new_token_ratio = batch.retract_decode(self.server_args)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/schedule_batch.py", line 1449, in retract_decode
req.offload_kv_cache(
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/schedule_batch.py", line 776, in offload_kv_cache
self.kv_cache_cpu = token_to_kv_pool_allocator.get_cpu_copy(token_indices)
File "/usr/local/lib/python3.10/dist-packages/sglang/srt/mem_cache/allocator.py", line 81, in get_cpu_copy
raise NotImplementedError()
NotImplementedError

[2025-07-04 16:59:55] Child process unexpectedly failed with exitcode=131. pid=19676`

### Reproduction

my command below
`#prefill
GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
python3 -m sglang.launch_server \
--attention-backend flashmla --disaggregation-ib-device  "mlx5_1,mlx5_2,mlx5_3,mlx5_4" \
--model-path /data/models/DeepSeek-R1/ \
--tp 8 --disaggregation-mode prefill  \
--host --port 30300 --trust-remote-code --enable-deepep-moe --deepep-mode normal  --disable-radix-cache  --max-running-requests 8  --chunked-prefill-size 0 \
--trust-remote-code --watchdog-timeout 1000000  \
--mem-fraction-static 0.8 \
--show-time-cost --kv-cache-dtype fp8_e4m3 --page-size 64



##decode
GLOO_SOCKET_IFNAME=eth0 NCCL_IB_HCA=mlx5_ NCCL_IB_DISABLE=0 NCCL_SOCKET_IFNAME=eth0 NCCL_IB_GID_INDEX=3 \
SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=1 \
NCCL_MIN_NCHANNELS=24 \
NCCL_IB_QPS_PER_CONNECTION=8 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
python3 -m sglang.launch_server \
--attention-backend flashmla \
--model-path /data/models/DeepSeek-R1/ \
--tp 8 --disaggregation-mode decode  --disaggregation-ib-device  "mlx5_1,mlx5_2,mlx5_3,mlx5_4" \
--host  --port 30300 --trust-remote-code --dist-init-addr  --enable-deepep-moe --deepep-mode low_latency  --disable-radix-cache --mem-fraction-static 0.7 --max-running-requests 256  --moe-dense-tp-size 1 --cuda-graph-bs 1 2 4 8 10 12 14 16 18 20 22 24 26 28 30 32  --watchdog-timeout 1000000  \
--enable-dp-attention --dp-size 8 \
--context-length 40000 \
--trust-remote-code  --page-size 64 \
--show-time-cost --enable-dp-lm-head \
--speculative-algo NEXTN --speculative-num-steps 3 --speculative-eagle-topk 2 --speculative-num-draft-tokens 4 \
--kv-cache-dtype fp8_e4m3  



python3 -m sglang.srt.disaggregation.mini_lb --prefill  --decode --port 8000 --neat-room`


`python3 benchmark_serving.py --backend vllm --model /data/models/DeepSeek-R1/  --base-url http://127.0.0.1:8000/ --endpoint /v1/completions --num-prompts 128 --request-rate 1 --goodput ttft:5000 tpot:50 --max-concurrency 32 --random-input-len 25000 --random-output-len 5000 --dataset-name random --ignore-eos --seed 5`

### Environment

Prefill: H20-141G 1Node
Decoder: H20-141G 1Node


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] offload_kv_cache error #7819

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] offload_kv_cache error #7819

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions