[Bug] sglang decode out of  memory

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug

The SGLang server crashes with the error "Decode out of memory" when the `--page-size` parameter is not set to 1. As shown in the log below, there is sufficient space (2048 tokens) available for new tokens. However, an allocation attempt for 148 tokens fail

`RuntimeError: Decode out of memory. Try to lower your batch size.
Try to allocate 148 tokens.
Avaliable tokens: 2048
self.token_to_kv_pool_allocator.available_size()=2048
self.tree_cache.evictable_size()=0`

### Reproduction

server command:
`python -m sglang.launch_server  --mem-fraction-static 0.90  --model-path <deepseek-v3/deeseek-r1> --trust-remote-code --tp-size 8 --disable-cuda-graph  --page-size 64`

client command:
`python -m sglang.bench_serving --dataset-name random --dataset-path <path-to-sharegpt> --random-range-ratio 1 --random-input-len 200 --random-output-len 200 --num-prompts 256`

### Environment

INFO 03-20 02:45:53 __init__.py:194] No platform detected, vLLM is running on UnspecifiedPlatform
Python: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.77
CUDA Driver Version: 535.183.06
PyTorch: 2.5.1+cu124
sgl_kernel: 0.0.5
flashinfer: 0.2.3+cu124torch2.5
triton: 3.1.0
transformers: 4.48.3
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.115.11
hf_transfer: 0.1.9
huggingface_hub: 0.29.3
interegular: 0.3.3
modelscope: 1.24.0
orjson: 3.10.15
packaging: 23.2
psutil: 6.0.0
pydantic: 2.9.2
multipart: 0.0.20
zmq: 26.2.0
uvicorn: 0.34.0
uvloop: 0.21.0
vllm: 0.7.2
openai: 1.66.5
tiktoken: 0.9.0
anthropic: 0.49.0
decord: 0.6.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] sglang decode out of memory #4602

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] sglang decode out of memory #4602

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions