[Bug] [Tracking] VLM/LLM OOM related issues

## Describe the bug

As discussed in other issues, when sending requests with images to sglang server that hosts a VLM, the memory usage will continuously increase until an OOM occurs. This ticket aims to monitor this issue and publish our experimental setup and results.

## Known Issue:

1. **The v0.4.9.post6 version has a severe bug that causes memory leakage for the Visual Language Model when it uses the fast image processor. Therefore, we strongly recommend avoiding the use of this version.**

## Some partial guess

- A memory leak appears to be a consistent issue across both VLMs and LLMs, which suggests the problem likely originates in the language model component itself. The reason VLMs hit OOM errors more often is simply because they manage a larger memory footprint, including the ViT and its activations, and sometimes a fast image processor.

-  Implying the growth of memory is contributed by the non static part of memory, e.g some temporary tensor. (Activations/or just memory fragmentations)

- Interestingly, this behavior isn't universal. For instance, we haven't observed any leaks with text-only requests on the gpt-oss model.

## TODO
- Expand testing to include additional models like Llama and GLM to see if the leak persists. **(help wanted)**
- Add test for VLM without initializing mm_processor


## Reproduction

### Follow the instruction in this [gist](https://gist.github.com/JustinTong0323/9c8bc4228bbf51990b408ab38c1071bc). Here are some experiment results:

#### Setup: Pure text model, gpt-oss 20b
`python -m sglang.launch_server --model-path openai/gpt-oss-20b --port 30324 --disable-radix-cache`

<img width="1500" height="750" alt="Image" src="https://github.com/user-attachments/assets/75fd4ce5-aaf5-49cf-9164-614e8e8c4c78" />

#### Setup: Pure text model, meta-llama/Llama-3.2-3B-Instruct
`python -m sglang.launch_server --model-path meta-llama/Llama-3.2-3B-Instruct --port 30324 --disable-radix-cache`

<img width="1500" height="750" alt="Image" src="https://github.com/user-attachments/assets/670c06cd-d3a7-4abe-8921-3c1c1dd00b56" />

#### Setup: Pure text model, qwen2.5-3b
`python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B --port 30322 --disable-radix-cache`
<img width="1500" height="750" alt="Image" src="https://github.com/user-attachments/assets/7682b656-88fa-4b13-9f83-85ff8d9f8387" />

#### Setup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, image req, mm-attn-backend=fa3
`python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --port 30323 --mm-attention-backend fa3 --disable-radix-cache`
<img width="1500" height="750" alt="Image" src="https://github.com/user-attachments/assets/2666df10-65a8-4e5c-b05a-faff0c0cd6d5" />
Same with backend=sdpa:
<img width="1500" height="750" alt="Image" src="https://github.com/user-attachments/assets/7b324cfe-79c5-4b99-bd0e-135597b4e18a" />

#### Setup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, text-only req
`python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --port 30324 --mm-attention-backend sdpa --disable-radix-cache`
Observation: Apart from constantly increasing memory occupation, qwenvl would have surges 

<img width="1500" height="750" alt="Image" src="https://github.com/user-attachments/assets/b39c8467-12ff-4ffa-84ec-9829649a3f32" />

#### Setup: Qwen2.5_vl (w/o fast image processor), disable-radix-cache, no flush, image req, fa3
(Ignore the peak near the end, was disrupted by other job)
<img width="1500" height="750" alt="Image" src="https://github.com/user-attachments/assets/cac70bca-030c-4f44-b4e2-740207460da3" />

#### Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, image req
Observation: Memory usage constantly increases, about 20MB/500reqs

<img width="1500" height="750" alt="Image" src="https://github.com/user-attachments/assets/c8642293-c98f-4c32-be59-906c256d3bbd" />

#### Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, text only req
Weird, get two different results.
`python -m sglang.launch_server --model-path internlm/Intern-S1-mini --trust-remote-code --grammar-backend none --port 30323`
<img width="1500" height="750" alt="Image" src="https://github.com/user-attachments/assets/8aec66d0-6a63-4841-b382-e6bde067c509" />

<img width="720" height="360" alt="Image" src="https://github.com/user-attachments/assets/b4a667c3-cb8b-4c20-ac5f-3de2585e34d4" />


#### Setup: Qwen2 (dense text model), disable-radix-cache, flush every 50 reqs

<img width="720" height="360" alt="Image" src="https://github.com/user-attachments/assets/8b51e23f-8ef0-45ae-b318-5c4721801d1e" />

## Environment

sglang==0.5.0rc2
transformers==4.55.2

> special thanks to @Swipe4057 @handoku @jinleic for their effort on identifying the issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Tracking] VLM/LLM OOM related issues #9365

Describe the bug

Known Issue:

Some partial guess

TODO

Reproduction

Follow the instruction in this gist. Here are some experiment results:

Setup: Pure text model, gpt-oss 20b

Setup: Pure text model, meta-llama/Llama-3.2-3B-Instruct

Setup: Pure text model, qwen2.5-3b

Setup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, image req, mm-attn-backend=fa3

Setup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, text-only req

Setup: Qwen2.5_vl (w/o fast image processor), disable-radix-cache, no flush, image req, fa3

Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, image req

Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, text only req

Setup: Qwen2 (dense text model), disable-radix-cache, flush every 50 reqs

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] [Tracking] VLM/LLM OOM related issues #9365

Description

Describe the bug

Known Issue:

Some partial guess

TODO

Reproduction

Follow the instruction in this gist. Here are some experiment results:

Setup: Pure text model, gpt-oss 20b

Setup: Pure text model, meta-llama/Llama-3.2-3B-Instruct

Setup: Pure text model, qwen2.5-3b

Setup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, image req, mm-attn-backend=fa3

Setup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, text-only req

Setup: Qwen2.5_vl (w/o fast image processor), disable-radix-cache, no flush, image req, fa3

Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, image req

Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, text only req

Setup: Qwen2 (dense text model), disable-radix-cache, flush every 50 reqs

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions