Skip to content

[Bug] [Tracking] VLM/LLM OOM related issues #9365

@JustinTong0323

Description

@JustinTong0323

Describe the bug

As discussed in other issues, when sending requests with images to sglang server that hosts a VLM, the memory usage will continuously increase until an OOM occurs. This ticket aims to monitor this issue and publish our experimental setup and results.

Known Issue:

  1. The v0.4.9.post6 version has a severe bug that causes memory leakage for the Visual Language Model when it uses the fast image processor. Therefore, we strongly recommend avoiding the use of this version.

Some partial guess

  • A memory leak appears to be a consistent issue across both VLMs and LLMs, which suggests the problem likely originates in the language model component itself. The reason VLMs hit OOM errors more often is simply because they manage a larger memory footprint, including the ViT and its activations, and sometimes a fast image processor.

  • Implying the growth of memory is contributed by the non static part of memory, e.g some temporary tensor. (Activations/or just memory fragmentations)

  • Interestingly, this behavior isn't universal. For instance, we haven't observed any leaks with text-only requests on the gpt-oss model.

TODO

  • Expand testing to include additional models like Llama and GLM to see if the leak persists. (help wanted)
  • Add test for VLM without initializing mm_processor

Reproduction

Follow the instruction in this gist. Here are some experiment results:

Setup: Pure text model, gpt-oss 20b

python -m sglang.launch_server --model-path openai/gpt-oss-20b --port 30324 --disable-radix-cache

Image

Setup: Pure text model, meta-llama/Llama-3.2-3B-Instruct

python -m sglang.launch_server --model-path meta-llama/Llama-3.2-3B-Instruct --port 30324 --disable-radix-cache

Image

Setup: Pure text model, qwen2.5-3b

python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B --port 30322 --disable-radix-cache
Image

Setup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, image req, mm-attn-backend=fa3

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --port 30323 --mm-attention-backend fa3 --disable-radix-cache
Image
Same with backend=sdpa:
Image

Setup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, text-only req

python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --port 30324 --mm-attention-backend sdpa --disable-radix-cache
Observation: Apart from constantly increasing memory occupation, qwenvl would have surges

Image

Setup: Qwen2.5_vl (w/o fast image processor), disable-radix-cache, no flush, image req, fa3

(Ignore the peak near the end, was disrupted by other job)
Image

Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, image req

Observation: Memory usage constantly increases, about 20MB/500reqs

Image

Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, text only req

Weird, get two different results.
python -m sglang.launch_server --model-path internlm/Intern-S1-mini --trust-remote-code --grammar-backend none --port 30323
Image

Image

Setup: Qwen2 (dense text model), disable-radix-cache, flush every 50 reqs

Image

Environment

sglang==0.5.0rc2
transformers==4.55.2

special thanks to @Swipe4057 @handoku @jinleic for their effort on identifying the issue

Metadata

Metadata

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions