Describe the bug
As discussed in other issues, when sending requests with images to sglang server that hosts a VLM, the memory usage will continuously increase until an OOM occurs. This ticket aims to monitor this issue and publish our experimental setup and results.
Known Issue:
- The v0.4.9.post6 version has a severe bug that causes memory leakage for the Visual Language Model when it uses the fast image processor. Therefore, we strongly recommend avoiding the use of this version.
Some partial guess
-
A memory leak appears to be a consistent issue across both VLMs and LLMs, which suggests the problem likely originates in the language model component itself. The reason VLMs hit OOM errors more often is simply because they manage a larger memory footprint, including the ViT and its activations, and sometimes a fast image processor.
-
Implying the growth of memory is contributed by the non static part of memory, e.g some temporary tensor. (Activations/or just memory fragmentations)
-
Interestingly, this behavior isn't universal. For instance, we haven't observed any leaks with text-only requests on the gpt-oss model.
TODO
- Expand testing to include additional models like Llama and GLM to see if the leak persists. (help wanted)
- Add test for VLM without initializing mm_processor
Reproduction
Follow the instruction in this gist. Here are some experiment results:
Setup: Pure text model, gpt-oss 20b
python -m sglang.launch_server --model-path openai/gpt-oss-20b --port 30324 --disable-radix-cache
Setup: Pure text model, meta-llama/Llama-3.2-3B-Instruct
python -m sglang.launch_server --model-path meta-llama/Llama-3.2-3B-Instruct --port 30324 --disable-radix-cache
Setup: Pure text model, qwen2.5-3b
python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B --port 30322 --disable-radix-cache

Setup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, image req, mm-attn-backend=fa3
python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --port 30323 --mm-attention-backend fa3 --disable-radix-cache

Same with backend=sdpa:

Setup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, text-only req
python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --port 30324 --mm-attention-backend sdpa --disable-radix-cache
Observation: Apart from constantly increasing memory occupation, qwenvl would have surges
Setup: Qwen2.5_vl (w/o fast image processor), disable-radix-cache, no flush, image req, fa3
(Ignore the peak near the end, was disrupted by other job)

Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, image req
Observation: Memory usage constantly increases, about 20MB/500reqs
Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, text only req
Weird, get two different results.
python -m sglang.launch_server --model-path internlm/Intern-S1-mini --trust-remote-code --grammar-backend none --port 30323

Setup: Qwen2 (dense text model), disable-radix-cache, flush every 50 reqs
Environment
sglang==0.5.0rc2
transformers==4.55.2
special thanks to @Swipe4057 @handoku @jinleic for their effort on identifying the issue
Describe the bug
As discussed in other issues, when sending requests with images to sglang server that hosts a VLM, the memory usage will continuously increase until an OOM occurs. This ticket aims to monitor this issue and publish our experimental setup and results.
Known Issue:
Some partial guess
A memory leak appears to be a consistent issue across both VLMs and LLMs, which suggests the problem likely originates in the language model component itself. The reason VLMs hit OOM errors more often is simply because they manage a larger memory footprint, including the ViT and its activations, and sometimes a fast image processor.
Implying the growth of memory is contributed by the non static part of memory, e.g some temporary tensor. (Activations/or just memory fragmentations)
Interestingly, this behavior isn't universal. For instance, we haven't observed any leaks with text-only requests on the gpt-oss model.
TODO
Reproduction
Follow the instruction in this gist. Here are some experiment results:
Setup: Pure text model, gpt-oss 20b
python -m sglang.launch_server --model-path openai/gpt-oss-20b --port 30324 --disable-radix-cacheSetup: Pure text model, meta-llama/Llama-3.2-3B-Instruct
python -m sglang.launch_server --model-path meta-llama/Llama-3.2-3B-Instruct --port 30324 --disable-radix-cacheSetup: Pure text model, qwen2.5-3b
python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B --port 30322 --disable-radix-cacheSetup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, image req, mm-attn-backend=fa3
python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --port 30323 --mm-attention-backend fa3 --disable-radix-cacheSame with backend=sdpa:
Setup: Qwen2.5_vl (w/ fast image processor), disable-radix-cache, no flush, text-only req
python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --port 30324 --mm-attention-backend sdpa --disable-radix-cacheObservation: Apart from constantly increasing memory occupation, qwenvl would have surges
Setup: Qwen2.5_vl (w/o fast image processor), disable-radix-cache, no flush, image req, fa3
(Ignore the peak near the end, was disrupted by other job)

Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, image req
Observation: Memory usage constantly increases, about 20MB/500reqs
Setup: Internvl3 (w/o fast image processor), disable-radix-cache, no flush, text only req
Weird, get two different results.

python -m sglang.launch_server --model-path internlm/Intern-S1-mini --trust-remote-code --grammar-backend none --port 30323Setup: Qwen2 (dense text model), disable-radix-cache, flush every 50 reqs
Environment
sglang==0.5.0rc2
transformers==4.55.2