[VLM] Introduce Cache for positional embedding ids for Qwen-VL family by yuan-luo · Pull Request #14292 · sgl-project/sglang

yuan-luo · 2025-12-02T09:25:49Z

Motivation

Introduce a cache for rot_pos_emb index computation to boost the calculation.
Introduce a mixin class for broader reuse for this mechanism. For cached rotary position embedding, improvement is significant.
Moreover, refine the index computation to use numpy obtains extra speedup which E2E improved 7%.

Server:
$SGLANG_MM_FEATURE_CACHE_MB=4096 \
SGLANG_USE_CUDA_IPC_TRANSPORT=1 \
SGLANG_VLM_CACHE_SIZE_MB=512 \
python -m sglang.launch_server --model-path /home/admin/Qwen3-VL-8B-Instruct/ \
--host 0.0.0.0 --port 8188 --trust-remote-code --tp-size 2 --enable-cache-report \
--log-level info --max-running-requests 48 --mem-fraction-static 0.7 \
--chunked-prefill-size 8192  --attention-backend flashinfer --mm-attention-backend fa3 \
--log-level debug --log-requests --log-requests-level 1

Client:
$python3 -m sglang.bench_serving --backend sglang-oai-chat --dataset-name image    --num-prompts 500  --apply-chat-template  --random-output-len 1    --random-input-len 100     --image-resolution 1120x700     --image-format jpeg     --image-count 1     --image-content random    --random-range-ratio 1 --port 8188 --max-concurrency 20

TTFT drop 2%.

PR:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  48.98     
Total input tokens:                      442616    
Total input text tokens:                 56615     
Total input vision tokens:               386001    
Total generated tokens:                  500       
Total generated tokens (retokenized):    500       
Request throughput (req/s):              10.21     
Input token throughput (tok/s):          9037.23   
Output token throughput (tok/s):         10.21     
Peak output token throughput (tok/s):    18.00     
Peak concurrent requests:                38        
Total token throughput (tok/s):          9047.44   
Concurrency:                             19.67     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1926.98   
Median E2E Latency (ms):                 1908.01   
---------------Time to First Token----------------
Mean TTFT (ms):                          1926.97   
Median TTFT (ms):                        1907.99   
P99 TTFT (ms):                           2565.78   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Main:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  49.58     
Total input tokens:                      442632    
Total input text tokens:                 56632     
Total input vision tokens:               386000    
Total generated tokens:                  500       
Total generated tokens (retokenized):    500       
Request throughput (req/s):              10.08     
Input token throughput (tok/s):          8927.52   
Output token throughput (tok/s):         10.08     
Peak output token throughput (tok/s):    20.00     
Peak concurrent requests:                40        
Total token throughput (tok/s):          8937.61   
Concurrency:                             19.67     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1950.94   
Median E2E Latency (ms):                 1895.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          1950.93   
Median TTFT (ms):                        1895.20   
P99 TTFT (ms):                           2609.81   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

yhyang201 · 2025-12-02T09:37:57Z

A quick question — is a 35s TTFT workload considered reasonable?

yuan-luo · 2025-12-02T11:07:59Z

A quick question — is a 35s TTFT workload considered reasonable?

This is just a benchmark test to input considerable multi-modal data such as 7 images in one request.

yudian0504 · 2025-12-02T11:11:20Z

A quick question — is a 35s TTFT workload considered reasonable?

This is just a benchmark test to input considerable multi-modal data such as 7 images in one request.

What are the timing comparisons for send_one?

yuan-luo · 2025-12-02T11:23:37Z

A quick question — is a 35s TTFT workload considered reasonable?

This is just a benchmark test to input considerable multi-modal data such as 7 images in one request.

What are the timing comparisons for send_one?

E2E improved 7% for send one.

yuan-luo · 2025-12-02T11:59:10Z

/tag-and-rerun-ci

yhyang201 · 2025-12-03T11:30:13Z

+import torch
+
+
+class RotaryPosMixin:


I’m not certain this is the right place for this class. Would there be a more appropriate location for it?

This class is only used by model. There will be more models reusing this class. So I prefer to keep it in this folder, or do you have any suggestion?

BBuf

LGTM.

yudian0504 · 2025-12-04T02:27:50Z

From the data you posted above, only the TTFT metric has been optimized, but both TPOT and E2E times have worsened?

yuan-luo · 2025-12-04T03:15:44Z

From the data you posted above, only the TTFT metric has been optimized, but both TPOT and E2E times have worsened?

This vision encoder change mainly focuses on Prefill, not related with TPOT. The worse might be turbulence. I retested PR, the TPOT is more stable.

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  68.59     
Total input tokens:                      442557    
Total input text tokens:                 56557     
Total input vision tokens:               386000    
Total generated tokens:                  50000     
Total generated tokens (retokenized):    49819     
Request throughput (req/s):              7.29      
Input token throughput (tok/s):          6452.28   
Output token throughput (tok/s):         728.98    
Peak output token throughput (tok/s):    1986.00   
Peak concurrent requests:                41        
Total token throughput (tok/s):          7181.26   
Concurrency:                             19.65     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2696.13   
Median E2E Latency (ms):                 2717.70   
---------------Time to First Token----------------
Mean TTFT (ms):                          1504.30   
Median TTFT (ms):                        1638.13   
P99 TTFT (ms):                           2196.11   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.04     
Median TPOT (ms):                        10.47     
P99 TPOT (ms):                           25.12     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           12.53     
Median ITL (ms):                         6.33      
P95 ITL (ms):                            22.87     
P99 ITL (ms):                            79.58     
Max ITL (ms):                            1748.90   
==================================================

yudian0504 · 2025-12-04T03:34:01Z

From the data you posted above, only the TTFT metric has been optimized, but both TPOT and E2E times have worsened?

This vision encoder change mainly focuses on Prefill, not related with TPOT. The worse might be turbulence. I retested PR, the TPOT is more stable.

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  68.59     
Total input tokens:                      442557    
Total input text tokens:                 56557     
Total input vision tokens:               386000    
Total generated tokens:                  50000     
Total generated tokens (retokenized):    49819     
Request throughput (req/s):              7.29      
Input token throughput (tok/s):          6452.28   
Output token throughput (tok/s):         728.98    
Peak output token throughput (tok/s):    1986.00   
Peak concurrent requests:                41        
Total token throughput (tok/s):          7181.26   
Concurrency:                             19.65     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2696.13   
Median E2E Latency (ms):                 2717.70   
---------------Time to First Token----------------
Mean TTFT (ms):                          1504.30   
Median TTFT (ms):                        1638.13   
P99 TTFT (ms):                           2196.11   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.04     
Median TPOT (ms):                        10.47     
P99 TPOT (ms):                           25.12     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           12.53     
Median ITL (ms):                         6.33      
P95 ITL (ms):                            22.87     
P99 ITL (ms):                            79.58     
Max ITL (ms):                            1748.90   
==================================================

Can we restart the engine each time and run three separate tests?

yuan-luo · 2025-12-04T03:40:07Z

I set output=0, restarted server and tested. TTFT improved 2%.



benchmark
$python3 -m sglang.bench_serving --backend sglang-oai-chat --dataset-name image    --num-prompts 500  --apply-chat-template  --random-output-len 1    --random-input-len 100     --image-resolution 1120x700     --image-format jpeg     --image-count 1     --image-content random    --random-range-ratio 1 --port 8188 --max-concurrency 20

PR:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  48.98     
Total input tokens:                      442616    
Total input text tokens:                 56615     
Total input vision tokens:               386001    
Total generated tokens:                  500       
Total generated tokens (retokenized):    500       
Request throughput (req/s):              10.21     
Input token throughput (tok/s):          9037.23   
Output token throughput (tok/s):         10.21     
Peak output token throughput (tok/s):    18.00     
Peak concurrent requests:                38        
Total token throughput (tok/s):          9047.44   
Concurrency:                             19.67     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1926.98   
Median E2E Latency (ms):                 1908.01   
---------------Time to First Token----------------
Mean TTFT (ms):                          1926.97   
Median TTFT (ms):                        1907.99   
P99 TTFT (ms):                           2565.78   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Main:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  49.58     
Total input tokens:                      442632    
Total input text tokens:                 56632     
Total input vision tokens:               386000    
Total generated tokens:                  500       
Total generated tokens (retokenized):    500       
Request throughput (req/s):              10.08     
Input token throughput (tok/s):          8927.52   
Output token throughput (tok/s):         10.08     
Peak output token throughput (tok/s):    20.00     
Peak concurrent requests:                40        
Total token throughput (tok/s):          8937.61   
Concurrency:                             19.67     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1950.94   
Median E2E Latency (ms):                 1895.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          1950.93   
Median TTFT (ms):                        1895.20   
P99 TTFT (ms):                           2609.81   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

yuan-luo · 2025-12-04T04:31:41Z

After tested for many times, I found the turbulence was caused by num prompt 100 too short. When setting num prompt to 500, the TPOT is stable.

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  68.98     
Total input tokens:                      442607    
Total input text tokens:                 56607     
Total input vision tokens:               386000    
Total generated tokens:                  50000     
Total generated tokens (retokenized):    49900     
Request throughput (req/s):              7.25      
Input token throughput (tok/s):          6416.63   
Output token throughput (tok/s):         724.87    
Peak output token throughput (tok/s):    1973.00   
Peak concurrent requests:                40        
Total token throughput (tok/s):          7141.50   
Concurrency:                             19.71     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2719.13   
Median E2E Latency (ms):                 2796.18   
---------------Time to First Token----------------
Mean TTFT (ms):                          1412.12   
Median TTFT (ms):                        1577.80   
P99 TTFT (ms):                           2246.61   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.20     
Median TPOT (ms):                        12.22     
P99 TPOT (ms):                           25.46     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.91     
Median ITL (ms):                         6.33      
P95 ITL (ms):                            46.99     
P99 ITL (ms):                            137.08    
Max ITL (ms):                            1725.75   
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  69.61     
Total input tokens:                      442600    
Total input text tokens:                 56600     
Total input vision tokens:               386000    
Total generated tokens:                  50000     
Total generated tokens (retokenized):    49861     
Request throughput (req/s):              7.18      
Input token throughput (tok/s):          6358.58   
Output token throughput (tok/s):         718.32    
Peak output token throughput (tok/s):    1978.00   
Peak concurrent requests:                40        
Total token throughput (tok/s):          7076.91   
Concurrency:                             19.71     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2744.26   
Median E2E Latency (ms):                 2835.02   
---------------Time to First Token----------------
Mean TTFT (ms):                          1501.72   
Median TTFT (ms):                        1630.36   
P99 TTFT (ms):                           2210.19   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.55     
Median TPOT (ms):                        11.50     
P99 TPOT (ms):                           25.65     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.20     
Median ITL (ms):                         6.33      
P95 ITL (ms):                            26.37     
P99 ITL (ms):                            117.48    
Max ITL (ms):                            1663.39   
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  71.41     
Total input tokens:                      442668    
Total input text tokens:                 56668     
Total input vision tokens:               386000    
Total generated tokens:                  50000     
Total generated tokens (retokenized):    49805     
Request throughput (req/s):              7.00      
Input token throughput (tok/s):          6198.68   
Output token throughput (tok/s):         700.15    
Peak output token throughput (tok/s):    1935.00   
Peak concurrent requests:                40        
Total token throughput (tok/s):          6898.83   
Concurrency:                             19.71     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2815.27   
Median E2E Latency (ms):                 2880.00   
---------------Time to First Token----------------
Mean TTFT (ms):                          1542.45   
Median TTFT (ms):                        1726.60   
P99 TTFT (ms):                           2734.81   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.86     
Median TPOT (ms):                        11.62     
P99 TPOT (ms):                           27.60     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.55     
Median ITL (ms):                         6.32      
P95 ITL (ms):                            31.19     
P99 ITL (ms):                            120.84    
Max ITL (ms):                            1767.98   
==================================================

…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

cache rotary position for qwen3 vl

41dcb18

yuan-luo requested review from BBuf, JustinTong0323, ispobock and yhyang201 December 2, 2025 09:25

yuan-luo added performance Multi-modal multi-modal language model vlm labels Dec 2, 2025

yuan-luo requested a review from mickqian December 2, 2025 09:26

github-actions Bot added the run-ci label Dec 2, 2025

yuan-luo requested a review from Alcanderian December 3, 2025 03:19

yhyang201 reviewed Dec 3, 2025

View reviewed changes

yuan-luo commented Dec 3, 2025

View reviewed changes

Comment thread python/sglang/srt/models/qwen3_vl.py Outdated

yuan-luo force-pushed the qwen3_cache_rot_pos branch from e873ff9 to 3068c73 Compare December 3, 2025 11:59

BBuf approved these changes Dec 3, 2025

View reviewed changes

yuan-luo force-pushed the qwen3_cache_rot_pos branch from 3068c73 to 8d7f5ed Compare December 3, 2025 15:19

yhyang201 approved these changes Dec 3, 2025

View reviewed changes

Abstract RotaryPosMixin to reuse

5af1887

yuan-luo force-pushed the qwen3_cache_rot_pos branch from 8d7f5ed to 5af1887 Compare December 4, 2025 02:15

yuan-luo merged commit b2b09f5 into sgl-project:main Dec 4, 2025
139 of 143 checks passed

yuan-luo deleted the qwen3_cache_rot_pos branch December 4, 2025 05:30

tom-jerr pushed a commit to tom-jerr/sglang that referenced this pull request Dec 4, 2025

[VLM] Introduce Cache for positional embedding ids for Qwen-VL family (…

32661d4

…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025

[VLM] Introduce Cache for positional embedding ids for Qwen-VL family (…

92882fd

…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

[VLM] Introduce Cache for positional embedding ids for Qwen-VL family (…

a787bef

…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

[VLM] Introduce Cache for positional embedding ids for Qwen-VL family (…

1f64518

…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025

[VLM] Introduce Cache for positional embedding ids for Qwen-VL family (…

8facacc

…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

Kevin-XiongC pushed a commit to novitalabs/sglang that referenced this pull request Dec 9, 2025

[VLM] Introduce Cache for positional embedding ids for Qwen-VL family (…

35d66f6

…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025

[VLM] Introduce Cache for positional embedding ids for Qwen-VL family (…

ffb04ea

…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025

[VLM] Introduce Cache for positional embedding ids for Qwen-VL family (…

6de816e

…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025

[VLM] Introduce Cache for positional embedding ids for Qwen-VL family (…

fd81b4d

…sgl-project#14292) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

hanzhangshen03 mentioned this pull request Feb 27, 2026

[Bug] Qwen3-VL vision encoder produces incorrect output since v0.5.7 #19513

Closed

5 tasks

Conversation

yuan-luo commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

yhyang201 commented Dec 2, 2025

Uh oh!

yuan-luo commented Dec 2, 2025

Uh oh!

yudian0504 commented Dec 2, 2025

Uh oh!

yuan-luo commented Dec 2, 2025

Uh oh!

yuan-luo commented Dec 2, 2025

Uh oh!

yhyang201 Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

yuan-luo Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BBuf left a comment

Choose a reason for hiding this comment

Uh oh!

yudian0504 commented Dec 4, 2025

Uh oh!

yuan-luo commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yudian0504 commented Dec 4, 2025

Uh oh!

yuan-luo commented Dec 4, 2025

Uh oh!

yuan-luo commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuan-luo commented Dec 2, 2025 •

edited

Loading

yuan-luo Dec 3, 2025 •

edited

Loading

yuan-luo commented Dec 4, 2025 •

edited

Loading

yuan-luo commented Dec 4, 2025 •

edited

Loading