Skip to content

[VLM] Introduce Cache for positional embedding ids for Qwen-VL family#14292

Merged
yuan-luo merged 2 commits intosgl-project:mainfrom
antgroup:qwen3_cache_rot_pos
Dec 4, 2025
Merged

[VLM] Introduce Cache for positional embedding ids for Qwen-VL family#14292
yuan-luo merged 2 commits intosgl-project:mainfrom
antgroup:qwen3_cache_rot_pos

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Dec 2, 2025

Motivation

Introduce a cache for rot_pos_emb index computation to boost the calculation.
Introduce a mixin class for broader reuse for this mechanism. For cached rotary position embedding, improvement is significant.
Moreover, refine the index computation to use numpy obtains extra speedup which E2E improved 7%.

Server:
$SGLANG_MM_FEATURE_CACHE_MB=4096 \
SGLANG_USE_CUDA_IPC_TRANSPORT=1 \
SGLANG_VLM_CACHE_SIZE_MB=512 \
python -m sglang.launch_server --model-path /home/admin/Qwen3-VL-8B-Instruct/ \
--host 0.0.0.0 --port 8188 --trust-remote-code --tp-size 2 --enable-cache-report \
--log-level info --max-running-requests 48 --mem-fraction-static 0.7 \
--chunked-prefill-size 8192  --attention-backend flashinfer --mm-attention-backend fa3 \
--log-level debug --log-requests --log-requests-level 1

Client:
$python3 -m sglang.bench_serving --backend sglang-oai-chat --dataset-name image    --num-prompts 500  --apply-chat-template  --random-output-len 1    --random-input-len 100     --image-resolution 1120x700     --image-format jpeg     --image-count 1     --image-content random    --random-range-ratio 1 --port 8188 --max-concurrency 20

TTFT drop 2%.

PR:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  48.98     
Total input tokens:                      442616    
Total input text tokens:                 56615     
Total input vision tokens:               386001    
Total generated tokens:                  500       
Total generated tokens (retokenized):    500       
Request throughput (req/s):              10.21     
Input token throughput (tok/s):          9037.23   
Output token throughput (tok/s):         10.21     
Peak output token throughput (tok/s):    18.00     
Peak concurrent requests:                38        
Total token throughput (tok/s):          9047.44   
Concurrency:                             19.67     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1926.98   
Median E2E Latency (ms):                 1908.01   
---------------Time to First Token----------------
Mean TTFT (ms):                          1926.97   
Median TTFT (ms):                        1907.99   
P99 TTFT (ms):                           2565.78   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Main:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  49.58     
Total input tokens:                      442632    
Total input text tokens:                 56632     
Total input vision tokens:               386000    
Total generated tokens:                  500       
Total generated tokens (retokenized):    500       
Request throughput (req/s):              10.08     
Input token throughput (tok/s):          8927.52   
Output token throughput (tok/s):         10.08     
Peak output token throughput (tok/s):    20.00     
Peak concurrent requests:                40        
Total token throughput (tok/s):          8937.61   
Concurrency:                             19.67     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1950.94   
Median E2E Latency (ms):                 1895.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          1950.93   
Median TTFT (ms):                        1895.20   
P99 TTFT (ms):                           2609.81   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@yhyang201
Copy link
Copy Markdown
Collaborator

A quick question — is a 35s TTFT workload considered reasonable?

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 2, 2025

A quick question — is a 35s TTFT workload considered reasonable?

This is just a benchmark test to input considerable multi-modal data such as 7 images in one request.

@yudian0504
Copy link
Copy Markdown
Contributor

A quick question — is a 35s TTFT workload considered reasonable?

This is just a benchmark test to input considerable multi-modal data such as 7 images in one request.

What are the timing comparisons for send_one?

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 2, 2025

A quick question — is a 35s TTFT workload considered reasonable?

This is just a benchmark test to input considerable multi-modal data such as 7 images in one request.

What are the timing comparisons for send_one?

E2E improved 7% for send one.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 2, 2025

/tag-and-rerun-ci

import torch


class RotaryPosMixin:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not certain this is the right place for this class. Would there be a more appropriate location for it?

Copy link
Copy Markdown
Collaborator Author

@yuan-luo yuan-luo Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is only used by model. There will be more models reusing this class. So I prefer to keep it in this folder, or do you have any suggestion?

Comment thread python/sglang/srt/models/qwen3_vl.py Outdated
@yuan-luo yuan-luo force-pushed the qwen3_cache_rot_pos branch from e873ff9 to 3068c73 Compare December 3, 2025 11:59
Copy link
Copy Markdown
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@yuan-luo yuan-luo force-pushed the qwen3_cache_rot_pos branch from 3068c73 to 8d7f5ed Compare December 3, 2025 15:19
@yuan-luo yuan-luo force-pushed the qwen3_cache_rot_pos branch from 8d7f5ed to 5af1887 Compare December 4, 2025 02:15
@yudian0504
Copy link
Copy Markdown
Contributor

From the data you posted above, only the TTFT metric has been optimized, but both TPOT and E2E times have worsened?

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 4, 2025

From the data you posted above, only the TTFT metric has been optimized, but both TPOT and E2E times have worsened?

This vision encoder change mainly focuses on Prefill, not related with TPOT. The worse might be turbulence. I retested PR, the TPOT is more stable.

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  68.59     
Total input tokens:                      442557    
Total input text tokens:                 56557     
Total input vision tokens:               386000    
Total generated tokens:                  50000     
Total generated tokens (retokenized):    49819     
Request throughput (req/s):              7.29      
Input token throughput (tok/s):          6452.28   
Output token throughput (tok/s):         728.98    
Peak output token throughput (tok/s):    1986.00   
Peak concurrent requests:                41        
Total token throughput (tok/s):          7181.26   
Concurrency:                             19.65     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2696.13   
Median E2E Latency (ms):                 2717.70   
---------------Time to First Token----------------
Mean TTFT (ms):                          1504.30   
Median TTFT (ms):                        1638.13   
P99 TTFT (ms):                           2196.11   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.04     
Median TPOT (ms):                        10.47     
P99 TPOT (ms):                           25.12     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           12.53     
Median ITL (ms):                         6.33      
P95 ITL (ms):                            22.87     
P99 ITL (ms):                            79.58     
Max ITL (ms):                            1748.90   
==================================================

@yudian0504
Copy link
Copy Markdown
Contributor

From the data you posted above, only the TTFT metric has been optimized, but both TPOT and E2E times have worsened?

This vision encoder change mainly focuses on Prefill, not related with TPOT. The worse might be turbulence. I retested PR, the TPOT is more stable.

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  68.59     
Total input tokens:                      442557    
Total input text tokens:                 56557     
Total input vision tokens:               386000    
Total generated tokens:                  50000     
Total generated tokens (retokenized):    49819     
Request throughput (req/s):              7.29      
Input token throughput (tok/s):          6452.28   
Output token throughput (tok/s):         728.98    
Peak output token throughput (tok/s):    1986.00   
Peak concurrent requests:                41        
Total token throughput (tok/s):          7181.26   
Concurrency:                             19.65     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2696.13   
Median E2E Latency (ms):                 2717.70   
---------------Time to First Token----------------
Mean TTFT (ms):                          1504.30   
Median TTFT (ms):                        1638.13   
P99 TTFT (ms):                           2196.11   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.04     
Median TPOT (ms):                        10.47     
P99 TPOT (ms):                           25.12     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           12.53     
Median ITL (ms):                         6.33      
P95 ITL (ms):                            22.87     
P99 ITL (ms):                            79.58     
Max ITL (ms):                            1748.90   
==================================================

Can we restart the engine each time and run three separate tests?

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 4, 2025

I set output=0, restarted server and tested. TTFT improved 2%.



benchmark
$python3 -m sglang.bench_serving --backend sglang-oai-chat --dataset-name image    --num-prompts 500  --apply-chat-template  --random-output-len 1    --random-input-len 100     --image-resolution 1120x700     --image-format jpeg     --image-count 1     --image-content random    --random-range-ratio 1 --port 8188 --max-concurrency 20

PR:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  48.98     
Total input tokens:                      442616    
Total input text tokens:                 56615     
Total input vision tokens:               386001    
Total generated tokens:                  500       
Total generated tokens (retokenized):    500       
Request throughput (req/s):              10.21     
Input token throughput (tok/s):          9037.23   
Output token throughput (tok/s):         10.21     
Peak output token throughput (tok/s):    18.00     
Peak concurrent requests:                38        
Total token throughput (tok/s):          9047.44   
Concurrency:                             19.67     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1926.98   
Median E2E Latency (ms):                 1908.01   
---------------Time to First Token----------------
Mean TTFT (ms):                          1926.97   
Median TTFT (ms):                        1907.99   
P99 TTFT (ms):                           2565.78   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

Main:
============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  49.58     
Total input tokens:                      442632    
Total input text tokens:                 56632     
Total input vision tokens:               386000    
Total generated tokens:                  500       
Total generated tokens (retokenized):    500       
Request throughput (req/s):              10.08     
Input token throughput (tok/s):          8927.52   
Output token throughput (tok/s):         10.08     
Peak output token throughput (tok/s):    20.00     
Peak concurrent requests:                40        
Total token throughput (tok/s):          8937.61   
Concurrency:                             19.67     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1950.94   
Median E2E Latency (ms):                 1895.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          1950.93   
Median TTFT (ms):                        1895.20   
P99 TTFT (ms):                           2609.81   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          0.00      
Median TPOT (ms):                        0.00      
P99 TPOT (ms):                           0.00      
---------------Inter-Token Latency----------------
Mean ITL (ms):                           0.00      
Median ITL (ms):                         0.00      
P95 ITL (ms):                            0.00      
P99 ITL (ms):                            0.00      
Max ITL (ms):                            0.00      
==================================================

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Dec 4, 2025

After tested for many times, I found the turbulence was caused by num prompt 100 too short. When setting num prompt to 500, the TPOT is stable.

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  68.98     
Total input tokens:                      442607    
Total input text tokens:                 56607     
Total input vision tokens:               386000    
Total generated tokens:                  50000     
Total generated tokens (retokenized):    49900     
Request throughput (req/s):              7.25      
Input token throughput (tok/s):          6416.63   
Output token throughput (tok/s):         724.87    
Peak output token throughput (tok/s):    1973.00   
Peak concurrent requests:                40        
Total token throughput (tok/s):          7141.50   
Concurrency:                             19.71     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2719.13   
Median E2E Latency (ms):                 2796.18   
---------------Time to First Token----------------
Mean TTFT (ms):                          1412.12   
Median TTFT (ms):                        1577.80   
P99 TTFT (ms):                           2246.61   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.20     
Median TPOT (ms):                        12.22     
P99 TPOT (ms):                           25.46     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.91     
Median ITL (ms):                         6.33      
P95 ITL (ms):                            46.99     
P99 ITL (ms):                            137.08    
Max ITL (ms):                            1725.75   
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  69.61     
Total input tokens:                      442600    
Total input text tokens:                 56600     
Total input vision tokens:               386000    
Total generated tokens:                  50000     
Total generated tokens (retokenized):    49861     
Request throughput (req/s):              7.18      
Input token throughput (tok/s):          6358.58   
Output token throughput (tok/s):         718.32    
Peak output token throughput (tok/s):    1978.00   
Peak concurrent requests:                40        
Total token throughput (tok/s):          7076.91   
Concurrency:                             19.71     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2744.26   
Median E2E Latency (ms):                 2835.02   
---------------Time to First Token----------------
Mean TTFT (ms):                          1501.72   
Median TTFT (ms):                        1630.36   
P99 TTFT (ms):                           2210.19   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.55     
Median TPOT (ms):                        11.50     
P99 TPOT (ms):                           25.65     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.20     
Median ITL (ms):                         6.33      
P95 ITL (ms):                            26.37     
P99 ITL (ms):                            117.48    
Max ITL (ms):                            1663.39   
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 20        
Successful requests:                     500       
Benchmark duration (s):                  71.41     
Total input tokens:                      442668    
Total input text tokens:                 56668     
Total input vision tokens:               386000    
Total generated tokens:                  50000     
Total generated tokens (retokenized):    49805     
Request throughput (req/s):              7.00      
Input token throughput (tok/s):          6198.68   
Output token throughput (tok/s):         700.15    
Peak output token throughput (tok/s):    1935.00   
Peak concurrent requests:                40        
Total token throughput (tok/s):          6898.83   
Concurrency:                             19.71     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2815.27   
Median E2E Latency (ms):                 2880.00   
---------------Time to First Token----------------
Mean TTFT (ms):                          1542.45   
Median TTFT (ms):                        1726.60   
P99 TTFT (ms):                           2734.81   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          12.86     
Median TPOT (ms):                        11.62     
P99 TPOT (ms):                           27.60     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           13.55     
Median ITL (ms):                         6.32      
P95 ITL (ms):                            31.19     
P99 ITL (ms):                            120.84    
Max ITL (ms):                            1767.98   
==================================================

@yuan-luo yuan-luo merged commit b2b09f5 into sgl-project:main Dec 4, 2025
139 of 143 checks passed
@yuan-luo yuan-luo deleted the qwen3_cache_rot_pos branch December 4, 2025 05:30
tom-jerr pushed a commit to tom-jerr/sglang that referenced this pull request Dec 4, 2025
yingluosanqian pushed a commit to yingluosanqian/sglang that referenced this pull request Dec 4, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
yuchengz816-bot pushed a commit to yuchengz816-bot/sglang that referenced this pull request Dec 8, 2025
Kevin-XiongC pushed a commit to novitalabs/sglang that referenced this pull request Dec 9, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants