Skip to content

[piecewise] Refactor VLM to support input embed buffer and remove external embedder hack#14155

Merged
ByronHsu merged 4 commits intomainfrom
byron/refactor-pcg-mm
Dec 1, 2025
Merged

[piecewise] Refactor VLM to support input embed buffer and remove external embedder hack#14155
ByronHsu merged 4 commits intomainfrom
byron/refactor-pcg-mm

Conversation

@ByronHsu
Copy link
Copy Markdown
Collaborator

@ByronHsu ByronHsu commented Nov 30, 2025

Motivation

The current way to enable PCG for VLM is hacky and requires us to change all VLM model files. This PR implements a more elegant way which does not need to change model files without perf drop.

Modifications

  1. Remove the external embedder hack from [VLM] Support Piecewise CUDA Graph for Qwen2.5-VL  #13055
  2. Pass in input_embed as a buffer to vlm. it will copy the embed output to the buffer at runtime.
  3. Wrap multimodal forward in the new context manager use_original_ca_comm which sets to the original ca comm because we only need to disble ca in the language model

Accuracy Tests

Follow #13055. No acc drops.

2025-11-30 07:59:11 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:239 - Output path not provided, skipping saving results aggregated
openai_compatible (model_version=Qwen/Qwen2.5-VL-7B-Instruct), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 16
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.5067|±  |   N/A|

Benchmarking and Profiling

On H100

Follow #13055. No perf drops after use_original_ca_comm fix.

This PR

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     256       
Benchmark duration (s):                  9.05      
Total input tokens:                      126333    
Total input text tokens:                 23421     
Total input vision tokens:               102912    
Total generated tokens:                  4541      
Total generated tokens (retokenized):    4539      
Request throughput (req/s):              28.30     
Input token throughput (tok/s):          13965.55  
Output token throughput (tok/s):         501.99    
Total token throughput (tok/s):          14467.54  
Concurrency:                             31.45     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1111.34   
Median E2E Latency (ms):                 1111.86   
---------------Time to First Token----------------
Mean TTFT (ms):                          351.08    
Median TTFT (ms):                        288.96    
P99 TTFT (ms):                           1069.64   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          42.58     
Median TPOT (ms):                        45.67     
P99 TPOT (ms):                           74.79     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           45.45     
Median ITL (ms):                         11.53     
P95 ITL (ms):                            240.69    
P99 ITL (ms):                            314.75    
Max ITL (ms):                            794.27    
==================================================

Main

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     256       
Benchmark duration (s):                  9.20      
Total input tokens:                      126345    
Total input text tokens:                 23433     
Total input vision tokens:               102912    
Total generated tokens:                  4541      
Total generated tokens (retokenized):    4541      
Request throughput (req/s):              27.83     
Input token throughput (tok/s):          13733.83  
Output token throughput (tok/s):         493.61    
Total token throughput (tok/s):          14227.44  
Concurrency:                             31.13     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1118.60   
Median E2E Latency (ms):                 1097.04   
---------------Time to First Token----------------
Mean TTFT (ms):                          362.51    
Median TTFT (ms):                        292.88    
P99 TTFT (ms):                           1055.66   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          42.40     
Median TPOT (ms):                        44.79     
P99 TPOT (ms):                           84.21     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           45.17     
Median ITL (ms):                         10.54     
P95 ITL (ms):                            255.83    
P99 ITL (ms):                            361.42    
Max ITL (ms):                            788.53    
==================================================

Profile

We can see the modules are still compiled and captured correctly.

image

Checklist

@ByronHsu ByronHsu changed the title [piecewise] Pass in input_embeds buffer to the model to avoid moving embedder outside of the model file [piecewise] Pass the input_embeds buffer to the model to avoid moving the embedder outside the model file Nov 30, 2025
@ByronHsu ByronHsu force-pushed the byron/refactor-pcg-mm branch from c3ab849 to 7f524e3 Compare November 30, 2025 07:25
@ByronHsu ByronHsu changed the title [piecewise] Pass the input_embeds buffer to the model to avoid moving the embedder outside the model file [piecewise] Refactor PCG VLM to support input embed buffer and remove external embedder hack Nov 30, 2025
@ByronHsu ByronHsu changed the title [piecewise] Refactor PCG VLM to support input embed buffer and remove external embedder hack [piecewise] Refactor VLM to support input embed buffer and remove external embedder hack Nov 30, 2025
@ByronHsu ByronHsu marked this pull request as ready for review November 30, 2025 07:29
@yuan-luo yuan-luo self-requested a review November 30, 2025 07:33
@yuan-luo yuan-luo added Multi-modal multi-modal language model vlm labels Nov 30, 2025
@hebiao064 hebiao064 self-assigned this Nov 30, 2025
@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Dec 1, 2025

Thanks for the refactor. LGTM.

@yhyang201
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@ByronHsu ByronHsu merged commit 0825d7f into main Dec 1, 2025
139 of 143 checks passed
@ByronHsu ByronHsu deleted the byron/refactor-pcg-mm branch December 1, 2025 05:43
@Lzhang-hub Lzhang-hub mentioned this pull request Dec 1, 2025
6 tasks
harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025
tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants