[piecewise] Refactor VLM to support input embed buffer and remove external embedder hack by ByronHsu · Pull Request #14155 · sgl-project/sglang

ByronHsu · 2025-11-30T03:31:37Z

Motivation

The current way to enable PCG for VLM is hacky and requires us to change all VLM model files. This PR implements a more elegant way which does not need to change model files without perf drop.

Modifications

Remove the external embedder hack from [VLM] Support Piecewise CUDA Graph for Qwen2.5-VL #13055
Pass in input_embed as a buffer to vlm. it will copy the embed output to the buffer at runtime.
Wrap multimodal forward in the new context manager use_original_ca_comm which sets to the original ca comm because we only need to disble ca in the language model

Accuracy Tests

Follow #13055. No acc drops.

2025-11-30 07:59:11 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:239 - Output path not provided, skipping saving results aggregated
openai_compatible (model_version=Qwen/Qwen2.5-VL-7B-Instruct), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 16
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.5067|±  |   N/A|

Benchmarking and Profiling

On H100

Follow #13055. No perf drops after use_original_ca_comm fix.

This PR

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     256       
Benchmark duration (s):                  9.05      
Total input tokens:                      126333    
Total input text tokens:                 23421     
Total input vision tokens:               102912    
Total generated tokens:                  4541      
Total generated tokens (retokenized):    4539      
Request throughput (req/s):              28.30     
Input token throughput (tok/s):          13965.55  
Output token throughput (tok/s):         501.99    
Total token throughput (tok/s):          14467.54  
Concurrency:                             31.45     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1111.34   
Median E2E Latency (ms):                 1111.86   
---------------Time to First Token----------------
Mean TTFT (ms):                          351.08    
Median TTFT (ms):                        288.96    
P99 TTFT (ms):                           1069.64   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          42.58     
Median TPOT (ms):                        45.67     
P99 TPOT (ms):                           74.79     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           45.45     
Median ITL (ms):                         11.53     
P95 ITL (ms):                            240.69    
P99 ITL (ms):                            314.75    
Max ITL (ms):                            794.27    
==================================================

Main

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf       
Max request concurrency:                 32        
Successful requests:                     256       
Benchmark duration (s):                  9.20      
Total input tokens:                      126345    
Total input text tokens:                 23433     
Total input vision tokens:               102912    
Total generated tokens:                  4541      
Total generated tokens (retokenized):    4541      
Request throughput (req/s):              27.83     
Input token throughput (tok/s):          13733.83  
Output token throughput (tok/s):         493.61    
Total token throughput (tok/s):          14227.44  
Concurrency:                             31.13     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   1118.60   
Median E2E Latency (ms):                 1097.04   
---------------Time to First Token----------------
Mean TTFT (ms):                          362.51    
Median TTFT (ms):                        292.88    
P99 TTFT (ms):                           1055.66   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          42.40     
Median TPOT (ms):                        44.79     
P99 TPOT (ms):                           84.21     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           45.17     
Median ITL (ms):                         10.54     
P95 ITL (ms):                            255.83    
P99 ITL (ms):                            361.42    
Max ITL (ms):                            788.53    
==================================================

Profile

We can see the modules are still compiled and captured correctly.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

…bedder hack

yuan-luo · 2025-12-01T02:02:17Z

Thanks for the refactor. LGTM.

yhyang201 · 2025-12-01T04:01:33Z

/tag-and-rerun-ci

…ernal embedder hack (sgl-project#14155)

github-actions Bot added the piecewise-cuda-graph label Nov 30, 2025

ByronHsu changed the title ~~[piecewise] Pass in input_embeds buffer to the model to avoid moving embedder outside of the model file~~ [piecewise] Pass the input_embeds buffer to the model to avoid moving the embedder outside the model file Nov 30, 2025

refactor pcg vlm to support input embed buffer and remove external em…

7f524e3

…bedder hack

ByronHsu force-pushed the byron/refactor-pcg-mm branch from c3ab849 to 7f524e3 Compare November 30, 2025 07:25

ByronHsu changed the title ~~[piecewise] Pass the input_embeds buffer to the model to avoid moving the embedder outside the model file~~ [piecewise] Refactor PCG VLM to support input embed buffer and remove external embedder hack Nov 30, 2025

ByronHsu changed the title ~~[piecewise] Refactor PCG VLM to support input embed buffer and remove external embedder hack~~ [piecewise] Refactor VLM to support input embed buffer and remove external embedder hack Nov 30, 2025

ByronHsu marked this pull request as ready for review November 30, 2025 07:29

ByronHsu requested review from Fridge003, Ying1123, hebiao064, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners November 30, 2025 07:29

ByronHsu added the run-ci label Nov 30, 2025

yuan-luo self-requested a review November 30, 2025 07:33

yuan-luo added Multi-modal multi-modal language model vlm labels Nov 30, 2025

hebiao064 self-assigned this Nov 30, 2025

use original custom all reduce in vit

faaf1ee

ByronHsu requested review from ch-wan and yizhang2077 as code owners November 30, 2025 23:16

polish comments

bcaaf42

ByronHsu added the ready-for-review label Nov 30, 2025

fix ca comm is None

bc2e10b

hebiao064 approved these changes Dec 1, 2025

View reviewed changes

yuan-luo approved these changes Dec 1, 2025

View reviewed changes

yhyang201 approved these changes Dec 1, 2025

View reviewed changes

ByronHsu merged commit 0825d7f into main Dec 1, 2025
139 of 143 checks passed

ByronHsu deleted the byron/refactor-pcg-mm branch December 1, 2025 05:43

ByronHsu mentioned this pull request Dec 1, 2025

[piecewise] Support custom all-reduce #14193

Closed

yuan-luo mentioned this pull request Dec 1, 2025

[VLM] Support Piecewise CUDA Graph for Qwen3-Omni-MOE #14222

Merged

6 tasks

Lzhang-hub mentioned this pull request Dec 1, 2025

support hunyuanocr model #14053

Open

6 tasks

harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025

[piecewise] Refactor VLM to support input embed buffer and remove ext…

bcd3e68

…ernal embedder hack (sgl-project#14155)

tonyluj pushed a commit to openanolis/sglang that referenced this pull request Dec 5, 2025

[piecewise] Refactor VLM to support input embed buffer and remove ext…

9d3ad0a

…ernal embedder hack (sgl-project#14155)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[piecewise] Refactor VLM to support input embed buffer and remove external embedder hack#14155

[piecewise] Refactor VLM to support input embed buffer and remove external embedder hack#14155
ByronHsu merged 4 commits intomainfrom
byron/refactor-pcg-mm

ByronHsu commented Nov 30, 2025 •

edited

Loading

Uh oh!

yuan-luo commented Dec 1, 2025

Uh oh!

yhyang201 commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ByronHsu commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

yuan-luo commented Dec 1, 2025

Uh oh!

yhyang201 commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ByronHsu commented Nov 30, 2025 •

edited

Loading