Refactor mm_utils to make it more general-purpose. by yhyang201 · Pull Request #13130 · sgl-project/sglang

yhyang201 · 2025-11-12T07:04:50Z

Motivation

Removed the deepstack logic from mm_utils and refactored qwen3vl.

Preparing for multimodal piecewise CUDA graph.

acc test:

main:

this pr:

speed test

python -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --dataset-name image \
    --num-prompts 96 \
    --image-count 3 \
    --image-resolution 1080p \
    --random-input-len 512 \
    --random-output-len 512 \
    --max-concurrency 16

main:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     96
Benchmark duration (s):                  81.09
Total input tokens:                      615580
Total input text tokens:                 27484
Total input vision tokens:               588096
Total generated tokens:                  25821
Total generated tokens (retokenized):    16275
Request throughput (req/s):              1.18
Input token throughput (tok/s):          7591.67
Output token throughput (tok/s):         318.44
Total token throughput (tok/s):          7910.11
Concurrency:                             15.63
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   13199.61
Median E2E Latency (ms):                 12783.86
---------------Time to First Token----------------
Mean TTFT (ms):                          3370.49
Median TTFT (ms):                        2350.90
P99 TTFT (ms):                           12031.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.74
Median TPOT (ms):                        37.67
P99 TPOT (ms):                           74.00
---------------Inter-Token Latency----------------
Mean ITL (ms):                           49.18
Median ITL (ms):                         9.23
P95 ITL (ms):                            262.85
P99 ITL (ms):                            1142.26
Max ITL (ms):                            9325.15
==================================================

this pr:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     96
Benchmark duration (s):                  81.06
Total input tokens:                      615639
Total input text tokens:                 27543
Total input vision tokens:               588096
Total generated tokens:                  25821
Total generated tokens (retokenized):    15091
Request throughput (req/s):              1.18
Input token throughput (tok/s):          7594.97
Output token throughput (tok/s):         318.55
Total token throughput (tok/s):          7913.51
Concurrency:                             15.64
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   13208.09
Median E2E Latency (ms):                 12687.83
---------------Time to First Token----------------
Mean TTFT (ms):                          3184.26
Median TTFT (ms):                        2014.51
P99 TTFT (ms):                           11299.38
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.69
Median TPOT (ms):                        37.62
P99 TPOT (ms):                           66.99
---------------Inter-Token Latency----------------
Mean ITL (ms):                           51.16
Median ITL (ms):                         8.48
P95 ITL (ms):                            259.76
P99 ITL (ms):                            1477.37
Max ITL (ms):                            8078.35
==================================================

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-12T07:05:06Z

Summary of Changes

Hello @yhyang201, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the multimodal utility functions to enhance their generality and prepare the system for advanced performance optimizations. It achieves this by extracting model-specific deepstack embedding logic from the shared mm_utils module and integrating it directly into the qwen3_vl model. This change streamlines the core multimodal embedding pipeline and sets the stage for future improvements like multimodal piecewise CUDA graph support.

Highlights

Decoupled Deepstack Logic: The deepstack embedding processing, previously intertwined with general multimodal utilities, has been moved into the qwen3_vl model's specific implementation.
Generalized Multimodal Utilities: Functions like embed_mm_inputs and general_mm_embed_routine in mm_utils.py are now more generic, removing model-specific parameters and logic related to deepstack.
Enhanced Qwen3-VL Multimodal Handling: The qwen3_vl.py model now includes a post_process method to manage and separate deepstack embeddings, which are then passed via the ForwardBatch object.
Preparation for CUDA Graph Optimization: This refactoring is explicitly stated as a step towards enabling multimodal piecewise CUDA graph integration, suggesting performance improvements.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors mm_utils.py to make it more general-purpose by moving model-specific logic, like the deepstack implementation for Qwen3-VL, into the model's own file. This is a good architectural improvement. I've found a bug in the newly introduced post_process method in qwen3_vl.py and have suggested a fix that also improves the code's clarity and efficiency.

yuan-luo · 2025-11-12T13:17:06Z

Thanks @yhyang201. This is the right direction towards piecewise CUDA graph vlm supporting. I'm working on refactor to move inputs_embedding outside model. Will have you updated ASAP.

yhyang201 requested review from Fridge003, Ying1123, hnyls2002, ispobock, merrymercy, xiezhq-hermann and zhyncs as code owners November 12, 2025 07:04

sglang-bot added the run-ci label Nov 12, 2025

gemini-code-assist Bot reviewed Nov 12, 2025

View reviewed changes

Comment thread python/sglang/srt/models/qwen3_vl.py

yhyang201 marked this pull request as draft November 12, 2025 08:35

yhyang201 removed the run-ci label Nov 12, 2025

yhyang201 force-pushed the qwen3vl_refactor branch from 5a3124f to ed951ae Compare November 12, 2025 10:22

yhyang201 marked this pull request as ready for review November 12, 2025 12:43

yhyang201 requested a review from yuan-luo November 12, 2025 12:57

yhyang201 force-pushed the qwen3vl_refactor branch from b615e5e to ed951ae Compare November 12, 2025 14:09

yuan-luo mentioned this pull request Nov 12, 2025

[VLM] Support Piecewise CUDA Graph for Qwen2.5-VL #13055

Merged

4 tasks

yhyang201 mentioned this pull request Nov 12, 2025

[DO NOT MERGE] vlm piece-wise-graph example #13160

Closed

4 tasks

yhyang201 added the run-ci label Nov 13, 2025

yhyang201 added 5 commits November 13, 2025 02:41

refactor qwen3vl

005fa83

fix lint

846ecd4

fix err

ead1e88

upd

34f4bc3

upd

c7154f1

yhyang201 force-pushed the qwen3vl_refactor branch from ed951ae to c7154f1 Compare November 13, 2025 02:42

yhyang201 added run-ci and removed run-ci labels Nov 14, 2025

upd

9670c60

yhyang201 closed this Nov 21, 2025

yhyang201 deleted the qwen3vl_refactor branch April 16, 2026 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor mm_utils to make it more general-purpose.#13130

Refactor mm_utils to make it more general-purpose.#13130
yhyang201 wants to merge 6 commits intosgl-project:mainfrom
yhyang201:qwen3vl_refactor

yhyang201 commented Nov 12, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Nov 12, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

yuan-luo commented Nov 12, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yhyang201 commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Nov 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

yuan-luo commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yhyang201 commented Nov 12, 2025 •

edited

Loading

yuan-luo commented Nov 12, 2025 •

edited

Loading