Skip to content

Refactor mm_utils to make it more general-purpose.#13130

Closed
yhyang201 wants to merge 6 commits intosgl-project:mainfrom
yhyang201:qwen3vl_refactor
Closed

Refactor mm_utils to make it more general-purpose.#13130
yhyang201 wants to merge 6 commits intosgl-project:mainfrom
yhyang201:qwen3vl_refactor

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

@yhyang201 yhyang201 commented Nov 12, 2025

Motivation

Removed the deepstack logic from mm_utils and refactored qwen3vl.

Preparing for multimodal piecewise CUDA graph.

acc test:

main:
image

this pr:
image

speed test

python -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --dataset-name image \
    --num-prompts 96 \
    --image-count 3 \
    --image-resolution 1080p \
    --random-input-len 512 \
    --random-output-len 512 \
    --max-concurrency 16

main:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     96
Benchmark duration (s):                  81.09
Total input tokens:                      615580
Total input text tokens:                 27484
Total input vision tokens:               588096
Total generated tokens:                  25821
Total generated tokens (retokenized):    16275
Request throughput (req/s):              1.18
Input token throughput (tok/s):          7591.67
Output token throughput (tok/s):         318.44
Total token throughput (tok/s):          7910.11
Concurrency:                             15.63
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   13199.61
Median E2E Latency (ms):                 12783.86
---------------Time to First Token----------------
Mean TTFT (ms):                          3370.49
Median TTFT (ms):                        2350.90
P99 TTFT (ms):                           12031.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.74
Median TPOT (ms):                        37.67
P99 TPOT (ms):                           74.00
---------------Inter-Token Latency----------------
Mean ITL (ms):                           49.18
Median ITL (ms):                         9.23
P95 ITL (ms):                            262.85
P99 ITL (ms):                            1142.26
Max ITL (ms):                            9325.15
==================================================

this pr:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    inf
Max request concurrency:                 16
Successful requests:                     96
Benchmark duration (s):                  81.06
Total input tokens:                      615639
Total input text tokens:                 27543
Total input vision tokens:               588096
Total generated tokens:                  25821
Total generated tokens (retokenized):    15091
Request throughput (req/s):              1.18
Input token throughput (tok/s):          7594.97
Output token throughput (tok/s):         318.55
Total token throughput (tok/s):          7913.51
Concurrency:                             15.64
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   13208.09
Median E2E Latency (ms):                 12687.83
---------------Time to First Token----------------
Mean TTFT (ms):                          3184.26
Median TTFT (ms):                        2014.51
P99 TTFT (ms):                           11299.38
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.69
Median TPOT (ms):                        37.62
P99 TPOT (ms):                           66.99
---------------Inter-Token Latency----------------
Mean ITL (ms):                           51.16
Median ITL (ms):                         8.48
P95 ITL (ms):                            259.76
P99 ITL (ms):                            1477.37
Max ITL (ms):                            8078.35
==================================================

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yhyang201, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the multimodal utility functions to enhance their generality and prepare the system for advanced performance optimizations. It achieves this by extracting model-specific deepstack embedding logic from the shared mm_utils module and integrating it directly into the qwen3_vl model. This change streamlines the core multimodal embedding pipeline and sets the stage for future improvements like multimodal piecewise CUDA graph support.

Highlights

  • Decoupled Deepstack Logic: The deepstack embedding processing, previously intertwined with general multimodal utilities, has been moved into the qwen3_vl model's specific implementation.
  • Generalized Multimodal Utilities: Functions like embed_mm_inputs and general_mm_embed_routine in mm_utils.py are now more generic, removing model-specific parameters and logic related to deepstack.
  • Enhanced Qwen3-VL Multimodal Handling: The qwen3_vl.py model now includes a post_process method to manage and separate deepstack embeddings, which are then passed via the ForwardBatch object.
  • Preparation for CUDA Graph Optimization: This refactoring is explicitly stated as a step towards enabling multimodal piecewise CUDA graph integration, suggesting performance improvements.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors mm_utils.py to make it more general-purpose by moving model-specific logic, like the deepstack implementation for Qwen3-VL, into the model's own file. This is a good architectural improvement. I've found a bug in the newly introduced post_process method in qwen3_vl.py and have suggested a fix that also improves the code's clarity and efficiency.

Comment thread python/sglang/srt/models/qwen3_vl.py
@yhyang201 yhyang201 marked this pull request as draft November 12, 2025 08:35
@yhyang201 yhyang201 removed the run-ci label Nov 12, 2025
@yhyang201 yhyang201 marked this pull request as ready for review November 12, 2025 12:43
@yhyang201 yhyang201 requested a review from yuan-luo November 12, 2025 12:57
@yuan-luo
Copy link
Copy Markdown
Collaborator

yuan-luo commented Nov 12, 2025

Thanks @yhyang201. This is the right direction towards piecewise CUDA graph vlm supporting. I'm working on refactor to move inputs_embedding outside model. Will have you updated ASAP.

@yhyang201 yhyang201 closed this Nov 21, 2025
@yhyang201 yhyang201 deleted the qwen3vl_refactor branch April 16, 2026 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants