[VLM] Refactor load_mm_data to improve performance#14644
[VLM] Refactor load_mm_data to improve performance#14644mickqian merged 1 commit intosgl-project:mainfrom
Conversation
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the multimodal data loading mechanism to enhance performance and simplify the default processing path. By introducing a streamlined Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the load_mm_data method to improve performance by introducing a simpler, faster implementation that avoids prompt scanning. The old logic is preserved in legacy_load_mm_data. A new helper _submit_mm_data_loading_tasks_simple is added to support this.
My review focuses on the new load_mm_data implementation. I've suggested improvements to its signature to remove an unused parameter and correct a type hint. I've also provided a more formal docstring and a refactoring to reduce code duplication. These changes will improve the code's clarity and maintainability.
|
/tag-and-rerun-ci |
91351c0 to
9c6cc52
Compare
|
TestMiniCPMo26Server test case failed unsurprisingly. Fixing. https://github.com/sgl-project/sglang/actions/runs/20085678223/job/57622671663?pr=14644 |
9c6cc52 to
1b945af
Compare
1b945af to
be874c2
Compare
be874c2 to
ce186d0
Compare
Fixed with adding fallback branch. |
|
/rerun-failed-ci |
5192073 to
1c4b8e8
Compare
|
/rerun-failed-ci |
|
/tag-and-rerun-ci |
1 similar comment
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
1c4b8e8 to
ba4c1da
Compare
|
/tag-and-rerun-ci |
1 similar comment
|
/tag-and-rerun-ci |
|
/rerun-failed-ci |
|
we should consider splitting |
|
There's a case not passed, not sure it's related with this PR or not. I'm manually rerun and follow up. |
|
This is an error in transformers. Might be related with not rebase main when run CI. It should be fine. Will keep a close eye on CI. |
|
The issue does exist in main CI. Investigating. |
|
I tested main, Qwen3-VL function works correctly, but Qwen2.5-VL function was broken. server: client: server: client: |
|
Checking whether it is this PR breaks Qwen2.5-VL or not. When submitting PR, Qwen2.5-VL was working correctly. |
|
I reverted this PR in my local environment. The problem still exists. |
| ) -> BaseMultiModalProcessorOutput: | ||
| """ | ||
| A fast version of `load_mm_data` that loads multimodal data directly. | ||
| This version does not scan the prompt to recognize tokens. It assumes |
There was a problem hiding this comment.
Should we add a safety check, such as:
expected_count = (
len(image_data or []) +
len(video_data or []) +
len(audio_data or [])
)
assert expected_count == len(tokenizer(prompt))
There was a problem hiding this comment.
Will update in a new PR.
It's weird that manually run this test case |
|
Run the test suite, it failed in the above test case: |
|
I can still reproduce this error in main without this PR. I believe we can re-land this PR. |
…ject#14644) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Motivation
Inspired by @mickqian .
In the load_mm_data method, there’s currently a redundant approach where we manually detect various tokens and then load them. This could be changed to simply loading all the passed data directly, but we previously made it this way because minicpmo model has a mechanism for adjusting video frames. We decide to simplify it in order to improve majority VLM models' performance.
For example, in both
load_mm_data()andsubmit_data_loading_tasks()there are redundantfor text_part in text_parts loop.In the new implementation,
load_mm_dataworks as “1 token → 1 data”: it doesn’t expand frames or rewrite the prompt, it just loads all the incoming data and aligns them with the tokens in order.We keep legacy
load_mm_datadedicated for the MiniCPM model: support MiniCPM’s “1 token → multiple frames” behavior.Log printing, result is as expected.
Server:
Client:
Modifications
Accuracy Tests
lmms_eval result no drop.
Benchmarking and Profiling
As the function portion is small, there's no significant improvement for the e2e performance. Whereas the function cost time reduces.
Checklist