vlm: Refactor engine vlm params and support precessor output as input#14091
vlm: Refactor engine vlm params and support precessor output as input#14091mickqian merged 51 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @minleminzui, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the engine's VLM parameter handling to introduce more flexible and performant multimodal input options. By supporting raw images, processor outputs, and precomputed embeddings, it allows users to optimize VLM queries based on their specific needs, from quick prototyping to high-throughput serving. The changes are thoroughly documented and tested, ensuring robust integration with existing VLM models. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
@zhaochenyang20 |
There was a problem hiding this comment.
Code Review
This pull request refactors the VLM parameters to support three different input formats for image data: raw images, processor outputs, and precomputed embeddings. The changes are well-implemented across the documentation, core logic, and tests. The new Jupyter notebook tutorial is a great addition, providing clear examples for the new features. My review includes a few suggestions to improve the consistency of the documentation and enhance the user-friendliness of error messages in the processor logic.
5049fc2 to
dd1dbf9
Compare
|
@ BenYao21 Please take a review and rewrite the description. |
|
move this #10532 to here |
9e4c798 to
ba327ab
Compare
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
The LLaVA models on HF (lmms-lab/LLaVA-OneVision-1.5-8B-Instruct, liuhaotian/llava-v1.5-7b) currently encounter weight loading issues (in config.json), which cause failures in test_vision_openai_server_a and test_chunked_prefill. We are skipping these tests until the weight information is updated in the upstream repositories. @zhaochenyang20 |
Summary: Replace implicit boolean check using `or` with explicit `None` check to prevent RuntimeError when `second_per_grid_ts` is a multi-element tensor. Details: - In `process_mm_data_async`, `getattr(ret, "second_per_grid_ts", None)` can return a tensor. - Using `or` triggers a boolean evaluation of the tensor, causing "RuntimeError: Boolean value of Tensor with more than one value is ambiguous". - Fixed by explicitly checking if the value is `None` before falling back to `video_second_per_grid`.
83b6aa4 to
54e28e8
Compare
|
/rerun-failed-ci try again |
1 similar comment
|
/rerun-failed-ci try again |
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
Add force_download=True to TestLlama4ImageUnderstandsImage to fix safetensors EOF error caused by broken CI cache.
… single-GPU runners
628554f to
5740488
Compare
|
/rerun-failed-ci |
1 similar comment
|
/rerun-failed-ci |
|
Great job! |
…sgl-project#14091) Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: BenYao21 <cyao22@asu.edu> Co-authored-by: minleminzui <minleminzui@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>
…sgl-project#14091) Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: BenYao21 <cyao22@asu.edu> Co-authored-by: minleminzui <minleminzui@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>
…sgl-project#14091) Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: zhaochenyang20 <zhaochenyang20@gmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: BenYao21 <cyao22@asu.edu> Co-authored-by: minleminzui <minleminzui@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: 赵晨阳 <zhaochen20@outlook.com>
Summary
This PR refactors how the offline Engine handles multimodal / VLM inputs and exposes a consistent
image_dataAPI across models.Motivation
Previously, each VLM had its own ad-hoc way to pass images (raw pixel values, custom dicts, etc.), and multi-image requests could easily break the engine path.
We want a single, well-defined
image_datacontract that works for:PIL,numpy,torch)What’s Changed
Engine API
Clarified and unified the supported formats for
image_datainEngine.generate/async_generate, including:image_data=[image]orimage_data=[[image1, image2, ...]]image_data=[dict(processor_output, format="processor_output")]image_data=[dict(processor_output, format="precomputed_embedding", feature=precomputed_embeddings)]Centralized multimodal validation / normalization in the scheduler (
mm_utils,schedule_batch)so batched and multi-image requests follow the same path.
Model / Processor Side
MultimodalInputFormathelper instead of model-specific dicts.refactor-engine-vlm-params— the engine now correctly handles multiple images per request.Tests
Added
test_vlm_input_format.py, which verifies that both Qwen2.5-VL and Gemma-3-VLM work with:image_data=[PIL.Image]image_dataformat="precomputed_embedding"It also checks that the model can correctly understand a 2-image input (e.g., taxi + SGL logo) for all three code paths.
Docs
Updated
docs/advanced_features/vlm_query.ipynbto:EngineAPITesting