vlm: Refactor engine vlm params and support precessor output as input#10532
vlm: Refactor engine vlm params and support precessor output as input#10532zhaochenyang20 wants to merge 18 commits intomainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @zhaochenyang20, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces significant refactoring to the engine's Vision-Language Model (VLM) parameter handling. The primary goal is to enhance flexibility and efficiency by allowing users to provide multimodal data in more advanced, pre-processed forms, such as direct processor outputs or precomputed embeddings. This change streamlines the VLM inference pipeline, reduces redundant processing, and provides clearer interfaces for multimodal input, as reflected in updated API docstrings and expanded test coverage.
Highlights
- Expanded VLM Input Formats: The engine now explicitly supports passing pre-processed multimodal data, including raw processor outputs and precomputed embeddings, directly to the
generateAPI, enhancing flexibility for VLM tasks. - Refactored Multimodal Data Handling: Core logic for processing and embedding multimodal data has been updated to accommodate these new input formats, including the introduction of a
MultimodalInputFormatenum and aformatfield inMultimodalDataItemfor clearer categorization. - Improved Documentation and Testing: Docstrings for
generatefunctions are updated to reflect the new input capabilities, and VLM input format tests are expanded to cover multiple image inputs and the new processor output and precomputed embedding formats.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request refactors how Vision Language Model (VLM) parameters are handled, introducing new formats for image_data like processor_output and precomputed_embedding. The changes are extensive, touching documentation, engine entrypoints, multimodal utilities, and tests. My review focuses on ensuring the correctness and robustness of these changes. I've identified a few issues: some debugging print statements left in the code, a case of poor exception handling that could hide bugs, and a misleading comment and docstring. Overall, the refactoring seems to be going in the right direction, but these points should be addressed to improve code quality.
| # - Single image for a single request | ||
| # - List of images (one per request in a batch) | ||
| # - List of lists of images (multiple images per request) | ||
| # - List of preprocessed pixel values, each as a dict containing field `format`: 'processor_output' and `feature`: the preprocessed pixel values |
There was a problem hiding this comment.
The documentation for the processor_output format is misleading. It states that the dictionary should contain a feature key, but the implementation and examples show that the entire processor_output dictionary is passed directly (e.g., image_data=[dict(processor_output, format="processor_output")]). Please update the docstring to accurately reflect this usage.
| # - List of preprocessed pixel values, each as a dict containing field `format`: 'processor_output' and `feature`: the preprocessed pixel values | |
| # - List of preprocessed outputs from a Huggingface processor, each as a dict containing `format`: 'processor_output' and other data. | |
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
3049d02 to
8459da7
Compare
|
@JustinTong0323 @mickqian what's left for this PR 🤔 |
|
Yep. This PR is still valid. I will find someone to make it through. |
|
I opened a follow-up PR #12755 to relax the test assertions in test_vlm_input_format.py. @zhaochenyang20 please review it |
|
@minleminzui I think the change is almost correct? Could you please check wether we should modify the document (I updated it two months before, I think there shall be something further now). Also, rebase it. After the CI, let me merge it! thanks so much! |
|
I opened a follow-up PR #12831 to update vlm_query.ipynb to include a Qwen2.5-VL example that passes HuggingFace processor_output into Engine.generate, aligning the docs @zhaochenyang20 please review it |
|
Also, this #12831 |
|
@zhaochenyang20 Previously, the CI job unit-test-backend-1-gpu (0), which runs pytest test/srt/test_vision_openai_server_a.py, failed. |
|
@zhaochenyang20 |
…-test-backend-1-gpu (0), (#14080) Co-authored-by: BenYao21 <cyao22@asu.edu>


Summary
This PR refactors how the offline Engine handles multimodal / VLM inputs and exposes a consistent
image_dataAPI across models.Motivation
Previously, each VLM had its own ad-hoc way to pass images (raw pixel values, custom dicts, etc.), and multi-image requests could easily break the engine path.
We want a single, well-defined
image_datacontract that works for:PIL,numpy,torch)What’s Changed
Engine API
Clarified and unified the supported formats for
image_datainEngine.generate/async_generate, including:image_data=[image]orimage_data=[[image1, image2, ...]]image_data=[dict(processor_output, format="processor_output")]image_data=[dict(processor_output, format="precomputed_embedding", feature=precomputed_embeddings)]Centralized multimodal validation / normalization in the scheduler (
mm_utils,schedule_batch)so batched and multi-image requests follow the same path.
Model / Processor Side
MultimodalInputFormathelper instead of model-specific dicts.refactor-engine-vlm-params— the engine now correctly handles multiple images per request.Tests
Added
test_vlm_input_format.py, which verifies that both Qwen2.5-VL and Gemma-3-VLM work with:image_data=[PIL.Image]image_dataformat="precomputed_embedding"It also checks that the model can correctly understand a 2-image input (e.g., taxi + SGL logo) for all three code paths.
Docs
Updated
docs/advanced_features/vlm_query.ipynbto:EngineAPITesting