[WIP][DO NOT MERGE] Optimize multi-image Qwen3-VL TTFT with shm-aware TP broadcast and per-image chunked ViT encoding#21462
[WIP][DO NOT MERGE] Optimize multi-image Qwen3-VL TTFT with shm-aware TP broadcast and per-image chunked ViT encoding#21462yhyang201 wants to merge 3 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces substantial optimizations for multi-image processing in the Qwen3-VL model, primarily targeting a reduction in Time To First Token (TTFT). By implementing a sophisticated per-image chunked encoding strategy and refining shared memory data transfer, the system can now process multiple images more efficiently, especially in distributed environments. These changes lead to noticeable performance gains, making the model more responsive for visual language tasks involving numerous images. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant improvements to multimodal processing, particularly for models like Qwen3-VL, by implementing per-image encoding and caching for bundled items to optimize memory and performance. It also refactors shared memory handling for tensor transfers, improving efficiency and resource management. A new VLM pipeline stage timer has been added and integrated across various stages (processor, transfer, ViT encoding, embedding merge, LLM prefill) to provide detailed performance metrics. Review comments suggest a minor optimization in a contiguity check and highlight a style guide violation regarding local imports of the new timing utilities, recommending they be moved to the top of the file or justified.
| is_contiguous = (last - first + 1 == len(image_indices)) and all( | ||
| image_indices[k] == first + k for k in range(len(image_indices)) | ||
| ) |
There was a problem hiding this comment.
The image_indices list is derived from uncached_indices, which is built by iterating through overlapping_indices. Since overlapping_indices is generated from enumerate(items_offset), it preserves order, meaning image_indices is a sorted list of unique integers.
Given this, the check (last - first + 1) == len(image_indices) is sufficient to determine if the indices are contiguous. The all(...) expression is redundant and introduces unnecessary overhead, which could be noticeable for requests with many images.
is_contiguous = (last - first + 1) == len(image_indices)| # We need to forward an embedding_idx to locate the item start-end position in embedding. | ||
| embedding_idx = 0 | ||
| kept_slices: List[torch.Tensor] = [] | ||
| from sglang.srt.vlm_stage_timer import record_stage |
There was a problem hiding this comment.
This local import of record_stage should be moved to the top of the file. This pattern of local imports for vlm_stage_timer and other modules like time appears in multiple places in this pull request (e.g., lines 635, 957, 1195, etc., and in other files).
According to PEP 8, imports should be at the top of the file to improve readability and make dependencies explicit. If this is a deliberate optimization to avoid import overhead, please add a comment explaining the reason. Otherwise, please move all local imports to the top of their respective files.
Motivation
This PR will be split
We implemented two optimizations for multi-image VLM inference on Qwen3-VL-32B (TP4, 32×720p):
1. Defer shm feature unwrapping (avoid large TP broadcasts)
Previously,
unwrap_shm_features()was called immediately after ZMQ recv, convertingShmPointerMMDataback intotorch.Tensor. As a result,broadcast_pyobjended up pickling and broadcasting the full pixel data (100MB+) across TP ranks.This is now changed to:
ShmPointerMMDatasafely pickleable across multiple round-trips (only shm metadata is serialized)materialize()until after TP broadcast completesResult:
stage2b_broadcast_reqsreduced from 3716 ms → ~8 ms, as only lightweight metadata is transmitted.2. Chunk-aware ViT encoding + per-image cache
The previous chunked prefill path was ineffective for multi-image inputs:
→ All ViT compute was concentrated in the first chunk
Additionally:
combine_hashes)This is now improved by:
(item_hash, image_index))Result:
Before:
After:
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci