Skip to content

[WIP][DO NOT MERGE] Optimize multi-image Qwen3-VL TTFT with shm-aware TP broadcast and per-image chunked ViT encoding#21462

Closed
yhyang201 wants to merge 3 commits intosgl-project:mainfrom
yhyang201:optimize-vlm
Closed

[WIP][DO NOT MERGE] Optimize multi-image Qwen3-VL TTFT with shm-aware TP broadcast and per-image chunked ViT encoding#21462
yhyang201 wants to merge 3 commits intosgl-project:mainfrom
yhyang201:optimize-vlm

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

@yhyang201 yhyang201 commented Mar 26, 2026

Motivation

This PR will be split

We implemented two optimizations for multi-image VLM inference on Qwen3-VL-32B (TP4, 32×720p):

1. Defer shm feature unwrapping (avoid large TP broadcasts)

Previously, unwrap_shm_features() was called immediately after ZMQ recv, converting ShmPointerMMData back into torch.Tensor. As a result, broadcast_pyobj ended up pickling and broadcasting the full pixel data (100MB+) across TP ranks.

This is now changed to:

  • Make ShmPointerMMData safely pickleable across multiple round-trips (only shm metadata is serialized)
  • Defer materialize() until after TP broadcast completes

Result: stage2b_broadcast_reqs reduced from 3716 ms → ~8 ms, as only lightweight metadata is transmitted.


2. Chunk-aware ViT encoding + per-image cache

The previous chunked prefill path was ineffective for multi-image inputs:

  • Chunk 1 encoded all 32 images (~1.9s ViT cost)
  • Subsequent chunks reused cached embeddings
    → All ViT compute was concentrated in the first chunk

Additionally:

  • Cache key was based on the full image set (combine_hashes)
  • Any difference in images led to a full cache miss
  • If total image embeddings exceeded cache capacity, frequent eviction would cause every chunk to recompute all 32 images, leading to significant redundant compute

This is now improved by:

  • Encoding only the images whose token ranges overlap with the current chunk
  • Switching cache granularity to per-image ((item_hash, image_index))
  • Falling back to the original path for non-bundled items / EVS

Result:

  • ViT compute is distributed across chunks instead of front-loaded
  • Cache reuse improves for partially overlapping image sets
  • Avoids repeated full recomputation under cache pressure

Before:

============================================================
  [1 images] Generating 1x 720p... (0.0s) Sending... OK (TTFT=304ms, e2e=0.5s)
  [2 images] Generating 2x 720p... (0.0s) Sending... OK (TTFT=530ms, e2e=0.8s)
  [3 images] Generating 3x 720p... (0.1s) Sending... OK (TTFT=708ms, e2e=0.9s)
  [4 images] Generating 4x 720p... (0.1s) Sending... OK (TTFT=983ms, e2e=1.2s)
  [5 images] Generating 5x 720p... (0.1s) Sending... OK (TTFT=1117ms, e2e=1.4s)
  [6 images] Generating 6x 720p... (0.1s) Sending... OK (TTFT=1325ms, e2e=1.6s)
  [7 images] Generating 7x 720p... (0.1s) Sending... OK (TTFT=1693ms, e2e=1.9s)
  [8 images] Generating 8x 720p... (0.1s) Sending... OK (TTFT=1754ms, e2e=2.0s)
  [9 images] Generating 9x 720p... (0.2s) Sending... OK (TTFT=2064ms, e2e=2.3s)
  [10 images] Generating 10x 720p... (0.2s) Sending... OK (TTFT=2367ms, e2e=2.6s)
  [11 images] Generating 11x 720p... (0.2s) Sending... OK (TTFT=2636ms, e2e=2.9s)
  [12 images] Generating 12x 720p... (0.2s) Sending... OK (TTFT=2946ms, e2e=3.2s)
  [13 images] Generating 13x 720p... (0.2s) Sending... OK (TTFT=3601ms, e2e=3.9s)
  [14 images] Generating 14x 720p... (0.3s) Sending... OK (TTFT=3814ms, e2e=4.1s)
  [15 images] Generating 15x 720p... (0.3s) Sending... OK (TTFT=3839ms, e2e=4.1s)
  [16 images] Generating 16x 720p... (0.3s) Sending... OK (TTFT=4024ms, e2e=4.3s)
  [17 images] Generating 17x 720p... (0.3s) Sending... OK (TTFT=4351ms, e2e=4.6s)
  [18 images] Generating 18x 720p... (0.3s) Sending... OK (TTFT=4549ms, e2e=4.8s)
  [19 images] Generating 19x 720p... (0.4s) Sending... OK (TTFT=4963ms, e2e=5.2s)
  [20 images] Generating 20x 720p... (0.4s) Sending... OK (TTFT=5332ms, e2e=5.6s)
  [21 images] Generating 21x 720p... (0.4s) Sending... OK (TTFT=5483ms, e2e=5.8s)
  [22 images] Generating 22x 720p... (0.4s) Sending... OK (TTFT=5779ms, e2e=6.0s)
  [23 images] Generating 23x 720p... (0.4s) Sending... OK (TTFT=6163ms, e2e=6.4s)
  [24 images] Generating 24x 720p... (0.5s) Sending... OK (TTFT=6520ms, e2e=6.8s)
  [25 images] Generating 25x 720p... (0.5s) Sending... OK (TTFT=7207ms, e2e=7.5s)
  [26 images] Generating 26x 720p... (0.5s) Sending... OK (TTFT=7493ms, e2e=7.8s)
  [27 images] Generating 27x 720p... (0.5s) Sending... OK (TTFT=7968ms, e2e=8.3s)
  [28 images] Generating 28x 720p... (0.5s) Sending... OK (TTFT=7874ms, e2e=8.2s)
  [29 images] Generating 29x 720p... (0.5s) Sending... OK (TTFT=8374ms, e2e=8.7s)
  [30 images] Generating 30x 720p... (0.5s) Sending... OK (TTFT=8739ms, e2e=9.0s)
  [31 images] Generating 31x 720p... (0.6s) Sending... OK (TTFT=8809ms, e2e=9.1s)
  [32 images] Generating 32x 720p... (0.6s) Sending... OK (TTFT=9063ms, e2e=9.4s)

After:

============================================================
Probing: 720p (1280x720)
============================================================
  [1 images] Generating 1x 720p... (0.0s) Sending... OK (TTFT=246ms, e2e=0.5s)
  [2 images] Generating 2x 720p... (0.0s) Sending... OK (TTFT=351ms, e2e=0.6s)
  [3 images] Generating 3x 720p... (0.1s) Sending... OK (TTFT=450ms, e2e=0.7s)
  [4 images] Generating 4x 720p... (0.1s) Sending... OK (TTFT=644ms, e2e=0.9s)
  [5 images] Generating 5x 720p... (0.1s) Sending... OK (TTFT=694ms, e2e=0.9s)
  [6 images] Generating 6x 720p... (0.1s) Sending... OK (TTFT=826ms, e2e=1.1s)
  [7 images] Generating 7x 720p... (0.1s) Sending... OK (TTFT=1103ms, e2e=1.3s)
  [8 images] Generating 8x 720p... (0.1s) Sending... OK (TTFT=1075ms, e2e=1.3s)
  [9 images] Generating 9x 720p... (0.2s) Sending... OK (TTFT=1238ms, e2e=1.5s)
  [10 images] Generating 10x 720p... (0.2s) Sending... OK (TTFT=1364ms, e2e=1.6s)
  [11 images] Generating 11x 720p... (0.2s) Sending... OK (TTFT=1520ms, e2e=1.8s)
  [12 images] Generating 12x 720p... (0.2s) Sending... OK (TTFT=1692ms, e2e=1.9s)
  [13 images] Generating 13x 720p... (0.2s) Sending... OK (TTFT=2084ms, e2e=2.3s)
  [14 images] Generating 14x 720p... (0.3s) Sending... OK (TTFT=2003ms, e2e=2.2s)
  [15 images] Generating 15x 720p... (0.3s) Sending... OK (TTFT=2078ms, e2e=2.3s)
  [16 images] Generating 16x 720p... (0.3s) Sending... OK (TTFT=2235ms, e2e=2.5s)
  [17 images] Generating 17x 720p... (0.3s) Sending... OK (TTFT=2420ms, e2e=2.7s)
  [18 images] Generating 18x 720p... (0.3s) Sending... OK (TTFT=2472ms, e2e=2.7s)
  [19 images] Generating 19x 720p... (0.3s) Sending... OK (TTFT=2651ms, e2e=2.9s)
  [20 images] Generating 20x 720p... (0.4s) Sending... OK (TTFT=2855ms, e2e=3.1s)
  [21 images] Generating 21x 720p... (0.4s) Sending... OK (TTFT=2905ms, e2e=3.1s)
  [22 images] Generating 22x 720p... (0.4s) Sending... OK (TTFT=3098ms, e2e=3.3s)
  [23 images] Generating 23x 720p... (0.4s) Sending... OK (TTFT=3194ms, e2e=3.4s)
  [24 images] Generating 24x 720p... (0.4s) Sending... OK (TTFT=3447ms, e2e=3.7s)
  [25 images] Generating 25x 720p... (0.5s) Sending... OK (TTFT=4174ms, e2e=4.4s)
  [26 images] Generating 26x 720p... (0.5s) Sending... OK (TTFT=3906ms, e2e=4.1s)
  [27 images] Generating 27x 720p... (0.5s) Sending... OK (TTFT=4041ms, e2e=4.3s)
  [28 images] Generating 28x 720p... (0.5s) Sending... OK (TTFT=4170ms, e2e=4.4s)
  [29 images] Generating 29x 720p... (0.5s) Sending... OK (TTFT=4353ms, e2e=4.6s)
  [30 images] Generating 30x 720p... (0.6s) Sending... OK (TTFT=4492ms, e2e=4.7s)
  [31 images] Generating 31x 720p... (0.6s) Sending... OK (TTFT=4703ms, e2e=4.9s)
  [32 images] Generating 32x 720p... (0.6s) Sending... OK (TTFT=4839ms, e2e=5.1s)

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces substantial optimizations for multi-image processing in the Qwen3-VL model, primarily targeting a reduction in Time To First Token (TTFT). By implementing a sophisticated per-image chunked encoding strategy and refining shared memory data transfer, the system can now process multiple images more efficiently, especially in distributed environments. These changes lead to noticeable performance gains, making the model more responsive for visual language tasks involving numerous images.

Highlights

  • Performance Optimization: Significantly improved Time To First Token (TTFT) for multi-image Qwen3-VL models, as demonstrated by benchmarks showing up to 2x speedup for 32 images.
  • Per-Image Chunked ViT Encoding: Introduced a new strategy for processing bundled multi-image inputs, where only images overlapping with the current prefill chunk are encoded, and individual image embeddings are cached on the GPU.
  • Optimized Shared Memory (SHM) Handling: Enhanced ShmPointerMMData for more efficient cross-process tensor transfer, utilizing zero-copy views and delaying materialization until after broadcast, reducing data duplication.
  • VLM Pipeline Stage Timing: Integrated a new VLM stage timer to provide detailed profiling of different pipeline stages (processor, data transfer, ViT encoding, embedding merge, LLM prefill), aiding in performance analysis.
  • Multimodal Embedding Refactor: Refactored the multimodal utility functions to support the new per-image chunking and caching mechanisms, improving modularity and efficiency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to multimodal processing, particularly for models like Qwen3-VL, by implementing per-image encoding and caching for bundled items to optimize memory and performance. It also refactors shared memory handling for tensor transfers, improving efficiency and resource management. A new VLM pipeline stage timer has been added and integrated across various stages (processor, transfer, ViT encoding, embedding merge, LLM prefill) to provide detailed performance metrics. Review comments suggest a minor optimization in a contiguity check and highlight a style guide violation regarding local imports of the new timing utilities, recommending they be moved to the top of the file or justified.

Comment thread python/sglang/srt/managers/mm_utils.py Outdated
Comment on lines +509 to +511
is_contiguous = (last - first + 1 == len(image_indices)) and all(
image_indices[k] == first + k for k in range(len(image_indices))
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The image_indices list is derived from uncached_indices, which is built by iterating through overlapping_indices. Since overlapping_indices is generated from enumerate(items_offset), it preserves order, meaning image_indices is a sorted list of unique integers.

Given this, the check (last - first + 1) == len(image_indices) is sufficient to determine if the indices are contiguous. The all(...) expression is redundant and introduces unnecessary overhead, which could be noticeable for requests with many images.

    is_contiguous = (last - first + 1) == len(image_indices)

Comment thread python/sglang/srt/managers/mm_utils.py Outdated
# We need to forward an embedding_idx to locate the item start-end position in embedding.
embedding_idx = 0
kept_slices: List[torch.Tensor] = []
from sglang.srt.vlm_stage_timer import record_stage
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This local import of record_stage should be moved to the top of the file. This pattern of local imports for vlm_stage_timer and other modules like time appears in multiple places in this pull request (e.g., lines 635, 957, 1195, etc., and in other files).

According to PEP 8, imports should be at the top of the file to improve readability and make dependencies explicit. If this is a deliberate optimization to avoid import overhead, please add a comment explaining the reason. Otherwise, please move all local imports to the top of their respective files.

@yhyang201 yhyang201 changed the title [WIP] Optimize multi-image Qwen3-VL TTFT with shm-aware TP broadcast and per-image chunked ViT encoding [WIP][DO NOT MERGE] Optimize multi-image Qwen3-VL TTFT with shm-aware TP broadcast and per-image chunked ViT encoding Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multi-modal multi-modal language model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant