[WIP][DO NOT MERGE] Optimize multi-image Qwen3-VL TTFT with shm-aware TP broadcast and per-image chunked ViT encoding by yhyang201 · Pull Request #21462 · sgl-project/sglang

yhyang201 · 2026-03-26T07:27:22Z

Motivation

This PR will be split

We implemented two optimizations for multi-image VLM inference on Qwen3-VL-32B (TP4, 32×720p):

1. Defer shm feature unwrapping (avoid large TP broadcasts)

Previously, unwrap_shm_features() was called immediately after ZMQ recv, converting ShmPointerMMData back into torch.Tensor. As a result, broadcast_pyobj ended up pickling and broadcasting the full pixel data (100MB+) across TP ranks.

This is now changed to:

Make ShmPointerMMData safely pickleable across multiple round-trips (only shm metadata is serialized)
Defer materialize() until after TP broadcast completes

Result: stage2b_broadcast_reqs reduced from 3716 ms → ~8 ms, as only lightweight metadata is transmitted.

2. Chunk-aware ViT encoding + per-image cache

The previous chunked prefill path was ineffective for multi-image inputs:

Chunk 1 encoded all 32 images (~1.9s ViT cost)
Subsequent chunks reused cached embeddings
→ All ViT compute was concentrated in the first chunk

Additionally:

Cache key was based on the full image set (combine_hashes)
Any difference in images led to a full cache miss
If total image embeddings exceeded cache capacity, frequent eviction would cause every chunk to recompute all 32 images, leading to significant redundant compute

This is now improved by:

Encoding only the images whose token ranges overlap with the current chunk
Switching cache granularity to per-image ((item_hash, image_index))
Falling back to the original path for non-bundled items / EVS

Result:

ViT compute is distributed across chunks instead of front-loaded
Cache reuse improves for partially overlapping image sets
Avoids repeated full recomputation under cache pressure

Before:

============================================================
  [1 images] Generating 1x 720p... (0.0s) Sending... OK (TTFT=304ms, e2e=0.5s)
  [2 images] Generating 2x 720p... (0.0s) Sending... OK (TTFT=530ms, e2e=0.8s)
  [3 images] Generating 3x 720p... (0.1s) Sending... OK (TTFT=708ms, e2e=0.9s)
  [4 images] Generating 4x 720p... (0.1s) Sending... OK (TTFT=983ms, e2e=1.2s)
  [5 images] Generating 5x 720p... (0.1s) Sending... OK (TTFT=1117ms, e2e=1.4s)
  [6 images] Generating 6x 720p... (0.1s) Sending... OK (TTFT=1325ms, e2e=1.6s)
  [7 images] Generating 7x 720p... (0.1s) Sending... OK (TTFT=1693ms, e2e=1.9s)
  [8 images] Generating 8x 720p... (0.1s) Sending... OK (TTFT=1754ms, e2e=2.0s)
  [9 images] Generating 9x 720p... (0.2s) Sending... OK (TTFT=2064ms, e2e=2.3s)
  [10 images] Generating 10x 720p... (0.2s) Sending... OK (TTFT=2367ms, e2e=2.6s)
  [11 images] Generating 11x 720p... (0.2s) Sending... OK (TTFT=2636ms, e2e=2.9s)
  [12 images] Generating 12x 720p... (0.2s) Sending... OK (TTFT=2946ms, e2e=3.2s)
  [13 images] Generating 13x 720p... (0.2s) Sending... OK (TTFT=3601ms, e2e=3.9s)
  [14 images] Generating 14x 720p... (0.3s) Sending... OK (TTFT=3814ms, e2e=4.1s)
  [15 images] Generating 15x 720p... (0.3s) Sending... OK (TTFT=3839ms, e2e=4.1s)
  [16 images] Generating 16x 720p... (0.3s) Sending... OK (TTFT=4024ms, e2e=4.3s)
  [17 images] Generating 17x 720p... (0.3s) Sending... OK (TTFT=4351ms, e2e=4.6s)
  [18 images] Generating 18x 720p... (0.3s) Sending... OK (TTFT=4549ms, e2e=4.8s)
  [19 images] Generating 19x 720p... (0.4s) Sending... OK (TTFT=4963ms, e2e=5.2s)
  [20 images] Generating 20x 720p... (0.4s) Sending... OK (TTFT=5332ms, e2e=5.6s)
  [21 images] Generating 21x 720p... (0.4s) Sending... OK (TTFT=5483ms, e2e=5.8s)
  [22 images] Generating 22x 720p... (0.4s) Sending... OK (TTFT=5779ms, e2e=6.0s)
  [23 images] Generating 23x 720p... (0.4s) Sending... OK (TTFT=6163ms, e2e=6.4s)
  [24 images] Generating 24x 720p... (0.5s) Sending... OK (TTFT=6520ms, e2e=6.8s)
  [25 images] Generating 25x 720p... (0.5s) Sending... OK (TTFT=7207ms, e2e=7.5s)
  [26 images] Generating 26x 720p... (0.5s) Sending... OK (TTFT=7493ms, e2e=7.8s)
  [27 images] Generating 27x 720p... (0.5s) Sending... OK (TTFT=7968ms, e2e=8.3s)
  [28 images] Generating 28x 720p... (0.5s) Sending... OK (TTFT=7874ms, e2e=8.2s)
  [29 images] Generating 29x 720p... (0.5s) Sending... OK (TTFT=8374ms, e2e=8.7s)
  [30 images] Generating 30x 720p... (0.5s) Sending... OK (TTFT=8739ms, e2e=9.0s)
  [31 images] Generating 31x 720p... (0.6s) Sending... OK (TTFT=8809ms, e2e=9.1s)
  [32 images] Generating 32x 720p... (0.6s) Sending... OK (TTFT=9063ms, e2e=9.4s)

After:

============================================================
Probing: 720p (1280x720)
============================================================
  [1 images] Generating 1x 720p... (0.0s) Sending... OK (TTFT=246ms, e2e=0.5s)
  [2 images] Generating 2x 720p... (0.0s) Sending... OK (TTFT=351ms, e2e=0.6s)
  [3 images] Generating 3x 720p... (0.1s) Sending... OK (TTFT=450ms, e2e=0.7s)
  [4 images] Generating 4x 720p... (0.1s) Sending... OK (TTFT=644ms, e2e=0.9s)
  [5 images] Generating 5x 720p... (0.1s) Sending... OK (TTFT=694ms, e2e=0.9s)
  [6 images] Generating 6x 720p... (0.1s) Sending... OK (TTFT=826ms, e2e=1.1s)
  [7 images] Generating 7x 720p... (0.1s) Sending... OK (TTFT=1103ms, e2e=1.3s)
  [8 images] Generating 8x 720p... (0.1s) Sending... OK (TTFT=1075ms, e2e=1.3s)
  [9 images] Generating 9x 720p... (0.2s) Sending... OK (TTFT=1238ms, e2e=1.5s)
  [10 images] Generating 10x 720p... (0.2s) Sending... OK (TTFT=1364ms, e2e=1.6s)
  [11 images] Generating 11x 720p... (0.2s) Sending... OK (TTFT=1520ms, e2e=1.8s)
  [12 images] Generating 12x 720p... (0.2s) Sending... OK (TTFT=1692ms, e2e=1.9s)
  [13 images] Generating 13x 720p... (0.2s) Sending... OK (TTFT=2084ms, e2e=2.3s)
  [14 images] Generating 14x 720p... (0.3s) Sending... OK (TTFT=2003ms, e2e=2.2s)
  [15 images] Generating 15x 720p... (0.3s) Sending... OK (TTFT=2078ms, e2e=2.3s)
  [16 images] Generating 16x 720p... (0.3s) Sending... OK (TTFT=2235ms, e2e=2.5s)
  [17 images] Generating 17x 720p... (0.3s) Sending... OK (TTFT=2420ms, e2e=2.7s)
  [18 images] Generating 18x 720p... (0.3s) Sending... OK (TTFT=2472ms, e2e=2.7s)
  [19 images] Generating 19x 720p... (0.3s) Sending... OK (TTFT=2651ms, e2e=2.9s)
  [20 images] Generating 20x 720p... (0.4s) Sending... OK (TTFT=2855ms, e2e=3.1s)
  [21 images] Generating 21x 720p... (0.4s) Sending... OK (TTFT=2905ms, e2e=3.1s)
  [22 images] Generating 22x 720p... (0.4s) Sending... OK (TTFT=3098ms, e2e=3.3s)
  [23 images] Generating 23x 720p... (0.4s) Sending... OK (TTFT=3194ms, e2e=3.4s)
  [24 images] Generating 24x 720p... (0.4s) Sending... OK (TTFT=3447ms, e2e=3.7s)
  [25 images] Generating 25x 720p... (0.5s) Sending... OK (TTFT=4174ms, e2e=4.4s)
  [26 images] Generating 26x 720p... (0.5s) Sending... OK (TTFT=3906ms, e2e=4.1s)
  [27 images] Generating 27x 720p... (0.5s) Sending... OK (TTFT=4041ms, e2e=4.3s)
  [28 images] Generating 28x 720p... (0.5s) Sending... OK (TTFT=4170ms, e2e=4.4s)
  [29 images] Generating 29x 720p... (0.5s) Sending... OK (TTFT=4353ms, e2e=4.6s)
  [30 images] Generating 30x 720p... (0.6s) Sending... OK (TTFT=4492ms, e2e=4.7s)
  [31 images] Generating 31x 720p... (0.6s) Sending... OK (TTFT=4703ms, e2e=4.9s)
  [32 images] Generating 32x 720p... (0.6s) Sending... OK (TTFT=4839ms, e2e=5.1s)

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-26T07:27:59Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces substantial optimizations for multi-image processing in the Qwen3-VL model, primarily targeting a reduction in Time To First Token (TTFT). By implementing a sophisticated per-image chunked encoding strategy and refining shared memory data transfer, the system can now process multiple images more efficiently, especially in distributed environments. These changes lead to noticeable performance gains, making the model more responsive for visual language tasks involving numerous images.

Highlights

Performance Optimization: Significantly improved Time To First Token (TTFT) for multi-image Qwen3-VL models, as demonstrated by benchmarks showing up to 2x speedup for 32 images.
Per-Image Chunked ViT Encoding: Introduced a new strategy for processing bundled multi-image inputs, where only images overlapping with the current prefill chunk are encoded, and individual image embeddings are cached on the GPU.
Optimized Shared Memory (SHM) Handling: Enhanced ShmPointerMMData for more efficient cross-process tensor transfer, utilizing zero-copy views and delaying materialization until after broadcast, reducing data duplication.
VLM Pipeline Stage Timing: Integrated a new VLM stage timer to provide detailed profiling of different pipeline stages (processor, data transfer, ViT encoding, embedding merge, LLM prefill), aiding in performance analysis.
Multimodal Embedding Refactor: Refactored the multimodal utility functions to support the new per-image chunking and caching mechanisms, improving modularity and efficiency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant improvements to multimodal processing, particularly for models like Qwen3-VL, by implementing per-image encoding and caching for bundled items to optimize memory and performance. It also refactors shared memory handling for tensor transfers, improving efficiency and resource management. A new VLM pipeline stage timer has been added and integrated across various stages (processor, transfer, ViT encoding, embedding merge, LLM prefill) to provide detailed performance metrics. Review comments suggest a minor optimization in a contiguity check and highlight a style guide violation regarding local imports of the new timing utilities, recommending they be moved to the top of the file or justified.

gemini-code-assist · 2026-03-26T07:32:35Z

+    is_contiguous = (last - first + 1 == len(image_indices)) and all(
+        image_indices[k] == first + k for k in range(len(image_indices))
+    )


The image_indices list is derived from uncached_indices, which is built by iterating through overlapping_indices. Since overlapping_indices is generated from enumerate(items_offset), it preserves order, meaning image_indices is a sorted list of unique integers.

Given this, the check (last - first + 1) == len(image_indices) is sufficient to determine if the indices are contiguous. The all(...) expression is redundant and introduces unnecessary overhead, which could be noticeable for requests with many images.

is_contiguous = (last - first + 1) == len(image_indices)

gemini-code-assist · 2026-03-26T07:32:35Z

-    # We need to forward an embedding_idx to locate the item start-end position in embedding.
-    embedding_idx = 0
-    kept_slices: List[torch.Tensor] = []
+        from sglang.srt.vlm_stage_timer import record_stage


This local import of record_stage should be moved to the top of the file. This pattern of local imports for vlm_stage_timer and other modules like time appears in multiple places in this pull request (e.g., lines 635, 957, 1195, etc., and in other files).

According to PEP 8, imports should be at the top of the file to improve readability and make dependencies explicit. If this is a deliberate optimization to avoid import overhead, please add a comment explaining the reason. Otherwise, please move all local imports to the top of their respective files.

upd

f2e0ec9

yhyang201 requested review from Ying1123, hanming-lu, hnyls2002, hzh0425, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners March 26, 2026 07:27

github-actions Bot added the Multi-modal multi-modal language model label Mar 26, 2026

gemini-code-assist Bot reviewed Mar 26, 2026

View reviewed changes

remove timing logic

cfe835e

yhyang201 changed the title ~~[WIP] Optimize multi-image Qwen3-VL TTFT with shm-aware TP broadcast and per-image chunked ViT encoding~~ [WIP][DO NOT MERGE] Optimize multi-image Qwen3-VL TTFT with shm-aware TP broadcast and per-image chunked ViT encoding Mar 26, 2026

upd

e520a46

This was referenced Mar 26, 2026

[VLM] Optimize ShmPointerMMData for multi-pickle safety and deferred unwrap #21465

Merged

[Feature] VLM Model Performance Improvement #21512

Open

yhyang201 closed this Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][DO NOT MERGE] Optimize multi-image Qwen3-VL TTFT with shm-aware TP broadcast and per-image chunked ViT encoding#21462

[WIP][DO NOT MERGE] Optimize multi-image Qwen3-VL TTFT with shm-aware TP broadcast and per-image chunked ViT encoding#21462
yhyang201 wants to merge 3 commits intosgl-project:mainfrom
yhyang201:optimize-vlm

yhyang201 commented Mar 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 26, 2026

Uh oh!

gemini-code-assist Bot Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yhyang201 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

This PR will be split

1. Defer shm feature unwrapping (avoid large TP broadcasts)

2. Chunk-aware ViT encoding + per-image cache

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yhyang201 commented Mar 26, 2026 •

edited

Loading