Skip to content

[VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer#22038

Merged
yhyang201 merged 1 commit intosgl-project:mainfrom
yhyang201:chunking-encoding
Apr 4, 2026
Merged

[VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer#22038
yhyang201 merged 1 commit intosgl-project:mainfrom
yhyang201:chunking-encoding

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

Motivation

  • Per-image embedding cache: Switch multimodal embedding cache granularity from per-request (combine_hashes(all_items)) to per-image (item.hash), improving cache reuse under LRU eviction and avoiding full recomputation when any single image is evicted.
  • Chunk-aware ViT encoding: Only encode images whose token ranges overlap with the current chunked prefill chunk, instead of encoding all images in chunk 1. Reduces peak GPU memory and ViT compute for multi-image requests.
  • Lazy device transfer: Remove eager CPU→GPU transfer of all image features from prepare_for_extend(). Defer to the embedding pipeline (mm_utils.py) so only chunk-relevant cache-miss items are moved to GPU.
  • Remove redundant model-level device transfers: Models (qwen3_vl, deepseek_vl2, phi4mm, step3_vl_10b) no longer manually .to(device) their features — handled uniformly by mm_utils.py.
  • Remove dead code: ~240 net lines removed, including unused chunked padding functions and qwen3_vl's internal ViT batching logic.

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@yhyang201
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions Bot added Multi-modal multi-modal language model deepseek run-ci labels Apr 3, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the multimodal embedding logic to improve efficiency through per-image chunk-aware encoding and centralized device management. Key changes include the introduction of _get_chunked_embedding_by_item for optimized caching and the removal of redundant device transfer logic across various model implementations like Qwen3-VL and DeepSeek-VL2. Feedback highlights that _move_items_to_device should be updated to handle numpy.ndarray features to prevent downstream failures, and a safety check is needed for item.offsets to avoid potential TypeError exceptions.

Comment on lines +467 to +468
if isinstance(item.feature, torch.Tensor) and item.feature.device != device:
item.feature = item.feature.to(device, non_blocking=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _move_items_to_device function only handles torch.Tensor features. However, MultimodalDataItem.feature can also be a numpy.ndarray (as defined in schedule_batch.py). If a feature is a numpy array, it won't be moved to the device, which will cause subsequent model operations (like torch.cat in qwen3_vl.py) to fail. It's safer to convert numpy arrays to tensors before moving them to the device.

    for item in items:
        if item.feature is not None:
            if not isinstance(item.feature, torch.Tensor):
                item.feature = torch.from_numpy(item.feature)
            if item.feature.device != device:
                item.feature = item.feature.to(device, non_blocking=True)

# Use per-image path when all items have exactly one offset (already
# split per-image) — this avoids encoding images not in this chunk.
# Fall back to combined path for non-split items or EVS.
is_per_image = all(len(item.offsets) == 1 for item in embedding_items_per_req)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check len(item.offsets) == 1 will raise a TypeError if item.offsets is None. While most processors set offsets, MultimodalDataItem defines offsets as Optional[list]. A safer check should ensure offsets is not None before accessing its length.

Suggested change
is_per_image = all(len(item.offsets) == 1 for item in embedding_items_per_req)
is_per_image = all(item.offsets is not None and len(item.offsets) == 1 for item in embedding_items_per_req)

@yhyang201 yhyang201 changed the title [VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer [DO NOT MERGE][VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer Apr 3, 2026
@yhyang201
Copy link
Copy Markdown
Collaborator Author

Per-Image ViT Cache Benchmark Results

Model: Qwen/Qwen3-VL-8B-Instruct (tp=1)
Scenario: multiturn_image (each round adds 1 new image, 8 rounds total)
Branch: chunking-encoding vs main
Hardware: NVIDIA H200

ViT Encoding Time per Chunk Prefill

720p (1280x720)

Round (images) main ViT (ms) main encoded PR ViT (ms) PR encoded ViT saved (ms)
0 (1 img) 29 1 29 1 0
1 (2 imgs) 33 2 17 1 -16
2 (3 imgs) 43 3 16 1 -27
3 (4 imgs) 50 4 17 1 -33
4 (5 imgs) 62 5 17 1 -45
5 (6 imgs) 70 6 17 1 -53
6 (7 imgs) 81 7 17 1 -64
7 (8 imgs) 90 8 17 1 -73

1080p (1920x1080)

Round (images) main ViT (ms) main encoded PR ViT (ms) PR encoded ViT saved (ms)
0 (1 img) 59 1 49 1 0
1 (2 imgs) 77 2 40 1 -37
2 (3 imgs) 106 3 39 1 -67
3 (4 imgs) 145 4 39 1 -106
4 (5 imgs) 179 5 39 1 -140
5 (6 imgs) 216 6 39 1 -177
6 (7 imgs) 249 7 39 1 -210
7 (8 imgs) 282 8 39 1 -243

2K (2560x1440)

Round (images) main ViT (ms) main encoded PR ViT (ms) PR encoded ViT saved (ms)
0 (1 img) 97 1 96 1 0
1 (2 imgs) 225 2 88 1 -137
2 (3 imgs) 336 3 90 1 -246
3 (4 imgs) 343 4 88 1 -255
4 (5 imgs) 433 5 88 1 -345
5 (6 imgs) 516 6 89 1 -427
6 (7 imgs) 600 7 89 1 -511
7 (8 imgs) 698 8 88 1 -610

TTFT Comparison (multiturn_image)

720p

Round (images) main TTFT (ms) PR TTFT (ms) improvement
0 (1 img) 846 848 -
1 (2 imgs) 230 184 -46ms (-20%)
2 (3 imgs) 297 235 -62ms (-21%)
3 (4 imgs) 354 306 -48ms (-14%)
4 (5 imgs) 436 355 -81ms (-19%)
5 (6 imgs) 507 394 -113ms (-22%)
6 (7 imgs) 567 454 -113ms (-20%)
7 (8 imgs) 685 516 -169ms (-25%)

1080p

Round (images) main TTFT (ms) PR TTFT (ms) improvement
0 (1 img) 995 990 -
1 (2 imgs) 474 392 -82ms (-17%)
2 (3 imgs) 642 522 -120ms (-19%)
3 (4 imgs) 832 662 -170ms (-20%)
4 (5 imgs) 1062 806 -256ms (-24%)
5 (6 imgs) 1216 975 -241ms (-20%)
6 (7 imgs) 1379 1084 -295ms (-21%)
7 (8 imgs) 1555 1223 -332ms (-21%)

2K

Round (images) main TTFT (ms) PR TTFT (ms) improvement
0 (1 img) 1225 1349 -
1 (2 imgs) 923 686 -237ms (-26%)
2 (3 imgs) 1363 945 -418ms (-31%)
3 (4 imgs) 1574 1197 -377ms (-24%)
4 (5 imgs) 1933 1460 -473ms (-24%)
5 (6 imgs) 2257 1688 -569ms (-25%)
6 (7 imgs) 2613 1965 -648ms (-25%)
7 (8 imgs) 2993 2172 -821ms (-27%)

Summary

Resolution Single-image ViT (ms) ViT saved at 8 imgs (ms) TTFT improvement at 8 imgs
720p ~17 73 -169ms (-25%)
1080p ~39 243 -332ms (-21%)
2K ~88 610 -821ms (-27%)

Root cause: On main, the embedding cache key is the combined hash of all images. In multi-turn conversations, every new image invalidates the combined hash, forcing a full ViT re-encoding of all images. This PR switches to per-image caching, so only the newly added image is encoded — previously seen images are served from the embedding cache.

Remaining TTFT gap: Even with ViT savings, TTFT still grows with prompt length due to radix tree lookup, scheduling overhead, and prefilling uncached tokens. These are independent of ViT and not addressed by this PR.

@yhyang201
Copy link
Copy Markdown
Collaborator Author

Image Limit Probe Benchmark Results

Model: Qwen/Qwen3.5-27B (tp=1)
Tool: sgl-bench probe --max-images 64 --timeout 120
Branch: chunking-encoding vs main
Hardware: NVIDIA H200

Max Image Limit

Resolution main PR improvement
720p (1280x720) 64 (cap) 64 (cap) -
1080p (1920x1080) 32 64 (cap) +100%
2K (2560x1440) 20 51 +155%

TTFT per Image Count

720p (1280x720)

Images main TTFT (ms) PR TTFT (ms) diff
1 187 188 -
2 317 330 -
3 463 505 -
4 719 676 -6%
5 897 835 -7%
6 1012 979 -3%
7 1160 1088 -6%
8 1294 1294 0%
9 1446 1374 -5%
10 1642 1487 -9%
11 1789 1607 -10%
12 2093 1737 -17%
13 2269 1863 -18%
14 2422 1996 -18%
15 2561 2100 -18%
16 2718 2618 -4%
17 2906 2369 -18%
18 3065 2491 -19%
19 3494 2637 -25%
20 3695 2799 -24%
21 3895 2896 -26%
22 4021 3016 -25%
23 4254 3145 -26%
24 4436 3299 -26%
25 4689 3605 -23%
26 4768 3668 -23%
27 4953 3786 -24%
28 5550 3905 -30%
29 5769 4060 -30%
30 5946 4172 -30%
31 6098 4267 -30%
32 6265 5279 -16%
33 6467 4503 -30%
34 6671 4614 -31%
35 6906 4748 -31%
36 7061 4870 -31%
37 7570 4988 -34%
38 8142 5093 -37%
39 8317 5195 -38%
40 8496 5344 -37%
41 8821 5494 -38%
42 8992 5609 -38%
43 9190 5707 -38%
44 9387 5815 -38%
45 9617 5958 -38%
46 9856 6088 -38%
47 10704 6241 -42%
48 10932 7691 -30%
49 11036 7847 -29%
50 11412 8026 -30%
51 11638 8117 -30%
52 11801 8224 -30%
53 12035 8466 -30%
54 12245 8650 -29%
55 12449 8778 -29%
56 13481 9045 -33%
57 13799 9199 -33%
58 13935 9398 -33%
59 14152 9512 -33%
60 14478 9710 -33%
61 14756 9826 -33%
62 15076 9999 -34%
63 15262 10208 -33%
64 15529 10246 -34%

1080p (1920x1080)

Images main TTFT (ms) PR TTFT (ms) diff
1 391 391 -
2 726 731 -
3 1096 1073 -2%
4 1612 1618 0%
5 1955 1983 +1%
6 2547 2332 -8%
7 2963 2697 -9%
8 3397 3113 -8%
9 4113 3450 -16%
10 4540 3799 -16%
11 4977 4184 -16%
12 5513 4554 -17%
13 6382 4991 -22%
14 6871 5381 -22%
15 7419 5818 -22%
16 8035 6357 -21%
17 9129 6689 -27%
18 9737 7095 -27%
19 10253 7549 -26%
20 10967 7923 -28%
21 12119 8279 -32%
22 12722 8705 -32%
23 13259 9130 -31%
24 13915 9707 -30%
25 15571 10056 -35%
26 16218 10526 -35%
27 16905 10887 -36%
28 17598 11415 -35%
29 19312 11729 -39%
30 20029 12208 -39%
31 20672 12557 -39%
32 21530 13112 -39%
33 OOM 13536 -
34 OOM 13938 -
35 OOM 14362 -
36 OOM 14980 -
37 OOM 15253 -
38 OOM 15740 -
39 OOM 16292 -
40 OOM 16763 -
41 OOM 17120 -
42 OOM 17572 -
43 OOM 18073 -
44 OOM 18491 -
45 OOM 19073 -
46 OOM 19490 -
47 OOM 19843 -
48 OOM 20529 -
49 OOM 20992 -
50 OOM 22478 -
51 OOM 21834 -
52 OOM 22364 -
53 OOM 22879 -
54 OOM 23319 -
55 OOM 23728 -
56 OOM 24199 -
57 OOM 24744 -
58 OOM 25283 -
59 OOM 25816 -
60 OOM 26180 -
61 OOM 26835 -
62 OOM 27313 -
63 OOM 27880 -
64 OOM 28280 -

2K (2560x1440)

Images main TTFT (ms) PR TTFT (ms) diff
1 667 667 -
2 1294 1275 -1%
3 2285 2047 -10%
4 3170 2857 -10%
5 4440 3581 -19%
6 5299 4209 -21%
7 6844 4911 -28%
8 7792 5689 -27%
9 8717 6343 -27%
10 10723 7190 -33%
11 11761 7900 -33%
12 14031 8673 -38%
13 15266 9442 -38%
14 17857 10266 -43%
15 19065 11023 -42%
16 21886 11823 -46%
17 23242 12669 -45%
18 24625 13385 -46%
19 27727 14067 -49%
20 29275 15042 -49%
21 OOM 15829 -
22 OOM 16749 -
23 OOM 17572 -
24 OOM 18444 -
25 OOM 19327 -
26 OOM 20160 -
27 OOM 21019 -
28 OOM 21943 -
29 OOM 22816 -
30 OOM 23784 -
31 OOM 24657 -
32 OOM 25581 -
33 OOM 26535 -
34 OOM 27377 -
35 OOM 28305 -
36 OOM 29334 -
37 OOM 30357 -
38 OOM 31252 -
39 OOM 32267 -
40 OOM 33479 -
41 OOM 34321 -
42 OOM 35273 -
43 OOM 36215 -
44 OOM 37278 -
45 OOM 38393 -
46 OOM 39480 -
47 OOM 40483 -
48 OOM 41513 -
49 OOM 42672 -
50 OOM 43586 -
51 OOM 43879 -

Summary

Resolution Metric main PR improvement
720p max images 64 64 -
720p TTFT @ 32 imgs 6265ms 5279ms -16%
720p TTFT @ 64 imgs 15529ms 10246ms -34%
1080p max images 32 64 +100%
1080p TTFT @ 32 imgs 21530ms 13112ms -39%
2K max images 20 51 +155%
2K TTFT @ 20 imgs 29275ms 15042ms -49%

Root cause of OOM on main: The combined embedding cache key changes whenever the image set changes. In probe, every step adds one more image, so the ViT re-encodes all N images from scratch each time. At high resolutions (1080p, 2K), the intermediate ViT activations for large batches exhaust GPU memory. This PR caches per-image embeddings individually, so only the new image is encoded, keeping peak memory constant regardless of total image count.

@yhyang201 yhyang201 changed the title [DO NOT MERGE][VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer [VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer Apr 4, 2026
@yhyang201
Copy link
Copy Markdown
Collaborator Author

OCRBench Accuracy Results

Model: Qwen/Qwen3.5-27B (tp=1, enable_thinking=False)
Benchmark: OCRBench (1,000 samples)
Tool: sgl-bench run with Kimi-Vendor-Verifier
Branch: chunking-encoding vs main
Hardware: NVIDIA H200

Configuration

[server]
model_path = "Qwen/Qwen3.5-27B"
extra_args = "--port 7893 --tp-size 1 --enable-multimodal"

[accuracy]
tasks = ["ocrbench"]
extra_args = "--max-tokens 8192 --stream --think-mode qwen3 --max-connections 10"
# think-mode qwen3 passes: extra_body = {"chat_template_kwargs": {"enable_thinking": False}}

Results

Branch Accuracy Stderr Total Time Input Tokens Output Tokens
PR (chunking-encoding) 0.845 0.011 4:52 852,194 50,873
main 0.836 0.012 5:42 852,194 66,057

Analysis

  • Accuracy difference (0.845 vs 0.836) is within 1 standard error — no regression.
  • PR branch completed 15% faster (4:52 vs 5:42).
  • Input tokens are identical (852,194), confirming both branches process the same prompts.
  • Main branch produced more output tokens (66,057 vs 50,873), likely due to minor non-determinism in generation.

@yhyang201 yhyang201 merged commit 34d5765 into sgl-project:main Apr 4, 2026
413 of 471 checks passed
sundar24295s pushed a commit to sundar24295s/sglang that referenced this pull request Apr 4, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
xiezhq-hermann pushed a commit to antgroup/sglang that referenced this pull request Apr 7, 2026
yhyang201 added a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant