[VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer by yhyang201 · Pull Request #22038 · sgl-project/sglang

yhyang201 · 2026-04-03T10:55:01Z

Motivation

Per-image embedding cache: Switch multimodal embedding cache granularity from per-request (combine_hashes(all_items)) to per-image (item.hash), improving cache reuse under LRU eviction and avoiding full recomputation when any single image is evicted.
Chunk-aware ViT encoding: Only encode images whose token ranges overlap with the current chunked prefill chunk, instead of encoding all images in chunk 1. Reduces peak GPU memory and ViT compute for multi-image requests.
Lazy device transfer: Remove eager CPU→GPU transfer of all image features from prepare_for_extend(). Defer to the embedding pipeline (mm_utils.py) so only chunk-relevant cache-miss items are moved to GPU.
Remove redundant model-level device transfers: Models (qwen3_vl, deepseek_vl2, phi4mm, step3_vl_10b) no longer manually .to(device) their features — handled uniformly by mm_utils.py.
Remove dead code: ~240 net lines removed, including unused chunked padding functions and qwen3_vl's internal ViT batching logic.

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

yhyang201 · 2026-04-03T10:55:10Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request refactors the multimodal embedding logic to improve efficiency through per-image chunk-aware encoding and centralized device management. Key changes include the introduction of _get_chunked_embedding_by_item for optimized caching and the removal of redundant device transfer logic across various model implementations like Qwen3-VL and DeepSeek-VL2. Feedback highlights that _move_items_to_device should be updated to handle numpy.ndarray features to prevent downstream failures, and a safety check is needed for item.offsets to avoid potential TypeError exceptions.

gemini-code-assist · 2026-04-03T10:58:13Z

+        if isinstance(item.feature, torch.Tensor) and item.feature.device != device:
+            item.feature = item.feature.to(device, non_blocking=True)


The _move_items_to_device function only handles torch.Tensor features. However, MultimodalDataItem.feature can also be a numpy.ndarray (as defined in schedule_batch.py). If a feature is a numpy array, it won't be moved to the device, which will cause subsequent model operations (like torch.cat in qwen3_vl.py) to fail. It's safer to convert numpy arrays to tensors before moving them to the device.

for item in items: if item.feature is not None: if not isinstance(item.feature, torch.Tensor): item.feature = torch.from_numpy(item.feature) if item.feature.device != device: item.feature = item.feature.to(device, non_blocking=True)

gemini-code-assist · 2026-04-03T10:58:13Z

+        # Use per-image path when all items have exactly one offset (already
+        # split per-image) — this avoids encoding images not in this chunk.
+        # Fall back to combined path for non-split items or EVS.
+        is_per_image = all(len(item.offsets) == 1 for item in embedding_items_per_req)


The check len(item.offsets) == 1 will raise a TypeError if item.offsets is None. While most processors set offsets, MultimodalDataItem defines offsets as Optional[list]. A safer check should ensure offsets is not None before accessing its length.

Suggested change

is_per_image = all(len(item.offsets) == 1 for item in embedding_items_per_req)

is_per_image = all(item.offsets is not None and len(item.offsets) == 1 for item in embedding_items_per_req)

yhyang201 · 2026-04-04T03:25:11Z

Per-Image ViT Cache Benchmark Results

Model: Qwen/Qwen3-VL-8B-Instruct (tp=1)
Scenario: multiturn_image (each round adds 1 new image, 8 rounds total)
Branch: chunking-encoding vs main
Hardware: NVIDIA H200

ViT Encoding Time per Chunk Prefill

720p (1280x720)

Round (images)	main ViT (ms)	main encoded	PR ViT (ms)	PR encoded	ViT saved (ms)
0 (1 img)	29	1	29	1	0
1 (2 imgs)	33	2	17	1	-16
2 (3 imgs)	43	3	16	1	-27
3 (4 imgs)	50	4	17	1	-33
4 (5 imgs)	62	5	17	1	-45
5 (6 imgs)	70	6	17	1	-53
6 (7 imgs)	81	7	17	1	-64
7 (8 imgs)	90	8	17	1	-73

1080p (1920x1080)

Round (images)	main ViT (ms)	main encoded	PR ViT (ms)	PR encoded	ViT saved (ms)
0 (1 img)	59	1	49	1	0
1 (2 imgs)	77	2	40	1	-37
2 (3 imgs)	106	3	39	1	-67
3 (4 imgs)	145	4	39	1	-106
4 (5 imgs)	179	5	39	1	-140
5 (6 imgs)	216	6	39	1	-177
6 (7 imgs)	249	7	39	1	-210
7 (8 imgs)	282	8	39	1	-243

2K (2560x1440)

Round (images)	main ViT (ms)	main encoded	PR ViT (ms)	PR encoded	ViT saved (ms)
0 (1 img)	97	1	96	1	0
1 (2 imgs)	225	2	88	1	-137
2 (3 imgs)	336	3	90	1	-246
3 (4 imgs)	343	4	88	1	-255
4 (5 imgs)	433	5	88	1	-345
5 (6 imgs)	516	6	89	1	-427
6 (7 imgs)	600	7	89	1	-511
7 (8 imgs)	698	8	88	1	-610

TTFT Comparison (multiturn_image)

720p

Round (images)	main TTFT (ms)	PR TTFT (ms)	improvement
0 (1 img)	846	848	-
1 (2 imgs)	230	184	-46ms (-20%)
2 (3 imgs)	297	235	-62ms (-21%)
3 (4 imgs)	354	306	-48ms (-14%)
4 (5 imgs)	436	355	-81ms (-19%)
5 (6 imgs)	507	394	-113ms (-22%)
6 (7 imgs)	567	454	-113ms (-20%)
7 (8 imgs)	685	516	-169ms (-25%)

1080p

Round (images)	main TTFT (ms)	PR TTFT (ms)	improvement
0 (1 img)	995	990	-
1 (2 imgs)	474	392	-82ms (-17%)
2 (3 imgs)	642	522	-120ms (-19%)
3 (4 imgs)	832	662	-170ms (-20%)
4 (5 imgs)	1062	806	-256ms (-24%)
5 (6 imgs)	1216	975	-241ms (-20%)
6 (7 imgs)	1379	1084	-295ms (-21%)
7 (8 imgs)	1555	1223	-332ms (-21%)

2K

Round (images)	main TTFT (ms)	PR TTFT (ms)	improvement
0 (1 img)	1225	1349	-
1 (2 imgs)	923	686	-237ms (-26%)
2 (3 imgs)	1363	945	-418ms (-31%)
3 (4 imgs)	1574	1197	-377ms (-24%)
4 (5 imgs)	1933	1460	-473ms (-24%)
5 (6 imgs)	2257	1688	-569ms (-25%)
6 (7 imgs)	2613	1965	-648ms (-25%)
7 (8 imgs)	2993	2172	-821ms (-27%)

Summary

Resolution	Single-image ViT (ms)	ViT saved at 8 imgs (ms)	TTFT improvement at 8 imgs
720p	~17	73	-169ms (-25%)
1080p	~39	243	-332ms (-21%)
2K	~88	610	-821ms (-27%)

Root cause: On main, the embedding cache key is the combined hash of all images. In multi-turn conversations, every new image invalidates the combined hash, forcing a full ViT re-encoding of all images. This PR switches to per-image caching, so only the newly added image is encoded — previously seen images are served from the embedding cache.

Remaining TTFT gap: Even with ViT savings, TTFT still grows with prompt length due to radix tree lookup, scheduling overhead, and prefilling uncached tokens. These are independent of ViT and not addressed by this PR.

yhyang201 · 2026-04-04T07:59:06Z

Image Limit Probe Benchmark Results

Model: Qwen/Qwen3.5-27B (tp=1)
Tool: sgl-bench probe --max-images 64 --timeout 120
Branch: chunking-encoding vs main
Hardware: NVIDIA H200

Max Image Limit

Resolution	main	PR	improvement
720p (1280x720)	64 (cap)	64 (cap)	-
1080p (1920x1080)	32	64 (cap)	+100%
2K (2560x1440)	20	51	+155%

TTFT per Image Count

720p (1280x720)

Images	main TTFT (ms)	PR TTFT (ms)	diff
1	187	188	-
2	317	330	-
3	463	505	-
4	719	676	-6%
5	897	835	-7%
6	1012	979	-3%
7	1160	1088	-6%
8	1294	1294	0%
9	1446	1374	-5%
10	1642	1487	-9%
11	1789	1607	-10%
12	2093	1737	-17%
13	2269	1863	-18%
14	2422	1996	-18%
15	2561	2100	-18%
16	2718	2618	-4%
17	2906	2369	-18%
18	3065	2491	-19%
19	3494	2637	-25%
20	3695	2799	-24%
21	3895	2896	-26%
22	4021	3016	-25%
23	4254	3145	-26%
24	4436	3299	-26%
25	4689	3605	-23%
26	4768	3668	-23%
27	4953	3786	-24%
28	5550	3905	-30%
29	5769	4060	-30%
30	5946	4172	-30%
31	6098	4267	-30%
32	6265	5279	-16%
33	6467	4503	-30%
34	6671	4614	-31%
35	6906	4748	-31%
36	7061	4870	-31%
37	7570	4988	-34%
38	8142	5093	-37%
39	8317	5195	-38%
40	8496	5344	-37%
41	8821	5494	-38%
42	8992	5609	-38%
43	9190	5707	-38%
44	9387	5815	-38%
45	9617	5958	-38%
46	9856	6088	-38%
47	10704	6241	-42%
48	10932	7691	-30%
49	11036	7847	-29%
50	11412	8026	-30%
51	11638	8117	-30%
52	11801	8224	-30%
53	12035	8466	-30%
54	12245	8650	-29%
55	12449	8778	-29%
56	13481	9045	-33%
57	13799	9199	-33%
58	13935	9398	-33%
59	14152	9512	-33%
60	14478	9710	-33%
61	14756	9826	-33%
62	15076	9999	-34%
63	15262	10208	-33%
64	15529	10246	-34%

1080p (1920x1080)

Images	main TTFT (ms)	PR TTFT (ms)	diff
1	391	391	-
2	726	731	-
3	1096	1073	-2%
4	1612	1618	0%
5	1955	1983	+1%
6	2547	2332	-8%
7	2963	2697	-9%
8	3397	3113	-8%
9	4113	3450	-16%
10	4540	3799	-16%
11	4977	4184	-16%
12	5513	4554	-17%
13	6382	4991	-22%
14	6871	5381	-22%
15	7419	5818	-22%
16	8035	6357	-21%
17	9129	6689	-27%
18	9737	7095	-27%
19	10253	7549	-26%
20	10967	7923	-28%
21	12119	8279	-32%
22	12722	8705	-32%
23	13259	9130	-31%
24	13915	9707	-30%
25	15571	10056	-35%
26	16218	10526	-35%
27	16905	10887	-36%
28	17598	11415	-35%
29	19312	11729	-39%
30	20029	12208	-39%
31	20672	12557	-39%
32	21530	13112	-39%
33	OOM	13536	-
34	OOM	13938	-
35	OOM	14362	-
36	OOM	14980	-
37	OOM	15253	-
38	OOM	15740	-
39	OOM	16292	-
40	OOM	16763	-
41	OOM	17120	-
42	OOM	17572	-
43	OOM	18073	-
44	OOM	18491	-
45	OOM	19073	-
46	OOM	19490	-
47	OOM	19843	-
48	OOM	20529	-
49	OOM	20992	-
50	OOM	22478	-
51	OOM	21834	-
52	OOM	22364	-
53	OOM	22879	-
54	OOM	23319	-
55	OOM	23728	-
56	OOM	24199	-
57	OOM	24744	-
58	OOM	25283	-
59	OOM	25816	-
60	OOM	26180	-
61	OOM	26835	-
62	OOM	27313	-
63	OOM	27880	-
64	OOM	28280	-

2K (2560x1440)

Images	main TTFT (ms)	PR TTFT (ms)	diff
1	667	667	-
2	1294	1275	-1%
3	2285	2047	-10%
4	3170	2857	-10%
5	4440	3581	-19%
6	5299	4209	-21%
7	6844	4911	-28%
8	7792	5689	-27%
9	8717	6343	-27%
10	10723	7190	-33%
11	11761	7900	-33%
12	14031	8673	-38%
13	15266	9442	-38%
14	17857	10266	-43%
15	19065	11023	-42%
16	21886	11823	-46%
17	23242	12669	-45%
18	24625	13385	-46%
19	27727	14067	-49%
20	29275	15042	-49%
21	OOM	15829	-
22	OOM	16749	-
23	OOM	17572	-
24	OOM	18444	-
25	OOM	19327	-
26	OOM	20160	-
27	OOM	21019	-
28	OOM	21943	-
29	OOM	22816	-
30	OOM	23784	-
31	OOM	24657	-
32	OOM	25581	-
33	OOM	26535	-
34	OOM	27377	-
35	OOM	28305	-
36	OOM	29334	-
37	OOM	30357	-
38	OOM	31252	-
39	OOM	32267	-
40	OOM	33479	-
41	OOM	34321	-
42	OOM	35273	-
43	OOM	36215	-
44	OOM	37278	-
45	OOM	38393	-
46	OOM	39480	-
47	OOM	40483	-
48	OOM	41513	-
49	OOM	42672	-
50	OOM	43586	-
51	OOM	43879	-

Summary

Resolution	Metric	main	PR	improvement
720p	max images	64	64	-
720p	TTFT @ 32 imgs	6265ms	5279ms	-16%
720p	TTFT @ 64 imgs	15529ms	10246ms	-34%
1080p	max images	32	64	+100%
1080p	TTFT @ 32 imgs	21530ms	13112ms	-39%
2K	max images	20	51	+155%
2K	TTFT @ 20 imgs	29275ms	15042ms	-49%

Root cause of OOM on main: The combined embedding cache key changes whenever the image set changes. In probe, every step adds one more image, so the ViT re-encodes all N images from scratch each time. At high resolutions (1080p, 2K), the intermediate ViT activations for large batches exhaust GPU memory. This PR caches per-image embeddings individually, so only the new image is encoded, keeping peak memory constant regardless of total image count.

yhyang201 · 2026-04-04T08:54:24Z

OCRBench Accuracy Results

Model: Qwen/Qwen3.5-27B (tp=1, enable_thinking=False)
Benchmark: OCRBench (1,000 samples)
Tool: sgl-bench run with Kimi-Vendor-Verifier
Branch: chunking-encoding vs main
Hardware: NVIDIA H200

Configuration

[server]
model_path = "Qwen/Qwen3.5-27B"
extra_args = "--port 7893 --tp-size 1 --enable-multimodal"

[accuracy]
tasks = ["ocrbench"]
extra_args = "--max-tokens 8192 --stream --think-mode qwen3 --max-connections 10"
# think-mode qwen3 passes: extra_body = {"chat_template_kwargs": {"enable_thinking": False}}

Results

Branch	Accuracy	Stderr	Total Time	Input Tokens	Output Tokens
PR (chunking-encoding)	0.845	0.011	4:52	852,194	50,873
main	0.836	0.012	5:42	852,194	66,057

Analysis

Accuracy difference (0.845 vs 0.836) is within 1 standard error — no regression.
PR branch completed 15% faster (4:52 vs 5:42).
Input tokens are identical (852,194), confirming both branches process the same prompts.
Main branch produced more output tokens (66,057 vs 50,873), likely due to minor non-determinism in generation.

…ransfer (sgl-project#22038)

…ransfer (#22038)

…ransfer (sgl-project#22038)

upd

d63db99

yhyang201 requested review from Ying1123, hanming-lu, hnyls2002, hzh0425, ispobock, merrymercy, xiezhq-hermann and yizhang2077 as code owners April 3, 2026 10:55

github-actions Bot added Multi-modal multi-modal language model deepseek run-ci labels Apr 3, 2026

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

yhyang201 changed the title ~~[VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer~~ [DO NOT MERGE][VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer Apr 3, 2026

yhyang201 added the DO NOT MERGE label Apr 3, 2026

yhyang201 changed the title ~~[DO NOT MERGE][VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer~~ [VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer Apr 4, 2026

yhyang201 removed the DO NOT MERGE label Apr 4, 2026

yhyang201 merged commit 34d5765 into sgl-project:main Apr 4, 2026
413 of 471 checks passed

sundar24295s pushed a commit to sundar24295s/sglang that referenced this pull request Apr 4, 2026

[VLM] Chunk-aware ViT encoding with per-image cache and lazy device t…

0381ca2

…ransfer (sgl-project#22038)

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[VLM] Chunk-aware ViT encoding with per-image cache and lazy device t…

324838c

…ransfer (sgl-project#22038)

Fridge003 pushed a commit that referenced this pull request Apr 7, 2026

[VLM] Chunk-aware ViT encoding with per-image cache and lazy device t…

38ace44

…ransfer (#22038)

xiezhq-hermann pushed a commit to antgroup/sglang that referenced this pull request Apr 7, 2026

[VLM] Chunk-aware ViT encoding with per-image cache and lazy device t…

483bc35

…ransfer (sgl-project#22038)

AgainstEntropy mentioned this pull request Apr 7, 2026

[fix] [whisper] ensure inputs are moved to the correct device before processing. #22293

Merged

5 tasks

JustinTong0323 mentioned this pull request Apr 8, 2026

[Whisper] Fix audio feature device placement in encoder forward #22296

Closed

2 tasks

yhyang201 added a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[VLM] Chunk-aware ViT encoding with per-image cache and lazy device t…

dacf63a

…ransfer (sgl-project#22038)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer#22038

[VLM] Chunk-aware ViT encoding with per-image cache and lazy device transfer#22038
yhyang201 merged 1 commit intosgl-project:mainfrom
yhyang201:chunking-encoding

yhyang201 commented Apr 3, 2026

Uh oh!

yhyang201 commented Apr 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

yhyang201 commented Apr 4, 2026

Uh oh!

yhyang201 commented Apr 4, 2026

Uh oh!

yhyang201 commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if isinstance(item.feature, torch.Tensor) and item.feature.device != device:
		item.feature = item.feature.to(device, non_blocking=True)

	is_per_image = all(len(item.offsets) == 1 for item in embedding_items_per_req)
	is_per_image = all(item.offsets is not None and len(item.offsets) == 1 for item in embedding_items_per_req)

Conversation

yhyang201 commented Apr 3, 2026

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

yhyang201 commented Apr 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

yhyang201 commented Apr 4, 2026

Per-Image ViT Cache Benchmark Results

ViT Encoding Time per Chunk Prefill

720p (1280x720)

1080p (1920x1080)

2K (2560x1440)

TTFT Comparison (multiturn_image)

720p

1080p

2K

Summary

Uh oh!

yhyang201 commented Apr 4, 2026

Image Limit Probe Benchmark Results

Max Image Limit

TTFT per Image Count

720p (1280x720)

1080p (1920x1080)

2K (2560x1440)

Summary

Uh oh!

yhyang201 commented Apr 4, 2026

OCRBench Accuracy Results

Configuration

Results

Analysis

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant