Skip to content

[model-gateway] Implement Zero-Copy Vision Tensor Access #15750

Merged
slin1237 merged 10 commits intosgl-project:mainfrom
ppraneth:vision
Dec 24, 2025
Merged

[model-gateway] Implement Zero-Copy Vision Tensor Access #15750
slin1237 merged 10 commits intosgl-project:mainfrom
ppraneth:vision

Conversation

@ppraneth
Copy link
Copy Markdown
Contributor

Motivation

Vision model tensors (e.g., for LLaVA or Qwen-VL) are large, often several megabytes in size. In the previous implementation, the pixel_values_flat method in image_processor.rs performed a linear-time deep copy of the entire vision tensor into a new heap-allocated Vec on every call. For a standard batch of 4 images at 336x336 resolution, this operation allocated and copied ~5.4 MB of data per call.

This created a memory pressure and CPU overhead on the hot path for multimodal request processing, leading to millisecond-scale latencies for a simple data access operation.

The goal of this pull request is to implement zero-copy access to these tensors, eliminating redundant allocations and significantly reducing processing latency for multimodal inputs.

Modifications

  • Core Logic Change: Updated src/multimodal/vision/image_processor.rs to return std::borrow::Cow<'_, [f32]> instead of Vec<f32>.
  • Optimization: Leveraged ndarray::as_slice() to check for memory contiguity. If the tensor is contiguous (the case for 100% of standard preprocessor outputs), it now returns a borrowed slice (Cow::Borrowed), bypassing the heap allocator and copy logic entirely.
  • Compatibility: Because Cow<[f32]> implements Deref, all existing call sites in vision processors (e.g., qwen2_vl.rs, qwen3_vl.rs, phi4_vision.rs, llama4_vision.rs) and integration tests continue to function without modification.

Accuracy Tests

  • Golden Tests: Verified compatibility by running cargo test --test vision_golden_tests.

Benchmarking and Profiling

The optimization was validated using a benchmark simulating a 4-image batch (Shape: [4, 3, 336, 336], ~5.4 MB).

Metric Before (Current) Post (Optimized) Delta
Mean Latency 5.6564 ms 25.456 ns ~222,000x Speedup
Memory Allocation ~5.4 MB / call 0 MB 100% Reduction
Execution Path Deep Copy Reference Zero-Copy

Checklist

@github-actions github-actions Bot added dependencies Pull requests that update a dependency file Multi-modal multi-modal language model model-gateway labels Dec 24, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @ppraneth, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the performance of vision tensor processing within the model gateway by implementing a zero-copy access mechanism. By transitioning from deep copies to borrowed slices for contiguous data, it drastically reduces memory allocations and processing latency, ensuring more efficient handling of large multimodal inputs.

Highlights

  • Performance Optimization for Vision Tensors: Previously, the pixel_values_flat method performed a linear-time deep copy of large vision tensors (e.g., ~5.4 MB for a 4-image batch) on every call. This caused significant memory pressure and CPU overhead on the hot path, leading to millisecond-scale latencies for a simple data access operation.
  • Zero-Copy Access Implementation: The pixel_values_flat method in src/multimodal/vision/image_processor.rs has been updated to return std::borrow::Cow<'_, [f32]> instead of Vec<f32>. It now leverages ndarray::as_slice() to return a borrowed slice (Cow::Borrowed) if the tensor is contiguous, thereby eliminating redundant allocations and copies.
  • Significant Performance Gains: Benchmarking validates a ~222,000x speedup in mean latency (from 5.6564 ms to 25.456 ns) and a 100% reduction in memory allocation (from ~5.4 MB/call to 0 MB) for vision tensor access, achieving zero-copy execution.
  • Backward Compatibility: Because Cow<[f32]> implements Deref, all existing call sites in vision processors (e.g., qwen2_vl.rs, qwen3_vl.rs, phi4_vision.rs, llama4_vision.rs) and integration tests continue to function without modification.
  • New Benchmark Added: A new benchmark, vision_access_bench, has been added to Cargo.toml and implemented in benches/vision_access_bench.rs to specifically measure and validate the performance improvements of vision tensor access.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent optimization that significantly improves performance for multimodal request processing by implementing zero-copy access for vision tensors. The use of std::borrow::Cow is idiomatic and effectively avoids unnecessary memory allocations and copies on the hot path. The change is well-motivated, clearly explained, and includes a benchmark to validate the impressive performance gains. The fallback path for non-contiguous tensors ensures correctness is maintained. Great work!

Comment thread sgl-model-gateway/src/multimodal/vision/image_processor.rs
@ppraneth
Copy link
Copy Markdown
Contributor Author

@slin1237 I haven’t touched schedule_batch.py in this PR.
The lint failure appears to be related to this file, as the issue is also appears in newly opened PRs

@slin1237
Copy link
Copy Markdown
Collaborator

fixed

@slin1237 slin1237 merged commit 370bd27 into sgl-project:main Dec 24, 2025
56 of 60 checks passed
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
Leoyzen pushed a commit to Leoyzen/sglang that referenced this pull request Dec 25, 2025
Leoyzen pushed a commit to Leoyzen/sglang that referenced this pull request Dec 25, 2025
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file model-gateway Multi-modal multi-modal language model run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants