[model-gateway] Implement Zero-Copy Vision Tensor Access #15750
[model-gateway] Implement Zero-Copy Vision Tensor Access #15750slin1237 merged 10 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @ppraneth, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of vision tensor processing within the model gateway by implementing a zero-copy access mechanism. By transitioning from deep copies to borrowed slices for contiguous data, it drastically reduces memory allocations and processing latency, ensuring more efficient handling of large multimodal inputs. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is an excellent optimization that significantly improves performance for multimodal request processing by implementing zero-copy access for vision tensors. The use of std::borrow::Cow is idiomatic and effectively avoids unnecessary memory allocations and copies on the hot path. The change is well-motivated, clearly explained, and includes a benchmark to validate the impressive performance gains. The fallback path for non-contiguous tensors ensures correctness is maintained. Great work!
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
@slin1237 I haven’t touched |
|
fixed |
Motivation
Vision model tensors (e.g., for LLaVA or Qwen-VL) are large, often several megabytes in size. In the previous implementation, the
pixel_values_flatmethod inimage_processor.rsperformed a linear-time deep copy of the entire vision tensor into a new heap-allocatedVecon every call. For a standard batch of 4 images at 336x336 resolution, this operation allocated and copied ~5.4 MB of data per call.This created a memory pressure and CPU overhead on the hot path for multimodal request processing, leading to millisecond-scale latencies for a simple data access operation.
The goal of this pull request is to implement zero-copy access to these tensors, eliminating redundant allocations and significantly reducing processing latency for multimodal inputs.
Modifications
src/multimodal/vision/image_processor.rsto returnstd::borrow::Cow<'_, [f32]>instead ofVec<f32>.ndarray::as_slice()to check for memory contiguity. If the tensor is contiguous (the case for 100% of standard preprocessor outputs), it now returns a borrowed slice (Cow::Borrowed), bypassing the heap allocator and copy logic entirely.Cow<[f32]>implementsDeref, all existing call sites in vision processors (e.g.,qwen2_vl.rs,qwen3_vl.rs,phi4_vision.rs,llama4_vision.rs) and integration tests continue to function without modification.Accuracy Tests
cargo test --test vision_golden_tests.Benchmarking and Profiling
The optimization was validated using a benchmark simulating a 4-image batch (Shape:
[4, 3, 336, 336], ~5.4 MB).Checklist