Skip to content

[Feature][MM] split the images of one request into multiparts#11828

Merged
ShangmingCai merged 21 commits intosgl-project:mainfrom
openanolis:Xuchun/mm-dev
Dec 24, 2025
Merged

[Feature][MM] split the images of one request into multiparts#11828
ShangmingCai merged 21 commits intosgl-project:mainfrom
openanolis:Xuchun/mm-dev

Conversation

@XucSh
Copy link
Copy Markdown
Collaborator

@XucSh XucSh commented Oct 19, 2025

Motivation

see issue 11785

[2025-10-19 19:44:46] Prefill batch. #new-seq: 1, #new-token: 328, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-19 19:44:46] INFO: 127.0.0.1:59486 - "POST /generate HTTP/1.1" 200 OK
[2025-10-19 19:44:46] Prefill batch. #new-seq: 1, #new-token: 327, #cached-token: 328, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-19 19:44:46] INFO: 127.0.0.1:59498 - "POST /generate HTTP/1.1" 200 OK

Cc @stmatengss @ByronHsu

Checklist

Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @XucSh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug related to the handling of multiple multimodal inputs, specifically images, within a single request. The core changes involve introducing logic to decompose bundled multimodal data into individual items, which then allows for more granular and correct processing. This includes applying unique token padding per item and caching embeddings on an individual item basis, ultimately enhancing the system's ability to manage complex multimodal requests efficiently and accurately.

Highlights

  • Image Splitting Logic: Implemented a new mechanism within MultimodalInputs.from_dict to automatically split bundled multimodal items (e.g., multiple images represented as a single entity) into distinct individual items. This ensures that each image is processed independently.
  • Refined Token Padding for Individual Items: Modified the pad_input_tokens function to apply specific pad values to placeholder tokens for each individual multimodal item. This change ensures correct tokenization and padding when a request contains multiple distinct items of the same modality.
  • Per-Item Embedding Caching: Updated the _get_chunked_prefill_embedding function to compute and cache embeddings for individual multimodal items rather than the entire request. This significantly improves caching efficiency and accuracy, especially for multi-image inputs, by avoiding recomputing embeddings for previously seen individual images.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement by splitting bundled multi-image requests into individual parts. This enables per-image embedding caching, which should improve performance and efficiency for requests with multiple images. The changes in mm_utils.py to handle token padding and embedding caching are well-implemented. The corresponding logic in schedule_batch.py to expand bundled items is also functionally correct. My main feedback is on improving the performance of the item splitting logic to avoid unnecessary memory copies.

continue

for i in range(num_images):
new_item = copy.deepcopy(item)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using copy.deepcopy(item) inside this loop can be inefficient, especially when dealing with a large number of images. The item.feature tensor can be quite large, and deepcopy will create a full copy of this tensor's data for each new item before it's sliced. This leads to unnecessary memory allocation and copying, potentially impacting performance.

A more efficient approach would be to construct a new MultimodalDataItem and selectively copy the necessary attributes, avoiding the deep copy of the large feature tensor. You only need to deep copy model_specific_data since it's modified in-place.

Suggested change
new_item = copy.deepcopy(item)
new_item = MultimodalDataItem(
modality=item.modality,
precomputed_embeddings=item.precomputed_embeddings,
model_specific_data=copy.deepcopy(item.model_specific_data),
)

Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
@XucSh XucSh changed the title [buf fix] split the images of one request into multi part [bug fix] split the images of one request into multi part Oct 20, 2025
@ByronHsu
Copy link
Copy Markdown
Collaborator

ByronHsu commented Oct 20, 2025

This breaks the following assumption. I am not sure if it will have any side effect cc @mickqian @JustinTong0323

One MultimodalDataItem contains all inputs for one modality. For example, if there are 3 images and 1 audio inputs, there will be 2 MultimodalDataItem. One for images and one for audio.

@XucSh XucSh changed the title [bug fix] split the images of one request into multi part [Performance] split the images of one request into multi part Oct 20, 2025
@XucSh XucSh changed the title [Performance] split the images of one request into multi part [bug fix] split the images of one request into multi part Oct 20, 2025
Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
@JustinTong0323 JustinTong0323 self-assigned this Oct 20, 2025
@JustinTong0323
Copy link
Copy Markdown
Collaborator

This breaks the following assumption. I am not sure if it will have any side effect cc @mickqian @JustinTong0323

One MultimodalDataItem contains all inputs for one modality. For example, if there are 3 images and 1 audio inputs, there will be 2 MultimodalDataItem. One for images and one for audio.

Yes I think it would break the internal processing logic...

@XucSh
Copy link
Copy Markdown
Collaborator Author

XucSh commented Oct 21, 2025

This breaks the following assumption. I am not sure if it will have any side effect cc @mickqian @JustinTong0323

One MultimodalDataItem contains all inputs for one modality. For example, if there are 3 images and 1 audio inputs, there will be 2 MultimodalDataItem. One for images and one for audio.

Yes I think it would break the internal processing logic...

Great review, thanks! Any thoughts on the overall architecture? We can pair up on a refactor. Thanks! Cc @stmatengss @ByronHsu

@mickqian mickqian changed the title [bug fix] split the images of one request into multi part [bug fix] split the images of one request into multiparts Oct 21, 2025
Comment thread python/sglang/srt/managers/mm_utils.py Outdated
@XucSh XucSh requested a review from mickqian October 21, 2025 02:50
@llfl llfl requested a review from zhyncs as a code owner November 12, 2025 01:05
@mickqian
Copy link
Copy Markdown
Collaborator

#9529

@liusy58
Copy link
Copy Markdown
Collaborator

liusy58 commented Dec 4, 2025

Problem

When a request contains multiple images [A, B, C], they are hashed together as one bundle. If another request has [A, B, D], we cannot reuse the cache for images A and B.

Solution

Split bundled images into individual items so each image has its own hash.

Before: hash([A, B, C]) vs hash([A, B, D]) → no cache hit

After: hash(A), hash(B), hash(C) vs hash(A), hash(B), hash(D) → A and B hit cache

This is a great PR! I think we should merge this.

@yhyang201
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@yhyang201
Copy link
Copy Markdown
Collaborator

Great work, let's take a look at the CI.

@ShangmingCai
Copy link
Copy Markdown
Collaborator

ShangmingCai commented Dec 5, 2025

/tag-and-rerun-ci 2

@ShangmingCai
Copy link
Copy Markdown
Collaborator

@yhyang201 CI is all green now, expect xpu. Do you think this PR is ready-to-merge?

@yhyang201
Copy link
Copy Markdown
Collaborator

I think it’s okay.

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. But still need a double-check from @mickqian

minor: is it possible that we wrap some logic in the def from_dict(obj: dict): to a mm util? The code seems pretty long in the scheduler batch.

@liusy58
Copy link
Copy Markdown
Collaborator

liusy58 commented Dec 8, 2025

LGTM. But still need a double-check from @mickqian

minor: is it possible that we wrap some logic in the def from_dict(obj: dict): to a mm util? The code seems pretty long in the scheduler batch.

Done

Copy link
Copy Markdown
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. This looks better to me now.

@stmatengss
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@stmatengss
Copy link
Copy Markdown
Collaborator

This PR is approved. Please ensure the requested changes are implemented. THX. @mickqian

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@XucSh
Copy link
Copy Markdown
Collaborator Author

XucSh commented Dec 23, 2025

/rerun-failed-ci

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
@XucSh
Copy link
Copy Markdown
Collaborator Author

XucSh commented Dec 23, 2025

/rerun-failed-ci

1 similar comment
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/rerun-failed-ci

@mickqian
Copy link
Copy Markdown
Collaborator

@yhyang201 @yuan-luo Would you take a look?

@ShangmingCai
Copy link
Copy Markdown
Collaborator

@yhyang201 CI looks OK, if you think this PR is ready to merge after double-checking, you can ping me to merge.

@yhyang201
Copy link
Copy Markdown
Collaborator

LGTM. I think it can be merged.
@ShangmingCai

@ShangmingCai
Copy link
Copy Markdown
Collaborator

LGTM. I think it can be merged. @ShangmingCai

@yhyang201 OK, let's get it merged. If any bug has been reported, we can do a quick fix or revert.

@ShangmingCai ShangmingCai merged commit 3bf07c6 into sgl-project:main Dec 24, 2025
167 of 171 checks passed
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
…oject#11828)

Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Signed-off-by: Kun(llfl) <i@imux.top>
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
Co-authored-by: Kun(llfl) <llfl@linux.alibaba.com>
Co-authored-by: Kun(llfl) <i@imux.top>
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026
…oject#11828)

Signed-off-by: Xuchun Shang <xuchun.shang@linux.alibaba.com>
Signed-off-by: Kun(llfl) <i@imux.top>
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Co-authored-by: Teng Ma <sima.mt@alibaba-inc.com>
Co-authored-by: Kun(llfl) <llfl@linux.alibaba.com>
Co-authored-by: Kun(llfl) <i@imux.top>
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants