Skip to content

[VLM][Reland] Refactor load_mm_data to improve performance#16152

Merged
JustinTong0323 merged 1 commit intosgl-project:mainfrom
antgroup:refactor_load_mm_data
Jan 18, 2026
Merged

[VLM][Reland] Refactor load_mm_data to improve performance#16152
JustinTong0323 merged 1 commit intosgl-project:mainfrom
antgroup:refactor_load_mm_data

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Dec 30, 2025

Motivation

Per discussed with @JustinTong0323 , we do need this enhancement as a part for token-in and token-out implementation, request to re-land #14644.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments (/tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci) or contact authorized users to do so.
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request re-lands a previous change aimed at significantly enhancing the performance of multimodal data loading within the system. It introduces an optimized data loading mechanism that leverages concurrent I/O operations for models that can benefit from a direct, aligned data processing approach. This improvement is a foundational step for upcoming token-in and token-out functionalities.

Highlights

  • Performance Refactor: The load_mm_data function has been refactored to improve performance, particularly for models that do not require dynamic frame expansion. This is achieved by introducing a new 'fast path' for data loading.
  • New Fast Data Loading Path: A new method, fast_load_mm_data, has been added. This method directly loads multimodal data concurrently using _submit_mm_data_loading_tasks_simple, assuming a 1:1 alignment between tokens and data, bypassing prompt scanning for efficiency.
  • Conditional Data Loading: The main load_mm_data function now acts as a dispatcher, calling either the new fast_load_mm_data or the original legacy_load_mm_data based on whether the processor supports dynamic frame expansion (checked via self.support_dynamic_frame_expansion).
  • MiniCPM Support: The MiniCPMMultimodalProcessor has been updated to explicitly set support_dynamic_frame_expansion = True, indicating it will utilize the legacy data loading path for now, while other models might use the new fast path.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the multimodal data loading process to improve performance. It introduces a fast_load_mm_data path that bypasses prompt scanning for multimodal tokens, assuming a direct 1:1 mapping of data to tokens. A legacy_load_mm_data path is preserved for models like MiniCPM that require the old behavior. The dispatch is handled by a new support_dynamic_frame_expansion flag. The changes are well-structured and the performance improvement is a logical consequence of the new fast path. I have a couple of suggestions to improve code clarity and consistency.

Comment thread python/sglang/srt/multimodal/processors/base_processor.py
Comment thread python/sglang/srt/multimodal/processors/base_processor.py
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 2, 2026

/rerun-failed-ci

2 similar comments
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 2, 2026

/rerun-failed-ci

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Jan 3, 2026

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 4, 2026

/tag-and-rerun-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 4, 2026

/rerun-failed-ci

1 similar comment
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 5, 2026

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 5, 2026

CI has an error. I can reproduce it locally. Investigating. Setting draft for the moment.

root@6996fb46042d:/sgl-workspace/sglang_dev/test/srt# python3 -m unittest test_skip_tokenizer_init
......
[2026-01-05 09:28:34] INFO:     127.0.0.1:37038 - "GET /health_generate HTTP/1.1" 200 OK
[CI Test Method] TestSkipTokenizerInitVLM.test_eos_behavior
[2026-01-05 09:28:34] INFO:     127.0.0.1:37054 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2026-01-05 09:28:34] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/protocols/http/h11_impl.py", line 410, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 119, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.12/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 105, in app
    response = await f(request)
               ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 426, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/fastapi/routing.py", line 312, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 490, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/tokenizer_manager.py", line 694, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 968, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 918, in _process_and_collect_mm_items
    ret = self.process_mm_data(
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
             ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
                       ~~~~~~~~~~~~~~^^^^^^^
IndexError: index 1 is out of bounds for dimension 0 with size 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

CI error:

Traceback (most recent call last):
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/srt/utils/common.py", line 2506, in retry
    return fn()
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/test/test_utils.py", line 1720, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
  File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/test/srt/test_skip_tokenizer_init.py", line 157, in test_simple_decode
    self.run_decode()
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
WARNING:sglang.srt.utils.common:retry() failed once (0th try, maximum 1 retries). Will delay 1.98s and retry. Error: Expecting value: line 1 column 1 (char 0)
[2026-01-04 10:24:25] INFO:     127.0.0.1:40848 - "POST /generate HTTP/1.1" 500 Internal Server Error
[2026-01-04 10:24:25] ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/h11_impl.py", line 410, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1135, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 107, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 63, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/middleware/asyncexitstack.py", line 18, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 716, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 736, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 290, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 115, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 42, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 101, in app
    response = await f(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 355, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 243, in run_endpoint_function
    return await dependant.call(**values)
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/srt/entrypoints/http_server.py", line 643, in generate_request
    ret = await _global_state.tokenizer_manager.generate_request(
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 490, in generate_request
    tokenized_obj = await self._tokenize_one_request(obj)
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/srt/managers/tokenizer_manager.py", line 694, in _tokenize_one_request
    mm_inputs: Dict = await self.mm_data_processor.process(
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/srt/managers/async_mm_data_processor.py", line 99, in process
    return await asyncio.wait_for(_invoke(), timeout=self.timeout_s)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/srt/managers/async_mm_data_processor.py", line 70, in _invoke
    return await self._proc_async(
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/srt/multimodal/processors/qwen_vl.py", line 337, in process_mm_data_async
    mm_items, input_ids, ret = self.process_and_combine_mm_data(
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 968, in process_and_combine_mm_data
    collected_items, input_ids, ret = self._process_and_collect_mm_items(
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 918, in _process_and_collect_mm_items
    ret = self.process_mm_data(
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/srt/multimodal/processors/base_processor.py", line 327, in process_mm_data
    result = processor.__call__(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_5_vl/processing_qwen2_5_vl.py", line 177, in __call__
    num_image_tokens = image_grid_thw[index].prod() // merge_length
IndexError: index 1 is out of bounds for dimension 0 with size 1
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 976, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/srt/utils/common.py", line 2506, in retry
    return fn()
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/python/sglang/test/test_utils.py", line 1720, in <lambda>
    lambda: super(CustomTestCase, self)._callTestMethod(method),
  File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/test/srt/test_skip_tokenizer_init.py", line 157, in test_simple_decode
    self.run_decode()
  File "/public_sglang_ci/runner-l1c-gpu-45/_work/sglang/sglang/test/srt/test_skip_tokenizer_init.py", line 68, in run_decode
    ret = response.json()
  File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 980, in json
    raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E[CI Test Method] TestSkipTokenizerInitVLM.test_simple_decode_stream
.

@yuan-luo yuan-luo marked this pull request as draft January 5, 2026 09:36
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 5, 2026

This PR breaks skip-tokenizer-init feature, which is an important feature. The fix is non-trivial. Investigating in-depth.

@yuan-luo yuan-luo force-pushed the refactor_load_mm_data branch from 9b9eefb to a5f0789 Compare January 6, 2026 07:51
@yuan-luo yuan-luo marked this pull request as ready for review January 6, 2026 07:52
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 6, 2026

/tag-and-rerun-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 6, 2026

In SGLang, the essence of --skip-tokenizer-init is that it splits the responsibility of “turning a prompt into model-usable inputs” and outsources part of it to the client.

skip-tokenizer-init = the server does not load a tokenizer (and therefore does not accept text prompts). It only accepts already-tokenized input_ids (or an even lower-level form like input_embeds), and then runs backend inference directly.

The current implementation, is “multimodal” which is more tricky than LLM case:

  • When skip-tokenizer-init is enabled: the client sends input_ids
  • At the same time, it still allows / requires sending image_data
  • The server-side mm_data_processor still performs an additional multimodal step via process_mm_data_async

This creates the core contradiction we ran into:
The client has already expanded image-related tokens into input_ids (commonly a large number of image_pad tokens), but the server decodes input_ids back into text, then uses the processor to realign again in a “text + images” manner.

As a result, the “expanded image tokens” get misinterpreted as “multiple image placeholders,” causing alignment to fail (out-of-bounds).

Given the analysis, the fix is to fallback to legacy slow path for skip-tokenizer-init.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 6, 2026

Manually run test case passed.

@yuan-luo yuan-luo force-pushed the refactor_load_mm_data branch from a5f0789 to 9c14f49 Compare January 6, 2026 08:03
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 6, 2026

This PR has some impact on #14091.
@minleminzui Could you help to review this PR as well? Thanks.

@yuan-luo yuan-luo requested a review from minleminzui January 6, 2026 09:21
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 7, 2026

/rerun-failed-ci

1 similar comment
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Jan 7, 2026

/rerun-failed-ci

@JustinTong0323
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@JustinTong0323 JustinTong0323 merged commit 6d29d8a into sgl-project:main Jan 18, 2026
230 of 243 checks passed
DotSlash-A pushed a commit to DotSlash-A/sglang that referenced this pull request Jan 19, 2026
* fix(ci): recover from corrupted MMMU parquet cache (sgl-project#17256)

* [diffusion] feat: support default 4-step inference for Flux2-Klein distilled models (sgl-project#17225)

Signed-off-by: Lancer <maruixiang6688@gmail.com>

* Add runner utilization report workflow (sgl-project#17234)

* cli: support sglang version (sgl-project#17250)

* Use swa radix cache and memory pool for gpt-oss model (sgl-project#17261)

* [VLM][Reland] Refactor load_mm_data to improve performance (sgl-project#16152)

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

* [Tiny] Improve docs (sgl-project#17264)

* [diffusion] fix: set guidance_scale default to None (sgl-project#17182)

* Tiny fix comment typo (sgl-project#17287)

* [SPEC_V2] Enable cudagraph draft_extend for trtllm_mla_backend and Acclen Fix for DP under cudagraph mode (sgl-project#16974)

* Add kl test for swa radix cache (sgl-project#17281)

* fix: Handle multiple named chat templates in HuggingFace tokenizers (sgl-project#17236)

Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>

* Move radix cache related tests (sgl-project#17295)

* [Refactor] Add `-fp4-gemm-backend` to replace `SGLANG_FLASHINFER_FP4_GEMM_BACKEND` (sgl-project#16534)

Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>

* [Bugfix] Fix PD accuracy when MTP is not configured on the prefill node (sgl-project#17212)

Co-authored-by: Shangming Cai <csmthu@gmail.com>

* [Diffusion] Apply jit qk_norm to flux1 (sgl-project#17296)

* [Refactor] Split out deepseek v2 weight loader function into mixin (sgl-project#16649)

* [NPU]Support GPT-OSS for NPU (sgl-project#14197)

* [jit-kernel] Add CuTe DSL GDN Decode Kernel (sgl-project#15631)

Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>

* [GLM 4.7] Add RTX 6000 Pro aka sm120 (sgl-project#17235)

Co-authored-by: root <root@ubuntu-nvidia.localdomain>

* Update CODEOWNERS for multimodal_gen (sgl-project#17308)

Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>

* [Feature] overlap LoRA weight loading with compute (sgl-project#15512)

* [PD] Optimize MHA models pp util calculation logic (sgl-project#17306)

* [Minor] Correct sglang version when installing from source (sgl-project#17315)

* Use dsv3 optimized routing `fused_topk_deepseek` instead of `moe_fused_gate` (sgl-project#15347)

* [DeepSeek v3.2] Opt MTP decode cuda batch sizes and nsa implementation (sgl-project#16961)

* Update code sync scripts (sgl-project#17319)

* [Auto Sync] Update tokenizer_manager.py (20260119) (sgl-project#17317)

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* support new qwen3_coder_detector (sgl-project#16744)

Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com>

* Fix kernel selection in biased_grouped_topk_gpu (sgl-project#17325)

* KV Cache Events with Attention DP bug fix (sgl-project#16030) (sgl-project#16412)

* [Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)

Co-authored-by: Minglei Zhu <zminglei@linkedin.com>

* [CI] Add partition to stage-b-test-large-1-gpu (11->12) (sgl-project#17245)

* fix(ci): rate limit and permission errors in trace publishing (sgl-project#17238)

* Revert "[Perf] fuse q, k norm for Flux2Attention (sgl-project#17241)" (sgl-project#17332)

* Migrate performance, accuracy, and quantization tests to CI registry (sgl-project#17177)

Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>

* Inclusion of nvfp4 blockscale in EPLB Rebalance (sgl-project#17158)

* [Refactor] Set `fp4-gemm-backend=auto` on SM100 and rename `fp4-gemm-backend` with `flashinfer_` prefix (sgl-project#17309)

* [Diffusion] Apply qknorm to flux2 and apply lightx2v rms_norm_one_pass kernel(without residual) (sgl-project#17305)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Fix v32 continue_final_message not work (sgl-project#16567)

* Evict swa kv cache during decoding (sgl-project#17220)

* [RadixTree][1/N Refactor]: Support unified match_prefix params (sgl-project#17142)

Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>

* [AMD CI] Migrate and Add More Testcases (sgl-project#17116)

Co-authored-by: yctseng0211 <yctseng@amd.com>

* [AMD] CI - add partitions for stage-b-test-small-1-gpu-amd (sgl-project#17345)

* Restore deepseek_v2.py to main's code, except the utils

* Ran `pre-commit`

---------

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Hudson Xing <1277646412@qq.com>
Co-authored-by: Lancer <402430575@qq.com>
Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
Co-authored-by: Yuan Luo <yuan.luo@hotmail.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Mohammad Miadh Angkad <mangkad.bsdsba2027@aim.edu>
Co-authored-by: Changyi Yang <112288487+ChangyiYang@users.noreply.github.com>
Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com>
Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com>
Co-authored-by: b8zhong <b8zhong@uwaterloo.ca>
Co-authored-by: Vincent Zhong <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Ch3ngY1 <91232537+Ch3ngY1@users.noreply.github.com>
Co-authored-by: Shangming Cai <csmthu@gmail.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Jerry Ji <jerryjilol@gmail.com>
Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com>
Co-authored-by: Jinyan Chen <93358689+liz-badada@users.noreply.github.com>
Co-authored-by: Jinyan Chen <jinyanc@nvidia.com>
Co-authored-by: Koushik Dutta <koush@koushikdutta.com>
Co-authored-by: root <root@ubuntu-nvidia.localdomain>
Co-authored-by: Glen Liu <62917497+glenliu21@users.noreply.github.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Co-authored-by: Lee Nau <lnau@nvidia.com>
Co-authored-by: Yongfei Xu <xuyongfei.xyf@antgroup.com>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Gaoji Liu <34803073+attack204@users.noreply.github.com>
Co-authored-by: liugaoji.lgj <liugaoji.lgj@alibaba-inc.com>
Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com>
Co-authored-by: Kartik Ramesh <kartikx2000@gmail.com>
Co-authored-by: Minglei Zhu <mingleizhu1122@gmail.com>
Co-authored-by: Minglei Zhu <zminglei@linkedin.com>
Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com>
Co-authored-by: Shu Wang <shuw@nvidia.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ybyang <10629930+whybeyoung@users.noreply.github.com>
Co-authored-by: zhangheng <hzh0425@apache.org>
Co-authored-by: yizhang2077 <1109276519@qq.com>
Co-authored-by: pansicheng <sicheng.pan.chn@gmail.com>
Co-authored-by: Bingxu Chen <Bingxu.Chen@amd.com>
Co-authored-by: yctseng0211 <yctseng@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants