Skip to content

[CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it.#30125

Merged
Isotr0py merged 27 commits intovllm-project:mainfrom
shen-shanshan:vit
Dec 15, 2025
Merged

[CustomOp][MM] Extract MMEncoderAttention as CustomOp and replace the backend of QwenVisionAttention with it.#30125
Isotr0py merged 27 commits intovllm-project:mainfrom
shen-shanshan:vit

Conversation

@shen-shanshan
Copy link
Copy Markdown
Contributor

@shen-shanshan shen-shanshan commented Dec 5, 2025

Purpose

To avoid maintaining a variety of modeling files in vllm-ascend, we propose to remove all files in models dir in vllm-ascend. After this, the only thing a vllm plugin need to do is just registering their custom device-specific OOT ops to vllm when adding a new model. To achieve this, there are some refactors need to be done both in vllm and vllm-ascend, such as extracting some general layers as CustomOp, find more details at vllm-project/vllm-ascend#4084.

Following #27919 and #27147, this PR has unified the getting logic of vit_attn_backend and extracted MMEncoderAttention as a CustomOp.

To be specific, vision attention backend should only be checked and overwritten in the platform-specific implementation. We should not overwrite this logic in any other places, such as model_executor/models/<model_name>.py. In addition, I have moved scattered forward dispatch logic into this CustomOp to avoid verification for current_platform in any other places.

To minimize the influence, I only replaced the backend of QwenVisionAttention with this CustomOp and have tested this PR both on Ascend A2 NPU and NVIDIA A100 GPU (TODO). I will modify other modeling files and delete the old MultiHeadAttention in the future if this PR could be merged.

Test Plan

Test Result

✅ Ascend A2 NPU

Run Qwen2.5-VL:

vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384 \
--max-num-batched-tokens 16384 \
--tensor-parallel-size 2 \
--enforce-eager

Output:

{"id":"chatcmpl-b4e3053f30ab2442","object":"chat.completion","created":1764922950,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Run Qwen3-VL:

vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384 \
--tensor-parallel-size 2 \
--enforce-eager

Output:

{"id":"chatcmpl-97571fbda8267bd1","object":"chat.completion","created":1764923306,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: shen-shanshan 467638484@qq.com
Co-authored-by: Isotr0py mozf@mail2.sysu.edu.cn
Co-authored-by: tjtanaa tunjian.tan@embeddedllm.com

@mergify mergify bot added qwen Related to Qwen models nvidia rocm Related to AMD ROCm tpu Related to Google TPUs labels Dec 5, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a good step towards refactoring the attention mechanisms and making the codebase more modular by introducing MMEncoderAttention as a CustomOp. The unification of the vision attention backend logic is also a welcome improvement.

I've found a critical bug in vllm/attention/layer.py where a variable was renamed but not all its usages were updated, which would cause a runtime error. I've also pointed out an instance of code duplication in the new mm_encoder_attention.py file that should be addressed to improve maintainability.

Once these issues are resolved, this PR will be a solid contribution to the project's architecture.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

https://github.com/vllm-project/vllm/blob/a995c1480683198b2f5ee9e5fcc8e149bdae8790/vllm/model_executor/models/paddleocr_vl.py#L612-L616
P1 Badge PaddleOCR vision attention calls helper with outdated signature

maybe_get_vit_flash_attn_backend now only accepts the backend and returns a single function, but the PaddleOCR vision attention still unpacks two return values and passes attn_backend_override. Instantiating this module will now raise TypeError: maybe_get_vit_flash_attn_backend() got an unexpected keyword argument 'attn_backend_override', preventing the model from loading.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@shen-shanshan shen-shanshan marked this pull request as draft December 5, 2025 09:51
@mergify
Copy link
Copy Markdown

mergify bot commented Dec 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shen-shanshan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 8, 2025
@shen-shanshan shen-shanshan marked this pull request as ready for review December 8, 2025 12:11
@mergify mergify bot removed the needs-rebase label Dec 8, 2025
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@shen-shanshan
Copy link
Copy Markdown
Contributor Author

CC @Isotr0py @tjtanaa @DarkLight1337

Copy link
Copy Markdown
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this effort! I left some initial comments, and will further look into this tomorrow. PTAL!

Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: shen-shanshan <467638484@qq.com>
iboiko-habana pushed a commit to vllm-project/vllm-gaudi that referenced this pull request Dec 15, 2025
Fix for upstream PR:
vllm-project/vllm#30125

Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai>
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Dec 15, 2025
… backend of QwenVisionAttention with it. (vllm-project#30125)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
joa-stdn pushed a commit to joa-stdn/vllm that referenced this pull request Dec 15, 2025
… backend of QwenVisionAttention with it. (vllm-project#30125)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Joachim Studnia <joachim@mistral.ai>
teddygood pushed a commit to teddygood/vllm that referenced this pull request Dec 16, 2025
… backend of QwenVisionAttention with it. (vllm-project#30125)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
@JartX
Copy link
Copy Markdown
Contributor

JartX commented Dec 16, 2025

Hi to everyone
@shen-shanshan @Isotr0py @tjtanaa has broken TORCH.SDPA attention on vit and now come back to hallucinate:

#27744 (comment)

Have checked that has changed:

https://github.com/vllm-project/vllm/pull/30125/changes#diff-65596b65d014c0a5be414881cb834532b23b593599ae1126aafd575af0218225

At removed the Contiguous
This means that when, for example, you ask them for a bill, they might say that what they're looking at is a cake or that it's a Chinese restaurant.

@tjtanaa
Copy link
Copy Markdown
Collaborator

tjtanaa commented Dec 16, 2025

@JartX let me fix it quickly. Thank you for pinging.

@tjtanaa
Copy link
Copy Markdown
Collaborator

tjtanaa commented Dec 16, 2025

@JartX This is the fix #30789

@JartX
Copy link
Copy Markdown
Contributor

JartX commented Dec 16, 2025

@tjtanaa many thanks :D

lkk12014402 pushed a commit to lkk12014402/vllm-gaudi that referenced this pull request Dec 17, 2025
…project#718)

Fix for upstream PR:
vllm-project/vllm#30125

Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai>
Signed-off-by: lvkaokao <kaokao.lv@intel.com>
@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

AndreasKaratzas commented Dec 20, 2025

@shen-shanshan Hello, I guess I am late for this one but would appreciate if we could address the following issue. First of all, I see that all tests proposed here tests/models/multimodal/generation/test_vit_backend_functionality.py are skipped and the reason is Broken test due to memory segmentation fault. So we should probably make those work. Furthermore this PR brakes ROCm on several parts:

FAILED models/multimodal/generation/test_pixtral.py::test_chat[bfloat16-8192-mistralai/Pixtral-12B-2409] - AssertionError: Test2:
FAILED models/multimodal/generation/test_pixtral.py::test_chat[bfloat16-65536-mistralai/Pixtral-12B-2409] - AssertionError: Test2:
FAILED models/multimodal/pooling/test_intern_vit.py::test_models[half-OpenGVLab/InternViT-300M-448px] - RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
FAILED models/multimodal/pooling/test_intern_vit.py::test_models[half-OpenGVLab/InternViT-6B-448px-V1-5] - RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
FAILED models/multimodal/pooling/test_radio.py::test_radio[half-nvidia/C-RADIOv2-H] - RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
FAILED models/multimodal/pooling/test_radio.py::test_radio[bfloat16-nvidia/C-RADIOv2-H] - RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

I would appreciate some help here to get ROCm also functional wrt the recent SDPA changes.

cc @tjtanaa @JartX @Isotr0py

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

Btw regarding the test tests/models/multimodal/generation/test_vit_backend_functionality.py, if I remove that pytest.skip line, everything passes on ROCm. But the current SDPA implementation is still broken on ROCm. Which means that the proposed test does not cover properly the newly proposed changes.

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

After debugging this a bit, I realized that it's the flash attention that is broken on ROCm, not SDPA. But because Pixtral is probably skipped on NVIDIA due to large memory requirements, it was not migrated to the new MMEncoderAttention. I'll continue debugging flash attention on ROCm.

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

AndreasKaratzas commented Dec 20, 2025

Edit: SDPA is also buggy on ROCm with bfloat16 or float16, the test (pytest -s -v tests/models/multimodal/generation/test_pixtral.py::test_chat[bfloat16-8192-mistralai/Pixtral-12B-2409]) only passed with float32

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

AndreasKaratzas commented Dec 20, 2025

Final Update: After playing more with the attention backends, I can report the following:

  1. sdpa for vit with triton fails with float16 and bfloat16
  2. flash attention for vit with triton fails with float16 and bfloat16
  3. sdpa for vit with triton passes with float32
  4. aiter flash attention for vit with triton backend fails with float16 and bloat16
  5. aiter flash attention for vit with aiter flash attention backend passes with both float16 and bfloat16
  6. flash attention for vit with aiter flash attention backend passes (I only tested bfloat16)
  7. sdpa for vit with aiter flash attention backend passes (I only tested bfloat16)

Personal verdict: Something is going on with the Triton backend when handling vision embeds in 16-bit mode on ROCm. Since this PR had nothing to do with the above, and only to do with vision backends, my initial claim that this PR broke ROCm is probably invalid. I apologize for this mistake. I will investigate anything related to my final observation in a different thread/issue.

wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Dec 22, 2025
…ted patch (#4750)

### What this PR does / why we need it?

Following vllm-project/vllm#30125, register
`AscendMMEncoderAttention` CustomOp and remove related patch.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

✅ Run Qwen2.5-VL:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```

Output:

```
{"id":"chatcmpl-b4e3053f30ab2442","object":"chat.completion","created":1764922950,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```

✅ Run Qwen3-VL:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```

Output:

```
{"id":"chatcmpl-97571fbda8267bd1","object":"chat.completion","created":1764923306,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Majid-Taheri pushed a commit to Majid-Taheri/vllm that referenced this pull request Dec 23, 2025
… backend of QwenVisionAttention with it. (vllm-project#30125)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Ubuntu <mjtaheri68@gmail.com>
dsuhinin pushed a commit to dsuhinin/vllm that referenced this pull request Jan 21, 2026
… backend of QwenVisionAttention with it. (vllm-project#30125)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
rajanintel24 pushed a commit to rajanintel24/vllm-gaudi that referenced this pull request Feb 11, 2026
…project#718)

Fix for upstream PR:
vllm-project/vllm#30125

Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai>
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
… backend of QwenVisionAttention with it. (vllm-project#30125)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…ted patch (vllm-project#4750)

### What this PR does / why we need it?

Following vllm-project/vllm#30125, register
`AscendMMEncoderAttention` CustomOp and remove related patch.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

✅ Run Qwen2.5-VL:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```

Output:

```
{"id":"chatcmpl-b4e3053f30ab2442","object":"chat.completion","created":1764922950,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```

✅ Run Qwen3-VL:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```

Output:

```
{"id":"chatcmpl-97571fbda8267bd1","object":"chat.completion","created":1764923306,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…ted patch (vllm-project#4750)

### What this PR does / why we need it?

Following vllm-project/vllm#30125, register
`AscendMMEncoderAttention` CustomOp and remove related patch.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

✅ Run Qwen2.5-VL:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct \
--max_model_len 16384
```

Output:

```
{"id":"chatcmpl-b4e3053f30ab2442","object":"chat.completion","created":1764922950,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen2.5-VL-7B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the image is \"TONGYI Qwen.\" The word \"TONGYI\" is written in blue, and \"Qwen\" is written in gray. The font appears to be modern and clean, with \"TONGYI\" being slightly larger than \"Qwen.\" The design includes a geometric, abstract shape on the left side of the logo, which complements the text.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":78,"total_tokens":162,"completion_tokens":84,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```

✅ Run Qwen3-VL:

```bash
vllm serve /root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct \
--max_model_len 16384
```

Output:

```
{"id":"chatcmpl-97571fbda8267bd1","object":"chat.completion","created":1764923306,"model":"/root/.cache/modelscope/hub/models/Qwen/Qwen3-VL-8B-Instruct","choices":[{"index":0,"message":{"role":"assistant","content":"The text in the illustration is **“TONGYI Qwen”**.\n\n### How it looks:\n- **“TONGYI”** is written in **uppercase letters** in a **bold, modern sans-serif font**, colored **blue**.\n- **“Qwen”** is written in **lowercase letters** in a **slightly thinner, elegant sans-serif font**, colored **dark gray**.\n- The two lines of text are stacked vertically, with “TONG","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null,"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":112,"total_tokens":212,"completion_tokens":100,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
```

- vLLM version: v0.12.0
- vLLM main:
vllm-project/vllm@ad32e3e

---------

Signed-off-by: shen-shanshan <467638484@qq.com>
Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) nvidia qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm tpu Related to Google TPUs

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants