[v1] Add encoder-only/cross attention support to Triton Attention backend#31406
[v1] Add encoder-only/cross attention support to Triton Attention backend#31406Isotr0py merged 30 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
There was a problem hiding this comment.
Code Review
This pull request adds support for encoder-only and cross-attention to the Triton attention backend. This is achieved by introducing a new prefill attention Triton kernel that can handle non-causal attention, and a new execution path in TritonAttentionImpl for encoder attention types. The changes also include refactoring in several multi-modal models to likely align with this new attention mechanism.
My review identified two critical issues in the new implementation:
- An incorrect masking logic for non-causal sliding window attention in the new Triton kernel, which results in a one-sided window instead of a bidirectional one.
- A type mismatch when passing the
sliding_windowparameter to the new attention function, which would lead to a runtime error.
I have provided suggestions to fix both issues. After these are addressed, the changes look good and provide a valuable extension to the Triton backend.
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
Also cc @NickLucche and @noooop about Whisper/Encoder-only models respectively. |
There was a problem hiding this comment.
💡 Codex Review
https://github.com/vllm-project/vllm/blob/6011b0a5603ef118d07e28f1ff178e53a4611bc4/model_executor/models/ernie45_vl.py#L155-L159
Tensor-parallel vision attention reshapes oversized QKV
When tp_size > 1, QKVParallelLinear emits Q/K/V blocks of length equal to the full projection size, but split_qkv now reshapes them assuming only projection_size / tp_size elements (num_attention_heads_per_partition * hidden_size_per_attention_head). Without the previous gather-and-repartition step, x.view(*new_shape) will raise a size mismatch (or, if forced, misassign heads) on multi-GPU tensor-parallel runs of the vision encoder. The same pattern appears in Glm4vVisionAttention and SiglipAttention, so any vision model using tensor parallelism will fail at runtime.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new Triton kernel for memory-efficient prefill attention, which includes support for sliding window attention. The TritonAttention backend is updated to leverage this new kernel, enabling support for encoder-only and encoder attention types by adding a dedicated _forward_encoder_attention method. A review comment identifies a critical bug in the new Triton kernel's _fwd_kernel function, specifically in the calculation of start_n_limit for the backward sliding window, which is currently incorrect and could lead to skipped key blocks and erroneous attention outputs. A corrected formula for start_n_limit is provided in the review.
NickLucche
left a comment
There was a problem hiding this comment.
can we add this backend to whisper-specific CI tests (test_transcription_validation_whisper.py)?
Sure, I think we can use FP32 to test Whisper, it should use Triton backend by default now since FA doesn't support FA: vllm/tests/models/multimodal/generation/test_whisper.py Lines 114 to 144 in bf73a3e |
|
The performance of TRITON_ATTN looks good. https://github.com/noooop/snippet/tree/main/benchmarks/triton_attention
X-axis: Throughput (request/s) |
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
How about the performance of |
There was a problem hiding this comment.
@Isotr0py This is LGTM from whisper-side, both from accuracy and latency at fp16 (don't really have a comparison to run at fp32 for enc-dec models).
Looking forward to the MMEncoderAttention backend to get a few more meaningful datapoints in benchmarks.
Thanks for your work!
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
https://buildkite.com/vllm/ci/builds/45546/steps/canvas the failure of today's Full CI run - daily Language Models Test (Extended Pooling) was caused by this PR. Running the Language Models Test (Extended Pooling) on this PR confirmed the issue. https://buildkite.com/vllm/ci/builds/45475/steps/canvas?sid=019b8e61-59ae-4d96-ad0a-b100f167c05c Failure below is not related to this PR. |
…kend (vllm-project#31406) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…kend (vllm-project#31406) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…kend (vllm-project#31406) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…kend (vllm-project#31406) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…kend (vllm-project#31406) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
…kend (vllm-project#31406) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary><b>#1 — llama-nemotron-embed / score-template support (vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary><b>vllm-project#7 — response_format validation for completions (vllm-project#35456)</b>: Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary><b>vllm-project#8 — response_format validation for chat completions (vllm-project#35510)</b>: Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ



Purpose
Motivation
xformers#29262), we want to add a Triton MMEncoderAttention backend to give a balanced solution between FA and SDPA for incompatablehead_size.Introduction
Test Plan
Whisper:
Embedding models with sliding window (Test should use Triton backend by default now):
Test Result
Whisper:
Encoder-only models: Tests should still pass with Triton backend
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.