Skip to content

Use FlashAttention for multi_query_kv_attention#4

Merged
WoosukKwon merged 8 commits intomainfrom
flash-attn
Mar 2, 2023
Merged

Use FlashAttention for multi_query_kv_attention#4
WoosukKwon merged 8 commits intomainfrom
flash-attn

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

@WoosukKwon WoosukKwon commented Mar 2, 2023

This PR is to use FlashAttention kernels for multi_query_kv_attention, which performs masked attention for the prompt inputs.

Pros

  • FlashAttention is fast and memory-efficient.
  • FlashAttention supports 1D inputs and only invokes a single kernel to handle multiple sequences with variable lengths.

Cons

Besides, note that FlashAttention does not support cached KV, which is required for interactive generation.

Tested models:

  • OPT-125M
  • OPT-350M
  • OPT-1.3B
  • OPT-2.7B
  • OPT-6.7B
  • OPT-13B

Tested GPUs:

  • A100

@WoosukKwon WoosukKwon merged commit 3e9f991 into main Mar 2, 2023
@WoosukKwon WoosukKwon deleted the flash-attn branch March 2, 2023 05:13
xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 12, 2024
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024
…o-model-executor

Adapt OpenVINO CPU plugin implementation
mzusman pushed a commit to mzusman/vllm that referenced this pull request Apr 16, 2024
BA-78760: Jamba

* Add support for n concat and splitting

* change naming

* input_metadata is a dict list now in order to pass "n"

* clean up code from unecessary changes and prints

* Remove kv cache allocation in case of mamba layer

* Add the considerations of mamba layer cache into the num of blocks
calculation

* Delete mamba cache after profile

* Remove prints

* Cleaning

* - and not _ for requirements

Approved-by: Tomer Asida
linxihui added a commit to linxihui/vllm that referenced this pull request May 14, 2024
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
…ect#4

magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support.

This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
…ect#4

magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support.

This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.
danisereb pushed a commit to danisereb/vllm that referenced this pull request Dec 24, 2025
hangy-amd added a commit to hangy-amd/vllm that referenced this pull request Jan 4, 2026
Anri-Lombard added a commit to Anri-Lombard/vllm that referenced this pull request Jan 10, 2026
Addresses RFC vllm-project#32028 Item vllm-project#4. Replaces 4 scattered state variables with a single TransferPhase enum for clearer producer-consumer coordination.

Signed-off-by: Anri Lombard <anri.m.lombard@gmail.com>
@mgoin mgoin mentioned this pull request Feb 18, 2026
5 tasks
IWantFight pushed a commit to IWantFight/vllm that referenced this pull request Mar 11, 2026
bigshanedogg pushed a commit to bigshanedogg/vllm that referenced this pull request Mar 19, 2026
…d_hyperclovax

[WIP] 250525 add hyperclovax
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 26, 2026
## Summary

Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All
fixes are from upstream vLLM `main` and address critical bugs affecting
RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately.

**Jira Epic:**
[INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743)

## Cherry-picked commits (chronological order)

| # | Upstream PR | Jira | Summary |
|---|------------|------|---------|
| 1 | [vllm-project#30550](vllm-project#30550) |
[INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) |
Support using chat template as custom score template for reranking
models |
| 2 | [vllm-project#31406](vllm-project#31406) |
[INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add
encoder-only/cross attention support to Triton Attention backend |
| 3 | [vllm-project#34243](vllm-project#34243) |
[INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix
Llama-4 attn quantization by correctly permuting scales for rope (int8,
fp8) |
| 4 | [vllm-project#34454](vllm-project#34454) |
[INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix
structured output in multi-turn GPT-OSS (content:null with json_object)
|
| 5 | [vllm-project#34507](vllm-project#34507) |
[INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix
fused MoE int32 overflow in stride*offset for large models |
| 6 | [vllm-project#35085](vllm-project#35085) |
[INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) |
Gracefully disable AllReduceFusionPass on GPUs without multicast support
|
| 7 | [vllm-project#35456](vllm-project#35456) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) |
Replace assert with ValueError for response_format validation
(completions) |
| 8 | [vllm-project#35510](vllm-project#35510) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add
response_format validation to chat completions endpoint |


## Conflict resolutions

<details>
<summary><b>#1 — llama-nemotron-embed / score-template support
(vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but
the fix targets a self-contained block.
</details>

<details>
<summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly despite 3 intermediate upstream commits that refactored
imports in `gptoss_reasoning_parser.py`. The fix logic (adding
`eom_token_id` early-exit check in `is_reasoning_end`) was independent
of the import changes.
</details>

<details>
<summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2
files</summary>

**`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30
intermediate upstream commits refactored `fused_moe_kernel` with
conditional `naive_block_assignment` logic that doesn't exist in
`rhai/0.13.0`. Resolved by keeping our simpler code and applying only
the int64 cast fix:
- `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()`
result
- `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)`
before `token_mask`

**`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on
`make_dummy_moe_config()` from intermediate refactors. Resolved by
keeping our existing test code (no test changes).
</details>

<details>
<summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict
due to file rename + API change</summary>

Upstream moved `collective_fusion.py` →
`compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API
from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to
`create_allreduce_fusion_workspace()`. Resolved by applying the
try/except wrapper around our existing
`trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in
`collective_fusion.py`. The error handling logic (catching RuntimeError
with "multicast" in message, logging warning, returning early) is
identical to upstream.
</details>

<details>
<summary><b>vllm-project#7 — response_format validation for completions
(vllm-project#35456)</b>: Conflict due to file restructuring</summary>

Upstream split `protocol.py` into `completion/protocol.py` and
`chat_completion/protocol.py`. Our branch still has the monolithic
`protocol.py`. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`CompletionRequest` in our `protocol.py`
- Using `ValueError` instead of upstream's `VLLMValidationError` (which
doesn't exist in our branch; `ValueError` is already handled as 400 Bad
Request in `serving_engine.py`)
- Test additions from upstream applied cleanly to
`test_completion_error.py`
</details>

<details>
<summary><b>vllm-project#8 — response_format validation for chat completions
(vllm-project#35510)</b>: Conflict due to file restructuring</summary>

Same file restructuring issue as vllm-project#6. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/chat_completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`ChatCompletionRequest` in our `protocol.py`
- Only accepting the `test_json_schema_response_format_missing_schema`
test from the conflict (discarding ~140 lines of intermediate upstream
tests that reference non-existent paths in our branch)
</details>

## Test plan

- [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the
backported score-template / bidirectional model support
- [ ] Verify Llama-4 quantized model loads correctly with int8/fp8
attention quantization
- [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format
returns valid content
- [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32
overflow
- [ ] Verify MoE model loading on H200 GPUs (without multicast)
gracefully falls back
- [ ] Verify `response_format: {type: "json_schema"}` without
`json_schema` field returns 400 (not 500) for both `/v1/completions` and
`/v1/chat/completions`
- [ ] Verify encoder models (e.g. Whisper) work with Triton attention
backend on ROCm


[INFERENG-4743]:
https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4800]:
https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4746]:
https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5032]:
https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5038]:
https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

[INFERENG-5106]:
https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026
…n-files-check

Fix vLLM framework documentation to match actual source code structure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant