Use FlashAttention for `multi_query_kv_attention` by WoosukKwon · Pull Request #4 · vllm-project/vllm

WoosukKwon · 2023-03-02T05:05:56Z

This PR is to use FlashAttention kernels for multi_query_kv_attention, which performs masked attention for the prompt inputs.

Pros

FlashAttention is fast and memory-efficient.
FlashAttention supports 1D inputs and only invokes a single kernel to handle multiple sequences with variable lengths.

Cons

FlashAttention does NOT support FP32.
FlashAttention does not support head_size > 128. (This is fine for all models except GPT-J).
- Ref: Support for 256 head dim Dao-AILab/flash-attention#67
FlashAttention does not support attention bias (GPT-J, BLOOM, LLaMA).

Besides, note that FlashAttention does not support cached KV, which is required for interactive generation.

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

* Init * refine

Support for optimum-intel models

…o-model-executor Adapt OpenVINO CPU plugin implementation

BA-78760: Jamba * Add support for n concat and splitting * change naming * input_metadata is a dict list now in order to pass "n" * clean up code from unecessary changes and prints * Remove kv cache allocation in case of mamba layer * Add the considerations of mamba layer cache into the num of blocks calculation * Delete mamba cache after profile * Remove prints * Cleaning * - and not _ for requirements Approved-by: Tomer Asida

patching for having type su

…ect#4 magic_wand semi_structured_sparse_tensor_linear branch integrates 2:4 semi-structured sparsity into SparseTensor. This PR adds a new sparsity config for 2:4 sparsity to neuralmagic-vllm, using the SparseTensor 2:4 support. This PR also refactors the sparse linear method into a separate file, vllm/model_executor/layers/sparsity/sparse_w16a16_linear_method.py, which supports all sparsity formats.

…-h-mtp WIP: mxfp8

[fix] add abs for ptpc

Addresses RFC vllm-project#32028 Item vllm-project#4. Replaces 4 scattered state variables with a single TransferPhase enum for clearer producer-consumer coordination. Signed-off-by: Anri Lombard <anri.m.lombard@gmail.com>

fix te error

…d_hyperclovax [WIP] 250525 add hyperclovax

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

…n-files-check Fix vLLM framework documentation to match actual source code structure

WoosukKwon added 8 commits March 2, 2023 04:20

Add a FlashAttention test

5685bac

Define MAX_SEQ_LEN

320b20c

Minor

4932c71

Use FlashAttention for multi_query_kv_attention

c302754

Add more head sizes for test

95e5c0f

Add error msgs

1c28f4f

Enhance the server script

eb9e9a0

Add flash-attn to README

72052b7

WoosukKwon merged commit 3e9f991 into main Mar 2, 2023

WoosukKwon deleted the flash-attn branch March 2, 2023 05:13

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

tmm1 mentioned this pull request Aug 3, 2023

Fix the rushed out multi-query kernel #44

Closed

xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023

Add BigDL Llama worker for batching on decoding (vllm-project#4)

02b4cac

* Init * refine

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang referenced this pull request in hongxiayang/vllm Feb 13, 2024

Use FlashAttention for multi_query_kv_attention (#4)

e7c912b

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 12, 2024

Merge pull request vllm-project#4 from slyalin/optimum_models

8a9862f

Support for optimum-intel models

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 25, 2024

Merge pull request vllm-project#4 from luo-cheng2021/luocheng/openvin…

658407a

…o-model-executor Adapt OpenVINO CPU plugin implementation

dlopes78 mentioned this pull request May 8, 2024

[Bug]: VLLM + tritonserver #4695

Closed

linxihui added a commit to linxihui/vllm that referenced this pull request May 14, 2024

Merge pull request vllm-project#4 from beagleski/bapatra/patching-for-su

7646e00

patching for having type su

Alexei-V-Ivanov-AMD mentioned this pull request May 16, 2024

[Speculative decoding][Re-take] Enable TP>1 speculative decoding #4840

Merged

afeldman-nm mentioned this pull request May 21, 2024

[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) #4837

Merged

yuhuixu1993 mentioned this pull request Jun 2, 2024

[Bug]: loading squeezellm model #5190

Closed

afeldman-nm mentioned this pull request Jun 3, 2024

[Bug]: VLLM_ATTENTION_BACKEND set to ROCM_FLASH only in GHA environment, overriding automatic backend selection; this breaks other kernel unit tests. #5208

Closed

oliver-li mentioned this pull request Jul 5, 2024

[Bug]: NCCL hangs and causes timeout #5484

Closed

haichuan1221 mentioned this pull request Jul 5, 2024

Support W4A8 quantization for vllm #5218

Merged

slwang-ustc mentioned this pull request Nov 27, 2025

[Bug]: RuntimeError "cancelled" when using pipeline parallelism with Qwen3-14B #29085

Closed

1 task

guyueh1 referenced this pull request in guyueh1/vllm Dec 9, 2025

Merge pull request #4 from TomerBN-Nvidia/linear-mxfp8-triton-support

b914d25

BJWang-ant mentioned this pull request Dec 17, 2025

[Bug]: Qwen3-32B with MTP, run failed. #30766

Open

1 task

danisereb pushed a commit to danisereb/vllm that referenced this pull request Dec 24, 2025

Merge pull request vllm-project#4 from amirkl94/feat/support-nemotron…

8f3ffdf

…-h-mtp WIP: mxfp8

hangy-amd added a commit to hangy-amd/vllm that referenced this pull request Jan 4, 2026

Merge pull request vllm-project#4 from hangy-amd/lhy/add_ptpc_abs

540dd21

[fix] add abs for ptpc

Anri-Lombard mentioned this pull request Jan 10, 2026

[RFC]: EPLB Implementation Refactoring #32028

Open

1 task

Anri-Lombard mentioned this pull request Jan 10, 2026

[EPLB] Replace async handshake flags with TransferPhase state machine #32078

Open

2 tasks

This was referenced Jan 27, 2026

[Feature] Add journey state cleanup to scheduler (PR #3/9) #33126

Closed

[Feature] Emit journey events to core spans (PR #4/9) #33136

Closed

[Feature] Add rich request snapshot stream for step-level observability (PR #5) #33334

Closed

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

jungledesh mentioned this pull request Feb 5, 2026

[Misc]: CMake Clean-up / Refactor Tasks #9129

Open

7 tasks

mgoin mentioned this pull request Feb 18, 2026

[Bugfix] Fix lora tests #34834

Merged

5 tasks

Isotr0py mentioned this pull request Feb 25, 2026

[Bug]: Large Video Request cause vLLM Progress Core Dump #35285

Open

1 task

JGSweets mentioned this pull request Mar 9, 2026

[Bug]: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #28028

Open

1 task

IWantFight pushed a commit to IWantFight/vllm that referenced this pull request Mar 11, 2026

Merge pull request vllm-project#4 from IWantFight/bf_ant_group

825cc98

fix te error

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

lavanyabollepalli mentioned this pull request Mar 12, 2026

[Bug]: GPU failure during repeated model loading when using --enable-prefix-caching with KV transfer (LMCacheConnectorV1) #36852

Open

1 task

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

watch-Ultra mentioned this pull request Mar 18, 2026

[Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 #37392

Open

1 task

bigshanedogg pushed a commit to bigshanedogg/vllm that referenced this pull request Mar 19, 2026

Merge pull request vllm-project#4 from HYPERSCALE-AI-VISION/250525_ad…

643c9c8

…d_hyperclovax [WIP] 250525 add hyperclovax

RocketRider mentioned this pull request Mar 21, 2026

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 #37431

Open

Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026

Merge pull request vllm-project#4 from bcacdwk/copilot/update-markdow…

8b8201c

…n-files-check Fix vLLM framework documentation to match actual source code structure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use FlashAttention for `multi_query_kv_attention`#4

Use FlashAttention for `multi_query_kv_attention`#4
WoosukKwon merged 8 commits intomainfrom
flash-attn

WoosukKwon commented Mar 2, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented Mar 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pros

Cons

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WoosukKwon commented Mar 2, 2023 •

edited

Loading