Implement `single_query_cached_kv_attention` kernel by WoosukKwon · Pull Request #3 · vllm-project/vllm

WoosukKwon · 2023-03-01T21:40:49Z

This PR adds the single_query_cached_kv_attention kernel.

Supported data types:

half
float

Tested models:

OPT-125M
OPT-350M
OPT-1.3B
OPT-2.7B
OPT-6.7B
OPT-13B

Tested GPUs:

A100

Merge linear layers

* add multiple requests test * fix

Passing alibi_slopes and sliding_window to PagedAttention extension

Added dockerfile with vLLM + openvino

Geon dev

* Remove assertion * adapting jamba vllm to changes after hf release, working on weight loading in modeling file * splitting the JambaDecoderLayer to JambaMambaDecoderLayer and JambaAttentionDecoderLayer * weight loading from hf checkpoint supposedly works, might be a mixup in the MoE between the gated and non-gated weights * Add mamba from jamba modeling file * Remove slow forward * Modifications to mamba_mixer * Save changes, WIP * Fix cache placement * Debugging * Additions and logging * Jamba with mamba cache handling * Clean up * Another cleanup * Use vllm's RMSNorm instead of JambaRMSNorm, Thier implementation is with fused kernel * Clean up and orginization of the objects to handle the mamba cache * Shorten the code for kv cache mem * Move cache handling inside the Mixer * Add mamba to the wheel requirements * Add mamba to the requirements script * Add mamba_metadata * Add to __init__ __all__ * Revert 2 commits ad1a3db 'Add mamba to the requirements script' 75ed2c8 'Add mamba to the wheel requirements' * Clean up * Naming * Apply whitespace suggestions from code review * pass tie_word_embeddings to PretrainedConfig init * Replace repeat with expand as expand doesn't require more mem * Allocate really small cache if needed , don't use meta * Fix for expanded --------- Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: Erez Schwartz <erezs@ai21.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

…xfp8 Stable tp for mxfp8

fea: support rfork

…ommit intial_commit: set base vllm (v0.9.0)

…n (yuezhu1) [RFC] Dynamic LoRA slot scaling via vLLM plugin

Fix vllm-project#1: Runtime instruction tensor fill - Compile-time tensor is now a TEMPLATE (opcodes set, dimensions zero) - Runtime code fills M/N/K and A/B/C pointers from function arguments - DAG analysis resolves which params feed each GEMM's A/B/C inputs - Intermediate buffers allocated/freed for inter-action data flow Fix vllm-project#2: bar.sync deadlock - GEMM pipeline emits bar.sync 0 expecting all 640 CTA threads - Only 128 threads (4 consumer warps) run the GEMM handler - Post-process replaces bar.sync 0 with bar.sync 1, 128 (scoped barrier) Fix vllm-project#3: Shared memory collision (verified non-issue) - Static smem (inst_state, mbarrier arrays) is separate from dynamic_smem - GEMM cp.async buffers use dynamic_smem offset 0 — no collision - Added compile-time assertions that GEMM smem fits in dynamic region - Documented Phase 3 TODO for when pages carry inter-op data 314 tests pass (84 ferrite-macros + 218 ferrite-ptx + 12 ferrite-runtime). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Nick Mitchell <nickm@us.ibm.com>

## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary>#1 — llama-nemotron-embed / score-template support (vllm-project#30550): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#2 — Triton Attention (vllm-project#31406): Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary>#3 — Llama-4 attn quant (vllm-project#34243): Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454): Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507): Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085): Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary>vllm-project#7 — response_format validation for completions (vllm-project#35456): Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary>vllm-project#8 — response_format validation for chat completions (vllm-project#35510): Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

…n-files Comprehensive expansion of vLLM framework documentation

WoosukKwon added 14 commits February 28, 2023 12:06

[Bugfix] int -> torch.int

6f5c391

[WIP] Add attention kernel

5726be8

Fix a bug in loading QK

1202c0a

Raise an error for invalid block sizes

b5f3c49

Add FP32 flags

645a49e

Change value cache layout

ad13de6

Add single_query_cached_kv_attention

3b416c0

Add TODO

723c9fc

Move

78e048a

Minor

b7c7a60

Minor

a0fbd6b

Minor

c89e5d3

Add a test for attention ops

7d30bdc

Minor fix in comment

40a2f7b

WoosukKwon merged commit 0deacbc into main Mar 1, 2023

WoosukKwon deleted the attention-kernel branch March 1, 2023 23:11

murongweibo mentioned this pull request Jul 11, 2023

NCCL Error 5: invalid usage #427

Closed

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023

Merge pull request vllm-project#3 from ri938/merge_linear_layers

a3ac858

Merge linear layers

xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 18, 2023

Send multiple requests to underlying model (vllm-project#3)

e36fc39

* add multiple requests test * fix

shanshanpt mentioned this pull request Nov 17, 2023

Run long conetxt error : CUDA error: an illegal memory access was encountered #1700

Closed

junior-zsy mentioned this pull request Nov 20, 2023

Error with 32k Long Text in chatglm2-6b-32k Model #1725

Closed

hongxiayang referenced this pull request in hongxiayang/vllm Feb 13, 2024

Implement single_query_cached_kv_attention kernel (#3)

98a4895

Qubitium mentioned this pull request Mar 5, 2024

Generation with Prefix-cache are slower than the ones without it ? #3154

Closed

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 12, 2024

Merge pull request vllm-project#3 from slyalin/window_and_alibi

ce80498

Passing alibi_slopes and sliding_window to PagedAttention extension

luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 20, 2024

Merge pull request vllm-project#3 from ilya-lavrenov/docker-file

354ca31

Added dockerfile with vLLM + openvino

daniel-geon-park added a commit to gmlwns2000/vllm-timber that referenced this pull request Apr 15, 2024

Merge pull request vllm-project#3 from DeepAuto-AI/geon-dev

52ed876

Geon dev

sfc-gh-hazhang referenced this pull request in sfc-gh-hazhang/vllm May 7, 2024

Merge pull request #3 from Snowflake-Labs/arctic-linting

bfe5011

slwang-ustc mentioned this pull request Nov 27, 2025

[Bug]: RuntimeError "cancelled" when using pipeline parallelism with Qwen3-14B #29085

Closed

1 task

guyueh1 referenced this pull request in guyueh1/vllm Dec 9, 2025

Merge pull request #3 from TomerBN-Nvidia/linear-mxfp8-triton-support

6e4fe42

guyueh1 added a commit to amirkl94/vllm that referenced this pull request Dec 16, 2025

Merge pull request vllm-project#3 from TomerBN-Nvidia/stable-tp-for-m…

93c9f1f

…xfp8 Stable tp for mxfp8

BJWang-ant mentioned this pull request Dec 17, 2025

[Bug]: Qwen3-32B with MTP, run failed. #30766

Open

1 task

Lrcx mentioned this pull request Jan 29, 2026

[Bug]: Crash when using presence_penalty with Qwen3-VL in v0.11.0 #33338

Open

1 task

HervorTao mentioned this pull request Feb 3, 2026

[Bug]: [CPU Backend] AttributeError: '_OpNamespace' '_C_utils' object has no attribute 'init_cpu_threads_env' #33675

Closed

1 task

Isotr0py mentioned this pull request Feb 25, 2026

[Bug]: Large Video Request cause vLLM Progress Core Dump #35285

Open

1 task

voipmonitor mentioned this pull request Mar 4, 2026

fix: tool_choice="required" falls back to tool_parser for non-JSON formats (streaming + non-streaming) #35936

Open

JGSweets mentioned this pull request Mar 9, 2026

[Bug]: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. #28028

Open

1 task

bnellnm mentioned this pull request Mar 10, 2026

[Bugfix] Miscellaneous MoE bug fixes #36678

Open

5 tasks

Copilot AI mentioned this pull request Mar 10, 2026

fix(mooncake): address review comments on HBM leak fix machov/vllm#1

Merged

IWantFight pushed a commit to IWantFight/vllm that referenced this pull request Mar 11, 2026

Merge pull request vllm-project#3 from IWantFight/bf_ant_group

4b32ca9

fea: support rfork

LironKesem mentioned this pull request Mar 12, 2026

[Bug] DGX Spark (sm_121): CUTLASS can_implement() rejects sm_120f binaries #36835

Closed

1 task

lavanyabollepalli mentioned this pull request Mar 12, 2026

[Bug]: GPU failure during repeated model loading when using --enable-prefix-caching with KV transfer (LMCacheConnectorV1) #36852

Open

1 task

mahaocong90 mentioned this pull request Mar 17, 2026

[Bug]: QWEN 3.5-397B-A17B report "RPC call to sample_tokens timed out" #37250

Closed

1 task

watch-Ultra mentioned this pull request Mar 18, 2026

[Bug]:推理时报错，模型关闭了。部署的Qwen3.5-122B-A10B-FP8模型 #37392

Open

1 task

bigshanedogg pushed a commit to bigshanedogg/vllm that referenced this pull request Mar 19, 2026

Merge pull request vllm-project#3 from HYPERSCALE-AI-VISION/initial_c…

831c245

…ommit intial_commit: set base vllm (v0.9.0)

yuezhu1 pushed a commit to yuezhu1/vllm that referenced this pull request Mar 20, 2026

Merge PR vllm-project#3: RFC dynamic LoRA slot scaling via vLLM plugi…

18bd1f0

…n (yuezhu1) [RFC] Dynamic LoRA slot scaling via vLLM plugin

RocketRider mentioned this pull request Mar 21, 2026

Mamba-2 Triton kernels crash with illegal instruction on SM121 (DGX Spark) without CUDA_LAUNCH_BLOCKING=1 #37431

Open

swap-debug mentioned this pull request Mar 23, 2026

[Usage]: Unable to run Qwen3-14B with vLLM (multiple issues) #37907

Open

1 task

Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026

Merge pull request vllm-project#3 from bcacdwk/copilot/update-markdow…

049e1c1

…n-files Comprehensive expansion of vLLM framework documentation

varjoranta mentioned this pull request Apr 4, 2026

[Attention Backend] TurboQuant: 2-bit KV cache compression with 4x capacity #38479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement `single_query_cached_kv_attention` kernel#3

Implement `single_query_cached_kv_attention` kernel#3
WoosukKwon merged 14 commits intomainfrom
attention-kernel

WoosukKwon commented Mar 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented Mar 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant