Skip to content

Implement single_query_cached_kv_attention kernel#3

Merged
WoosukKwon merged 14 commits intomainfrom
attention-kernel
Mar 1, 2023
Merged

Implement single_query_cached_kv_attention kernel#3
WoosukKwon merged 14 commits intomainfrom
attention-kernel

Conversation

@WoosukKwon
Copy link
Copy Markdown
Collaborator

This PR adds the single_query_cached_kv_attention kernel.

Supported data types:

  • half
  • float

Tested models:

  • OPT-125M
  • OPT-350M
  • OPT-1.3B
  • OPT-2.7B
  • OPT-6.7B
  • OPT-13B

Tested GPUs:

  • A100

@WoosukKwon WoosukKwon merged commit 0deacbc into main Mar 1, 2023
@WoosukKwon WoosukKwon deleted the attention-kernel branch March 1, 2023 23:11
v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023
xiangyuT pushed a commit to xiangyuT/vllm that referenced this pull request Oct 18, 2023
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 12, 2024
Passing alibi_slopes and sliding_window to PagedAttention extension
luo-cheng2021 pushed a commit to luo-cheng2021/vllm that referenced this pull request Mar 20, 2024
daniel-geon-park added a commit to gmlwns2000/vllm-timber that referenced this pull request Apr 15, 2024
mzusman added a commit to mzusman/vllm that referenced this pull request Apr 16, 2024
* Remove assertion

* adapting jamba vllm to changes after hf release, working on weight loading in modeling file

* splitting the JambaDecoderLayer to JambaMambaDecoderLayer and JambaAttentionDecoderLayer

* weight loading from hf checkpoint supposedly works, might be a mixup in the MoE between the gated and non-gated weights

* Add mamba from jamba modeling file

* Remove slow forward

* Modifications to mamba_mixer

* Save changes, WIP

* Fix cache placement

* Debugging

* Additions and logging

* Jamba with mamba cache handling

* Clean up

* Another cleanup

* Use vllm's RMSNorm instead of JambaRMSNorm, Thier implementation is with
fused kernel

* Clean up and orginization of the objects to handle the mamba cache

* Shorten the code for kv cache mem

* Move cache handling inside the Mixer

* Add mamba to the wheel requirements

* Add mamba to the requirements script

* Add mamba_metadata

* Add to __init__ __all__

* Revert 2 commits

ad1a3db 'Add mamba to the requirements script'
75ed2c8 'Add mamba to the wheel requirements'

* Clean up

* Naming

* Apply whitespace suggestions from code review

* pass tie_word_embeddings to PretrainedConfig init

* Replace repeat with expand as expand doesn't require more mem

* Allocate really small cache if needed , don't use meta

* Fix for expanded

---------

Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: Erez Schwartz <erezs@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
guyueh1 added a commit to amirkl94/vllm that referenced this pull request Dec 16, 2025
IWantFight pushed a commit to IWantFight/vllm that referenced this pull request Mar 11, 2026
bigshanedogg pushed a commit to bigshanedogg/vllm that referenced this pull request Mar 19, 2026
…ommit

intial_commit: set base vllm (v0.9.0)
yuezhu1 pushed a commit to yuezhu1/vllm that referenced this pull request Mar 20, 2026
…n (yuezhu1)

[RFC] Dynamic LoRA slot scaling via vLLM plugin
starpit added a commit to starpit/vllm that referenced this pull request Mar 22, 2026
Fix vllm-project#1: Runtime instruction tensor fill
- Compile-time tensor is now a TEMPLATE (opcodes set, dimensions zero)
- Runtime code fills M/N/K and A/B/C pointers from function arguments
- DAG analysis resolves which params feed each GEMM's A/B/C inputs
- Intermediate buffers allocated/freed for inter-action data flow

Fix vllm-project#2: bar.sync deadlock
- GEMM pipeline emits bar.sync 0 expecting all 640 CTA threads
- Only 128 threads (4 consumer warps) run the GEMM handler
- Post-process replaces bar.sync 0 with bar.sync 1, 128 (scoped barrier)

Fix vllm-project#3: Shared memory collision (verified non-issue)
- Static smem (inst_state, mbarrier arrays) is separate from dynamic_smem
- GEMM cp.async buffers use dynamic_smem offset 0 — no collision
- Added compile-time assertions that GEMM smem fits in dynamic region
- Documented Phase 3 TODO for when pages carry inter-op data

314 tests pass (84 ferrite-macros + 218 ferrite-ptx + 12 ferrite-runtime).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Nick Mitchell <nickm@us.ibm.com>
khairulkabir1661 referenced this pull request in khairulkabir1661/vllm Mar 26, 2026
## Summary

Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All
fixes are from upstream vLLM `main` and address critical bugs affecting
RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately.

**Jira Epic:**
[INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743)

## Cherry-picked commits (chronological order)

| # | Upstream PR | Jira | Summary |
|---|------------|------|---------|
| 1 | [vllm-project#30550](vllm-project#30550) |
[INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) |
Support using chat template as custom score template for reranking
models |
| 2 | [vllm-project#31406](vllm-project#31406) |
[INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add
encoder-only/cross attention support to Triton Attention backend |
| 3 | [vllm-project#34243](vllm-project#34243) |
[INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix
Llama-4 attn quantization by correctly permuting scales for rope (int8,
fp8) |
| 4 | [vllm-project#34454](vllm-project#34454) |
[INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix
structured output in multi-turn GPT-OSS (content:null with json_object)
|
| 5 | [vllm-project#34507](vllm-project#34507) |
[INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix
fused MoE int32 overflow in stride*offset for large models |
| 6 | [vllm-project#35085](vllm-project#35085) |
[INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) |
Gracefully disable AllReduceFusionPass on GPUs without multicast support
|
| 7 | [vllm-project#35456](vllm-project#35456) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) |
Replace assert with ValueError for response_format validation
(completions) |
| 8 | [vllm-project#35510](vllm-project#35510) |
[INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add
response_format validation to chat completions endpoint |


## Conflict resolutions

<details>
<summary><b>#1 — llama-nemotron-embed / score-template support
(vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly onto `rhai/0.13.0`.
</details>

<details>
<summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but
the fix targets a self-contained block.
</details>

<details>
<summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no
conflicts</summary>

Applied cleanly despite 3 intermediate upstream commits that refactored
imports in `gptoss_reasoning_parser.py`. The fix logic (adding
`eom_token_id` early-exit check in `is_reasoning_end`) was independent
of the import changes.
</details>

<details>
<summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2
files</summary>

**`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30
intermediate upstream commits refactored `fused_moe_kernel` with
conditional `naive_block_assignment` logic that doesn't exist in
`rhai/0.13.0`. Resolved by keeping our simpler code and applying only
the int64 cast fix:
- `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()`
result
- `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)`
before `token_mask`

**`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on
`make_dummy_moe_config()` from intermediate refactors. Resolved by
keeping our existing test code (no test changes).
</details>

<details>
<summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict
due to file rename + API change</summary>

Upstream moved `collective_fusion.py` →
`compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API
from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to
`create_allreduce_fusion_workspace()`. Resolved by applying the
try/except wrapper around our existing
`trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in
`collective_fusion.py`. The error handling logic (catching RuntimeError
with "multicast" in message, logging warning, returning early) is
identical to upstream.
</details>

<details>
<summary><b>vllm-project#7 — response_format validation for completions
(vllm-project#35456)</b>: Conflict due to file restructuring</summary>

Upstream split `protocol.py` into `completion/protocol.py` and
`chat_completion/protocol.py`. Our branch still has the monolithic
`protocol.py`. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`CompletionRequest` in our `protocol.py`
- Using `ValueError` instead of upstream's `VLLMValidationError` (which
doesn't exist in our branch; `ValueError` is already handled as 400 Bad
Request in `serving_engine.py`)
- Test additions from upstream applied cleanly to
`test_completion_error.py`
</details>

<details>
<summary><b>vllm-project#8 — response_format validation for chat completions
(vllm-project#35510)</b>: Conflict due to file restructuring</summary>

Same file restructuring issue as vllm-project#6. Resolved by:
- Removing the non-existent
`vllm/entrypoints/openai/chat_completion/protocol.py`
- Manually adding `validate_response_format` model_validator to
`ChatCompletionRequest` in our `protocol.py`
- Only accepting the `test_json_schema_response_format_missing_schema`
test from the conflict (discarding ~140 lines of intermediate upstream
tests that reference non-existent paths in our branch)
</details>

## Test plan

- [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the
backported score-template / bidirectional model support
- [ ] Verify Llama-4 quantized model loads correctly with int8/fp8
attention quantization
- [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format
returns valid content
- [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32
overflow
- [ ] Verify MoE model loading on H200 GPUs (without multicast)
gracefully falls back
- [ ] Verify `response_format: {type: "json_schema"}` without
`json_schema` field returns 400 (not 500) for both `/v1/completions` and
`/v1/chat/completions`
- [ ] Verify encoder models (e.g. Whisper) work with Triton attention
backend on ROCm


[INFERENG-4743]:
https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4800]:
https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-4746]:
https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5032]:
https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[INFERENG-5038]:
https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

[INFERENG-5106]:
https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Damon-Salvetore pushed a commit to Damon-Salvetore/vllm that referenced this pull request Mar 31, 2026
…n-files

Comprehensive expansion of vLLM framework documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant