Skip to content

[Performance] Split FlashAttn attention and cache update#25954

Merged
vllm-bot merged 108 commits into
vllm-project:mainfrom
neuralmagic:split-attention-cache-update
Jan 24, 2026
Merged

[Performance] Split FlashAttn attention and cache update#25954
vllm-bot merged 108 commits into
vllm-project:mainfrom
neuralmagic:split-attention-cache-update

Conversation

@ElizaWszola

@ElizaWszola ElizaWszola commented Sep 30, 2025

Copy link
Copy Markdown
Contributor

This PR creates codepaths for separating KV Cache update and Attention forward op. It also implements this split for FlashAttn backend. This separation facilitates future unwrapping.

E2E tests:

ran inference on Blackwell machine with

llm = LLM(model="deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)

and both VLLM_MLA_DISABLE=1 (to test the split) and VLLM_MLA_DISABLE=0 (to test if this PR does not affect backends that don't do the split).

Some lm_eval results (flash infer has been tested to check how the PR affects non-splitting backends)

lm-eval --model vllm --model_args '{"pretrained": "meta-llama/Llama-3.1-8B-Instruct", "speculative_config": {"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}, "max_seq_len": 2048}' --tasks gsm8k --batch_size auto

main (flash attn):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7748|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7020|±  |0.0126|

pr (flash attn):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7741|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|
lm-eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks gsm8k --batch_size auto

main (flash infer):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7726|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|

pr (flash infer):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7726|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7020|±  |0.0126|

main (flash attn):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7748|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|

pr (flash attn):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7748|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|

Note

Separates KV-cache update from attention forward and plumbs this split across backends, layers, runner, and tests.

  • Introduces AttentionBackend.forward_includes_kv_cache (default True); sets it to False for FLASH_ATTN and implements do_kv_cache_update, removing KV update from forward.
  • Adds new custom op unified_kv_cache_update and updates Attention layer to invoke KV update separately when the backend’s forward excludes it.
  • GPU runner: adds ForceAttention enum; updates metadata building to carry per-group slot_mapping; adjusts padding/ubatching and cudagraph warmup/capture to include KV update when split from attention.
  • Tests: add try_backend_includes_kv_cache and conditionally call do_kv_cache_update in attention and tree-attention tests.
  • Resilience fixes: various Mamba/ShortConv/Qwen paths now safely access per-layer metadata via dict.get; minor MLA builder tweak for cudagraph capture.

Written by Cursor Bugbot for commit 7e9f893. This will update automatically on new commits. Configure here.


Note

Separates KV-cache update from attention forward and plumbs the split across backends, layer, runner, and tests.

  • Introduces AttentionBackend.forward_includes_kv_cache; sets FLASH_ATTN to False and implements do_kv_cache_update, removing KV update from its forward
  • Adds unified_kv_cache_update custom op; Attention layer invokes it when the backend’s forward excludes KV update; uses unified attention ops for direct and custom-op paths
  • GPU runner: adds ForceAttention enum; carries per-group slot_mapping in metadata; ensures KV update is captured in cudagraphs; adjusts padding/ubatching and warmup logic
  • Tests: add try_backend_includes_kv_cache and conditionally call do_kv_cache_update in attention and tree-attention tests
  • Robustness tweaks: Mamba/ShortConv/Qwen layers access per-layer metadata via dict.get; minor MLA cudagraph builder adjustment

Written by Cursor Bugbot for commit fa5a30a. This will update automatically on new commits. Configure here.

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Comment thread vllm/attention/layer.py Outdated
Comment thread vllm/attention/layer.py Outdated
Comment thread vllm/attention/layer.py Outdated
@mergify

mergify Bot commented Oct 3, 2025

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Oct 3, 2025
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
@mergify mergify Bot removed the needs-rebase label Oct 3, 2025
ElizaWszola and others added 5 commits October 3, 2025 14:12
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
…attention

Signed-off-by: ElizaWszola <ewszola@redhat.com>
@mergify

mergify Bot commented Oct 6, 2025

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Oct 6, 2025
…on op backends

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
@mergify mergify Bot removed the needs-rebase label Oct 6, 2025
@ElizaWszola ElizaWszola marked this pull request as ready for review October 6, 2025 16:56
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Comment thread vllm/v1/spec_decode/eagle.py Outdated
self._slot_mapping_buffer[num_actual:num_tokens].fill_(PADDING_SLOT_ID)

if num_tokens not in self._cached_slot_mapping_views:
self._cached_slot_mapping_views[num_tokens] = self._slot_mapping_buffer[

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the point of self._cached_slot_mapping_views? appears to just a mirror of self._slot_mapping_buffer

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a mirror -- I became convinced that identical views were necessary, not just identical memory buffers. You're right, that wasn't necessary. Thanks for catching this. The actual fix was use_eagle_buffer. Removed the cached views in f831c3c

Comment thread vllm/v1/spec_decode/eagle.py Outdated
self.attn_layer_names
and slot_mappings is not None
and self.attn_layer_names[0] in slot_mappings
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add a comment here explaining this?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in de1e5dc

Comment thread tests/v1/attention/utils.py Outdated
raise AssertionError("unreachable") from None


def try_backend_includes_kv_cache(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: try_backend_includes_kv_cache -> try_backend_includes_kv_cache_update

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 238ae7e

return [16, 32, 64]
return [MultipleOf(16)]

forward_includes_kv_cache: bool = False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: forward_includes_kv_cache -> forward_includes_kv_cache_update

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 238ae7e

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@mergify

mergify Bot commented Jan 23, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@Rohan138 Rohan138 mentioned this pull request Jan 30, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation kv-connector nvidia qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs speculative-decoding v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.