[Performance] Split FlashAttn attention and cache update by ElizaWszola · Pull Request #25954 · vllm-project/vllm

ElizaWszola · 2025-09-30T13:39:51Z

This PR creates codepaths for separating KV Cache update and Attention forward op. It also implements this split for FlashAttn backend. This separation facilitates future unwrapping.

E2E tests:

ran inference on Blackwell machine with

llm = LLM(model="deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)

and both VLLM_MLA_DISABLE=1 (to test the split) and VLLM_MLA_DISABLE=0 (to test if this PR does not affect backends that don't do the split).

Some lm_eval results (flash infer has been tested to check how the PR affects non-splitting backends)

lm-eval --model vllm --model_args '{"pretrained": "meta-llama/Llama-3.1-8B-Instruct", "speculative_config": {"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}, "max_seq_len": 2048}' --tasks gsm8k --batch_size auto

main (flash attn):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7748|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7020|±  |0.0126|

pr (flash attn):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7741|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|

lm-eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --tasks gsm8k --batch_size auto

main (flash infer):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7726|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|

pr (flash infer):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7726|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7020|±  |0.0126|

main (flash attn):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7748|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|

pr (flash attn):
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7748|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7013|±  |0.0126|

Note

Separates KV-cache update from attention forward and plumbs this split across backends, layers, runner, and tests.

Introduces AttentionBackend.forward_includes_kv_cache (default True); sets it to False for FLASH_ATTN and implements do_kv_cache_update, removing KV update from forward.
Adds new custom op unified_kv_cache_update and updates Attention layer to invoke KV update separately when the backend’s forward excludes it.
GPU runner: adds ForceAttention enum; updates metadata building to carry per-group slot_mapping; adjusts padding/ubatching and cudagraph warmup/capture to include KV update when split from attention.
Tests: add try_backend_includes_kv_cache and conditionally call do_kv_cache_update in attention and tree-attention tests.
Resilience fixes: various Mamba/ShortConv/Qwen paths now safely access per-layer metadata via dict.get; minor MLA builder tweak for cudagraph capture.

^{Written by Cursor Bugbot for commit 7e9f893. This will update automatically on new commits. Configure here.}

Note

Separates KV-cache update from attention forward and plumbs the split across backends, layer, runner, and tests.

Introduces AttentionBackend.forward_includes_kv_cache; sets FLASH_ATTN to False and implements do_kv_cache_update, removing KV update from its forward
Adds unified_kv_cache_update custom op; Attention layer invokes it when the backend’s forward excludes KV update; uses unified attention ops for direct and custom-op paths
GPU runner: adds ForceAttention enum; carries per-group slot_mapping in metadata; ensures KV update is captured in cudagraphs; adjusts padding/ubatching and warmup logic
Tests: add try_backend_includes_kv_cache and conditionally call do_kv_cache_update in attention and tree-attention tests
Robustness tweaks: Mamba/ShortConv/Qwen layers access per-layer metadata via dict.get; minor MLA cudagraph builder adjustment

^{Written by Cursor Bugbot for commit fa5a30a. This will update automatically on new commits. Configure here.}

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify · 2025-10-03T05:41:03Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: mgoin <mgoin64@gmail.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com>

…attention Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify · 2025-10-06T12:27:08Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…on op backends Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

LucasWilkinson · 2026-01-22T22:30:24Z

+                self._slot_mapping_buffer[num_actual:num_tokens].fill_(PADDING_SLOT_ID)
+
+        if num_tokens not in self._cached_slot_mapping_views:
+            self._cached_slot_mapping_views[num_tokens] = self._slot_mapping_buffer[


what is the point of self._cached_slot_mapping_views? appears to just a mirror of self._slot_mapping_buffer

It was just a mirror -- I became convinced that identical views were necessary, not just identical memory buffers. You're right, that wasn't necessary. Thanks for catching this. The actual fix was use_eagle_buffer. Removed the cached views in f831c3c

LucasWilkinson · 2026-01-22T22:30:29Z

+                self.attn_layer_names
+                and slot_mappings is not None
+                and self.attn_layer_names[0] in slot_mappings
+            )


can you please add a comment here explaining this?

Done in de1e5dc

LucasWilkinson · 2026-01-22T22:32:08Z

        raise AssertionError("unreachable") from None


+def try_backend_includes_kv_cache(


nit: try_backend_includes_kv_cache -> try_backend_includes_kv_cache_update

Done in 238ae7e

LucasWilkinson · 2026-01-22T22:32:28Z

            return [16, 32, 64]
        return [MultipleOf(16)]

+    forward_includes_kv_cache: bool = False


nit: forward_includes_kv_cache -> forward_includes_kv_cache_update

Done in 238ae7e

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

mergify · 2026-01-23T17:59:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

ElizaWszola added 6 commits September 29, 2025 04:34

Split KV-Cache update and Attention op for FlashAttn

4101a38

Signed-off-by: ElizaWszola <ewszola@redhat.com>

cleanup

a38af45

Signed-off-by: ElizaWszola <ewszola@redhat.com>

run update kv cache conditionally inside flash_attn (temp for debugging)

b880e76

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Merge branch 'main' into split-attention-cache-update

a4e84b2

Signed-off-by: ElizaWszola <ewszola@redhat.com>

try calling through a different op

affee1a

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Pass right arguments, remove redundant block

9e3b1fc

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify Bot added the v1 label Sep 30, 2025

ProExpertProg mentioned this pull request Oct 1, 2025

Fuse RoPE and MLA KV-cache write #25774

Merged

ElizaWszola added 2 commits October 2, 2025 07:43

Make the split version run with piecewise cudagraphs

bc1ef6d

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Missing gpu model runner update

25e36c7

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ProExpertProg reviewed Oct 2, 2025

View reviewed changes

Comment thread vllm/attention/layer.py Outdated

Comment thread vllm/attention/layer.py Outdated

ProExpertProg reviewed Oct 2, 2025

View reviewed changes

Comment thread vllm/attention/layer.py Outdated

mergify Bot added the needs-rebase label Oct 3, 2025

ElizaWszola added 2 commits October 3, 2025 12:59

Merge branch 'main' into split-attention-cache-update

3a98457

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Cleanup

74a0ef6

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify Bot removed the needs-rebase label Oct 3, 2025

ElizaWszola and others added 5 commits October 3, 2025 14:12

format

df69b52

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Make sure TRTLLM attention is available for test_blackwell_moe

054f5bd

Signed-off-by: mgoin <mgoin64@gmail.com>

Merge branch 'vllm-project:main' into main

1106472

Merge branch 'main' into split-attention-cache-update

33b60ec

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Only force attention when there's a backend with split kv update and …

d3a2781

…attention Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify Bot added the needs-rebase label Oct 6, 2025

ElizaWszola added 2 commits October 6, 2025 12:23

Always force attention in dummy runs only for split kv update-attenti…

726f89f

…on op backends Signed-off-by: ElizaWszola <ewszola@redhat.com>

Merge branch 'main' into split-attention-cache-update

7f7ccaf

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify Bot removed the needs-rebase label Oct 6, 2025

ElizaWszola marked this pull request as ready for review October 6, 2025 16:56

ElizaWszola requested review from mgoin and robertgshaw2-redhat as code owners October 6, 2025 16:56

Fix pre-commit

48fed29

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

LucasWilkinson reviewed Jan 22, 2026

View reviewed changes

MatthewBonanni added 3 commits January 22, 2026 22:39

Remove cached views

f831c3c

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Add comment

de1e5dc

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Rename

238ae7e

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

ProExpertProg approved these changes Jan 23, 2026

View reviewed changes

Merge branch 'main' into split-attention-cache-update

2f93ace

Merge branch 'main' into split-attention-cache-update

5c0247f

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

mgoin approved these changes Jan 24, 2026

View reviewed changes

This was referenced Jan 24, 2026

[Bug]: Model Runner V2 broken CUDA Graph after kvcache update split(#25954) #33003

Closed

[BugFix] fix model runner v2 error after kvcache update split #33004

Closed

WoosukKwon mentioned this pull request Jan 26, 2026

[Model Runner V2] Fix slot_mapping after #25954 #33046

Merged

NickLucche mentioned this pull request Jan 26, 2026

[Bugfix] Fix Voxtral streaming slot_mapping #33073

Merged

VedantMadane mentioned this pull request Jan 28, 2026

[Refactor] Extract KV-cache update logic for FlashAttentionDiffKV backend #32509

Closed

LucasWilkinson mentioned this pull request Jan 28, 2026

[BugFix] Fix IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1) for encoder models #33278

Closed

shen-shanshan mentioned this pull request Jan 29, 2026

[Main2Main] Upgrade vllm commit to v0.15.0rc0 vllm-project/vllm-ascend#6304

Closed

tomasruizt mentioned this pull request Jan 29, 2026

[Feature][Performance][Speculative Decoding]: Support Full CUDA Graph for the drafter #33341

Open

1 task

zou3519 mentioned this pull request Jan 29, 2026

[BugFix] Fix cold start compilation time #33357

Closed

ProExpertProg mentioned this pull request Jan 30, 2026

[fix][torch.compile] Fix cold-start compilation time increase by adding kv cache update to splitting ops #33441

Merged

Rohan138 mentioned this pull request Jan 30, 2026

[ROCm] AITER fused RoPE+KVCache #33443

Merged

5 tasks

wangxiyuan mentioned this pull request Feb 2, 2026

[Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 vllm-project/vllm-ascend#6470

Merged

This was referenced Feb 13, 2026

Monkey-patch Attention.forward vllm-project/vllm-gaudi#973

Merged

Monkey-patch of Attention.forward vllm-project/vllm-gaudi#975

Merged

This was referenced Feb 25, 2026

[Performance] Extract kv update ops from MLA attention backends #34627

Merged

[Performance] Extract KV cache update op from flashinfer forward #35422

Merged

elvircrn mentioned this pull request Apr 9, 2026

[Bugfix] Fix V1 dummy run writing NaN to KV cache null block #39444

Merged

3 tasks

		raise AssertionError("unreachable") from None


		def try_backend_includes_kv_cache(

Uh oh!

Conversation

ElizaWszola commented Sep 30, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E tests:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Oct 3, 2025

Uh oh!

mergify Bot commented Oct 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

ElizaWszola commented Sep 30, 2025 •

edited by github-actions Bot

Loading