[Performance] Extract kv update ops from MLA attention backends by ElizaWszola · Pull Request #34627 · vllm-project/vllm

ElizaWszola · 2026-02-16T15:21:08Z

Extract KV cache update from MLA attention backends similar to #25954

This PR adapts some elements of #33658

lm-eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite --tasks gsm8k --batch_size auto

this PR:

deepgemm, CUDA:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3821|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3798|±  |0.0134|

no deepgemm, CUDA:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3821|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3798|±  |0.0134|

ROCm:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3791|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3760|±  |0.0133|

main:

deepgemm, CUDA:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3836|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3813|±  |0.0134|

no deepgemm, CUDA:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3813|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3783|±  |0.0134|

ROCm:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3791|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3760|±  |0.0133|

Signed-off-by: ElizaWszola <ewszola@redhat.com>

gemini-code-assist

Code Review

This pull request refactors the Multi-Layer Attention (MLA) backend to extract the KV cache update logic into a separate custom operator, unified_mla_kv_cache_update. This change is aimed at improving torch.compile compatibility by isolating the side effect of updating the cache. The logic is controlled by a new forward_includes_kv_cache_update flag in the attention backend. The implementation follows existing patterns in the codebase for creating data dependencies for torch.compile. The changes appear solid, but I have identified one potential issue regarding the registration of the new custom operator.

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify · 2026-02-25T17:07:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ProExpertProg · 2026-02-25T17:20:51Z

            return "FlashAttention MLA not supported on this device"
        return None

+    forward_includes_kv_cache_update: bool = False


I don't think we even need this because we removed it completely from the layer, right?

yes, it's a cruft, thanks

ProExpertProg

Just remove the boolean flag and test on ROCm with AITER if you can!

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify · 2026-02-26T12:20:52Z

Hi @ElizaWszola, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: ElizaWszola <ewszola@redhat.com> Co-authored-by: Di Wu <dw2761@nyu.edu>

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ElizaWszola · 2026-03-02T14:27:10Z

Just remove the boolean flag and test on ROCm with AITER if you can!

@ProExpertProg I've updated with ROCm eval numbers

Before vllm-project#34627, MLA only wrote KV inside forward_impl, after checking attn_metadata is not None. With vllm-project#34627, we started calling unified_mla_kv_cache_update unconditionally, so warmup/profile runs could still write into KV pages. This breaks prefix cache after elastic ep reconfigure, since it involves a dummy run which can now overwrite KV pages without invalidating the prefix-cache entries. Fix it by making unified_mla_kv_cache_update a no-op when attn_metadata is None and restore old logic to skip kv cache update on dummy runs) Signed-off-by: Itay Alroy <ialroy@nvidia.com>

…-project#34627) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Di Wu <dw2761@nyu.edu> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

Extract kv update ops from MLA attention backends

2aa6db1

Signed-off-by: ElizaWszola <ewszola@redhat.com>

gemini-code-assist Bot reviewed Feb 16, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/attention/mla_attention.py

ElizaWszola changed the title ~~Extract kv update ops from MLA attention backends~~ [Performance] Extract kv update ops from MLA attention backends Feb 16, 2026

Rohan138 mentioned this pull request Feb 24, 2026

[ROCm][WIP]: Fused aiter rope kvcache mla #35245

Closed

5 tasks

Do extraction

53b881f

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify Bot added the v1 label Feb 25, 2026

mergify Bot added the needs-rebase label Feb 25, 2026

ElizaWszola marked this pull request as ready for review February 25, 2026 17:12

ElizaWszola requested review from LucasWilkinson, MatthewBonanni, ProExpertProg, WoosukKwon, alexm-redhat, hmellor, houseroad, mgoin, njhill, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256, youkaichao and zhuohan123 as code owners February 25, 2026 17:12

Merge branch 'main' into split-kv-attention-mla

9ac0415

Signed-off-by: ElizaWszola <ewszola@redhat.com>

mergify Bot removed the needs-rebase label Feb 25, 2026

ProExpertProg reviewed Feb 25, 2026

View reviewed changes

ProExpertProg approved these changes Feb 25, 2026

View reviewed changes

Remove forward_includes_kv_cache_update

7430b90

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ProExpertProg reviewed Feb 25, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/attention/mla_attention.py

Missing import

b2ceea9

Signed-off-by: ElizaWszola <ewszola@redhat.com> Co-authored-by: Di Wu <dw2761@nyu.edu>

ElizaWszola force-pushed the split-kv-attention-mla branch from 5693b2c to b2ceea9 Compare February 26, 2026 14:26

ProExpertProg reviewed Feb 26, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/attention/mla_attention.py

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 26, 2026

ElizaWszola added 2 commits February 27, 2026 13:34

Move kv cache update call to precede forwards with and without output

ab97840

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Merge branch 'main' into split-kv-attention-mla

6d7f201

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ProExpertProg approved these changes Feb 27, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/attention/mla_attention.py Outdated

ProExpertProg and others added 2 commits February 27, 2026 16:12

remove default

7407991

Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>

Merge branch 'main' into split-kv-attention-mla

cc28e8a

Signed-off-by: ElizaWszola <ewszola@redhat.com>

ProExpertProg approved these changes Mar 2, 2026

View reviewed changes

ProExpertProg merged commit d9c7730 into vllm-project:main Mar 2, 2026
67 checks passed

itayalroy mentioned this pull request Mar 6, 2026

mla: don't update kv cache on dummy forwards #36282

Merged

xaguilar-amd mentioned this pull request May 3, 2026

[Performance][MLA] Lift decode Q-prep (q-absorb + cat + FP8 quant) out of forward_impl #41568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] Extract kv update ops from MLA attention backends#34627

[Performance] Extract kv update ops from MLA attention backends#34627
ProExpertProg merged 12 commits into
vllm-project:mainfrom
neuralmagic:split-kv-attention-mla

ElizaWszola commented Feb 16, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

mergify Bot commented Feb 25, 2026

Uh oh!

ProExpertProg Feb 25, 2026

Uh oh!

ElizaWszola Feb 25, 2026

Uh oh!

ElizaWszola Feb 25, 2026

Uh oh!

ProExpertProg left a comment

Uh oh!

Uh oh!

mergify Bot commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

ElizaWszola commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ElizaWszola commented Feb 16, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify Bot commented Feb 25, 2026

Uh oh!

ProExpertProg Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

ElizaWszola Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

ElizaWszola Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify Bot commented Feb 26, 2026

Uh oh!

Uh oh!

Uh oh!

ElizaWszola commented Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ElizaWszola commented Feb 16, 2026 •

edited by github-actions Bot

Loading