Skip to content

[Performance] Extract kv update ops from MLA attention backends#34627

Merged
ProExpertProg merged 12 commits into
vllm-project:mainfrom
neuralmagic:split-kv-attention-mla
Mar 2, 2026
Merged

[Performance] Extract kv update ops from MLA attention backends#34627
ProExpertProg merged 12 commits into
vllm-project:mainfrom
neuralmagic:split-kv-attention-mla

Conversation

@ElizaWszola

@ElizaWszola ElizaWszola commented Feb 16, 2026

Copy link
Copy Markdown
Contributor

Extract KV cache update from MLA attention backends similar to #25954

This PR adapts some elements of #33658

lm-eval --model vllm --model_args pretrained=deepseek-ai/DeepSeek-V2-Lite --tasks gsm8k --batch_size auto

this PR:

deepgemm, CUDA:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3821|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3798|±  |0.0134|

no deepgemm, CUDA:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3821|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3798|±  |0.0134|

ROCm:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3791|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3760|±  |0.0133|

main:

deepgemm, CUDA:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3836|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3813|±  |0.0134|

no deepgemm, CUDA:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3813|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3783|±  |0.0134|

ROCm:
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3791|±  |0.0134|
|     |       |strict-match    |     5|exact_match|↑  |0.3760|±  |0.0133|

Signed-off-by: ElizaWszola <ewszola@redhat.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Multi-Layer Attention (MLA) backend to extract the KV cache update logic into a separate custom operator, unified_mla_kv_cache_update. This change is aimed at improving torch.compile compatibility by isolating the side effect of updating the cache. The logic is controlled by a new forward_includes_kv_cache_update flag in the attention backend. The implementation follows existing patterns in the codebase for creating data dependencies for torch.compile. The changes appear solid, but I have identified one potential issue regarding the registration of the new custom operator.

Comment thread vllm/model_executor/layers/attention/mla_attention.py
@ElizaWszola ElizaWszola changed the title Extract kv update ops from MLA attention backends [Performance] Extract kv update ops from MLA attention backends Feb 16, 2026
Signed-off-by: ElizaWszola <ewszola@redhat.com>
@mergify mergify Bot added the v1 label Feb 25, 2026
@mergify

mergify Bot commented Feb 25, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ElizaWszola <ewszola@redhat.com>
@mergify mergify Bot removed the needs-rebase label Feb 25, 2026
return "FlashAttention MLA not supported on this device"
return None

forward_includes_kv_cache_update: bool = False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we even need this because we removed it completely from the layer, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's a cruft, thanks

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@ProExpertProg ProExpertProg left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just remove the boolean flag and test on ROCm with AITER if you can!

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Comment thread vllm/model_executor/layers/attention/mla_attention.py
@mergify

mergify Bot commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

Hi @ElizaWszola, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: ElizaWszola <ewszola@redhat.com>

Co-authored-by: Di Wu <dw2761@nyu.edu>
@ElizaWszola ElizaWszola force-pushed the split-kv-attention-mla branch from 5693b2c to b2ceea9 Compare February 26, 2026 14:26
Comment thread vllm/model_executor/layers/attention/mla_attention.py
@ProExpertProg ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 26, 2026
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Comment thread vllm/model_executor/layers/attention/mla_attention.py Outdated
ProExpertProg and others added 2 commits February 27, 2026 16:12
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
@ElizaWszola

Copy link
Copy Markdown
Contributor Author

Just remove the boolean flag and test on ROCm with AITER if you can!

@ProExpertProg I've updated with ROCm eval numbers

@ProExpertProg ProExpertProg merged commit d9c7730 into vllm-project:main Mar 2, 2026
67 checks passed
itayalroy added a commit to itayalroy/vllm that referenced this pull request Mar 6, 2026
Before vllm-project#34627, MLA only wrote KV inside forward_impl,
after checking attn_metadata is not None.
With vllm-project#34627, we started calling unified_mla_kv_cache_update
unconditionally, so warmup/profile runs could still write
into KV pages.

This breaks prefix cache after elastic ep reconfigure, since
it involves a dummy run which can now overwrite KV pages
without invalidating the prefix-cache entries.

Fix it by making unified_mla_kv_cache_update a no-op when
attn_metadata is None and restore old logic to skip kv cache
update on dummy runs)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…-project#34627)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Di Wu <dw2761@nyu.edu>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Mar 12, 2026
…-project#34627)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Di Wu <dw2761@nyu.edu>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…-project#34627)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Di Wu <dw2761@nyu.edu>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
…-project#34627)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Di Wu <dw2761@nyu.edu>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…-project#34627)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Di Wu <dw2761@nyu.edu>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…-project#34627)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Di Wu <dw2761@nyu.edu>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026
…-project#34627)

Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Co-authored-by: Di Wu <dw2761@nyu.edu>
Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants