[Performance] Extract kv update ops from MLA attention backends#34627
Conversation
Signed-off-by: ElizaWszola <ewszola@redhat.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the Multi-Layer Attention (MLA) backend to extract the KV cache update logic into a separate custom operator, unified_mla_kv_cache_update. This change is aimed at improving torch.compile compatibility by isolating the side effect of updating the cache. The logic is controlled by a new forward_includes_kv_cache_update flag in the attention backend. The implementation follows existing patterns in the codebase for creating data dependencies for torch.compile. The changes appear solid, but I have identified one potential issue regarding the registration of the new custom operator.
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: ElizaWszola <ewszola@redhat.com>
| return "FlashAttention MLA not supported on this device" | ||
| return None | ||
|
|
||
| forward_includes_kv_cache_update: bool = False |
There was a problem hiding this comment.
I don't think we even need this because we removed it completely from the layer, right?
There was a problem hiding this comment.
yes, it's a cruft, thanks
ProExpertProg
left a comment
There was a problem hiding this comment.
Just remove the boolean flag and test on ROCm with AITER if you can!
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
Hi @ElizaWszola, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: ElizaWszola <ewszola@redhat.com> Co-authored-by: Di Wu <dw2761@nyu.edu>
5693b2c to
b2ceea9
Compare
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
@ProExpertProg I've updated with ROCm eval numbers |
Before vllm-project#34627, MLA only wrote KV inside forward_impl, after checking attn_metadata is not None. With vllm-project#34627, we started calling unified_mla_kv_cache_update unconditionally, so warmup/profile runs could still write into KV pages. This breaks prefix cache after elastic ep reconfigure, since it involves a dummy run which can now overwrite KV pages without invalidating the prefix-cache entries. Fix it by making unified_mla_kv_cache_update a no-op when attn_metadata is None and restore old logic to skip kv cache update on dummy runs) Signed-off-by: Itay Alroy <ialroy@nvidia.com>
…-project#34627) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Di Wu <dw2761@nyu.edu> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
…-project#34627) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Di Wu <dw2761@nyu.edu> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
…-project#34627) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Di Wu <dw2761@nyu.edu> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
…-project#34627) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Di Wu <dw2761@nyu.edu> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
…-project#34627) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Di Wu <dw2761@nyu.edu> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
…-project#34627) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Di Wu <dw2761@nyu.edu> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
…-project#34627) Signed-off-by: ElizaWszola <ewszola@redhat.com> Signed-off-by: Luka Govedič <ProExpertProg@users.noreply.github.com> Co-authored-by: Di Wu <dw2761@nyu.edu> Co-authored-by: Luka Govedič <ProExpertProg@users.noreply.github.com>
Extract KV cache update from MLA attention backends similar to #25954
This PR adapts some elements of #33658