[Performance] Split FlashAttn attention and cache update#25954
Conversation
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
…attention Signed-off-by: ElizaWszola <ewszola@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
…on op backends Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: ElizaWszola <ewszola@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
| self._slot_mapping_buffer[num_actual:num_tokens].fill_(PADDING_SLOT_ID) | ||
|
|
||
| if num_tokens not in self._cached_slot_mapping_views: | ||
| self._cached_slot_mapping_views[num_tokens] = self._slot_mapping_buffer[ |
There was a problem hiding this comment.
what is the point of self._cached_slot_mapping_views? appears to just a mirror of self._slot_mapping_buffer
There was a problem hiding this comment.
It was just a mirror -- I became convinced that identical views were necessary, not just identical memory buffers. You're right, that wasn't necessary. Thanks for catching this. The actual fix was use_eagle_buffer. Removed the cached views in f831c3c
| self.attn_layer_names | ||
| and slot_mappings is not None | ||
| and self.attn_layer_names[0] in slot_mappings | ||
| ) |
There was a problem hiding this comment.
can you please add a comment here explaining this?
| raise AssertionError("unreachable") from None | ||
|
|
||
|
|
||
| def try_backend_includes_kv_cache( |
There was a problem hiding this comment.
nit: try_backend_includes_kv_cache -> try_backend_includes_kv_cache_update
| return [16, 32, 64] | ||
| return [MultipleOf(16)] | ||
|
|
||
| forward_includes_kv_cache: bool = False |
There was a problem hiding this comment.
nit: forward_includes_kv_cache -> forward_includes_kv_cache_update
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
This PR creates codepaths for separating KV Cache update and Attention forward op. It also implements this split for FlashAttn backend. This separation facilitates future unwrapping.
E2E tests:
ran inference on Blackwell machine with
and both
VLLM_MLA_DISABLE=1(to test the split) andVLLM_MLA_DISABLE=0(to test if this PR does not affect backends that don't do the split).Some lm_eval results (flash infer has been tested to check how the PR affects non-splitting backends)
Note
Separates KV-cache update from attention forward and plumbs this split across backends, layers, runner, and tests.
AttentionBackend.forward_includes_kv_cache(default True); sets it to False forFLASH_ATTNand implementsdo_kv_cache_update, removing KV update fromforward.unified_kv_cache_updateand updatesAttentionlayer to invoke KV update separately when the backend’sforwardexcludes it.ForceAttentionenum; updates metadata building to carry per-groupslot_mapping; adjusts padding/ubatching and cudagraph warmup/capture to include KV update when split from attention.try_backend_includes_kv_cacheand conditionally calldo_kv_cache_updatein attention and tree-attention tests.dict.get; minor MLA builder tweak for cudagraph capture.Written by Cursor Bugbot for commit 7e9f893. This will update automatically on new commits. Configure here.
Note
Separates KV-cache update from attention forward and plumbs the split across backends, layer, runner, and tests.
AttentionBackend.forward_includes_kv_cache; setsFLASH_ATTNtoFalseand implementsdo_kv_cache_update, removing KV update from itsforwardunified_kv_cache_updatecustom op;Attentionlayer invokes it when the backend’sforwardexcludes KV update; uses unified attention ops for direct and custom-op pathsForceAttentionenum; carries per-groupslot_mappingin metadata; ensures KV update is captured in cudagraphs; adjusts padding/ubatching and warmup logictry_backend_includes_kv_cacheand conditionally calldo_kv_cache_updatein attention and tree-attention testsdict.get; minor MLA cudagraph builder adjustmentWritten by Cursor Bugbot for commit fa5a30a. This will update automatically on new commits. Configure here.