Fix FlashAttention MLA prefill V unpadding by voipmonitor · Pull Request #42642 · vllm-project/vllm

voipmonitor · 2026-05-14T13:42:16Z

Purpose

Fix a regression in the FlashAttention MLA prefill path introduced when the prefill implementations were split out in #32623.

Before #32623, the FlashAttention MLA helper padded V when the selected FlashAttention implementation did not support different QK/V head dimensions. The padded output was kept through the context/suffix merge_attn_states path, and only then sliced back to v_head_dim in MLACommonImpl.forward_mha.

After #32623, FlashAttnPrefillBackend._flash_attn_varlen_diff_headdims() slices the output back to v_head_dim inside the backend. That changes the tensor shape contract seen by the chunked-context merge path. On a long-context Kimi/DeepSeek-style MLA setup using the FlashAttention prefill backend with requires_v_padding=True, this produced incorrect long-context generations: the model started continuing unrelated prompt padding text instead of answering the user question. Keeping the old late-unpad behavior restores the expected output.

This PR moves the unpad back to the caller:

FlashAttnPrefillBackend now returns the same padded output shape that the old in-file FlashAttention MLA prefill helper returned.
MLACommonImpl.forward_mha slices context_output, suffix_output, and no-context output_prefill back to v_head_dim immediately before writing/merging into the final output buffer.

Notes

This only affects the FlashAttention MLA prefill backend when requires_v_padding=True. Backends that natively support different QK/V head dimensions keep returning v_head_dim already, so the added checks are no-ops for them.

I intentionally kept this as a small compatibility fix rather than changing backend selection or DCP behavior.

Test Plan

python3 -m py_compile vllm/model_executor/layers/attention/mla_attention.py vllm/v1/attention/backends/mla/prefill/flash_attn.py
Local long-context Kimi-K2.6 validation on Blackwell, TRITON_MLA, DCP=1, MTP disabled, 128k synthetic context. Before this change the model continued unrelated context padding; with this change it answers the requested Sieve of Eratosthenes prompt again.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request refactors the unpadding logic for attention outputs by moving it from the backend-specific Flash Attention implementation to the core MLA attention layer. This change ensures that outputs are correctly sliced to the value head dimension when padding is applied. The reviewer recommended simplifying the newly added conditional check in mla_attention.py by removing a redundant attribute check, as the shape comparison is sufficient to identify when unpadding is required.

gemini-code-assist · 2026-05-14T13:44:52Z

+            prefill_backend = prefill_metadata.prefill_backend
+            if (
+                getattr(prefill_backend, "requires_v_padding", False)
+                and context_output.shape[-1] != self.v_head_dim
+            ):


The check for requires_v_padding via getattr is redundant here because the shape check context_output.shape[-1] != self.v_head_dim is sufficient to determine if unpadding is necessary. Simplifying this condition makes the code cleaner and consistent with the logic used in the else block (lines 2332-2333).

Suggested change

prefill_backend = prefill_metadata.prefill_backend

if (

getattr(prefill_backend, "requires_v_padding", False)

and context_output.shape[-1] != self.v_head_dim

):

if context_output.shape[-1] != self.v_head_dim:

Good point. I updated this to rely on the output shape instead of the backend-specific attribute, and made the context and suffix slices independent so each tensor is normalized to v_head_dim before the merge.

Signed-off-by: Martin Vit <martin@voipmonitor.org>

MatthewBonanni

Thanks for catching this! Just a few small comments

MatthewBonanni · 2026-05-14T14:57:51Z

+            if context_output.shape[-1] != self.v_head_dim:
+                context_output = context_output[..., : self.v_head_dim]
+            if suffix_output.shape[-1] != self.v_head_dim:
+                suffix_output = suffix_output[..., : self.v_head_dim]
+


The if statements aren't necessary because this will be a no-op when context_output.shape[-1] == self.v_head_dim

MatthewBonanni · 2026-05-14T14:58:00Z

+            if output_prefill.shape[-1] != self.v_head_dim:
+                output_prefill = output_prefill[..., : self.v_head_dim]


MatthewBonanni · 2026-05-14T15:01:02Z

                )

+            if context_output.shape[-1] != self.v_head_dim:
+                context_output = context_output[..., : self.v_head_dim]


nit: stylistically would prefer context_output[..., :self.v_head_dim] (no space after colon)

ehfd · 2026-06-04T11:57:40Z

@voipmonitor @MatthewBonanni Can we revive this?

ehfd · 2026-06-04T11:59:03Z

I think this is related to #41623 or #42426.

bbartels · 2026-06-12T15:17:21Z

Would be curious about this as well :)

voipmonitor requested review from LucasWilkinson, MatthewBonanni and pavanimajety as code owners May 14, 2026 13:42

claude Bot reviewed May 14, 2026

View reviewed changes

voipmonitor mentioned this pull request May 14, 2026

[Attention] Abstract the MLA prefill backends and eliminate cuDNN #32623

Merged

5 tasks

mergify Bot added the v1 label May 14, 2026

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

voipmonitor added 2 commits May 14, 2026 13:53

Fix FlashAttention MLA prefill V unpadding

21866f5

Signed-off-by: Martin Vit <martin@voipmonitor.org>

Simplify MLA prefill output unpadding check

a27fab7

Signed-off-by: Martin Vit <martin@voipmonitor.org>

voipmonitor force-pushed the codex/upstream-mla-lateunpad-pr branch from fb90f07 to a27fab7 Compare May 14, 2026 13:53

MatthewBonanni reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix FlashAttention MLA prefill V unpadding#42642

Fix FlashAttention MLA prefill V unpadding#42642
voipmonitor wants to merge 2 commits into
vllm-project:mainfrom
voipmonitor:codex/upstream-mla-lateunpad-pr

voipmonitor commented May 14, 2026

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

voipmonitor May 14, 2026

Uh oh!

MatthewBonanni left a comment

Uh oh!

MatthewBonanni May 14, 2026

Uh oh!

MatthewBonanni May 14, 2026

Uh oh!

MatthewBonanni May 14, 2026 •

edited

Loading

Uh oh!

ehfd commented Jun 4, 2026

Uh oh!

ehfd commented Jun 4, 2026 •

edited

Loading

Uh oh!

bbartels commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		if output_prefill.shape[-1] != self.v_head_dim:
		output_prefill = output_prefill[..., : self.v_head_dim]

Uh oh!

Conversation

voipmonitor commented May 14, 2026

Purpose

Notes

Test Plan

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

voipmonitor May 14, 2026

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni left a comment

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni May 14, 2026

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni May 14, 2026

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ehfd commented Jun 4, 2026

Uh oh!

ehfd commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbartels commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MatthewBonanni May 14, 2026 •

edited

Loading

ehfd commented Jun 4, 2026 •

edited

Loading