[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible#36982
Merged
MatthewBonanni merged 3 commits intovllm-project:mainfrom Mar 16, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request enhances sparse MLA by leveraging native MTP support in the indexer when available, falling back to flattening otherwise. The changes are well-structured and the logic for determining when to use the native path versus flattening is sound. I have one suggestion to improve a comment for better clarity and to avoid potential confusion for future developers.
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
benchislett
reviewed
Mar 13, 2026
| int rowStart = 0; | ||
| int seq_len = seqLens[rowIdx / next_n]; | ||
| int rowEnd = seq_len - next_n + (rowIdx % next_n) + 1; | ||
| int rowEnd = max(0, seq_len - next_n + (rowIdx % next_n) + 1); |
Collaborator
Author
There was a problem hiding this comment.
Cudagraph padding requests with seq_len == 0. We could alternatively clamp seq_lens to a minimum of next_n on the python side but the kernel will do dummy work
LucasWilkinson
approved these changes
Mar 15, 2026
Collaborator
LucasWilkinson
left a comment
There was a problem hiding this comment.
LGTM; thanks for following up on this!
Lucaskabela
pushed a commit
to Lucaskabela/vllm
that referenced
this pull request
Mar 17, 2026
…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
andylolu2
pushed a commit
to andylolu2/vllm
that referenced
this pull request
Mar 18, 2026
…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
wendyliu235
pushed a commit
to wendyliu235/vllm-public
that referenced
this pull request
Mar 18, 2026
…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
fxdawnn
pushed a commit
to fxdawnn/vllm
that referenced
this pull request
Mar 19, 2026
…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
khairulkabir1661
pushed a commit
to khairulkabir1661/vllm
that referenced
this pull request
Mar 27, 2026
…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Monishver11
pushed a commit
to Monishver11/vllm
that referenced
this pull request
Mar 27, 2026
…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
JiantaoXu
pushed a commit
to JiantaoXu/vllm
that referenced
this pull request
Mar 28, 2026
…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
vrdn-23
pushed a commit
to vrdn-23/vllm
that referenced
this pull request
Mar 30, 2026
…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
EricccYang
pushed a commit
to EricccYang/vllm
that referenced
this pull request
Apr 1, 2026
…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
liuchenbing2026
pushed a commit
to liuchenbing2026/vllm
that referenced
this pull request
Apr 4, 2026
…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
PR #34552 added support for MTP > 1 with sparse MLA by unconditionally flattening all requests into single-token decodes. The indexer kernel does support MTP = 1, though, and future iterations of the kernel will support other token counts e.g. MTP = 3. This PR takes advantage of this support by only flattening when the specified
num_speculative_tokensis not natively supported by the kernel.It does this by turning on
require_uniformwhen taking this pathway. When very short prefills (<1 + num_speculative_tokens) are present in the batch, these requests and all others below (including decodes) will be treated as prefills. This is suboptimal, but expected to be very rare. Addressing it would require reordering the batch into [speculative decodes, short prefills, all other prefills] rather than just [decodes, prefills], which would add complexity.Test Plan
with
Test Result
Main
PR
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.