[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible by MatthewBonanni · Pull Request #36982 · vllm-project/vllm

MatthewBonanni · 2026-03-13T14:38:50Z

Purpose

PR #34552 added support for MTP > 1 with sparse MLA by unconditionally flattening all requests into single-token decodes. The indexer kernel does support MTP = 1, though, and future iterations of the kernel will support other token counts e.g. MTP = 3. This PR takes advantage of this support by only flattening when the specified num_speculative_tokens is not natively supported by the kernel.

It does this by turning on require_uniform when taking this pathway. When very short prefills (< 1 + num_speculative_tokens) are present in the batch, these requests and all others below (including decodes) will be treated as prefills. This is suboptimal, but expected to be very rare. Addressing it would require reordering the batch into [speculative decodes, short prefills, all other prefills] rather than just [decodes, prefills], which would add complexity.

Test Plan

vllm serve deepseek-ai/DeepSeek-V3.2 \
    -tp 8 -ep \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
    --no-enable-prefix-caching

with

vllm bench serve \
    --dataset-name spec_bench \
    --dataset-path question.jsonl \
    --spec-bench-output-len 1024 \
    --seed 42 \
    --ignore-eos \
    --temperature 0 \
    --tokenizer deepseek-ai/DeepSeek-V3

Test Result

Main

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  96.72     
Total input tokens:                      310052    
Total generated tokens:                  1024000   
Request throughput (req/s):              10.34     
Output token throughput (tok/s):         10587.74  
Peak output token throughput (tok/s):    9616.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          13793.54  
---------------Time to First Token----------------
Mean TTFT (ms):                          7737.62   
Median TTFT (ms):                        7051.60   
P99 TTFT (ms):                           15349.34  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          72.98     
Median TPOT (ms):                        72.92     
P99 TPOT (ms):                           81.29     
---------------Inter-token Latency----------------
Mean ITL (ms):                           139.33    
Median ITL (ms):                         113.82    
P99 ITL (ms):                            368.19    
---------------Speculative Decoding---------------
Acceptance rate (%):                     91.04     
Acceptance length:                       1.91      
Drafts:                                  535608    
Draft tokens:                            535608    
Accepted tokens:                         487602    
Per-position acceptance (%):
  Position 0:                            91.04     
==================================================

PR

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  95.90     
Total input tokens:                      310052    
Total generated tokens:                  1024000   
Request throughput (req/s):              10.43     
Output token throughput (tok/s):         10678.25  
Peak output token throughput (tok/s):    9786.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          13911.46  
---------------Time to First Token----------------
Mean TTFT (ms):                          7779.73   
Median TTFT (ms):                        7314.22   
P99 TTFT (ms):                           15318.20  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          72.71     
Median TPOT (ms):                        72.83     
P99 TPOT (ms):                           80.71     
---------------Inter-token Latency----------------
Mean ITL (ms):                           138.74    
Median ITL (ms):                         113.15    
P99 ITL (ms):                            366.24    
---------------Speculative Decoding---------------
Acceptance rate (%):                     90.96     
Acceptance length:                       1.91      
Drafts:                                  535831    
Draft tokens:                            535831    
Accepted tokens:                         487380    
Per-position acceptance (%):
  Position 0:                            90.96     
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

gemini-code-assist

Code Review

This pull request enhances sparse MLA by leveraging native MTP support in the indexer when available, falling back to flattening otherwise. The changes are well-structured and the logic for determining when to use the native path versus flattening is sound. I have one suggestion to improve a comment for better clarity and to avoid potential confusion for future developers.

vllm/v1/attention/backends/mla/indexer.py

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

benchislett · 2026-03-13T19:19:27Z

csrc/sampler.cu

  int rowStart = 0;
  int seq_len = seqLens[rowIdx / next_n];
-  int rowEnd = seq_len - next_n + (rowIdx % next_n) + 1;
+  int rowEnd = max(0, seq_len - next_n + (rowIdx % next_n) + 1);


What's this for?

Cudagraph padding requests with seq_len == 0. We could alternatively clamp seq_lens to a minimum of next_n on the python side but the kernel will do dummy work

LucasWilkinson

LGTM; thanks for following up on this!

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Implement

8e20632

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni requested a review from pavanimajety as a code owner March 13, 2026 14:38

mergify bot added the v1 label Mar 13, 2026

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

vllm/v1/attention/backends/mla/indexer.py Outdated Show resolved Hide resolved

MatthewBonanni added 2 commits March 13, 2026 10:48

Clarify comment

d1016d1

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix cudagraphs: handle padding reqs and use buffer

16b5da0

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

benchislett reviewed Mar 13, 2026

View reviewed changes

MatthewBonanni mentioned this pull request Mar 14, 2026

[BugFix] Add support for MTP num_speculative_tokens > 1 with sparse MLA #34552

Merged

LucasWilkinson approved these changes Mar 15, 2026

View reviewed changes

MatthewBonanni added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 16, 2026

MatthewBonanni merged commit c88ea83 into vllm-project:main Mar 16, 2026
120 of 121 checks passed

MatthewBonanni deleted the sparse_mla_mtp_1 branch March 16, 2026 17:51

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[MTP][Sparse MLA] Take advantage of native MTP support in indexer whe…

0633804

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

andylolu2 pushed a commit to andylolu2/vllm that referenced this pull request Mar 18, 2026

[MTP][Sparse MLA] Take advantage of native MTP support in indexer whe…

d1410af

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[MTP][Sparse MLA] Take advantage of native MTP support in indexer whe…

0b05b42

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[MTP][Sparse MLA] Take advantage of native MTP support in indexer whe…

2b4d40e

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[MTP][Sparse MLA] Take advantage of native MTP support in indexer whe…

858766f

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[MTP][Sparse MLA] Take advantage of native MTP support in indexer whe…

4c9fb55

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026

[MTP][Sparse MLA] Take advantage of native MTP support in indexer whe…

05d63bc

…n possible (vllm-project#36982) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible#36982

[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible#36982
MatthewBonanni merged 3 commits intovllm-project:mainfrom
MatthewBonanni:sparse_mla_mtp_1

MatthewBonanni commented Mar 13, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

benchislett Mar 13, 2026

Uh oh!

MatthewBonanni Mar 13, 2026 •

edited

Loading

Uh oh!

LucasWilkinson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

MatthewBonanni commented Mar 13, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Main

PR

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

benchislett Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MatthewBonanni commented Mar 13, 2026 •

edited by github-actions bot

Loading

MatthewBonanni Mar 13, 2026 •

edited

Loading