Skip to content

[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible#36982

Merged
MatthewBonanni merged 3 commits intovllm-project:mainfrom
MatthewBonanni:sparse_mla_mtp_1
Mar 16, 2026
Merged

[MTP][Sparse MLA] Take advantage of native MTP support in indexer when possible#36982
MatthewBonanni merged 3 commits intovllm-project:mainfrom
MatthewBonanni:sparse_mla_mtp_1

Conversation

@MatthewBonanni
Copy link
Copy Markdown
Collaborator

@MatthewBonanni MatthewBonanni commented Mar 13, 2026

Purpose

PR #34552 added support for MTP > 1 with sparse MLA by unconditionally flattening all requests into single-token decodes. The indexer kernel does support MTP = 1, though, and future iterations of the kernel will support other token counts e.g. MTP = 3. This PR takes advantage of this support by only flattening when the specified num_speculative_tokens is not natively supported by the kernel.

It does this by turning on require_uniform when taking this pathway. When very short prefills (< 1 + num_speculative_tokens) are present in the batch, these requests and all others below (including decodes) will be treated as prefills. This is suboptimal, but expected to be very rare. Addressing it would require reordering the batch into [speculative decodes, short prefills, all other prefills] rather than just [decodes, prefills], which would add complexity.

Test Plan

vllm serve deepseek-ai/DeepSeek-V3.2 \
    -tp 8 -ep \
    --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
    --no-enable-prefix-caching

with

vllm bench serve \
    --dataset-name spec_bench \
    --dataset-path question.jsonl \
    --spec-bench-output-len 1024 \
    --seed 42 \
    --ignore-eos \
    --temperature 0 \
    --tokenizer deepseek-ai/DeepSeek-V3

Test Result

Main

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  96.72     
Total input tokens:                      310052    
Total generated tokens:                  1024000   
Request throughput (req/s):              10.34     
Output token throughput (tok/s):         10587.74  
Peak output token throughput (tok/s):    9616.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          13793.54  
---------------Time to First Token----------------
Mean TTFT (ms):                          7737.62   
Median TTFT (ms):                        7051.60   
P99 TTFT (ms):                           15349.34  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          72.98     
Median TPOT (ms):                        72.92     
P99 TPOT (ms):                           81.29     
---------------Inter-token Latency----------------
Mean ITL (ms):                           139.33    
Median ITL (ms):                         113.82    
P99 ITL (ms):                            368.19    
---------------Speculative Decoding---------------
Acceptance rate (%):                     91.04     
Acceptance length:                       1.91      
Drafts:                                  535608    
Draft tokens:                            535608    
Accepted tokens:                         487602    
Per-position acceptance (%):
  Position 0:                            91.04     
==================================================

PR

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Benchmark duration (s):                  95.90     
Total input tokens:                      310052    
Total generated tokens:                  1024000   
Request throughput (req/s):              10.43     
Output token throughput (tok/s):         10678.25  
Peak output token throughput (tok/s):    9786.00   
Peak concurrent requests:                1000.00   
Total token throughput (tok/s):          13911.46  
---------------Time to First Token----------------
Mean TTFT (ms):                          7779.73   
Median TTFT (ms):                        7314.22   
P99 TTFT (ms):                           15318.20  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          72.71     
Median TPOT (ms):                        72.83     
P99 TPOT (ms):                           80.71     
---------------Inter-token Latency----------------
Mean ITL (ms):                           138.74    
Median ITL (ms):                         113.15    
P99 ITL (ms):                            366.24    
---------------Speculative Decoding---------------
Acceptance rate (%):                     90.96     
Acceptance length:                       1.91      
Drafts:                                  535831    
Draft tokens:                            535831    
Accepted tokens:                         487380    
Per-position acceptance (%):
  Position 0:                            90.96     
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@mergify mergify bot added the v1 label Mar 13, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances sparse MLA by leveraging native MTP support in the indexer when available, falling back to flattening otherwise. The changes are well-structured and the logic for determining when to use the native path versus flattening is sound. I have one suggestion to improve a comment for better clarity and to avoid potential confusion for future developers.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
int rowStart = 0;
int seq_len = seqLens[rowIdx / next_n];
int rowEnd = seq_len - next_n + (rowIdx % next_n) + 1;
int rowEnd = max(0, seq_len - next_n + (rowIdx % next_n) + 1);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for?

Copy link
Copy Markdown
Collaborator Author

@MatthewBonanni MatthewBonanni Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cudagraph padding requests with seq_len == 0. We could alternatively clamp seq_lens to a minimum of next_n on the python side but the kernel will do dummy work

Copy link
Copy Markdown
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks for following up on this!

@MatthewBonanni MatthewBonanni added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 16, 2026
@MatthewBonanni MatthewBonanni merged commit c88ea83 into vllm-project:main Mar 16, 2026
120 of 121 checks passed
@MatthewBonanni MatthewBonanni deleted the sparse_mla_mtp_1 branch March 16, 2026 17:51
Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026
…n possible (vllm-project#36982)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
andylolu2 pushed a commit to andylolu2/vllm that referenced this pull request Mar 18, 2026
…n possible (vllm-project#36982)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…n possible (vllm-project#36982)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…n possible (vllm-project#36982)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
…n possible (vllm-project#36982)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…n possible (vllm-project#36982)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
…n possible (vllm-project#36982)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
…n possible (vllm-project#36982)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…n possible (vllm-project#36982)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
liuchenbing2026 pushed a commit to liuchenbing2026/vllm that referenced this pull request Apr 4, 2026
…n possible (vllm-project#36982)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants