[Bugfix][Async] Fix async spec decoding with hybrid models by MatthewBonanni · Pull Request #38556 · vllm-project/vllm

MatthewBonanni · 2026-03-30T15:01:46Z

Purpose

Incorporates 2 fixes:

Fix 1

Posted earlier as #38419, incorporated into this PR.

In async mode, seq_lens_cpu is inflated by optimistic draft token placeholders. When prepare_next_token_ids_padded uses this inflated value to call get_token_id(), it reads past the end of the committed tokens and returns -1. Use num_tokens_no_spec - 1 (the actual last committed token position) instead of seq_lens_cpu for computing backup token indices.

Fix 2

In async mode, condense() copies num_accepted_tokens_cpu values while the GPU→CPU async copy from the previous batch is still in-flight. This results in stale values being propagated to reordered indices, corrupting Mamba hidden states.

Test Plan

LM Eval Large Models (H200)

pytest tests/evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-h200.txt -k "Nemotron-3-Super-120B-A12B-BF16" -v -s --tb=long

Test Result

main: Fails
PR: Passes

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…re_next_token_ids_padded) When async scheduling is enabled (zero-bubble spec decoding, PR vllm-project#32951), optimistic_seq_lens_cpu = num_computed_tokens + num_scheduled_tokens is passed to prepare_next_token_ids_padded as seq_lens_cpu. This value is inflated relative to the actual committed output_token_ids because _prepare_inputs appends -1 placeholder slots optimistically. The backup token lookup calls: request.get_token_id(seq_lens_cpu[i]) where seq_lens_cpu[i] points one past the end of the committed tokens, causing get_token_id() to return -1 (placeholder). The drafter then receives -1 as its next input token, which corrupts its hidden state and degrades the draft acceptance rate — causing the Nemotron-3-Super-120B BF16 GSM8K score to drop from ~0.93 to ~0.74. Fix: use (num_tokens_no_spec[i] - 1) — the index of the last committed output token — for the backup token lookup in both EagleProposer (eagle.py) and ExtractHiddenStatesProposer (extract_hidden_states.py). num_tokens_no_spec is set to request.num_tokens before the optimistic extend, so it always points to a valid token slot. Fixes: vllm-project#38098 Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>

Per Gemini review, the original name was misleading — the buggy code was always off-by-one, not just when async inflation was present. Rename to test_buggy_code_was_always_off_by_one and update the docstring to clearly explain that seq_len (= num_tokens) is always out of range for get_token_id(). Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>

Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request addresses issues with async scheduling in speculative decoding. It updates eagle.py and extract_hidden_states.py to use num_tokens_no_spec - 1 instead of sequence lengths to correctly identify the last committed token, preventing errors caused by inflated sequence lengths from async-scheduling placeholders. Additionally, gpu_model_runner.py is updated to correctly map num_accepted_tokens using prev_positions when async scheduling is enabled, accounting for index reordering by condense(). I have no feedback to provide.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

NickLucche

I think we want to test this @ZhanqiuHu

benchislett

LGTM

Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com> (cherry picked from commit 757068d)

PR vllm-project#38116 relocated encoder_cudagraph.py and encoder_cudagraph_defs.py into vllm/v1/worker/gpu/mm/, then PR vllm-project#38556 moved them back to vllm/v1/worker/. However, 9 import statements across 5 files were not updated and still reference the old gpu.mm path, causing a ModuleNotFoundError when enabling cudagraph_mm_encoder. Fixes vllm-project#38982 Signed-off-by: UgaTheDev <kushzingade@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ect#38556) Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>

…ect#38556) Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com>

Closes two empty cells in the MTP x (No-Sync, Correctness-on-hybrid) spec-decode coverage matrix: - test_async_spec_decode.py: add mtp-qwen3_5-hybrid case alongside the existing eagle3-llama and eagle-mla-deepseek entries, so the async spec-decode pipeline is exercised for MTP on a hybrid model. - test_mtp_correctness: add a Qwen/Qwen3.5-0.8B entry so the ref-vs-spec output-match assertion covers a hybrid (linear-attn + attn) target. Today test_mtp_correctness has only dense (MiMo-7B) and MLA (DeepSeek-V3-4layers) cases; hybrid is uncovered. Qwen/Qwen3.5-0.8B is the canonical Qwen3_5MTP example in tests/models/registry.py:1301 and ships MTP weights inside the target safetensors bundle (mtp.* keys), so self-draft works without a separate download. The correctness case would have caught vllm-project#38556 (async spec decoding with hybrid models) -- the condense() race that propagated stale num_accepted_tokens_cpu values and corrupted Mamba-descendant hidden state. Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>

### What this PR does / why we need it? Delete configuration code that cannot enable speculative inference and asynchronous scheduling simultaneously and refer to: vllm-project/vllm#38556 , Fix async spec decoding with hybrid models and by asynchronously and non blocking synchronizing the correct seq_lens on the NPU to the CPU, ensure that the value of the attention layer is correct，the performance impact is almost negligible. - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: HF-001 <1670186653@qq.com>

### What this PR does / why we need it? Delete configuration code that cannot enable speculative inference and asynchronous scheduling simultaneously and refer to: vllm-project/vllm#38556 , Fix async spec decoding with hybrid models and by asynchronously and non blocking synchronizing the correct seq_lens on the NPU to the CPU, ensure that the value of the attention layer is correct，the performance impact is almost negligible. - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: guxin108 <1252896542@qq.com>

### What this PR does / why we need it? Delete configuration code that cannot enable speculative inference and asynchronous scheduling simultaneously and refer to: vllm-project/vllm#38556 , Fix async spec decoding with hybrid models and by asynchronously and non blocking synchronizing the correct seq_lens on the NPU to the CPU, ensure that the value of the attention layer is correct，the performance impact is almost negligible. - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>

### What this PR does / why we need it? Delete configuration code that cannot enable speculative inference and asynchronous scheduling simultaneously and refer to: vllm-project/vllm#38556 , Fix async spec decoding with hybrid models and by asynchronously and non blocking synchronizing the correct seq_lens on the NPU to the CPU, ensure that the value of the attention layer is correct，the performance impact is almost negligible. - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: HF-001 <1670186653@qq.com>

### What this PR does / why we need it? Delete configuration code that cannot enable speculative inference and asynchronous scheduling simultaneously and refer to: vllm-project/vllm#38556 , Fix async spec decoding with hybrid models and by asynchronously and non blocking synchronizing the correct seq_lens on the NPU to the CPU, ensure that the value of the attention layer is correct，the performance impact is almost negligible. - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: PiratePai <416932041@qq.com>

SandishKumarHN added 3 commits March 28, 2026 00:44

Simplify comments and rewrite test file for clarity

055ccf2

Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>

MatthewBonanni requested review from benchislett, luccafong and njhill as code owners March 30, 2026 15:01

claude Bot reviewed Mar 30, 2026

View reviewed changes

mergify Bot added speculative-decoding v1 bug Something isn't working labels Mar 30, 2026

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

mgoin mentioned this pull request Mar 30, 2026

[NVIDIA] Bugfix NVFP4 DGX Spark and RTX50 #38423

Merged

8 tasks

MatthewBonanni changed the title ~~[Bugfix] Fix async mamba~~ [WIP][Bugfix] Fix async mamba Mar 30, 2026

MatthewBonanni added 6 commits March 30, 2026 20:01

Fix

ac9338c

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Vectorize

68c082d

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Clean up seq_lens

987155d

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Rename

5fae1b4

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Clean up

b54dff8

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Fix bad merge

46fa724

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni force-pushed the fix_async_mamba branch from e1e41ab to 46fa724 Compare March 30, 2026 20:12

MatthewBonanni added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 30, 2026

MatthewBonanni changed the title ~~[WIP][Bugfix] Fix async mamba~~ [Bugfix] Fix async mamba Mar 30, 2026

MatthewBonanni mentioned this pull request Mar 30, 2026

[Bugfix] Fix backup token index in async spec decode (fixes Nemotron BF16 accuracy) #38419

Closed

MatthewBonanni changed the title ~~[Bugfix] Fix async mamba~~ [Bugfix] Fix async spec decoding with hybrid models Mar 30, 2026

MatthewBonanni changed the title ~~[Bugfix] Fix async spec decoding with hybrid models~~ [Bugfix][Async] Fix async spec decoding with hybrid models Mar 30, 2026

MatthewBonanni added 2 commits March 30, 2026 21:19

Fix tests

518fac2

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

Merge branch 'main' into fix_async_mamba

f0dafde

NickLucche reviewed Mar 31, 2026

View reviewed changes

benchislett approved these changes Mar 31, 2026

View reviewed changes

MatthewBonanni merged commit 757068d into vllm-project:main Mar 31, 2026
59 checks passed

MatthewBonanni deleted the fix_async_mamba branch March 31, 2026 15:09

LucasWilkinson added this to the v0.19.0 cherry picks milestone Mar 31, 2026

zzlol63 mentioned this pull request Apr 4, 2026

[Bug]: Enabling cudagraph_mm_encoder results in ModuleNotFoundError #38982

Closed

1 task

UgaTheDev mentioned this pull request Apr 5, 2026

[Bugfix] Fix broken imports for encoder_cudagraph after partial revert #39028

Closed

3 tasks

HF-001 mentioned this pull request Apr 21, 2026

[BugFix]fix spec decode and async bug vllm-project/vllm-ascend#8461

Merged

stecasta mentioned this pull request Apr 21, 2026

[CI] Add MTP coverage: Qwen3.5 correctness + no-sync spec decode stecasta/vllm#2

Open

stecasta mentioned this pull request Apr 21, 2026

[CI] Add MTP coverage: Qwen3.5 correctness + no-sync spec decode #40472

Merged

Sandermage mentioned this pull request Apr 25, 2026

[Bug]: MTP × TurboQuant × CUDA graph capture produces degenerate output on Qwen3-Next hybrid (not closed by v7.13 ngram fix tree) #40880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix][Async] Fix async spec decoding with hybrid models#38556

[Bugfix][Async] Fix async spec decoding with hybrid models#38556
MatthewBonanni merged 11 commits intovllm-project:mainfrom
MatthewBonanni:fix_async_mamba

MatthewBonanni commented Mar 30, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

NickLucche left a comment

Uh oh!

benchislett left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

MatthewBonanni commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Fix 1

Fix 2

Test Plan

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

benchislett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MatthewBonanni commented Mar 30, 2026 •

edited

Loading