Skip to content

[Bugfix][Async] Fix async spec decoding with hybrid models#38556

Merged
MatthewBonanni merged 11 commits intovllm-project:mainfrom
MatthewBonanni:fix_async_mamba
Mar 31, 2026
Merged

[Bugfix][Async] Fix async spec decoding with hybrid models#38556
MatthewBonanni merged 11 commits intovllm-project:mainfrom
MatthewBonanni:fix_async_mamba

Conversation

@MatthewBonanni
Copy link
Copy Markdown
Collaborator

@MatthewBonanni MatthewBonanni commented Mar 30, 2026

co-authored by @SandishKumarHN

FIX: #38098

Purpose

Incorporates 2 fixes:

Fix 1

Posted earlier as #38419, incorporated into this PR.

In async mode, seq_lens_cpu is inflated by optimistic draft token placeholders. When prepare_next_token_ids_padded uses this inflated value to call get_token_id(), it reads past the end of the committed tokens and returns -1. Use num_tokens_no_spec - 1 (the actual last committed token position) instead of seq_lens_cpu for computing backup token indices.

Fix 2

In async mode, condense() copies num_accepted_tokens_cpu values while the GPU→CPU async copy from the previous batch is still in-flight. This results in stale values being propagated to reordered indices, corrupting Mamba hidden states.

Test Plan

LM Eval Large Models (H200)

pytest tests/evals/gsm8k/test_gsm8k_correctness.py --config-list-file=configs/models-h200.txt -k "Nemotron-3-Super-120B-A12B-BF16" -v -s --tb=long

Test Result

main: Fails
PR: Passes


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…re_next_token_ids_padded)

When async scheduling is enabled (zero-bubble spec decoding, PR vllm-project#32951),
optimistic_seq_lens_cpu = num_computed_tokens + num_scheduled_tokens is
passed to prepare_next_token_ids_padded as seq_lens_cpu. This value is
inflated relative to the actual committed output_token_ids because
_prepare_inputs appends -1 placeholder slots optimistically.

The backup token lookup calls:
  request.get_token_id(seq_lens_cpu[i])

where seq_lens_cpu[i] points one past the end of the committed tokens,
causing get_token_id() to return -1 (placeholder). The drafter then
receives -1 as its next input token, which corrupts its hidden state and
degrades the draft acceptance rate — causing the Nemotron-3-Super-120B
BF16 GSM8K score to drop from ~0.93 to ~0.74.

Fix: use (num_tokens_no_spec[i] - 1) — the index of the last committed
output token — for the backup token lookup in both EagleProposer
(eagle.py) and ExtractHiddenStatesProposer (extract_hidden_states.py).
num_tokens_no_spec is set to request.num_tokens before the optimistic
extend, so it always points to a valid token slot.

Fixes: vllm-project#38098
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Per Gemini review, the original name was misleading — the buggy code was
always off-by-one, not just when async inflation was present. Rename to
test_buggy_code_was_always_off_by_one and update the docstring to clearly
explain that seq_len (= num_tokens) is always out of range for get_token_id().

Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added speculative-decoding v1 bug Something isn't working labels Mar 30, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses issues with async scheduling in speculative decoding. It updates eagle.py and extract_hidden_states.py to use num_tokens_no_spec - 1 instead of sequence lengths to correctly identify the last committed token, preventing errors caused by inflated sequence lengths from async-scheduling placeholders. Additionally, gpu_model_runner.py is updated to correctly map num_accepted_tokens using prev_positions when async scheduling is enabled, accounting for index reordering by condense(). I have no feedback to provide.

@MatthewBonanni MatthewBonanni changed the title [Bugfix] Fix async mamba [WIP][Bugfix] Fix async mamba Mar 30, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@MatthewBonanni MatthewBonanni added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 30, 2026
@MatthewBonanni MatthewBonanni changed the title [WIP][Bugfix] Fix async mamba [Bugfix] Fix async mamba Mar 30, 2026
@MatthewBonanni MatthewBonanni changed the title [Bugfix] Fix async mamba [Bugfix] Fix async spec decoding with hybrid models Mar 30, 2026
@MatthewBonanni MatthewBonanni changed the title [Bugfix] Fix async spec decoding with hybrid models [Bugfix][Async] Fix async spec decoding with hybrid models Mar 30, 2026
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to test this @ZhanqiuHu

Copy link
Copy Markdown
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@MatthewBonanni MatthewBonanni merged commit 757068d into vllm-project:main Mar 31, 2026
59 checks passed
@MatthewBonanni MatthewBonanni deleted the fix_async_mamba branch March 31, 2026 15:09
@LucasWilkinson LucasWilkinson added this to the v0.19.0 cherry picks milestone Mar 31, 2026
khluu pushed a commit that referenced this pull request Apr 1, 2026
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com>
(cherry picked from commit 757068d)
UgaTheDev added a commit to UgaTheDev/vllm that referenced this pull request Apr 5, 2026
PR vllm-project#38116 relocated encoder_cudagraph.py and encoder_cudagraph_defs.py
into vllm/v1/worker/gpu/mm/, then PR vllm-project#38556 moved them back to
vllm/v1/worker/. However, 9 import statements across 5 files were not
updated and still reference the old gpu.mm path, causing a
ModuleNotFoundError when enabling cudagraph_mm_encoder.

Fixes vllm-project#38982

Signed-off-by: UgaTheDev <kushzingade@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
…ect#38556)

Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
…ect#38556)

Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com>
stecasta added a commit to stecasta/vllm that referenced this pull request Apr 21, 2026
Closes two empty cells in the MTP x (No-Sync, Correctness-on-hybrid)
spec-decode coverage matrix:

- test_async_spec_decode.py: add mtp-qwen3_5-hybrid case alongside the
  existing eagle3-llama and eagle-mla-deepseek entries, so the async
  spec-decode pipeline is exercised for MTP on a hybrid model.
- test_mtp_correctness: add a Qwen/Qwen3.5-0.8B entry so the ref-vs-spec
  output-match assertion covers a hybrid (linear-attn + attn) target.
  Today test_mtp_correctness has only dense (MiMo-7B) and MLA
  (DeepSeek-V3-4layers) cases; hybrid is uncovered.

Qwen/Qwen3.5-0.8B is the canonical Qwen3_5MTP example in
tests/models/registry.py:1301 and ships MTP weights inside the target
safetensors bundle (mtp.* keys), so self-draft works without a separate
download.

The correctness case would have caught vllm-project#38556
(async spec decoding with hybrid models) -- the condense() race that
propagated stale num_accepted_tokens_cpu values and corrupted
Mamba-descendant hidden state.

Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Apr 23, 2026
### What this PR does / why we need it?
Delete configuration code that cannot enable speculative inference and
asynchronous scheduling simultaneously
and  
refer to: vllm-project/vllm#38556 , Fix async
spec decoding with hybrid models
and 
by asynchronously and non blocking synchronizing the correct seq_lens on
the NPU to the CPU, ensure that the value of the attention layer is
correct,the performance impact is almost negligible.

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

---------

Signed-off-by: HF-001 <1670186653@qq.com>
1kzk pushed a commit to 1kzk/vllm-ascend that referenced this pull request Apr 23, 2026
### What this PR does / why we need it?
Delete configuration code that cannot enable speculative inference and
asynchronous scheduling simultaneously
and  
refer to: vllm-project/vllm#38556 , Fix async
spec decoding with hybrid models
and 
by asynchronously and non blocking synchronizing the correct seq_lens on
the NPU to the CPU, ensure that the value of the attention layer is
correct,the performance impact is almost negligible.

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

---------

Signed-off-by: HF-001 <1670186653@qq.com>
guxin108 pushed a commit to guxin108/vllm-ascend that referenced this pull request Apr 24, 2026
### What this PR does / why we need it?
Delete configuration code that cannot enable speculative inference and
asynchronous scheduling simultaneously
and
refer to: vllm-project/vllm#38556 , Fix async
spec decoding with hybrid models
and
by asynchronously and non blocking synchronizing the correct seq_lens on
the NPU to the CPU, ensure that the value of the attention layer is
correct,the performance impact is almost negligible.

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

---------

Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: guxin108 <1252896542@qq.com>
zouyida2052 pushed a commit to zouyida2052/vllm-ascend that referenced this pull request Apr 28, 2026
### What this PR does / why we need it?
Delete configuration code that cannot enable speculative inference and
asynchronous scheduling simultaneously
and
refer to: vllm-project/vllm#38556 , Fix async
spec decoding with hybrid models
and
by asynchronously and non blocking synchronizing the correct seq_lens on
the NPU to the CPU, ensure that the value of the attention layer is
correct,the performance impact is almost negligible.

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

---------

Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
yangzhe-2026 pushed a commit to yangzhe-2026/vllm-ascend that referenced this pull request May 6, 2026
### What this PR does / why we need it?
Delete configuration code that cannot enable speculative inference and
asynchronous scheduling simultaneously
and  
refer to: vllm-project/vllm#38556 , Fix async
spec decoding with hybrid models
and 
by asynchronously and non blocking synchronizing the correct seq_lens on
the NPU to the CPU, ensure that the value of the attention layer is
correct,the performance impact is almost negligible.

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

---------

Signed-off-by: HF-001 <1670186653@qq.com>
PiratePai pushed a commit to PiratePai/vllm-ascend that referenced this pull request May 7, 2026
### What this PR does / why we need it?
Delete configuration code that cannot enable speculative inference and
asynchronous scheduling simultaneously
and
refer to: vllm-project/vllm#38556 , Fix async
spec decoding with hybrid models
and
by asynchronously and non blocking synchronizing the correct seq_lens on
the NPU to the CPU, ensure that the value of the attention layer is
correct,the performance impact is almost negligible.

- vLLM version: v0.19.0
- vLLM main:
vllm-project/vllm@6f786f2

---------

Signed-off-by: HF-001 <1670186653@qq.com>
Signed-off-by: PiratePai <416932041@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI Failure]: LM Eval Large Models (H200)

5 participants