[Bugfix][Async] Fix async spec decoding with hybrid models#38556
Merged
MatthewBonanni merged 11 commits intovllm-project:mainfrom Mar 31, 2026
Merged
[Bugfix][Async] Fix async spec decoding with hybrid models#38556MatthewBonanni merged 11 commits intovllm-project:mainfrom
MatthewBonanni merged 11 commits intovllm-project:mainfrom
Conversation
…re_next_token_ids_padded) When async scheduling is enabled (zero-bubble spec decoding, PR vllm-project#32951), optimistic_seq_lens_cpu = num_computed_tokens + num_scheduled_tokens is passed to prepare_next_token_ids_padded as seq_lens_cpu. This value is inflated relative to the actual committed output_token_ids because _prepare_inputs appends -1 placeholder slots optimistically. The backup token lookup calls: request.get_token_id(seq_lens_cpu[i]) where seq_lens_cpu[i] points one past the end of the committed tokens, causing get_token_id() to return -1 (placeholder). The drafter then receives -1 as its next input token, which corrupts its hidden state and degrades the draft acceptance rate — causing the Nemotron-3-Super-120B BF16 GSM8K score to drop from ~0.93 to ~0.74. Fix: use (num_tokens_no_spec[i] - 1) — the index of the last committed output token — for the backup token lookup in both EagleProposer (eagle.py) and ExtractHiddenStatesProposer (extract_hidden_states.py). num_tokens_no_spec is set to request.num_tokens before the optimistic extend, so it always points to a valid token slot. Fixes: vllm-project#38098 Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Per Gemini review, the original name was misleading — the buggy code was always off-by-one, not just when async inflation was present. Rename to test_buggy_code_was_always_off_by_one and update the docstring to clearly explain that seq_len (= num_tokens) is always out of range for get_token_id(). Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request addresses issues with async scheduling in speculative decoding. It updates eagle.py and extract_hidden_states.py to use num_tokens_no_spec - 1 instead of sequence lengths to correctly identify the last committed token, preventing errors caused by inflated sequence lengths from async-scheduling placeholders. Additionally, gpu_model_runner.py is updated to correctly map num_accepted_tokens using prev_positions when async scheduling is enabled, accounting for index reordering by condense(). I have no feedback to provide.
8 tasks
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
e1e41ab to
46fa724
Compare
NickLucche
reviewed
Mar 31, 2026
Collaborator
NickLucche
left a comment
There was a problem hiding this comment.
I think we want to test this @ZhanqiuHu
khluu
pushed a commit
that referenced
this pull request
Apr 1, 2026
Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com> (cherry picked from commit 757068d)
1 task
UgaTheDev
added a commit
to UgaTheDev/vllm
that referenced
this pull request
Apr 5, 2026
PR vllm-project#38116 relocated encoder_cudagraph.py and encoder_cudagraph_defs.py into vllm/v1/worker/gpu/mm/, then PR vllm-project#38556 moved them back to vllm/v1/worker/. However, 9 import statements across 5 files were not updated and still reference the old gpu.mm path, causing a ModuleNotFoundError when enabling cudagraph_mm_encoder. Fixes vllm-project#38982 Signed-off-by: UgaTheDev <kushzingade@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3 tasks
puririshi98
pushed a commit
to puririshi98/vllm
that referenced
this pull request
Apr 7, 2026
…ect#38556) Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet
pushed a commit
to blackfuel-ai/vllm
that referenced
this pull request
Apr 9, 2026
…ect#38556) Signed-off-by: SandishKumarHN <sandishkumarhn@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com>
stecasta
added a commit
to stecasta/vllm
that referenced
this pull request
Apr 21, 2026
Closes two empty cells in the MTP x (No-Sync, Correctness-on-hybrid) spec-decode coverage matrix: - test_async_spec_decode.py: add mtp-qwen3_5-hybrid case alongside the existing eagle3-llama and eagle-mla-deepseek entries, so the async spec-decode pipeline is exercised for MTP on a hybrid model. - test_mtp_correctness: add a Qwen/Qwen3.5-0.8B entry so the ref-vs-spec output-match assertion covers a hybrid (linear-attn + attn) target. Today test_mtp_correctness has only dense (MiMo-7B) and MLA (DeepSeek-V3-4layers) cases; hybrid is uncovered. Qwen/Qwen3.5-0.8B is the canonical Qwen3_5MTP example in tests/models/registry.py:1301 and ships MTP weights inside the target safetensors bundle (mtp.* keys), so self-draft works without a separate download. The correctness case would have caught vllm-project#38556 (async spec decoding with hybrid models) -- the condense() race that propagated stale num_accepted_tokens_cpu values and corrupted Mamba-descendant hidden state. Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>
wangxiyuan
pushed a commit
to vllm-project/vllm-ascend
that referenced
this pull request
Apr 23, 2026
### What this PR does / why we need it? Delete configuration code that cannot enable speculative inference and asynchronous scheduling simultaneously and refer to: vllm-project/vllm#38556 , Fix async spec decoding with hybrid models and by asynchronously and non blocking synchronizing the correct seq_lens on the NPU to the CPU, ensure that the value of the attention layer is correct,the performance impact is almost negligible. - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: HF-001 <1670186653@qq.com>
1kzk
pushed a commit
to 1kzk/vllm-ascend
that referenced
this pull request
Apr 23, 2026
### What this PR does / why we need it? Delete configuration code that cannot enable speculative inference and asynchronous scheduling simultaneously and refer to: vllm-project/vllm#38556 , Fix async spec decoding with hybrid models and by asynchronously and non blocking synchronizing the correct seq_lens on the NPU to the CPU, ensure that the value of the attention layer is correct,the performance impact is almost negligible. - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: HF-001 <1670186653@qq.com>
guxin108
pushed a commit
to guxin108/vllm-ascend
that referenced
this pull request
Apr 24, 2026
### What this PR does / why we need it? Delete configuration code that cannot enable speculative inference and asynchronous scheduling simultaneously and refer to: vllm-project/vllm#38556 , Fix async spec decoding with hybrid models and by asynchronously and non blocking synchronizing the correct seq_lens on the NPU to the CPU, ensure that the value of the attention layer is correct,the performance impact is almost negligible. - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: guxin108 <1252896542@qq.com>
zouyida2052
pushed a commit
to zouyida2052/vllm-ascend
that referenced
this pull request
Apr 28, 2026
### What this PR does / why we need it? Delete configuration code that cannot enable speculative inference and asynchronous scheduling simultaneously and refer to: vllm-project/vllm#38556 , Fix async spec decoding with hybrid models and by asynchronously and non blocking synchronizing the correct seq_lens on the NPU to the CPU, ensure that the value of the attention layer is correct,the performance impact is almost negligible. - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
yangzhe-2026
pushed a commit
to yangzhe-2026/vllm-ascend
that referenced
this pull request
May 6, 2026
### What this PR does / why we need it? Delete configuration code that cannot enable speculative inference and asynchronous scheduling simultaneously and refer to: vllm-project/vllm#38556 , Fix async spec decoding with hybrid models and by asynchronously and non blocking synchronizing the correct seq_lens on the NPU to the CPU, ensure that the value of the attention layer is correct,the performance impact is almost negligible. - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: HF-001 <1670186653@qq.com>
PiratePai
pushed a commit
to PiratePai/vllm-ascend
that referenced
this pull request
May 7, 2026
### What this PR does / why we need it? Delete configuration code that cannot enable speculative inference and asynchronous scheduling simultaneously and refer to: vllm-project/vllm#38556 , Fix async spec decoding with hybrid models and by asynchronously and non blocking synchronizing the correct seq_lens on the NPU to the CPU, ensure that the value of the attention layer is correct,the performance impact is almost negligible. - vLLM version: v0.19.0 - vLLM main: vllm-project/vllm@6f786f2 --------- Signed-off-by: HF-001 <1670186653@qq.com> Signed-off-by: PiratePai <416932041@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
co-authored by @SandishKumarHN
FIX: #38098
Purpose
Incorporates 2 fixes:
Fix 1
Posted earlier as #38419, incorporated into this PR.
In async mode,
seq_lens_cpuis inflated by optimistic draft token placeholders. Whenprepare_next_token_ids_paddeduses this inflated value to callget_token_id(), it reads past the end of the committed tokens and returns -1. Usenum_tokens_no_spec - 1(the actual last committed token position) instead ofseq_lens_cpufor computing backup token indices.Fix 2
In async mode,
condense()copiesnum_accepted_tokens_cpuvalues while the GPU→CPU async copy from the previous batch is still in-flight. This results in stale values being propagated to reordered indices, corrupting Mamba hidden states.Test Plan
LM Eval Large Models (H200)
Test Result
main: Fails
PR: Passes
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.