Skip to content

[Core] Avoid seq_lens_cpu GPU->CPU sync#40654

Merged
WoosukKwon merged 5 commits into
vllm-project:mainfrom
njhill:seq_lens_cpu_nosync
Apr 24, 2026
Merged

[Core] Avoid seq_lens_cpu GPU->CPU sync#40654
WoosukKwon merged 5 commits into
vllm-project:mainfrom
njhill:seq_lens_cpu_nosync

Conversation

@njhill

@njhill njhill commented Apr 22, 2026

Copy link
Copy Markdown
Member

With help from claude

Signed-off-by: Nick Hill <nickhill123@gmail.com>
@njhill njhill force-pushed the seq_lens_cpu_nosync branch from 6e33c79 to e4da3cc Compare April 22, 2026 22:05

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces seq_lens_cpu_upper_bound to CommonAttentionMetadata to provide a CPU-side upper bound for sequence lengths, aiming to eliminate blocking GPU-to-CPU synchronizations during model execution, particularly in speculative decoding scenarios. The field is integrated across various attention backends and model runners to facilitate kernel dispatch and workspace sizing using optimistic bounds. A review comment identifies a critical issue in vllm/v1/spec_decode/eagle.py, where subtracting num_rejected_tokens from the CPU upper bound could trigger a synchronization or lead to device mismatches, potentially undermining the performance benefits of the change.

Comment thread vllm/v1/spec_decode/eagle.py Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 22, 2026
Signed-off-by: Nick Hill <nickhill123@gmail.com>
@benchislett

Copy link
Copy Markdown
Member

@njhill we already have something similar here in V1 with optimistic_seq_lens_cpu. What is the goal of this PR?

@benchislett

Copy link
Copy Markdown
Member

Why is it modifying the vLLM V1 specdec implementation? Is this change not intended to be isolated to MRV2?

@njhill

njhill commented Apr 22, 2026

Copy link
Copy Markdown
Member Author

@benchislett the primary motivation is eliminating the current prefill cpu sync for DS3.2. This generalizes the v1-specific optimistic_seq_lens_cpu via explicit field in CommonAttentionMetadata that can be used more places in the various attention backends. And also extending it to MRV2.

@njhill

njhill commented Apr 22, 2026

Copy link
Copy Markdown
Member Author

Also aiming to avoid use of common_attn_metadata.seq_lens_cpu generally which is deprecated and can implicitly sync from GPU, i.e. preferring to use the new field consistently which is guaranteed to be cpu-only.

@benchislett

Copy link
Copy Markdown
Member

@LucasWilkinson @MatthewBonanni for review

njhill added 2 commits April 23, 2026 10:09
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
# Conflicts:
#	vllm/v1/spec_decode/eagle.py

Signed-off-by: Nick Hill <nickhill123@gmail.com>
@WoosukKwon WoosukKwon enabled auto-merge (squash) April 23, 2026 23:34
@WoosukKwon WoosukKwon merged commit fe85a92 into vllm-project:main Apr 24, 2026
76 checks passed
@njhill njhill deleted the seq_lens_cpu_nosync branch April 24, 2026 00:38
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Apr 27, 2026
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Lafunamor pushed a commit to Lafunamor/vllm that referenced this pull request May 1, 2026
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Adrian <info@zzit.ch>
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
Signed-off-by: Nick Hill <nickhill123@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
Signed-off-by: Nick Hill <nickhill123@gmail.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
Signed-off-by: Nick Hill <nickhill123@gmail.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
Signed-off-by: Nick Hill <nickhill123@gmail.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
Signed-off-by: Nick Hill <nickhill123@gmail.com>
brian-dellabetta pushed a commit to neuralmagic/vllm that referenced this pull request May 29, 2026
Signed-off-by: Nick Hill <nickhill123@gmail.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants