Skip to content

[wip][perf] fully overlap spec v2, remove wait_for_verify sync#23452

Open
Qiaolin-Yu wants to merge 5 commits intomainfrom
qiaolin/fully_overlap
Open

[wip][perf] fully overlap spec v2, remove wait_for_verify sync#23452
Qiaolin-Yu wants to merge 5 commits intomainfrom
qiaolin/fully_overlap

Conversation

@Qiaolin-Yu
Copy link
Copy Markdown
Collaborator

@Qiaolin-Yu Qiaolin-Yu commented Apr 22, 2026

Motivation

use it with SGLANG_SPEC_V2_NO_VERIFY_SYNC=1
disclaimer: just a poc, only support trtllm_mla now. not sure if it has data race, lack of testing.

Modifications

tried this on dpsk-fp4 on blackwell

Before,
image

After,
img_v3_02110_a5a87942-e876-49b7-9bb1-41e392a372ix

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization for speculative decoding (Spec V2) by allowing the skipping of explicit CPU-GPU synchronizations, controlled by the SGLANG_SPEC_V2_NO_VERIFY_SYNC environment variable. It updates the TRT-LLM MLA backend, batch scheduling, and Eagle workers to handle deferred CPU-side metadata updates such as sequence length sums. Feedback was provided regarding a potential performance bottleneck caused by a CPU-GPU synchronization point when calculating sequence length sums for backends other than MLA.

Comment on lines +361 to +364
if model_worker_batch.seq_lens_sum is None and tree_mask_buf is None:
model_worker_batch.seq_lens_sum = (
model_worker_batch.seq_lens.sum().item()
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The call to .item() introduces a CPU-GPU synchronization point. While this is conditional on tree_mask_buf being None (which avoids the sync for the MLA backend), it will still cause a performance bottleneck for other backends (like FlashAttention) that don't provide a pre-allocated buffer. Consider if seq_lens.sum() can be handled entirely on the GPU or if the value can be passed from the scheduler when available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blackwell SM100/SM120

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant