[wip][perf] fully overlap spec v2, remove wait_for_verify sync by Qiaolin-Yu · Pull Request #23452 · sgl-project/sglang

Qiaolin-Yu · 2026-04-22T07:35:27Z

Motivation

use it with SGLANG_SPEC_V2_NO_VERIFY_SYNC=1
disclaimer: just a poc, only support trtllm_mla now. not sure if it has data race, lack of testing.

Modifications

tried this on dpsk-fp4 on blackwell

Before,

After,

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request introduces an optimization for speculative decoding (Spec V2) by allowing the skipping of explicit CPU-GPU synchronizations, controlled by the SGLANG_SPEC_V2_NO_VERIFY_SYNC environment variable. It updates the TRT-LLM MLA backend, batch scheduling, and Eagle workers to handle deferred CPU-side metadata updates such as sequence length sums. Feedback was provided regarding a potential performance bottleneck caused by a CPU-GPU synchronization point when calculating sequence length sums for backends other than MLA.

gemini-code-assist · 2026-04-22T07:40:04Z

+        if model_worker_batch.seq_lens_sum is None and tree_mask_buf is None:
+            model_worker_batch.seq_lens_sum = (
+                model_worker_batch.seq_lens.sum().item()
+            )


The call to .item() introduces a CPU-GPU synchronization point. While this is conditional on tree_mask_buf being None (which avoids the sync for the MLA backend), it will still cause a performance bottleneck for other backends (like FlashAttention) that don't provide a pre-allocated buffer. Consider if seq_lens.sum() can be handled entirely on the GPU or if the value can be passed from the scheduler when available.

Qiaolin-Yu added 3 commits April 3, 2026 05:31

temp

fbdd6fa

upd

1373853

upd

25fc77c

Qiaolin-Yu requested review from Fridge003, HaiShaw, Ying1123, hebiao064, hnyls2002, ispobock, merrymercy and xiezhq-hermann as code owners April 22, 2026 07:35

github-actions Bot added the blackwell SM100/SM120 label Apr 22, 2026

Qiaolin-Yu mentioned this pull request Apr 22, 2026

Speculative Decoding Development Roadmap (2026 Q2) #23005

Open

11 tasks

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

Qiaolin-Yu added 2 commits May 7, 2026 01:45

Merge remote-tracking branch 'origin/main' into pr-23452

ef7bdd5

lint

b6ddd53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip][perf] fully overlap spec v2, remove wait_for_verify sync#23452

[wip][perf] fully overlap spec v2, remove wait_for_verify sync#23452
Qiaolin-Yu wants to merge 5 commits intomainfrom
qiaolin/fully_overlap

Qiaolin-Yu commented Apr 22, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Qiaolin-Yu commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Qiaolin-Yu commented Apr 22, 2026 •

edited

Loading