[wip][perf] fully overlap spec v2, remove wait_for_verify sync#23452
[wip][perf] fully overlap spec v2, remove wait_for_verify sync#23452Qiaolin-Yu wants to merge 5 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces an optimization for speculative decoding (Spec V2) by allowing the skipping of explicit CPU-GPU synchronizations, controlled by the SGLANG_SPEC_V2_NO_VERIFY_SYNC environment variable. It updates the TRT-LLM MLA backend, batch scheduling, and Eagle workers to handle deferred CPU-side metadata updates such as sequence length sums. Feedback was provided regarding a potential performance bottleneck caused by a CPU-GPU synchronization point when calculating sequence length sums for backends other than MLA.
| if model_worker_batch.seq_lens_sum is None and tree_mask_buf is None: | ||
| model_worker_batch.seq_lens_sum = ( | ||
| model_worker_batch.seq_lens.sum().item() | ||
| ) |
There was a problem hiding this comment.
The call to .item() introduces a CPU-GPU synchronization point. While this is conditional on tree_mask_buf being None (which avoids the sync for the MLA backend), it will still cause a performance bottleneck for other backends (like FlashAttention) that don't provide a pre-allocated buffer. Consider if seq_lens.sum() can be handled entirely on the GPU or if the value can be passed from the scheduler when available.
Motivation
use it with
SGLANG_SPEC_V2_NO_VERIFY_SYNC=1disclaimer: just a poc, only support trtllm_mla now. not sure if it has data race, lack of testing.
Modifications
tried this on dpsk-fp4 on blackwell
Before,

After,

Accuracy Tests
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci