feat: Full Support for Overlapped Constrained Decoding + Spec V2 by Ubospica · Pull Request #15623 · sgl-project/sglang

Ubospica · 2025-12-22T08:40:10Z

This PR supports fully overlapped constrained decoding with Spec V2.

See #13425 for the non-overlapped version. See #11762, #13019 for more background. This PR is on top of #15465.

Overlapping Pattern

Benchmark

E2E Latency:

Test Case	Baseline	Optimized	Speedup
test_json_generate	0.644s	0.609s	1.06x
test_mix_json_and_other (batch JSON)	0.17-0.21s	0.12-0.16s	~1.3x
test_regex_generate_complex_json	0.175s	0.164s	1.07x
test_regex_generate_custom_log_format	0.19-0.26s	0.18-0.23s	~1.1x
test_regex_generate_email	0.07s	0.04-0.06s	~1.2x
test_regex_generate_phone	0.08-0.11s	0.07-0.08s	~1.2x

Key Findings

Overall Speedup: 5%-30% reduction in e2e_latency across most test cases
Best improvement observed in batch JSON constrained generation (test_mix_json_and_other): ~30% faster
Correctness verified: All 12 tests pass with identical output formats

Signed-off-by: Ubospica ubospica@gmail.com

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-22T08:40:42Z

Summary of Changes

Hello @Ubospica, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the efficiency of constrained decoding, particularly when using speculative decoding (Spec V2) with grammars. By enabling full overlap between CPU-bound grammar processing and GPU computations, it minimizes idle time and boosts overall throughput. This is achieved through asynchronous data transfers, a new native C++ implementation for grammar tree traversal, and a refactored scheduling mechanism that processes grammar state updates from previous batches concurrently with the current batch's GPU forward pass.

Highlights

Overlapped Grammar Processing: Implemented full support for overlapped constrained decoding with Spec V2, allowing CPU-bound grammar state updates to run concurrently with GPU computations.
Performance Optimization: Introduced a C++ native implementation for grammar tree traversal and integrated asynchronous GPU-to-CPU data transfers for grammar-related tensors, significantly improving performance.
XGrammar Update: Updated the xgrammar dependency to version 0.1.29 to leverage the new native traversal capabilities.
Benchmarking and Testing: Added new benchmark scripts and unit tests to validate the performance and correctness of the native grammar traversal and the overall overlapped decoding mechanism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Ubospica · 2025-12-22T08:41:00Z

Overlap pattern:

cc @hnyls2002 @merrymercy

gemini-code-assist

Code Review

This pull request introduces full support for overlapped constrained decoding with Speculative Decoding V2, which is a significant performance enhancement. The changes are well-structured, primarily affecting the scheduler and the speculative worker to enable overlapping CPU-bound grammar processing with GPU computations. Key changes include the introduction of pending_accept_info for carrying over grammar work, asynchronous data transfers using pinned memory, and the integration of a faster C++ implementation for draft tree traversal. The addition of new benchmarks and unit tests is commendable and ensures the correctness and performance of the new features. The implementation appears solid, and my feedback focuses on improving type hint accuracy and reducing some code duplication for better long-term maintainability.

gemini-code-assist · 2025-12-22T08:45:00Z

+        if last_batch.forward_mode.is_extend():
+            # Prefill case: single token per request (next_token_ids shape: [bs])
+            for i, req in enumerate(last_batch.reqs):
+                if (
+                    req.grammar is not None
+                    and not req.finished()
+                    and not req.is_retracted
+                ):
+                    try:
+                        req.grammar.accept_token(next_token_ids[i])
+                    except ValueError as e:
+                        logger.error(
+                            f"Grammar accept_token failed for req {req.rid} "
+                            f"with token {next_token_ids[i]}: {e}"
+                        )
+        else:
+            # Decode case: multiple accepted tokens per request
+            # next_token_ids shape: [bs * speculative_num_draft_tokens]
+            accept_lens = last_result.accept_lens.tolist()
+            stride = self.speculative_num_draft_tokens
+
+            for i, req in enumerate(last_batch.reqs):
+                if (
+                    req.grammar is not None
+                    and not req.finished()
+                    and not req.is_retracted
+                ):
+                    # Get the accepted tokens for this request
+                    accepted_tokens = next_token_ids[
+                        i * stride : i * stride + accept_lens[i]
+                    ]
+                    try:
+                        for token_id in accepted_tokens:
+                            req.grammar.accept_token(token_id)
+                    except ValueError as e:
+                        logger.error(
+                            f"Grammar accept_token failed for req {req.rid} "
+                            f"with tokens {accepted_tokens}: {e}"
+                        )


The logic for handling prefill and decode cases in this method has some code duplication, particularly the checks for req.grammar, req.finished(), req.is_retracted, and the try...except block. This could be refactored to improve maintainability by first preparing a unified list of accepted tokens for each request and then iterating through them in a single loop.

if last_batch.forward_mode.is_extend(): # Prefill case: single token per request (next_token_ids shape: [bs]) accepted_tokens_per_req = [[t] for t in next_token_ids] else: # Decode case: multiple accepted tokens per request accept_lens = last_result.accept_lens.tolist() stride = self.speculative_num_draft_tokens accepted_tokens_per_req = [ next_token_ids[i * stride : i * stride + accept_lens[i]] for i in range(len(last_batch.reqs)) ] for i, req in enumerate(last_batch.reqs): if ( req.grammar is not None and not req.finished() and not req.is_retracted ): accepted_tokens = accepted_tokens_per_req[i] try: for token_id in accepted_tokens: req.grammar.accept_token(token_id) except ValueError as e: logger.error( f"Grammar accept_token failed for req {req.rid} " f"with tokens {accepted_tokens}: {e}" )

the suggestion looks not bad

Signed-off-by: Ubospica <ubospica@gmail.com>

acelyc111 · 2026-01-08T08:01:00Z

+            # the current batch's verify forward to overlap CPU and GPU operations.
+            if (
+                batch
+                and batch.is_eagle_v2


AttributeError: 'ScheduleBatch' object has no attribute 'is_eagle_v2'

acelyc111 · 2026-01-08T09:13:57Z

Also consider to adapt to python/sglang/srt/speculative/multi_layer_eagle_worker_v2.py ?

hanming-lu · 2026-02-24T06:04:55Z

@@ -0,0 +1,7 @@
+#!/bin/bash


maybe we don't need this file?

hanming-lu · 2026-02-24T19:02:32Z

    def draft_worker(self):
        return self._draft_worker

+    def _init_grammar_pinned_buffers(self, bs: int, num_draft_tokens: int):


what's the reason we want to lazy init here? seems like non-lazy init is fine?

hanming-lu · 2026-02-24T19:07:48Z

+        )
+
+        # Record event to synchronize before using CPU tensors
+        copy_event = torch.cuda.Event()


grammar_copy_done for clarity and consistency since we have a copy_done and verify_done

hanming-lu · 2026-02-24T20:13:06Z

+                if last_batch.has_grammar:
+                    batch.pending_accept_info = (last_batch, last_result)
+                    # Mark that grammar accept will be processed in the next batch's verify
+                    last_result.grammar_accept_processed = True


I am trying to understand why we need this grammar_accept_processed flag. for spec v2 + decode + not disable_overlap_for_batch, all batches will have grammar_accept_processed = True, and all other cases grammar_accept_processed = False.

In the output processing, can we just check the conditions above instead?

Also, IIUC, grammar_accept_processed only works for decode batch. How do you check confirm that for the last_batch?

I see you check batch.forward_mode.is_decode(), but it's checking if batch N being decode, last_batch is batch N-1, how can we infer batch N-1 from batch N?

I think I got the reason now, the batch N's verify handles batch N-1's grammar_accept_processed

Ubospica requested review from Fridge003, Ying1123, hnyls2002, ispobock, merrymercy and xiezhq-hermann as code owners December 22, 2025 08:40

github-actions Bot added the dependencies Pull requests that update a dependency file label Dec 22, 2025

gemini-code-assist Bot reviewed Dec 22, 2025

View reviewed changes

Ubospica added 3 commits January 1, 2026 01:44

finish

3432818

Signed-off-by: Ubospica <ubospica@gmail.com>

fix linting

8ace3ea

Signed-off-by: Ubospica <ubospica@gmail.com>

finish

26354d2

Signed-off-by: Ubospica <ubospica@gmail.com>

Ubospica force-pushed the main-dev/2025-12-22-overlap branch from 5d71d30 to 26354d2 Compare January 1, 2026 06:46

Ubospica requested review from BBuf and DarkSharpness as code owners January 1, 2026 06:46

update msg

35bca07

Signed-off-by: Ubospica <ubospica@gmail.com>

Tiktus mentioned this pull request Jan 6, 2026

[Bug] Structured outputs not working with spec v2 PD and spec v1 crashes - DeepseekV3.1 / mimo-v2-flash #16541

Closed

5 tasks

acelyc111 added the run-ci label Jan 8, 2026

acelyc111 reviewed Jan 8, 2026

View reviewed changes

hanming-lu reviewed Feb 24, 2026

View reviewed changes

hanming-lu reviewed Feb 25, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/schedule_batch.py

liuhuijiayou mentioned this pull request Apr 21, 2026

Speculative Decoding Development Roadmap (2026 Q2) #23005

Open

11 tasks

Conversation

Ubospica commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overlapping Pattern

Benchmark

Key Findings

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 22, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

Ubospica commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Dec 22, 2025

Choose a reason for hiding this comment

Uh oh!

hanming-lu Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

acelyc111 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

acelyc111 commented Jan 8, 2026

Uh oh!

hanming-lu Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

hanming-lu Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanming-lu Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

hanming-lu Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanming-lu Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hanming-lu Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ubospica commented Dec 22, 2025 •

edited

Loading

Ubospica commented Dec 22, 2025 •

edited

Loading

hanming-lu Feb 24, 2026 •

edited

Loading

hanming-lu Feb 24, 2026 •

edited

Loading

hanming-lu Feb 24, 2026 •

edited

Loading