Skip to content

feat: Full Support for Overlapped Constrained Decoding + Spec V2#15623

Open
Ubospica wants to merge 4 commits intosgl-project:mainfrom
Ubospica:main-dev/2025-12-22-overlap
Open

feat: Full Support for Overlapped Constrained Decoding + Spec V2#15623
Ubospica wants to merge 4 commits intosgl-project:mainfrom
Ubospica:main-dev/2025-12-22-overlap

Conversation

@Ubospica
Copy link
Copy Markdown
Collaborator

@Ubospica Ubospica commented Dec 22, 2025

This PR supports fully overlapped constrained decoding with Spec V2.

See #13425 for the non-overlapped version. See #11762, #13019 for more background. This PR is on top of #15465.

Overlapping Pattern

image

Benchmark

E2E Latency:

Test Case Baseline Optimized Speedup
test_json_generate 0.644s 0.609s 1.06x
test_mix_json_and_other (batch JSON) 0.17-0.21s 0.12-0.16s ~1.3x
test_regex_generate_complex_json 0.175s 0.164s 1.07x
test_regex_generate_custom_log_format 0.19-0.26s 0.18-0.23s ~1.1x
test_regex_generate_email 0.07s 0.04-0.06s ~1.2x
test_regex_generate_phone 0.08-0.11s 0.07-0.08s ~1.2x

Key Findings

  1. Overall Speedup: 5%-30% reduction in e2e_latency across most test cases
  2. Best improvement observed in batch JSON constrained generation (test_mix_json_and_other): ~30% faster
  3. Correctness verified: All 12 tests pass with identical output formats

Signed-off-by: Ubospica ubospica@gmail.com

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label Dec 22, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Ubospica, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the efficiency of constrained decoding, particularly when using speculative decoding (Spec V2) with grammars. By enabling full overlap between CPU-bound grammar processing and GPU computations, it minimizes idle time and boosts overall throughput. This is achieved through asynchronous data transfers, a new native C++ implementation for grammar tree traversal, and a refactored scheduling mechanism that processes grammar state updates from previous batches concurrently with the current batch's GPU forward pass.

Highlights

  • Overlapped Grammar Processing: Implemented full support for overlapped constrained decoding with Spec V2, allowing CPU-bound grammar state updates to run concurrently with GPU computations.
  • Performance Optimization: Introduced a C++ native implementation for grammar tree traversal and integrated asynchronous GPU-to-CPU data transfers for grammar-related tensors, significantly improving performance.
  • XGrammar Update: Updated the xgrammar dependency to version 0.1.29 to leverage the new native traversal capabilities.
  • Benchmarking and Testing: Added new benchmark scripts and unit tests to validate the performance and correctness of the native grammar traversal and the overall overlapped decoding mechanism.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Ubospica
Copy link
Copy Markdown
Collaborator Author

Ubospica commented Dec 22, 2025

Overlap pattern:

image

cc @hnyls2002 @merrymercy

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces full support for overlapped constrained decoding with Speculative Decoding V2, which is a significant performance enhancement. The changes are well-structured, primarily affecting the scheduler and the speculative worker to enable overlapping CPU-bound grammar processing with GPU computations. Key changes include the introduction of pending_accept_info for carrying over grammar work, asynchronous data transfers using pinned memory, and the integration of a faster C++ implementation for draft tree traversal. The addition of new benchmarks and unit tests is commendable and ensures the correctness and performance of the new features. The implementation appears solid, and my feedback focuses on improving type hint accuracy and reducing some code duplication for better long-term maintainability.

Comment thread python/sglang/srt/managers/schedule_batch.py Outdated
Comment thread python/sglang/srt/managers/schedule_batch.py Outdated
Comment on lines +853 to +891
if last_batch.forward_mode.is_extend():
# Prefill case: single token per request (next_token_ids shape: [bs])
for i, req in enumerate(last_batch.reqs):
if (
req.grammar is not None
and not req.finished()
and not req.is_retracted
):
try:
req.grammar.accept_token(next_token_ids[i])
except ValueError as e:
logger.error(
f"Grammar accept_token failed for req {req.rid} "
f"with token {next_token_ids[i]}: {e}"
)
else:
# Decode case: multiple accepted tokens per request
# next_token_ids shape: [bs * speculative_num_draft_tokens]
accept_lens = last_result.accept_lens.tolist()
stride = self.speculative_num_draft_tokens

for i, req in enumerate(last_batch.reqs):
if (
req.grammar is not None
and not req.finished()
and not req.is_retracted
):
# Get the accepted tokens for this request
accepted_tokens = next_token_ids[
i * stride : i * stride + accept_lens[i]
]
try:
for token_id in accepted_tokens:
req.grammar.accept_token(token_id)
except ValueError as e:
logger.error(
f"Grammar accept_token failed for req {req.rid} "
f"with tokens {accepted_tokens}: {e}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for handling prefill and decode cases in this method has some code duplication, particularly the checks for req.grammar, req.finished(), req.is_retracted, and the try...except block. This could be refactored to improve maintainability by first preparing a unified list of accepted tokens for each request and then iterating through them in a single loop.

        if last_batch.forward_mode.is_extend():
            # Prefill case: single token per request (next_token_ids shape: [bs])
            accepted_tokens_per_req = [[t] for t in next_token_ids]
        else:
            # Decode case: multiple accepted tokens per request
            accept_lens = last_result.accept_lens.tolist()
            stride = self.speculative_num_draft_tokens
            accepted_tokens_per_req = [
                next_token_ids[i * stride : i * stride + accept_lens[i]]
                for i in range(len(last_batch.reqs))
            ]

        for i, req in enumerate(last_batch.reqs):
            if (
                req.grammar is not None
                and not req.finished()
                and not req.is_retracted
            ):
                accepted_tokens = accepted_tokens_per_req[i]
                try:
                    for token_id in accepted_tokens:
                        req.grammar.accept_token(token_id)
                except ValueError as e:
                    logger.error(
                        f"Grammar accept_token failed for req {req.rid} "
                        f"with tokens {accepted_tokens}: {e}"
                    )

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the suggestion looks not bad

Signed-off-by: Ubospica <ubospica@gmail.com>
Signed-off-by: Ubospica <ubospica@gmail.com>
Signed-off-by: Ubospica <ubospica@gmail.com>
@Ubospica Ubospica force-pushed the main-dev/2025-12-22-overlap branch from 5d71d30 to 26354d2 Compare January 1, 2026 06:46
Signed-off-by: Ubospica <ubospica@gmail.com>
# the current batch's verify forward to overlap CPU and GPU operations.
if (
batch
and batch.is_eagle_v2
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AttributeError: 'ScheduleBatch' object has no attribute 'is_eagle_v2'

@acelyc111
Copy link
Copy Markdown
Collaborator

Also consider to adapt to python/sglang/srt/speculative/multi_layer_eagle_worker_v2.py ?

@@ -0,0 +1,7 @@
#!/bin/bash
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't need this file?

def draft_worker(self):
return self._draft_worker

def _init_grammar_pinned_buffers(self, bs: int, num_draft_tokens: int):
Copy link
Copy Markdown
Collaborator

@hanming-lu hanming-lu Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason we want to lazy init here? seems like non-lazy init is fine?

)

# Record event to synchronize before using CPU tensors
copy_event = torch.cuda.Event()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grammar_copy_done for clarity and consistency since we have a copy_done and verify_done

if last_batch.has_grammar:
batch.pending_accept_info = (last_batch, last_result)
# Mark that grammar accept will be processed in the next batch's verify
last_result.grammar_accept_processed = True
Copy link
Copy Markdown
Collaborator

@hanming-lu hanming-lu Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand why we need this grammar_accept_processed flag. for spec v2 + decode + not disable_overlap_for_batch, all batches will have grammar_accept_processed = True, and all other cases grammar_accept_processed = False.

In the output processing, can we just check the conditions above instead?

Copy link
Copy Markdown
Collaborator

@hanming-lu hanming-lu Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, IIUC, grammar_accept_processed only works for decode batch. How do you check confirm that for the last_batch?

I see you check batch.forward_mode.is_decode(), but it's checking if batch N being decode, last_batch is batch N-1, how can we infer batch N-1 from batch N?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got the reason now, the batch N's verify handles batch N-1's grammar_accept_processed

Comment thread python/sglang/srt/managers/schedule_batch.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants