feat: Naive support Spec V2 + Constrained Decoding by Ubospica · Pull Request #13425 · sgl-project/sglang

Ubospica · 2025-11-17T11:11:29Z

This PR supports enabling constrained decoding and Speculative decoding v2 at the same time, resolving #13019.

When the batch contains grammars to handle, the overlap is temporarily disabled.

Further TODO:

Split the forward launch into two phases: before verify sampling and after verify sampling
Apply the delayed launch of phase-2, which requires
- Launch phase-2 after the last batch's result is processed (grammar is synced)
- Sync the draft token ids to the CPU before launching phase-2 or make mask preparation on the GPU.

Signed-off-by: Ubospica ubospica@gmail.com

cc @merrymercy @hnyls2002 @jiapingW

Output:

result {'text': '{"name": "John", "age": 30}', 'output_ids': [6377, 978, 1115, 376, 11639, 613, 376, 482, 1115, 29871, 29941, 29900, 29913, 2], 'meta_info': {'id': '05c466dca22d4397b91b31c8f7422aaf', 'finish_reason': {'type': 'stop', 'matched': 2}, 'prompt_tokens': 58, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 14, 'cached_tokens': 0, 'spec_accept_rate': 0.45, 'spec_accept_length': 3.5, 'spec_verify_ct': 4, 'spec_accept_token_num': 9, 'spec_draft_token_num': 20, 'e2e_latency': 0.6285512447357178, 'response_sent_to_client_ts': 1764062075.5853202}}
result {'text': '{"items": [{"id": 1, "name": "Item 1"}, {"id": 2, "name": "Item 2"}, {"id": 3, "name": "Item 3"}]}', 'output_ids': [6377, 7076, 1115, 518, 6377, 333, 1115, 29871, 29896, 29892, 376, 978, 1115, 376, 2001, 29871, 29896, 10758, 8853, 333, 1115, 29871, 29906, 29892, 376, 978, 1115, 376, 2001, 29871, 29906, 10758, 8853, 333, 1115, 29871, 29941, 29892, 376, 978, 1115, 376, 2001, 29871, 29941, 29908, 6525, 29913, 2], 'meta_info': {'id': 'b81f85e345d0434e877dcfd23f134d6b', 'finish_reason': {'type': 'stop', 'matched': 2}, 'prompt_tokens': 88, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 49, 'cached_tokens': 3, 'spec_accept_rate': 0.44, 'spec_accept_length': 3.2666666666666666, 'spec_verify_ct': 15, 'spec_accept_token_num': 33, 'spec_draft_token_num': 75, 'e2e_latency': 0.15197324752807617, 'response_sent_to_client_ts': 1764062075.739491}}
result {'text': 'user@example.com', 'output_ids': [1792, 29992, 4773, 29889, 510, 2], 'meta_info': {'id': '2a645c62e26c4419864049c70fb0ca62', 'finish_reason': {'type': 'stop', 'matched': 2}, 'prompt_tokens': 8, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 6, 'cached_tokens': 2, 'spec_accept_rate': 0.13333333333333333, 'spec_accept_length': 2.0, 'spec_verify_ct': 3, 'spec_accept_token_num': 2, 'spec_draft_token_num': 15, 'e2e_latency': 0.04137468338012695, 'response_sent_to_client_ts': 1764062075.7826078}}

Co-authored-by: Liangsheng Yin lsyincs@gmail.com

gemini-code-assist · 2025-11-17T11:11:55Z

Summary of Changes

Hello @Ubospica, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SGLang system by integrating support for constrained decoding with Speculative Decoding V2 (EAGLE). The core purpose is to enable the simultaneous use of both advanced features, allowing for more controlled and structured text generation while retaining the performance benefits of speculative decoding. The changes involve modifying the token sampling process to incorporate grammar-based vocabulary masks and adding comprehensive tests to ensure the correctness and compatibility of these combined functionalities across different speculative decoding versions and constraint types.

Highlights

Constrained Decoding Integration: Implemented the ability to apply vocabulary masks during speculative decoding (Spec V2) to enforce grammar constraints, ensuring generated tokens adhere to specified rules.
Grammar Data Preparation: Added logic within the verification process to prepare grammar-related data on the CPU and dynamically generate a vocab_mask based on active grammar rules for constrained decoding.
Comprehensive Test Coverage: Introduced new unit tests to validate the combined functionality of speculative decoding (both EAGLE v1 and v2) with JSON schema and regex-based constrained decoding, ensuring robustness.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully adds support for constrained decoding with speculative decoding v2. The changes correctly plumb the vocabulary mask through the speculative verification path, enabling grammar-based constraints. The addition of a new integration test file with cases for JSON schema and regex constraints is a great way to ensure this new functionality is well-tested and robust. I have one suggestion for a potential performance improvement in the data transfer logic.

gemini-code-assist · 2025-11-17T11:13:57Z

+            retrieve_next_token_cpu = verify_input.retrive_next_token.cpu()
+            retrieve_next_sibling_cpu = verify_input.retrive_next_sibling.cpu()
+            draft_tokens_cpu = verify_input.draft_token.view(
+                verify_input.retrive_next_token.shape
+            ).cpu()


For a potential performance improvement, consider making the device-to-host transfers non-blocking. This allows the data transfer to overlap with the GPU computations in forward_batch_generation, which could reduce overall latency. You can achieve this by using .to('cpu', non_blocking=True) instead of .cpu().

Suggested change

retrieve_next_token_cpu = verify_input.retrive_next_token.cpu()

retrieve_next_sibling_cpu = verify_input.retrive_next_sibling.cpu()

draft_tokens_cpu = verify_input.draft_token.view(

verify_input.retrive_next_token.shape

).cpu()

retrieve_next_token_cpu = verify_input.retrive_next_token.to("cpu", non_blocking=True)

retrieve_next_sibling_cpu = verify_input.retrive_next_sibling.to("cpu", non_blocking=True)

draft_tokens_cpu = verify_input.draft_token.view(

verify_input.retrive_next_token.shape

).to("cpu", non_blocking=True)

jiapingW · 2025-11-18T01:03:12Z

Thanks, I test your impl use Llama3.1-8b-Instruct and eagle model. When set export SGLANG_ENABLE_SPEC_V2=0, the response satisfies r"^user@example\.com$" . When set export SGLANG_ENABLE_SPEC_V2=1, the response is use the following information to generate an email address. I test use the following prompt and sampling parameter.

    payload = {
        "text": prompt,
        "sampling_params": {
            "temperature": 0.5,
            "max_new_tokens": 128,
            "n": 3,
            "regex": regex,
        },
        "stream": False,
        "return_logprob": False,
        "top_logprobs_num": 0,
        "logprob_start_len": 0,
    }

merrymercy · 2025-11-18T13:39:07Z

/tag-run-ci-label try again

Ubospica · 2025-11-19T06:04:23Z

Thanks, I test your impl use Llama3.1-8b-Instruct and eagle model. When set export SGLANG_ENABLE_SPEC_V2=0, the response satisfies r"^user@example\.com$" . When set export SGLANG_ENABLE_SPEC_V2=1, the response is use the following information to generate an email address. I test use the following prompt and sampling parameter.
    payload = {
        "text": prompt,
        "sampling_params": {
            "temperature": 0.5,
            "max_new_tokens": 128,
            "n": 3,
            "regex": regex,
        },
        "stream": False,
        "return_logprob": False,
        "top_logprobs_num": 0,
        "logprob_start_len": 0,
    }

@jiapingW Thanks for testing! I will further fix this.

jiapingW · 2025-11-19T06:42:52Z

Thanks, I test your impl use Llama3.1-8b-Instruct and eagle model. When set export SGLANG_ENABLE_SPEC_V2=0, the response satisfies r"^user@example\.com$" . When set export SGLANG_ENABLE_SPEC_V2=1, the response is use the following information to generate an email address. I test use the following prompt and sampling parameter.
    payload = {
        "text": prompt,
        "sampling_params": {
            "temperature": 0.5,
            "max_new_tokens": 128,
            "n": 3,
            "regex": regex,
        },
        "stream": False,
        "return_logprob": False,
        "top_logprobs_num": 0,
        "logprob_start_len": 0,
    }
@jiapingW Thanks for testing! I will further fix this.
I'm also fixing it. You can check here first: the EAGLEWorkerV2 verifyfunction's input batch is a ModelWorkerBatchclass. It doesn't have has_grammar attribute. If you have fixed it, I'll help to test it.

jiapingW · 2025-11-19T07:29:44Z

My main finding is that because of the spec v2 overlap, the grammar is not updated immediately after prefilling, but only after the first decode. This results in the grammar being None during the first decode, leading to duplicate data generation, which also affects subsequent results.

jiapingW · 2025-11-19T16:20:11Z

I impl a runable version with no polish in https://github.com/sgl-project/sglang/pull/13441/files. However, I haven't conducted any testing or performance analysis yet. I test use the following code. Its result is OK.

import json
import re
import sys
import requests
import argparse

def regex_match(text, pattern):
    return re.fullmatch(pattern, text) is not None

def send_sglang_request(
    base_url,
    prompt,
    regex,
    n=1,
    temperature=0,
    max_new_tokens=128,
):
    # if n > 1 and temperature == 0:
    #     temperature = 0.5
    #     print(f"Warning: n > 1 but temperature is 0. Automatically setting temperature to {temperature} for diverse outputs.")
    temperature = 0.5
    payload = {
        "text": prompt,
        "sampling_params": {
            "temperature": temperature,
            "max_new_tokens": max_new_tokens,
            "n": n,
            "regex": regex,
        },
        "stream": False,
        "return_logprob": False,
        "top_logprobs_num": 0,
        "logprob_start_len": 0,
    }

    print("--- Sending Request ---")
    print("URL:", base_url + "/generate")
    print("Payload:")
    print(json.dumps(payload, indent=2))
    print("-" * 25)

    try:
        response = requests.post(base_url + "/generate", json=payload)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error: Request failed: {e}")
        sys.exit(1)

    ret = response.json()

    print("\n--- Received Response ---")
    print("Raw JSON Response:")
    # print(ret)
    # for i,item in enumerate(ret):
    #     print(f"response_{i}:",item["text"])
    print("-" * 25)

    if isinstance(ret, dict):
        results_list = [ret]
    elif isinstance(ret, list):
        results_list = ret
    else:
        print(f"Error: Expected response to be a list or dict, but got {type(ret)}")
        sys.exit(1)

    print("\n--- Validating Results ---")
    all_passed = True
    if not results_list:
        print("Error: Response list is empty.")
        all_passed = False

    for i, item in enumerate(results_list):
        text = item.get("text", "").strip()
        print(f"Result {i+1}/{len(results_list)}: '{text}'")

        if not text:
            print("  -> FAIL: Generated text is empty.")
            all_passed = False
            continue

        if not regex_match(text, regex):
            print(f"  -> FAIL: Text does not match regex pattern '{regex}'.")
            all_passed = False
        else:
            print("  -> PASS: Text matches regex.")
    
    print("-" * 25)
    return all_passed


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Test SGLang server with regex-constrained generation.")
    parser.add_argument("--port", type=int, default=21000, help="Port of the SGLang server (default: 30000)")
    args = parser.parse_args()
    test_prompt = "Generate an email address:"
    test_regex = r"^user@example\.com$"
    num_generations = 3

    print(f"Running test: Generate {num_generations} email(s) with regex '{test_regex}'")
    print("=" * 100)

    SGLANG_BASE_URL = f"http://127.0.0.1:{args.port}"

    success = send_sglang_request(
        base_url=SGLANG_BASE_URL,
        prompt=test_prompt,
        regex=test_regex,
        n=num_generations,
    )

    print("\n--- Final Result ---")
    if success:
        print("✅ All generated texts passed validation.")
        sys.exit(0)
    else:
        print("❌ One or more generated texts failed validation.")
        sys.exit(1)

jiapingW · 2025-11-19T16:28:59Z

My design is as follows:

Take question: "Generate an email address:" and grammar: "^user@example\.com$" as an example.

The original Spec V2's overlap design handles the process as follows:

After the prefill run_batch finishes, it produces the "use" token. At this point, the grammar is not processed because last_batch is None.
After the first decode-phase run_batch finishes, it produces the user token. Since the grammar has not been updated, it still constructs the vocab mask based on the initial user. This results in repeated generation of matching tokens, producing outputs similar to useuseruser@eexamplele.com.

Therefore, my design idea is: place the grammar update operation after run_batch, and separate process_batch_result and grammar processing as much as possible. This way, process_batch_result can handle the output normally, while the grammar value is always up-to-date and won't affect subsequent token generation.
The revised processing steps are:

1. After the prefill run_batch finishes, it produces the "use" token. At this point, update the grammar value to use.
1. After entering the first decode phase run_batch finishes, it produces the r token. After completion, update the grammar value to user, and simultaneously execute process_batch_result to handle the batch result and output the result token use.
.......
n. After entering the n-th decode phase run_batch finishes, it produces the .com<|im_end|> token. After completion, update the grammar value to user@example.com<im_end>, and execute process_batch_result to process the batch result and output the result token ple.
n+1. After entering the (n+1)-th round of the decode phase, run_batch. The input is <im_end>. After completion, token number 151935 is generated. The grammar is not updated. Simultaneously, process_batch_result is executed to process the batch result and output the result to .com.

Need to fix

The final round should not need to execute this special run_batch; it can be skipped directly by checking whether the grammar has terminated.

jiapingW · 2025-11-21T05:43:33Z

@Ubospica Can help review the impl?

Ubospica · 2025-11-21T06:29:47Z

@Ubospica Can help review the impl?

Certainly. I will check that out. Thanks for the reference!

hnyls2002 · 2025-11-25T16:22:59Z

/tag-and-rerun-ci

Signed-off-by: Ubospica <ubospica@gmail.com>

Ubospica · 2025-11-27T04:06:14Z

Now all the tests have passed. cc @hnyls2002

Signed-off-by: Ubospica <ubospica@gmail.com>

hnyls2002 · 2025-11-27T11:05:38Z

/tag-and-rerun-ci

Signed-off-by: Ubospica <ubospica@gmail.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>

Ubospica requested review from Ying1123, hnyls2002 and merrymercy as code owners November 17, 2025 11:11

Ubospica marked this pull request as draft November 17, 2025 11:12

gemini-code-assist Bot reviewed Nov 17, 2025

View reviewed changes

Ubospica marked this pull request as ready for review November 17, 2025 20:56

Ubospica changed the title ~~Support Spec V2 + Constrained Decoding~~ feat: Support Spec V2 + Constrained Decoding Nov 17, 2025

github-actions Bot added the run-ci label Nov 18, 2025

merrymercy removed the run-ci label Nov 18, 2025

hnyls2002 assigned Ubospica, jiapingW and hnyls2002 Nov 18, 2025

Ubospica force-pushed the main-dev/2025-11-17-xgrammar-spec-v2 branch 3 times, most recently from 3bc3427 to 0d29bae Compare November 25, 2025 09:10

Ubospica requested review from xiezhq-hermann and zhyncs as code owners November 25, 2025 09:10

github-actions Bot added the run-ci label Nov 25, 2025

Ubospica force-pushed the main-dev/2025-11-17-xgrammar-spec-v2 branch from 3d576cf to bd68835 Compare November 27, 2025 01:54

Ubospica force-pushed the main-dev/2025-11-17-xgrammar-spec-v2 branch 2 times, most recently from f24b69f to bb0383b Compare November 27, 2025 03:40

update nvtx

e510ee6

Signed-off-by: Ubospica <ubospica@gmail.com>

Ubospica force-pushed the main-dev/2025-11-17-xgrammar-spec-v2 branch from bb0383b to e510ee6 Compare November 27, 2025 03:56

update test

1f4df06

Signed-off-by: Ubospica <ubospica@gmail.com>

Ubospica and others added 6 commits November 26, 2025 23:07

add json tests

e0994c0

Signed-off-by: Ubospica <ubospica@gmail.com>

fix json tests

ae5122e

Signed-off-by: Ubospica <ubospica@gmail.com>

fix lint

69c730b

fix style

70a9a4e

fix v1 spec decode

83b036e

add v1 constrained decoding test

77e423a

hnyls2002 mentioned this pull request Nov 27, 2025

[spec-v2] (WIP) add constrained decoding test #12635

Closed

update

7ee0b47

fix

780bf97

hnyls2002 changed the title ~~feat: Support Spec V2 + Constrained Decoding~~ feat: Naive support Spec V2 + Constrained Decoding Nov 27, 2025

hnyls2002 merged commit 6350042 into sgl-project:main Nov 27, 2025
75 of 80 checks passed

hnyls2002 mentioned this pull request Nov 27, 2025

[Feature] Fully Overlap with spec v2 + Constrained Decoding #13019

Closed

harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025

feat: Naive support Spec V2 + Constrained Decoding (sgl-project#13425)

64e7135

Signed-off-by: Ubospica <ubospica@gmail.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>

Ubospica mentioned this pull request Dec 22, 2025

feat: Full Support for Overlapped Constrained Decoding + Spec V2 #15623

Open

6 tasks

Conversation

Ubospica commented Nov 17, 2025 • edited by hnyls2002 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Nov 17, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

jiapingW commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy commented Nov 18, 2025

Uh oh!

Ubospica commented Nov 19, 2025

Uh oh!

jiapingW commented Nov 19, 2025

Uh oh!

jiapingW commented Nov 19, 2025

Uh oh!

jiapingW commented Nov 19, 2025

Uh oh!

jiapingW commented Nov 19, 2025

My design is as follows:

Need to fix

Uh oh!

jiapingW commented Nov 21, 2025

Uh oh!

Ubospica commented Nov 21, 2025

Uh oh!

hnyls2002 commented Nov 25, 2025

Uh oh!

Ubospica commented Nov 27, 2025

Uh oh!

hnyls2002 commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Ubospica commented Nov 17, 2025 •

edited by hnyls2002

Loading

jiapingW commented Nov 18, 2025 •

edited

Loading