Skip to content

feat: Naive support Spec V2 + Constrained Decoding#13425

Merged
hnyls2002 merged 10 commits intosgl-project:mainfrom
Ubospica:main-dev/2025-11-17-xgrammar-spec-v2
Nov 27, 2025
Merged

feat: Naive support Spec V2 + Constrained Decoding#13425
hnyls2002 merged 10 commits intosgl-project:mainfrom
Ubospica:main-dev/2025-11-17-xgrammar-spec-v2

Conversation

@Ubospica
Copy link
Copy Markdown
Collaborator

@Ubospica Ubospica commented Nov 17, 2025

This PR supports enabling constrained decoding and Speculative decoding v2 at the same time, resolving #13019.

When the batch contains grammars to handle, the overlap is temporarily disabled.

Further TODO:

  • Split the forward launch into two phases: before verify sampling and after verify sampling
  • Apply the delayed launch of phase-2, which requires
    • Launch phase-2 after the last batch's result is processed (grammar is synced)
    • Sync the draft token ids to the CPU before launching phase-2 or make mask preparation on the GPU.

Signed-off-by: Ubospica ubospica@gmail.com

cc @merrymercy @hnyls2002 @jiapingW

Output:

result {'text': '{"name": "John", "age": 30}', 'output_ids': [6377, 978, 1115, 376, 11639, 613, 376, 482, 1115, 29871, 29941, 29900, 29913, 2], 'meta_info': {'id': '05c466dca22d4397b91b31c8f7422aaf', 'finish_reason': {'type': 'stop', 'matched': 2}, 'prompt_tokens': 58, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 14, 'cached_tokens': 0, 'spec_accept_rate': 0.45, 'spec_accept_length': 3.5, 'spec_verify_ct': 4, 'spec_accept_token_num': 9, 'spec_draft_token_num': 20, 'e2e_latency': 0.6285512447357178, 'response_sent_to_client_ts': 1764062075.5853202}}
result {'text': '{"items": [{"id": 1, "name": "Item 1"}, {"id": 2, "name": "Item 2"}, {"id": 3, "name": "Item 3"}]}', 'output_ids': [6377, 7076, 1115, 518, 6377, 333, 1115, 29871, 29896, 29892, 376, 978, 1115, 376, 2001, 29871, 29896, 10758, 8853, 333, 1115, 29871, 29906, 29892, 376, 978, 1115, 376, 2001, 29871, 29906, 10758, 8853, 333, 1115, 29871, 29941, 29892, 376, 978, 1115, 376, 2001, 29871, 29941, 29908, 6525, 29913, 2], 'meta_info': {'id': 'b81f85e345d0434e877dcfd23f134d6b', 'finish_reason': {'type': 'stop', 'matched': 2}, 'prompt_tokens': 88, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 49, 'cached_tokens': 3, 'spec_accept_rate': 0.44, 'spec_accept_length': 3.2666666666666666, 'spec_verify_ct': 15, 'spec_accept_token_num': 33, 'spec_draft_token_num': 75, 'e2e_latency': 0.15197324752807617, 'response_sent_to_client_ts': 1764062075.739491}}
result {'text': 'user@example.com', 'output_ids': [1792, 29992, 4773, 29889, 510, 2], 'meta_info': {'id': '2a645c62e26c4419864049c70fb0ca62', 'finish_reason': {'type': 'stop', 'matched': 2}, 'prompt_tokens': 8, 'weight_version': 'default', 'total_retractions': 0, 'completion_tokens': 6, 'cached_tokens': 2, 'spec_accept_rate': 0.13333333333333333, 'spec_accept_length': 2.0, 'spec_verify_ct': 3, 'spec_accept_token_num': 2, 'spec_draft_token_num': 15, 'e2e_latency': 0.04137468338012695, 'response_sent_to_client_ts': 1764062075.7826078}}

Co-authored-by: Liangsheng Yin lsyincs@gmail.com

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Ubospica, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the SGLang system by integrating support for constrained decoding with Speculative Decoding V2 (EAGLE). The core purpose is to enable the simultaneous use of both advanced features, allowing for more controlled and structured text generation while retaining the performance benefits of speculative decoding. The changes involve modifying the token sampling process to incorporate grammar-based vocabulary masks and adding comprehensive tests to ensure the correctness and compatibility of these combined functionalities across different speculative decoding versions and constraint types.

Highlights

  • Constrained Decoding Integration: Implemented the ability to apply vocabulary masks during speculative decoding (Spec V2) to enforce grammar constraints, ensuring generated tokens adhere to specified rules.
  • Grammar Data Preparation: Added logic within the verification process to prepare grammar-related data on the CPU and dynamically generate a vocab_mask based on active grammar rules for constrained decoding.
  • Comprehensive Test Coverage: Introduced new unit tests to validate the combined functionality of speculative decoding (both EAGLE v1 and v2) with JSON schema and regex-based constrained decoding, ensuring robustness.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Ubospica Ubospica marked this pull request as draft November 17, 2025 11:12
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully adds support for constrained decoding with speculative decoding v2. The changes correctly plumb the vocabulary mask through the speculative verification path, enabling grammar-based constraints. The addition of a new integration test file with cases for JSON schema and regex constraints is a great way to ensure this new functionality is well-tested and robust. I have one suggestion for a potential performance improvement in the data transfer logic.

Comment on lines +679 to +683
retrieve_next_token_cpu = verify_input.retrive_next_token.cpu()
retrieve_next_sibling_cpu = verify_input.retrive_next_sibling.cpu()
draft_tokens_cpu = verify_input.draft_token.view(
verify_input.retrive_next_token.shape
).cpu()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For a potential performance improvement, consider making the device-to-host transfers non-blocking. This allows the data transfer to overlap with the GPU computations in forward_batch_generation, which could reduce overall latency. You can achieve this by using .to('cpu', non_blocking=True) instead of .cpu().

Suggested change
retrieve_next_token_cpu = verify_input.retrive_next_token.cpu()
retrieve_next_sibling_cpu = verify_input.retrive_next_sibling.cpu()
draft_tokens_cpu = verify_input.draft_token.view(
verify_input.retrive_next_token.shape
).cpu()
retrieve_next_token_cpu = verify_input.retrive_next_token.to("cpu", non_blocking=True)
retrieve_next_sibling_cpu = verify_input.retrive_next_sibling.to("cpu", non_blocking=True)
draft_tokens_cpu = verify_input.draft_token.view(
verify_input.retrive_next_token.shape
).to("cpu", non_blocking=True)

@Ubospica Ubospica marked this pull request as ready for review November 17, 2025 20:56
@Ubospica Ubospica changed the title Support Spec V2 + Constrained Decoding feat: Support Spec V2 + Constrained Decoding Nov 17, 2025
@jiapingW
Copy link
Copy Markdown
Contributor

jiapingW commented Nov 18, 2025

Thanks, I test your impl use Llama3.1-8b-Instruct and eagle model. When set export SGLANG_ENABLE_SPEC_V2=0, the response satisfies r"^user@example\.com$" . When set export SGLANG_ENABLE_SPEC_V2=1, the response is use the following information to generate an email address. I test use the following prompt and sampling parameter.

    payload = {
        "text": prompt,
        "sampling_params": {
            "temperature": 0.5,
            "max_new_tokens": 128,
            "n": 3,
            "regex": regex,
        },
        "stream": False,
        "return_logprob": False,
        "top_logprobs_num": 0,
        "logprob_start_len": 0,
    }

@merrymercy
Copy link
Copy Markdown
Contributor

/tag-run-ci-label try again

@Ubospica
Copy link
Copy Markdown
Collaborator Author

Thanks, I test your impl use Llama3.1-8b-Instruct and eagle model. When set export SGLANG_ENABLE_SPEC_V2=0, the response satisfies r"^user@example\.com$" . When set export SGLANG_ENABLE_SPEC_V2=1, the response is use the following information to generate an email address. I test use the following prompt and sampling parameter.

    payload = {
        "text": prompt,
        "sampling_params": {
            "temperature": 0.5,
            "max_new_tokens": 128,
            "n": 3,
            "regex": regex,
        },
        "stream": False,
        "return_logprob": False,
        "top_logprobs_num": 0,
        "logprob_start_len": 0,
    }

@jiapingW Thanks for testing! I will further fix this.

@jiapingW
Copy link
Copy Markdown
Contributor

Thanks, I test your impl use Llama3.1-8b-Instruct and eagle model. When set export SGLANG_ENABLE_SPEC_V2=0, the response satisfies r"^user@example\.com$" . When set export SGLANG_ENABLE_SPEC_V2=1, the response is use the following information to generate an email address. I test use the following prompt and sampling parameter.

    payload = {
        "text": prompt,
        "sampling_params": {
            "temperature": 0.5,
            "max_new_tokens": 128,
            "n": 3,
            "regex": regex,
        },
        "stream": False,
        "return_logprob": False,
        "top_logprobs_num": 0,
        "logprob_start_len": 0,
    }

@jiapingW Thanks for testing! I will further fix this.
I'm also fixing it. You can check here first: the EAGLEWorkerV2 verifyfunction's input batch is a ModelWorkerBatchclass. It doesn't have has_grammar attribute. If you have fixed it, I'll help to test it.

@jiapingW
Copy link
Copy Markdown
Contributor

My main finding is that because of the spec v2 overlap, the grammar is not updated immediately after prefilling, but only after the first decode. This results in the grammar being None during the first decode, leading to duplicate data generation, which also affects subsequent results.

@jiapingW
Copy link
Copy Markdown
Contributor

I impl a runable version with no polish in https://github.com/sgl-project/sglang/pull/13441/files. However, I haven't conducted any testing or performance analysis yet. I test use the following code. Its result is OK.

import json
import re
import sys
import requests
import argparse

def regex_match(text, pattern):
    return re.fullmatch(pattern, text) is not None

def send_sglang_request(
    base_url,
    prompt,
    regex,
    n=1,
    temperature=0,
    max_new_tokens=128,
):
    # if n > 1 and temperature == 0:
    #     temperature = 0.5
    #     print(f"Warning: n > 1 but temperature is 0. Automatically setting temperature to {temperature} for diverse outputs.")
    temperature = 0.5
    payload = {
        "text": prompt,
        "sampling_params": {
            "temperature": temperature,
            "max_new_tokens": max_new_tokens,
            "n": n,
            "regex": regex,
        },
        "stream": False,
        "return_logprob": False,
        "top_logprobs_num": 0,
        "logprob_start_len": 0,
    }

    print("--- Sending Request ---")
    print("URL:", base_url + "/generate")
    print("Payload:")
    print(json.dumps(payload, indent=2))
    print("-" * 25)

    try:
        response = requests.post(base_url + "/generate", json=payload)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        print(f"Error: Request failed: {e}")
        sys.exit(1)

    ret = response.json()

    print("\n--- Received Response ---")
    print("Raw JSON Response:")
    # print(ret)
    # for i,item in enumerate(ret):
    #     print(f"response_{i}:",item["text"])
    print("-" * 25)

    if isinstance(ret, dict):
        results_list = [ret]
    elif isinstance(ret, list):
        results_list = ret
    else:
        print(f"Error: Expected response to be a list or dict, but got {type(ret)}")
        sys.exit(1)

    print("\n--- Validating Results ---")
    all_passed = True
    if not results_list:
        print("Error: Response list is empty.")
        all_passed = False

    for i, item in enumerate(results_list):
        text = item.get("text", "").strip()
        print(f"Result {i+1}/{len(results_list)}: '{text}'")

        if not text:
            print("  -> FAIL: Generated text is empty.")
            all_passed = False
            continue

        if not regex_match(text, regex):
            print(f"  -> FAIL: Text does not match regex pattern '{regex}'.")
            all_passed = False
        else:
            print("  -> PASS: Text matches regex.")
    
    print("-" * 25)
    return all_passed


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Test SGLang server with regex-constrained generation.")
    parser.add_argument("--port", type=int, default=21000, help="Port of the SGLang server (default: 30000)")
    args = parser.parse_args()
    test_prompt = "Generate an email address:"
    test_regex = r"^user@example\.com$"
    num_generations = 3

    print(f"Running test: Generate {num_generations} email(s) with regex '{test_regex}'")
    print("=" * 100)

    SGLANG_BASE_URL = f"http://127.0.0.1:{args.port}"

    success = send_sglang_request(
        base_url=SGLANG_BASE_URL,
        prompt=test_prompt,
        regex=test_regex,
        n=num_generations,
    )

    print("\n--- Final Result ---")
    if success:
        print("✅ All generated texts passed validation.")
        sys.exit(0)
    else:
        print("❌ One or more generated texts failed validation.")
        sys.exit(1)

@jiapingW
Copy link
Copy Markdown
Contributor

My design is as follows:

Take question: "Generate an email address:" and grammar: "^user@example\.com$" as an example.

The original Spec V2's overlap design handles the process as follows:

  1. After the prefill run_batch finishes, it produces the "use" token. At this point, the grammar is not processed because last_batch is None.
  2. After the first decode-phase run_batch finishes, it produces the user token. Since the grammar has not been updated, it still constructs the vocab mask based on the initial user. This results in repeated generation of matching tokens, producing outputs similar to useuseruser@eexamplele.com.

Therefore, my design idea is: place the grammar update operation after run_batch, and separate process_batch_result and grammar processing as much as possible. This way, process_batch_result can handle the output normally, while the grammar value is always up-to-date and won't affect subsequent token generation.
The revised processing steps are:

    1. After the prefill run_batch finishes, it produces the "use" token. At this point, update the grammar value to use.
    1. After entering the first decode phase run_batch finishes, it produces the r token. After completion, update the grammar value to user, and simultaneously execute process_batch_result to handle the batch result and output the result token use.
  • .......
  • n. After entering the n-th decode phase run_batch finishes, it produces the .com<|im_end|> token. After completion, update the grammar value to user@example.com<im_end>, and execute process_batch_result to process the batch result and output the result token ple.
  • n+1. After entering the (n+1)-th round of the decode phase, run_batch. The input is <im_end>. After completion, token number 151935 is generated. The grammar is not updated. Simultaneously, process_batch_result is executed to process the batch result and output the result to .com.

Need to fix

The final round should not need to execute this special run_batch; it can be skipped directly by checking whether the grammar has terminated.

@jiapingW
Copy link
Copy Markdown
Contributor

@Ubospica Can help review the impl?

@Ubospica
Copy link
Copy Markdown
Collaborator Author

@Ubospica Can help review the impl?

Certainly. I will check that out. Thanks for the reference!

@Ubospica Ubospica force-pushed the main-dev/2025-11-17-xgrammar-spec-v2 branch 3 times, most recently from 3bc3427 to 0d29bae Compare November 25, 2025 09:10
@hnyls2002
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@Ubospica Ubospica force-pushed the main-dev/2025-11-17-xgrammar-spec-v2 branch from 3d576cf to bd68835 Compare November 27, 2025 01:54
@Ubospica Ubospica force-pushed the main-dev/2025-11-17-xgrammar-spec-v2 branch 2 times, most recently from f24b69f to bb0383b Compare November 27, 2025 03:40
Signed-off-by: Ubospica <ubospica@gmail.com>
@Ubospica Ubospica force-pushed the main-dev/2025-11-17-xgrammar-spec-v2 branch from bb0383b to e510ee6 Compare November 27, 2025 03:56
Signed-off-by: Ubospica <ubospica@gmail.com>
@Ubospica
Copy link
Copy Markdown
Collaborator Author

Now all the tests have passed. cc @hnyls2002

Ubospica and others added 6 commits November 26, 2025 23:07
Signed-off-by: Ubospica <ubospica@gmail.com>
Signed-off-by: Ubospica <ubospica@gmail.com>
@hnyls2002
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@hnyls2002 hnyls2002 changed the title feat: Support Spec V2 + Constrained Decoding feat: Naive support Spec V2 + Constrained Decoding Nov 27, 2025
@hnyls2002 hnyls2002 merged commit 6350042 into sgl-project:main Nov 27, 2025
75 of 80 checks passed
harvenstar pushed a commit to harvenstar/sglang that referenced this pull request Dec 4, 2025
Signed-off-by: Ubospica <ubospica@gmail.com>
Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants