feat: Naive support Spec V2 + Constrained Decoding#13425
feat: Naive support Spec V2 + Constrained Decoding#13425hnyls2002 merged 10 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @Ubospica, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the SGLang system by integrating support for constrained decoding with Speculative Decoding V2 (EAGLE). The core purpose is to enable the simultaneous use of both advanced features, allowing for more controlled and structured text generation while retaining the performance benefits of speculative decoding. The changes involve modifying the token sampling process to incorporate grammar-based vocabulary masks and adding comprehensive tests to ensure the correctness and compatibility of these combined functionalities across different speculative decoding versions and constraint types. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request successfully adds support for constrained decoding with speculative decoding v2. The changes correctly plumb the vocabulary mask through the speculative verification path, enabling grammar-based constraints. The addition of a new integration test file with cases for JSON schema and regex constraints is a great way to ensure this new functionality is well-tested and robust. I have one suggestion for a potential performance improvement in the data transfer logic.
| retrieve_next_token_cpu = verify_input.retrive_next_token.cpu() | ||
| retrieve_next_sibling_cpu = verify_input.retrive_next_sibling.cpu() | ||
| draft_tokens_cpu = verify_input.draft_token.view( | ||
| verify_input.retrive_next_token.shape | ||
| ).cpu() |
There was a problem hiding this comment.
For a potential performance improvement, consider making the device-to-host transfers non-blocking. This allows the data transfer to overlap with the GPU computations in forward_batch_generation, which could reduce overall latency. You can achieve this by using .to('cpu', non_blocking=True) instead of .cpu().
| retrieve_next_token_cpu = verify_input.retrive_next_token.cpu() | |
| retrieve_next_sibling_cpu = verify_input.retrive_next_sibling.cpu() | |
| draft_tokens_cpu = verify_input.draft_token.view( | |
| verify_input.retrive_next_token.shape | |
| ).cpu() | |
| retrieve_next_token_cpu = verify_input.retrive_next_token.to("cpu", non_blocking=True) | |
| retrieve_next_sibling_cpu = verify_input.retrive_next_sibling.to("cpu", non_blocking=True) | |
| draft_tokens_cpu = verify_input.draft_token.view( | |
| verify_input.retrive_next_token.shape | |
| ).to("cpu", non_blocking=True) |
|
Thanks, I test your impl use Llama3.1-8b-Instruct and eagle model. When set payload = {
"text": prompt,
"sampling_params": {
"temperature": 0.5,
"max_new_tokens": 128,
"n": 3,
"regex": regex,
},
"stream": False,
"return_logprob": False,
"top_logprobs_num": 0,
"logprob_start_len": 0,
} |
|
/tag-run-ci-label try again |
@jiapingW Thanks for testing! I will further fix this. |
|
|
My main finding is that because of the spec v2 overlap, the grammar is not updated immediately after prefilling, but only after the first decode. This results in the grammar being None during the first decode, leading to duplicate data generation, which also affects subsequent results. |
|
I impl a runable version with no polish in https://github.com/sgl-project/sglang/pull/13441/files. However, I haven't conducted any testing or performance analysis yet. I test use the following code. Its result is OK. import json
import re
import sys
import requests
import argparse
def regex_match(text, pattern):
return re.fullmatch(pattern, text) is not None
def send_sglang_request(
base_url,
prompt,
regex,
n=1,
temperature=0,
max_new_tokens=128,
):
# if n > 1 and temperature == 0:
# temperature = 0.5
# print(f"Warning: n > 1 but temperature is 0. Automatically setting temperature to {temperature} for diverse outputs.")
temperature = 0.5
payload = {
"text": prompt,
"sampling_params": {
"temperature": temperature,
"max_new_tokens": max_new_tokens,
"n": n,
"regex": regex,
},
"stream": False,
"return_logprob": False,
"top_logprobs_num": 0,
"logprob_start_len": 0,
}
print("--- Sending Request ---")
print("URL:", base_url + "/generate")
print("Payload:")
print(json.dumps(payload, indent=2))
print("-" * 25)
try:
response = requests.post(base_url + "/generate", json=payload)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error: Request failed: {e}")
sys.exit(1)
ret = response.json()
print("\n--- Received Response ---")
print("Raw JSON Response:")
# print(ret)
# for i,item in enumerate(ret):
# print(f"response_{i}:",item["text"])
print("-" * 25)
if isinstance(ret, dict):
results_list = [ret]
elif isinstance(ret, list):
results_list = ret
else:
print(f"Error: Expected response to be a list or dict, but got {type(ret)}")
sys.exit(1)
print("\n--- Validating Results ---")
all_passed = True
if not results_list:
print("Error: Response list is empty.")
all_passed = False
for i, item in enumerate(results_list):
text = item.get("text", "").strip()
print(f"Result {i+1}/{len(results_list)}: '{text}'")
if not text:
print(" -> FAIL: Generated text is empty.")
all_passed = False
continue
if not regex_match(text, regex):
print(f" -> FAIL: Text does not match regex pattern '{regex}'.")
all_passed = False
else:
print(" -> PASS: Text matches regex.")
print("-" * 25)
return all_passed
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Test SGLang server with regex-constrained generation.")
parser.add_argument("--port", type=int, default=21000, help="Port of the SGLang server (default: 30000)")
args = parser.parse_args()
test_prompt = "Generate an email address:"
test_regex = r"^user@example\.com$"
num_generations = 3
print(f"Running test: Generate {num_generations} email(s) with regex '{test_regex}'")
print("=" * 100)
SGLANG_BASE_URL = f"http://127.0.0.1:{args.port}"
success = send_sglang_request(
base_url=SGLANG_BASE_URL,
prompt=test_prompt,
regex=test_regex,
n=num_generations,
)
print("\n--- Final Result ---")
if success:
print("✅ All generated texts passed validation.")
sys.exit(0)
else:
print("❌ One or more generated texts failed validation.")
sys.exit(1) |
My design is as follows:Take The original Spec V2's overlap design handles the process as follows:
Therefore, my design idea is: place the grammar update operation after
Need to fixThe final round should not need to execute this special run_batch; it can be skipped directly by checking whether the grammar has terminated. |
|
@Ubospica Can help review the impl? |
Certainly. I will check that out. Thanks for the reference! |
3bc3427 to
0d29bae
Compare
|
/tag-and-rerun-ci |
3d576cf to
bd68835
Compare
f24b69f to
bb0383b
Compare
Signed-off-by: Ubospica <ubospica@gmail.com>
bb0383b to
e510ee6
Compare
Signed-off-by: Ubospica <ubospica@gmail.com>
|
Now all the tests have passed. cc @hnyls2002 |
Signed-off-by: Ubospica <ubospica@gmail.com>
Signed-off-by: Ubospica <ubospica@gmail.com>
|
/tag-and-rerun-ci |
Signed-off-by: Ubospica <ubospica@gmail.com> Co-authored-by: Liangsheng Yin <lsyincs@gmail.com>
This PR supports enabling constrained decoding and Speculative decoding v2 at the same time, resolving #13019.
When the batch contains grammars to handle, the overlap is temporarily disabled.
Further TODO:
Signed-off-by: Ubospica ubospica@gmail.com
cc @merrymercy @hnyls2002 @jiapingW
Output:
Co-authored-by: Liangsheng Yin lsyincs@gmail.com