Checklist
Describe the bug
The results (w and w/o JSON schema) are different, while those generated from vllm server (v0.6.4.post1) remain the same
Reproduction
How to start sglang server
services:
llm-sglang-dev:
image: lmsysorg/sglang:latest
container_name: llm-sglang-dev
restart: unless-stopped
environment:
HUGGING_FACE_HUB_TOKEN: <my-hf-token>
ports:
- "8007:8007"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
ipc: host
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
env_file:
- .env
command: >
python3 -m sglang.launch_server
--model Qwen/Qwen2.5-7B-Instruct-AWQ
--host 0.0.0.0
--port 8007
--api-key <my-api-key>
--served-model-name gpt-4o
--tensor-parallel-size 1
--mem-fraction-static 0.4
--random-seed 42
--enable-p2p-check
--show-time-cost
--quantization awq_marlin
--grammar-backend xgrammar
--enable-cache-report
--context-length 2048
How to start vllm server
services:
llm-vllm:
image: vllm/vllm-openai:latest
container_name: llm-vllm
restart: unless-stopped
environment:
HUGGING_FACE_HUB_TOKEN: <my-hf-token>
ports:
- "8007:8007"
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
ipc: host
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
command: >
--host 0.0.0.0
--port 8007
--api-key <my-api-key>
--max-model-len 16382
--tensor-parallel-size 1
--gpu-memory-utilization 0.8
--served-model-name gpt-4o
--seed 42
--disable-log-requests
--enable-prefix-caching
--model MODEL=Qwen/Qwen2.5-7B-Instruct-AWQ
Python script
import json
import openai
from pydantic import BaseModel
client = openai.OpenAI(
base_url="http://localhost:8007/v1",
api_key="Lizai@54321"
)
class Players(BaseModel):
names: list[str]
class Model(BaseModel):
name: str
number_of_parameters: str
number_of_max_tokens: str
architecture: list[str]
class Usage(BaseModel):
use_case: list[str]
license: str
class Schema(BaseModel):
model: Model
usage: Usage
document = """We introduce Mistral 7B, a 7–billion-parameter language model engineered for
superior performance and efficiency. Mistral 7B outperforms the best open 13B
model (Llama 2) across all evaluated benchmarks, and the best released 34B
model (Llama 1) in reasoning, mathematics, and code generation. Our model
leverages grouped-query attention (GQA) for faster inference, coupled with sliding
window attention (SWA) to effectively handle sequences of arbitrary length with a
reduced inference cost. We also provide a model fine-tuned to follow instructions,
Mistral 7B – Instruct, that surpasses Llama 2 13B – chat model both on human and
automated benchmarks. Our models are released under the Apache 2.0 license.
Code: <https://github.com/mistralai/mistral-src>
Webpage: <https://mistral.ai/news/announcing-mistral-7b/>"""
template = """{
"model": {
"name": "",
"number_of_parameters": "",
"number_of_max_tokens": "",
"architecture": []
},
"usage": {
"use_case": [],
"licence": ""
}
}"""
schema = json.dumps(json.loads(template))
completion = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": f"Task: Extract precise values for each field in the provided output schema from the given document.\nInstructions:\n-Do not hallucinate, paraphrase, or modify the extracted values.\n- If a field has no corresponding value in the document, leave it as an empty string (\"\").\n\nDocument:\n{document}\nSchema:\n{schema}"}
],
temperature=0.0,
max_tokens=256,
extra_body={
"response_format": {
"type": "json_schema",
"json_schema": {
"name": Schema.__name__,
"schema": Schema.model_json_schema()
}
}
}
)
print(completion.choices[0].message.content)
Results without json_schema
vllm
{
"model": {
"name": "Mistral 7B",
"number_of_parameters": "7 billion",
"number_of_max_tokens": "",
"architecture": ["grouped-query attention (GQA)", "sliding window attention (SWA)"]
},
"usage": {
"use_case": ["superior performance and efficiency", "reasoning", "mathematics", "code generation", "following instructions"],
"licence": "Apache 2.0"
}
}
sglang
{
"model": {
"name": "Mistral 7B",
"number_of_parameters": "7 billion",
"number_of_max_tokens": "",
"architecture": ["grouped-query attention (GQA)", "sliding window attention (SWA)"]
},
"usage": {
"use_case": ["superior performance and efficiency", "reasoning", "mathematics", "code generation", "following instructions"],
"licence": "Apache 2.0"
}
}
Results with json_schema
vllm
{
"model": {
"name": "Mistral 7B",
"number_of_parameters": "7 billion",
"number_of_max_tokens": "",
"architecture": ["grouped-query attention (GQA) for faster inference", "sliding window attention (SWA)"]
},
"usage": {
"use_case": ["superior performance and efficiency", "reasoning", "mathematics", "code generation", "following instructions"],
"license": "Apache 2.0"
}
}
sglang (xgrammar backend)
{"model": {"name": "Mistral 7B", "number_of_parameters": "7–billion", "number_of_max_tokens": "", "architecture": ["grouped-query attention (GQA)", "sliding window attention (SWA)"]}, "usage": {"use_case": [], "license": "Apache 2.0"}}
sglang (outlines backend)
{"model": {"name": "Mistral 7B", "number_of_parameters": "7–billion", "number_of_max_tokens": "", "architecture": ["grouped-query attention (GQA)", "sliding window attention (SWA)"]}, "usage": {"use_case": [], "license": "Apache 2.0"}}
Environment
Python: 3.10.15 (main, Sep 7 2024, 18:35:33) [GCC 9.4.0]
CUDA available: True
GPU 0: NVIDIA A10G
GPU 0 Compute Capability: 8.6
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.1, V12.1.105
CUDA Driver Version: 550.120
PyTorch: 2.5.1+cu124
flashinfer: 0.1.6+cu121torch2.4
triton: 3.1.0
transformers: 4.46.3
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.7
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.26.2
interegular: 0.3.3
psutil: 6.1.0
pydantic: 2.10.1
multipart: 0.0.17
zmq: 26.2.0
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.55.1
anthropic: 0.39.0
NVIDIA Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-47 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Hypervisor vendor: KVM
ulimit soft: 1048576
Checklist
Describe the bug
The results (w and w/o JSON schema) are different, while those generated from
vllmserver (v0.6.4.post1) remain the sameReproduction
How to start
sglangserverHow to start
vllmserverPython script
Results without json_schema
vllm
sglang
Results with json_schema
vllm
sglang (xgrammar backend)
sglang (outlines backend)
Environment