XGrammarGrammarBackend does not respect HF model config.json's eos_token_id so if build_ebnf is enabled for some tool_call parsers, it will lead to endless tool_call response from LLM instead of stopping the sequence correctly.
From my investigation, it is b/c XGrammarGrammarBackend loads eos_token via HF tokenizer.json. However, HF tokenizer.json only allows one EOS token. Instead, HF model config.json's eos_token_id allows more than one EOS tokens. It causes XGrammarGrammarBackend to fail to understand other types of EOS token defined by the config.json. Thus, even if model wants to end the sequence using other types of EOS tokens, it is prohibited by XGrammarGrammarBackend via its vocab_mask tensor
(venv) jobuser [ ~/sglang ]$ python3 -m sglang.check_env
Python: 3.10.14 (main, Jul 14 2024, 22:24:12) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.6, V12.6.77
CUDA Driver Version: 550.163.01
PyTorch: 2.7.1+cu126
sglang: 0.4.9.post3
sgl_kernel: 0.2.7
flashinfer_python: 0.2.9rc1
triton: 3.3.1
transformers: 4.55.0.dev0
torchao: 0.9.0
numpy: 2.2.6
aiohttp: 3.12.14
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.1
interegular: 0.3.3
modelscope: 1.28.1
orjson: 3.11.1
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.0
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.21
openai: 1.97.1
tiktoken: 0.9.0
anthropic: 0.59.0
litellm: 1.74.8
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA AffinityGPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS 0-63,128-191 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS 0-63,128-191 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS 0-63,128-191 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS SYS 0-63,128-191 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 NODE NODE PIX NODE 64-127,192-255 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 NODE NODE NODE PIX 64-127,192-255 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 NODE PIX NODE NODE 64-127,192-255 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X PIX NODE NODE NODE 64-127,192-255 1 N/A
NIC0 SYS SYS SYS SYS NODE NODE NODE PIX X NODE NODE NODE
NIC1 SYS SYS SYS SYS NODE NODE PIX NODE NODE X NODE NODE
NIC2 SYS SYS SYS SYS PIX NODE NODE NODE NODE NODE X NODE
NIC3 SYS SYS SYS SYS NODE PIX NODE NODE NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
ulimit soft: 10000000
Checklist
Describe the bug
XGrammarGrammarBackend does not respect HF model config.json's eos_token_id so if build_ebnf is enabled for some tool_call parsers, it will lead to endless tool_call response from LLM instead of stopping the sequence correctly.
From my investigation, it is b/c XGrammarGrammarBackend loads eos_token via HF tokenizer.json. However, HF tokenizer.json only allows one EOS token. Instead, HF model config.json's eos_token_id allows more than one EOS tokens. It causes XGrammarGrammarBackend to fail to understand other types of EOS token defined by the config.json. Thus, even if model wants to end the sequence using other types of EOS tokens, it is prohibited by XGrammarGrammarBackend via its vocab_mask tensor
Reproduction
n/a
Environment