Skip to content

[Frontend] Support strict mode for tool calling#45003

Merged
chaunceyjiang merged 19 commits into
vllm-project:mainfrom
chaunceyjiang:xgrammar_builtin
Jun 12, 2026
Merged

[Frontend] Support strict mode for tool calling#45003
chaunceyjiang merged 19 commits into
vllm-project:mainfrom
chaunceyjiang:xgrammar_builtin

Conversation

@chaunceyjiang

@chaunceyjiang chaunceyjiang commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Co-authored-by: @cjackal 44624812+cjackal@users.noreply.github.com

Purpose

[Frontend] Support strict mode for tool calling

Test Plan

I tested it locally with Minimax 2.5,Qwen2.5, Qwen3.5, Qwen3.6, Qwen3, and DeepSeek V3.2.

I also tested the tool_choice modes required, auto, and named tool selection, and all of them worked correctly.

Test Result


vllm serve /mnt/data3/models/MiniMax/MiniMax-M2.5 --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2  -tp 4 --port 8001
vllm bench serve --port 8001 --model /mnt/data3/models/MiniMax/MiniMax-M2.5\
  --backend openai-chat --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path gorilla-llm/Berkeley-Function-Calling-Leaderboard \
  --bfcl-categories simple,live_simple,multiple \
  --num-warmups 5   --temperature 0   --percentile-metrics ttft,tpot,itl,e2el   \
  --max-concurrency 8 --num-prompts 500

main:

============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  128.23    
Total input tokens:                      185991    
Total generated tokens:                  86375     
Request throughput (req/s):              3.90      
Output token throughput (tok/s):         673.62    
Peak output token throughput (tok/s):    671.00    
Peak concurrent requests:                17.00     
Total token throughput (tok/s):          2124.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          37.61     
Median TTFT (ms):                        37.76     
P99 TTFT (ms):                           52.79     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.52     
Median TPOT (ms):                        11.53     
P99 TPOT (ms):                           11.66     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.41     
Median ITL (ms):                         11.40     
P99 ITL (ms):                            23.12     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          2013.90   
Median E2EL (ms):                        1576.51   
P99 E2EL (ms):                           5956.04 

this pr

============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  130.70    
Total input tokens:                      185991    
Total generated tokens:                  88166     
Request throughput (req/s):              3.83      
Output token throughput (tok/s):         674.57    
Peak output token throughput (tok/s):    643.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          2097.60   
---------------Time to First Token----------------
Mean TTFT (ms):                          37.79     
Median TTFT (ms):                        37.38     
P99 TTFT (ms):                           69.77     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.52     
Median TPOT (ms):                        11.54     
P99 TPOT (ms):                           11.65     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.34     
Median ITL (ms):                         11.42     
P99 ITL (ms):                            23.14     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          2056.44   
Median E2EL (ms):                        1646.51   
P99 E2EL (ms):                           5965.42   
==================================================


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

@mergify mergify Bot added deepseek Related to DeepSeek models frontend llama Related to Llama models qwen Related to Qwen models tool-calling labels Jun 9, 2026
@cjackal

cjackal commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

this PR supercedes #43678

@mergify mergify Bot added the ci/build label Jun 10, 2026
@chaunceyjiang chaunceyjiang marked this pull request as ready for review June 10, 2026 09:44
@cjackal

cjackal commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

glm_4_7 (GLM-4.7 / GLM-5) also works well in our internal test.

Comment thread vllm/envs.py
@mergify

mergify Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chaunceyjiang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 10, 2026
Comment thread vllm/tool_parsers/abstract_tool_parser.py Outdated
@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Documentation preview: https://vllm--45003.org.readthedocs.build/en/45003/

@mergify mergify Bot added the documentation Improvements or additions to documentation label Jun 11, 2026
@chaunceyjiang chaunceyjiang requested review from cjackal and sfeng33 June 11, 2026 03:19
@chaunceyjiang chaunceyjiang changed the title [Frontend] Integrate xgrammar builtin structural tags for strict tool calling [Frontend] Support strict mode for tool calling Jun 11, 2026
@mergify mergify Bot removed the needs-rebase label Jun 11, 2026
Comment thread vllm/parser/abstract_parser.py
@ankrovv

ankrovv commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

I tested this with GPT-OSS 120B using:

VLLM_ENFORCE_STRICT_TOOL_CALLING=true
--enable-auto-tool-choice
--tool-call-parser openai
--reasoning-parser openai_gptoss

I noticed the structural tag is not applied in the live GPT-OSS chat render path for tool_choice="required". The request for gpt goes through the _make_request_with_harmony() path which bypasses the generic preprocess_chat(... parser=self.parser ...) path where the structural_tag gets applied

And, if I force the generated Harmony structural tag, it leaks raw Harmony markers into assistant content (reproducible when using ambiguous prompts).
For prompt "Hi, how are you?" with tool_choice="required":

message.content = <|channel|>commentary to=functions.get_weather<|constrain|>json<|message|>{...}
tool_calls = []
finish_reason = stop

Also, the generated Harmony tag for tool_choice="required" contains: "at_least_one": false so even if wired into the live path, it does not appear to enforce required-tool semantics. Could you look into this? Personally, I think using json constraints for gpt oss is a lot simpler since it avoids the overall harmony leak but feel free to do your own deep-dive @chaunceyjiang @cjackal

@chaunceyjiang

Copy link
Copy Markdown
Collaborator Author

@ankrovv Yeah, GPT-OSS isn't supported yet. As you mentioned, it currently goes through a separate code path.

We're in the process of unifying the Harmony and non-Harmony paths, and once that's done, we'll integrate it with structural_tags harmony tags as well.

Overall, using structural tags has been quite effective at improving tool-calling accuracy and reliability.

… calling

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chaunceyjiang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 11, 2026
@ywang96

ywang96 commented Jun 11, 2026

Copy link
Copy Markdown
Member

Can we resolve the merge conflict so that we can get this one in? Thanks for the great work from everyone!

@yzong-rh

This comment was marked as resolved.

@yzong-rh

This comment was marked as resolved.

… calling

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
… calling

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@mergify mergify Bot removed the needs-rebase label Jun 12, 2026
@chaunceyjiang

Copy link
Copy Markdown
Collaborator Author

@yzong-rh Thanks for catching this so carefully.

Regarding the MiniMax issue, I believe the root cause is in xgrammar's built-in structural_tag implementation (https://github.com/mlc-ai/xgrammar/blob/main/python/xgrammar/builtin_structural_tag.py#L1455-L1456). I'll discuss it with their team later. In the meantime, I've reimplemented that part on our side, and the issue has been resolved.

As for the OpenAI/Harmony issue, the previous Harmony path bypassed structural_tag entirely. Since I hadn't tested that path yet, and because of your refactoring there, I haven't had a chance to verify the behavior with the new implementation. For now, I've temporarily disabled the OpenAI parser handling until I can properly test it.

… calling

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@vllm-project vllm-project deleted a comment from mergify Bot Jun 12, 2026
@yzong-rh

yzong-rh commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

I believe the root cause is in xgrammar's built-in structural_tag implementation (https://github.com/mlc-ai/xgrammar/blob/main/python/xgrammar/builtin_structural_tag.py#L1455-L1456). I'll discuss it with their team later.

Great, thank you. I went through the structural tags in XGRAMMAR_BUILTIN_STRUCTURAL_TAG_MODELS. minimax was indeed the only implementation with that behavior.

A remaining risk is kimi, whose is_reasoning_end allows "implicit reasoning end via tool call section".

def is_reasoning_end(self, input_ids: Sequence[int]) -> bool:
"""
Check if the reasoning content ends in the input_ids.
Reasoning ends when we see either:
1. The end token (</think>)
2. The tool section start token (<|tool_calls_section_begin|>)
"""

When that triggers, the start of a tool section is already generated. xgrammar's tags would force the start of tool section to be generated again. But this only occurs if kimi "forgot" to close </think>. So not an xgrammar problem, and not a blocker.

For now, I've temporarily disabled the OpenAI parser handling until I can properly test it.

Sg. There might be some challenges due to the harmony channel format. Happy to look into it if you don't have the bandwidth.

@chaunceyjiang chaunceyjiang enabled auto-merge (squash) June 12, 2026 06:55
@chaunceyjiang chaunceyjiang merged commit 2043258 into vllm-project:main Jun 12, 2026
191 checks passed
bbrowning added a commit to bbrowning/vllm that referenced this pull request Jun 12, 2026
The qwen3_xml parser was deleted upstream in vllm-project#45003. The old_xml
pairing now resolves to the engine parser, making those tests
redundant duplicates of the engine pairing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Ben Browning <bbrownin@redhat.com>
@bbrowning

Copy link
Copy Markdown
Collaborator

Since this merged I'm seeing Qwen 3.6 models getting stuck in infinite whitespace generation loops regularly in relatively simple tool calling scenarios with our out of the box setup. It's easily triggered with things like BFCL multi_turn_base, resulting in extremely long generation times and timeouts while the model generates thousands of newline tokens per turn.

We've flip this on by default for a lot of models now in tool calling scenarios, and at least in some cases this has made things worse. We may need a more selective testing and case-by-case enabling of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend llama Related to Llama models qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed tool-calling

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants