[Frontend] Support strict mode for tool calling by chaunceyjiang · Pull Request #45003 · vllm-project/vllm

chaunceyjiang · 2026-06-09T10:24:07Z

Co-authored-by: @cjackal 44624812+cjackal@users.noreply.github.com

Purpose

[Frontend] Support strict mode for tool calling

Test Plan

I tested it locally with Minimax 2.5,Qwen2.5, Qwen3.5, Qwen3.6, Qwen3, and DeepSeek V3.2.

I also tested the tool_choice modes required, auto, and named tool selection, and all of them worked correctly.

Test Result


vllm serve /mnt/data3/models/MiniMax/MiniMax-M2.5 --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2  -tp 4 --port 8001
vllm bench serve --port 8001 --model /mnt/data3/models/MiniMax/MiniMax-M2.5\
  --backend openai-chat --endpoint /v1/chat/completions \
  --dataset-name hf \
  --dataset-path gorilla-llm/Berkeley-Function-Calling-Leaderboard \
  --bfcl-categories simple,live_simple,multiple \
  --num-warmups 5   --temperature 0   --percentile-metrics ttft,tpot,itl,e2el   \
  --max-concurrency 8 --num-prompts 500

main:

============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  128.23    
Total input tokens:                      185991    
Total generated tokens:                  86375     
Request throughput (req/s):              3.90      
Output token throughput (tok/s):         673.62    
Peak output token throughput (tok/s):    671.00    
Peak concurrent requests:                17.00     
Total token throughput (tok/s):          2124.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          37.61     
Median TTFT (ms):                        37.76     
P99 TTFT (ms):                           52.79     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.52     
Median TPOT (ms):                        11.53     
P99 TPOT (ms):                           11.66     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.41     
Median ITL (ms):                         11.40     
P99 ITL (ms):                            23.12     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          2013.90   
Median E2EL (ms):                        1576.51   
P99 E2EL (ms):                           5956.04

this pr

============ Serving Benchmark Result ============
Successful requests:                     500       
Failed requests:                         0         
Maximum request concurrency:             8         
Benchmark duration (s):                  130.70    
Total input tokens:                      185991    
Total generated tokens:                  88166     
Request throughput (req/s):              3.83      
Output token throughput (tok/s):         674.57    
Peak output token throughput (tok/s):    643.00    
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          2097.60   
---------------Time to First Token----------------
Mean TTFT (ms):                          37.79     
Median TTFT (ms):                        37.38     
P99 TTFT (ms):                           69.77     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.52     
Median TPOT (ms):                        11.54     
P99 TPOT (ms):                           11.65     
---------------Inter-token Latency----------------
Mean ITL (ms):                           15.34     
Median ITL (ms):                         11.42     
P99 ITL (ms):                            23.14     
----------------End-to-end Latency----------------
Mean E2EL (ms):                          2056.44   
Median E2EL (ms):                        1646.51   
P99 E2EL (ms):                           5965.42   
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

cjackal · 2026-06-09T14:01:18Z

this PR supercedes #43678

cjackal · 2026-06-10T11:31:42Z

glm_4_7 (GLM-4.7 / GLM-5) also works well in our internal test.

mergify · 2026-06-10T14:50:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chaunceyjiang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-06-11T03:14:22Z

Documentation preview: https://vllm--45003.org.readthedocs.build/en/45003/

ankrovv · 2026-06-11T07:25:52Z

I tested this with GPT-OSS 120B using:

VLLM_ENFORCE_STRICT_TOOL_CALLING=true
--enable-auto-tool-choice
--tool-call-parser openai
--reasoning-parser openai_gptoss

I noticed the structural tag is not applied in the live GPT-OSS chat render path for tool_choice="required". The request for gpt goes through the _make_request_with_harmony() path which bypasses the generic preprocess_chat(... parser=self.parser ...) path where the structural_tag gets applied

And, if I force the generated Harmony structural tag, it leaks raw Harmony markers into assistant content (reproducible when using ambiguous prompts).
For prompt "Hi, how are you?" with tool_choice="required":

message.content = <|channel|>commentary to=functions.get_weather<|constrain|>json<|message|>{...}
tool_calls = []
finish_reason = stop

Also, the generated Harmony tag for tool_choice="required" contains: "at_least_one": false so even if wired into the live path, it does not appear to enforce required-tool semantics. Could you look into this? Personally, I think using json constraints for gpt oss is a lot simpler since it avoids the overall harmony leak but feel free to do your own deep-dive @chaunceyjiang @cjackal

chaunceyjiang · 2026-06-11T07:32:03Z

@ankrovv Yeah, GPT-OSS isn't supported yet. As you mentioned, it currently goes through a separate code path.

We're in the process of unifying the Harmony and non-Harmony paths, and once that's done, we'll integrate it with structural_tags harmony tags as well.

Overall, using structural tags has been quite effective at improving tool-calling accuracy and reliability.

… calling Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

mergify · 2026-06-11T21:43:45Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chaunceyjiang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ywang96 · 2026-06-11T22:58:12Z

Can we resolve the merge conflict so that we can get this one in? Thanks for the great work from everyone!

… calling Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang · 2026-06-12T03:15:53Z

@yzong-rh Thanks for catching this so carefully.

Regarding the MiniMax issue, I believe the root cause is in xgrammar's built-in structural_tag implementation (https://github.com/mlc-ai/xgrammar/blob/main/python/xgrammar/builtin_structural_tag.py#L1455-L1456). I'll discuss it with their team later. In the meantime, I've reimplemented that part on our side, and the issue has been resolved.

As for the OpenAI/Harmony issue, the previous Harmony path bypassed structural_tag entirely. Since I hadn't tested that path yet, and because of your refactoring there, I haven't had a chance to verify the behavior with the new implementation. For now, I've temporarily disabled the OpenAI parser handling until I can properly test it.

… calling Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

yzong-rh · 2026-06-12T04:44:23Z

I believe the root cause is in xgrammar's built-in structural_tag implementation (https://github.com/mlc-ai/xgrammar/blob/main/python/xgrammar/builtin_structural_tag.py#L1455-L1456). I'll discuss it with their team later.

Great, thank you. I went through the structural tags in XGRAMMAR_BUILTIN_STRUCTURAL_TAG_MODELS. minimax was indeed the only implementation with that behavior.

A remaining risk is kimi, whose is_reasoning_end allows "implicit reasoning end via tool call section".

vllm/vllm/reasoning/kimi_k2_reasoning_parser.py

Lines 76 to 83 in 7021be6

    
               def is_reasoning_end(self, input_ids: Sequence[int]) -> bool: 
        
                   """ 
        
                   Check if the reasoning content ends in the input_ids. 
        
                   Reasoning ends when we see either: 
        
                   1. The end token (</think>) 
        
                   2. The tool section start token (<|tool_calls_section_begin|>) 
        
                   """

When that triggers, the start of a tool section is already generated. xgrammar's tags would force the start of tool section to be generated again. But this only occurs if kimi "forgot" to close </think>. So not an xgrammar problem, and not a blocker.

For now, I've temporarily disabled the OpenAI parser handling until I can properly test it.

Sg. There might be some challenges due to the harmony channel format. Happy to look into it if you don't have the bandwidth.

… calling Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

The qwen3_xml parser was deleted upstream in vllm-project#45003. The old_xml pairing now resolves to the engine parser, making those tests redundant duplicates of the engine pairing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben Browning <bbrownin@redhat.com>

bbrowning · 2026-06-13T13:05:30Z

Since this merged I'm seeing Qwen 3.6 models getting stuck in infinite whitespace generation loops regularly in relatively simple tool calling scenarios with our out of the box setup. It's easily triggered with things like BFCL multi_turn_base, resulting in extremely long generation times and timeouts while the model generates thousands of newline tokens per turn.

We've flip this on by default for a lot of models now in tool calling scenarios, and at least in some cases this has made things worse. We may need a more selective testing and case-by-case enabling of this.

mergify Bot added deepseek Related to DeepSeek models frontend llama Related to Llama models qwen Related to Qwen models tool-calling labels Jun 9, 2026

github-project-automation Bot added this to Tool Calling Jun 9, 2026

mergify Bot added the ci/build label Jun 10, 2026

chaunceyjiang force-pushed the xgrammar_builtin branch from bb47907 to d14e537 Compare June 10, 2026 09:04

chaunceyjiang marked this pull request as ready for review June 10, 2026 09:44

chaunceyjiang requested review from DarkLight1337, aarnphm, bbrowning, russellb and sfeng33 as code owners June 10, 2026 09:44

cjackal reviewed Jun 10, 2026

View reviewed changes

Comment thread vllm/envs.py

mergify Bot added the needs-rebase label Jun 10, 2026

sfeng33 reviewed Jun 10, 2026

View reviewed changes

Comment thread vllm/tool_parsers/abstract_tool_parser.py Outdated

sfeng33 mentioned this pull request Jun 11, 2026

[Responses] Support required function tools for GPT-OSS Harmony #44664

Draft

chaunceyjiang force-pushed the xgrammar_builtin branch from fcc6eae to 0185fc7 Compare June 11, 2026 03:13

chaunceyjiang requested a review from njhill as a code owner June 11, 2026 03:13

mergify Bot added the documentation Improvements or additions to documentation label Jun 11, 2026

chaunceyjiang requested review from cjackal and sfeng33 June 11, 2026 03:19

chaunceyjiang changed the title ~~[Frontend] Integrate xgrammar builtin structural tags for strict tool calling~~ [Frontend] Support strict mode for tool calling Jun 11, 2026

mergify Bot removed the needs-rebase label Jun 11, 2026

sfeng33 reviewed Jun 11, 2026

View reviewed changes

Comment thread vllm/parser/abstract_parser.py

chaunceyjiang force-pushed the xgrammar_builtin branch from 3777089 to 946f4a0 Compare June 11, 2026 07:05

[Frontend] Integrate xgrammar builtin structural tags for strict tool…

6c42b67

… calling Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang requested review from AndreasKaratzas, NickLucche and robertgshaw2-redhat as code owners June 11, 2026 09:08

mergify Bot added the needs-rebase label Jun 11, 2026

This comment was marked as resolved.

Sign in to view

chaunceyjiang added 2 commits June 12, 2026 10:41

[Frontend] Integrate xgrammar builtin structural tags for strict tool…

c960335

… calling Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

[Frontend] Integrate xgrammar builtin structural tags for strict tool…

0d24dec

… calling Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

mergify Bot removed the needs-rebase label Jun 12, 2026

[Frontend] Integrate xgrammar builtin structural tags for strict tool…

7da4a95

… calling Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

vllm-project deleted a comment from mergify Bot Jun 12, 2026

chaunceyjiang and others added 2 commits June 12, 2026 13:29

[Frontend] Integrate xgrammar builtin structural tags for strict tool…

6bab3ad

… calling Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Merge branch 'main' into xgrammar_builtin

4d50f7d

chaunceyjiang enabled auto-merge (squash) June 12, 2026 06:55

chaunceyjiang merged commit 2043258 into vllm-project:main Jun 12, 2026
191 checks passed

github-project-automation Bot moved this to Done in Tool Calling Jun 12, 2026

chaunceyjiang mentioned this pull request Jun 12, 2026

[Frontend] Support strict mode for tool calling with ResponsesAPI #45396

Merged

4 tasks

This was referenced Jun 12, 2026

fix: route Kimi forced tools through native parser #43155

Closed

[Bugfix] Clear conflicting structured outputs in strict tool calling #44134

Closed

yzong-rh mentioned this pull request Jun 13, 2026

[Bugfix] Chat Completions Harmony Refactor Clean up #45464

Open

4 tasks

Uh oh!

Conversation

chaunceyjiang commented Jun 9, 2026 • edited by sfeng33 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

cjackal commented Jun 9, 2026

Uh oh!

cjackal commented Jun 10, 2026

Uh oh!

Uh oh!

mergify Bot commented Jun 10, 2026

Uh oh!

Uh oh!

mergify Bot commented Jun 11, 2026

Uh oh!

Uh oh!

ankrovv commented Jun 11, 2026

Uh oh!

chaunceyjiang commented Jun 11, 2026

Uh oh!

mergify Bot commented Jun 11, 2026

Uh oh!

ywang96 commented Jun 11, 2026

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

chaunceyjiang commented Jun 12, 2026

Uh oh!

yzong-rh commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bbrowning commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

chaunceyjiang commented Jun 9, 2026 •

edited by sfeng33

Loading

yzong-rh commented Jun 12, 2026 •

edited

Loading