fix: tool_choice="required" falls back to tool_parser for non-JSON formats (streaming + non-streaming) by voipmonitor · Pull Request #35936 · vllm-project/vllm

voipmonitor · 2026-03-04T00:57:25Z

Summary

tool_choice="required" fails when the model produces non-JSON tool calls (e.g. Qwen3 XML format), even though a compatible --tool-call-parser (e.g. qwen3_coder) is configured. This affects both the non-streaming and streaming code paths.

Root cause

Both tool_choice="required" code paths bypass the configured tool_parser and use JSON-only parsing, while the tool_choice="auto" branches correctly use the tool parser.

Non-streaming path (`engine/serving.py`)

In _parse_tool_calls_from_content(), the code uses pydantic TypeAdapter JSON validation only. When the model produces XML-formatted tool calls (e.g. <tool_call>{"name": ...}</tool_call>), parsing fails with ValidationError. #36841 changed this block to silently suppress ValidationError (fixing the crash when content is None), but this means non-JSON tool calls are now silently dropped instead of being routed to the tool parser.

Streaming path (`chat_completion/serving.py`)

Three issues in the streaming required branch:

tool_parsers not initialized -- tool_parsers array is only created when tool_choice_auto is true. When tool_choice="required", it stays empty, so even if a parser is configured, it is never used.
if/else prevents tool parsing when reasoning ends -- With MTP (multi-token prediction), the model can send the </think> reasoning-end token and the beginning of tool call content in the same chunk. The original if/else structure (reasoning processing vs. tool call processing) meant that the iteration that detected reasoning-end would skip tool call extraction entirely.
No tool_parser fallback for non-JSON formats -- The required branch only has extract_tool_call_required_streaming() which expects JSON. Models using XML-style tool calls (like Qwen3 with qwen3_coder parser) fail silently.

Fix

Non-streaming (`engine/serving.py`)

Replace the contextlib.suppress(ValidationError) from #36841 with a try/except that preserves the crash-safety (content = content or "") while adding a fallback: when JSON parsing fails with ValidationError or JSONDecodeError, fall back to the configured tool_parser.extract_tool_calls() (when enable_auto_tools is true and a parser is available). Uses the new tool_parser_cls(tokenizer, request.tools) constructor from #38029.

Streaming (`chat_completion/serving.py`)

Initialize tool_parsers for required too -- Create parser instances when tool_choice is "auto" OR "required".
Separate if statements -- Change the if reasoning / else tool_parsing to two independent if blocks so both can execute in the same iteration when reasoning ends.
Dual parser approach -- When a tool_parser is available, try it first (for XML/custom formats). If it has not detected tool calls yet, also try the JSON extract_tool_call_required_streaming() as fallback. This handles the non-deterministic output from MTP, where the same model may produce JSON in one request and XML in another.

Note: content alongside tool_calls

When using the tool_parser fallback for required, the parser may emit partial text as delta.content before it detects the tool call pattern (e.g., qwen3_coder looks for XML <tool_call> tags but the model outputs JSON [{"name":...). This means some SSE chunks may contain both content and tool_calls. Clients consuming tool_choice="required" responses should prioritize tool_calls over content when both are present in the stream -- the content in that case is leaked parser buffer text, not a real text response.

Discussion point: Dual parser approach

Fix #3 is somewhat opinionated. We use it because with MTP (speculative decoding), the same model (Qwen3) non-deterministically produces either JSON or XML tool calls across different requests. The dual approach handles both formats reliably. However, if the vLLM team prefers a different strategy (e.g., always routing through the tool_parser for required), we are happy to adjust.

Testing

Unit tests added in tests/tool_use/test_tool_choice_required_fallback.py
Tested with Qwen3-Coder 32B + MTP (--tool-call-parser qwen3_coder --enable-auto-tool-choice):
- 30/30 streaming requests with tool_choice="required" succeed
- 10/10 agentic RAG pipeline queries (multi-turn tool calling) succeed
Zero impact on JSON-based models (GLM, Llama, Hermes) -- JSON path succeeds immediately, fallback never triggers

Files changed

vllm/entrypoints/openai/engine/serving.py -- Non-streaming fallback (replaces silent suppress from [Bugfix] Fix crash when tool_choice=required exceeds max_tokens #36841 with try/except + tool_parser fallback)
vllm/entrypoints/openai/chat_completion/serving.py -- Streaming fixes (3 changes)
tests/tool_use/test_tool_choice_required_fallback.py -- Unit tests (new)

gemini-code-assist

Code Review

This pull request effectively addresses a bug where tool_choice="required" would fail for non-JSON tool call formats. The fix, which involves adding a try...except block to fall back to the configured tool_parser, is logical and mirrors the behavior of tool_choice="auto". This is a good, low-risk solution. I have one suggestion to make the exception handling more specific and robust.

gemini-code-assist · 2026-03-04T00:59:27Z

+                        for tool_call in tool_calls
+                    ]
+                )
+            except Exception:


Using except Exception: is too broad and can mask unexpected errors (e.g., KeyboardInterrupt). It's a best practice to catch more specific exceptions. In this case, TypeAdapter(...).validate_json() raises pydantic.ValidationError on failure. Catching this specific exception makes the code safer and the intent clearer. You will need to add from pydantic import ValidationError at the top of the file.

Suggested change

except Exception:

except ValidationError:

mergify · 2026-03-04T01:02:14Z

Hi @voipmonitor, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-03-04T12:45:03Z

Hi @voipmonitor, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-03-04T15:27:33Z

Hi @voipmonitor, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Cherry-pick upstream fixes for GB10 Spark (SM121): - PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8 kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py) - PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4 by using ReplicatedLinear with quant_config=None - PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds on-the-fly FP8 dequantization in Triton kernels - PR vllm-project#35936: tool_choice="required" falls back to tool_parser for non-JSON (XML) tool calls from Qwen3 models Local patches: - Patch FlashInfer TRTLLM JIT to compile for SM12x (supported_major_versions=[10] → [10, 12]) - Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)

mergify · 2026-03-13T00:46:35Z

Hi @voipmonitor, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Cherry-pick upstream fixes for GB10 Spark (SM121): - PR vllm-project#35568: Recognize SM121 as SM120 family for Marlin/CUTLASS FP8 kernels (generate_kernels.py, ops.cu, scaled_mm*.cuh, marlin_utils.py) - PR vllm-project#35675: Fix Qwen3.5 MTP fc layer weight shape mismatch with NVFP4 by using ReplicatedLinear with quant_config=None - PR vllm-project#35833: FP8 KV cache for Triton MLA decode on Blackwell — adds on-the-fly FP8 dequantization in Triton kernels - PR vllm-project#35936: tool_choice="required" falls back to tool_parser for non-JSON (XML) tool calls from Qwen3 models Local patches: - Patch FlashInfer TRTLLM JIT to compile for SM12x (supported_major_versions=[10] → [10, 12]) - Skip VLLM_TEST_FORCE_FP8_MARLIN for NVFP4 MoE (not SM121-ready)

PR vllm-project#35675 equivalent (MTP fc layer fix) Updated qwen3_5_mtp.py Switched import from ColumnParallelLinear to ReplicatedLinear Changed FC construction from self.fc = ColumnParallelLinear(...) to self.fc = ReplicatedLinear(...) Removed TP-only args (gather_output, return_bias) Set quant_config=None for this layer Updated call site to unpack tuple: hidden_states, _ = self.fc(hidden_states) PR vllm-project#35936 equivalent (tool_choice="required" fallback) Updated engine/serving.py Replaced JSON parse suppress-block at elif request.tool_choice == "required": New flow: First try TypeAdapter(...).validate_json(content) On ValidationError or JSON decode error, fallback to configured tool parser when available Convert parsed tool calls into FunctionCall(...) entries Removed now-unused contextlib import Signed-off-by: ec-jt <james.trappett@elementalcompute.com>

…rmats When tool_choice="required" and the model produces non-JSON tool calls (e.g. XML from Qwen3 with qwen3_coder parser), both non-streaming and streaming paths now fall back to the configured tool_parser instead of silently dropping tool calls or failing. Non-streaming (engine/serving.py): - Replace contextlib.suppress(ValidationError) from vllm-project#36841 with try/except that preserves crash-safety (content or "") while adding fallback to tool_parser.extract_tool_calls() for non-JSON formats. Streaming (chat_completion/serving.py): - Initialize tool_parsers for "required" (not just "auto"). - Use separate if blocks (not if/else) so tool parsing runs in the same iteration when reasoning ends (critical for MTP/speculative decoding where </think> and tool call arrive in one chunk). - Dual parser: try tool_parser first (XML), fall back to JSON-only extract_tool_call_required_streaming() for non-deterministic MTP. Signed-off-by: voipmonitor <festr@voipmonitor.org>

andrea-tomassi · 2026-04-13T06:58:12Z

Reproduction confirmed: Qwen3.5-27B AWQ-4bit + vLLM v0.19.0 + qwen3_xml

We hit this independently while debugging with cyankiwi/Qwen3.5-27B-AWQ-4bit on vllm/vllm-openai:v0.19.0.

Key observation: the bug affects qwen3_xml as well as qwen3_coder. After applying the reasoning parser fix (nleve fork / PR #35687), tool_choice: "required" still returned tool_calls: [] with finish_reason: "tool_calls" — even though the same request with tool_choice: "auto" worked reliably every time.

Traced to the same place: the required branch in _parse_tool_calls_from_content() uses TypeAdapter(list[FunctionDefinition]).validate_json(content) wrapped in contextlib.suppress(ValidationError). The model's <tool_call>...</tool_call> XML output always raises ValidationError → silently returns empty. finish_reason: "tool_calls" is set regardless of whether any calls were extracted, which makes the symptom very misleading.

The auto branch correctly calls tool_parser.extract_tool_calls(content, request) and works reliably.

Confirmed the streaming issue too — tool_parsers is never initialized for the required path.

Current workaround: tool_choice: "auto" from the client. The fix here looks correct — hope the pre-commit issues get resolved soon, this is a real production blocker.

fishingpvalues · 2026-04-16T11:02:52Z

I wired this PR into a custom patched vllm version and it fixed the tool calling. The model now reliably calls all tools flawlessly.

chaunceyjiang · 2026-04-16T11:05:33Z

How can I reproduce this issue? @voipmonitor @fishingpvalues

mergify · 2026-04-16T11:06:21Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @voipmonitor.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

fishingpvalues · 2026-04-16T11:33:49Z

How can I reproduce this issue? @voipmonitor @fishingpvalues

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-8B \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="x",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"},
                },
                "required": ["city"],
            },
        },
    }
]

messages = [
    {"role": "user", "content": "What's the weather in Berlin?"},
]

# --- Non-streaming repro ---
resp = client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=messages,
    tools=tools,
    tool_choice="required",  # buggy
)

choice = resp.choices[0]
print("Non-streaming:")
print(f"  finish_reason : {choice.finish_reason}")
print(f"  tool_calls    : {choice.message.tool_calls}")  # should print: None/[]
print(f"  content       : {choice.message.content!r}")   # should return raw XML here

# --- Streaming repro ---
print("\nStreaming:")
tool_calls_seen = False

with client.chat.completions.create(
    model="Qwen/Qwen3.5-8B",
    messages=messages,
    tools=tools,
    tool_choice="required",
    stream=True,
) as stream:
    for chunk in stream:
        delta = chunk.choices[0].delta if chunk.choices else None
        if delta and delta.tool_calls:
            tool_calls_seen = True
            print(f"  tool_call delta: {delta.tool_calls[0]}")

if not tool_calls_seen:
    print("  BUG: no tool_call deltas received")

noonghunna · 2026-05-12T16:58:54Z

Independent cross-rig reproduction confirming the diagnosis — adding the data in case it helps push this forward.

Repro: Qwen 3.6-27B AutoRound INT4 + qwen3_coder tool parser + tool_choice="required"

vLLM image: nightly-1acd67a795ebccdf9b9db7697ae9082058301657 (0.20.2rc1.dev129+g1acd67a79, post-[Perf][3/n] Eliminate GPU<->CPU syncs in attention impls #41434)
Compose: dual RTX 3090 PCIe (TP=2, INT8 PTH KV, MTP n=4)
Flags: --enable-auto-tool-choice --tool-call-parser qwen3_coder --default-chat-template-kwargs '{"enable_thinking": false}'
Agent harness: MLS-Bench (sends tool_choice="required" from its OpenAIClient)

Behavior matches the report exactly — same prompt, same tools, two requests:

tool_choice="auto" → tool_calls=[edit(...)], content="I'll start by reading..." ✓
tool_choice="required" → tool_calls=[], content="", finish_reason="tool_calls" ✗

The qwen3_coder parser emits XML-style <tool_call>{...}</tool_call> blocks — under auto they get parsed and surfaced; under required the JSON-only fallback bypasses the parser and silently drops them, matching the root-cause analysis here.

Local workaround for anyone bitten: keep tool_choice="auto" (the parser path runs and tool calls land reliably).

Cross-rig confirmations now stacked on this issue:

@andrea-tomassi (2026-04-13): Qwen3.5-27B AWQ-4bit + qwen3_xml on v0.19.0
@fishingpvalues (2026-04-16): patched fork validates the fix
this comment (2026-05-12): Qwen 3.6-27B AutoRound INT4 + qwen3_coder on post-[Perf][3/n] Eliminate GPU<->CPU syncs in attention impls #41434 nightly

Happy to test a rebased version on Qwen 3.6-27B + the MLS-Bench harness if/when the merge conflicts clear. Thanks @voipmonitor for the careful diagnosis.

Vendored local overlay for vllm-project/vllm#35936 — fixes empty tool_calls=[] response when tool_choice="required" is used with the qwen3_coder parser. Under "required", vLLM forces structured JSON output regardless of which parser is configured; the qwen3_coder parser then scans the JSON for its XML <tool_call> sentinel, finds none, and returns tools_called=False. Overlay reshapes _parse_tool_calls_from_content() in vllm/entrypoints/openai/engine/serving.py to try JSON-validate first when tool_choice="required", and fall back to the configured parser only when validation fails — covering both the supports_required_and_named True and False parser paths. End-to-end validated 2026-05-12: - curl tool_choice="required" + get_weather tool: tool_calls populated - curl tool_choice="auto" regression: still works - MLS-Bench ml-ensemble-boosting with thinking.enabled=false: agent reaches Step 1 (edit) with no "No action returned" stall Mounted into the 23 Qwen 3.6-27B composes pinning the post-#41434 nightly-1acd67a79 image. The single compose on the older 01d4d1ad3 pin (dual/tq3-mtp-genesis.yml) is intentionally excluded — its source tree diverges from the rebase base. Added UPSTREAM.md row tracking the upstream PR + drop trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mergify · 2026-05-18T10:29:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @voipmonitor.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

voipmonitor requested review from DarkLight1337, aarnphm, chaunceyjiang and russellb as code owners March 4, 2026 00:57

mergify Bot added the frontend label Mar 4, 2026

gemini-code-assist Bot reviewed Mar 4, 2026

View reviewed changes

voipmonitor force-pushed the fix/tool-choice-required-xml-parser branch from 2817ac6 to ac8975e Compare March 4, 2026 01:04

voipmonitor changed the title ~~fix: tool_choice="required" falls back to tool_parser for non-JSON formats~~ fix: tool_choice="required" falls back to tool_parser for non-JSON formats (streaming + non-streaming) Mar 4, 2026

mergify Bot added the tool-calling label Mar 4, 2026

github-project-automation Bot added this to Tool Calling Mar 4, 2026

voipmonitor force-pushed the fix/tool-choice-required-xml-parser branch from 4fb8b2c to 50662ad Compare March 4, 2026 14:10

voipmonitor force-pushed the fix/tool-choice-required-xml-parser branch from e57e4f9 to eed0ae7 Compare March 13, 2026 00:42

will-deines mentioned this pull request Mar 17, 2026

[Responses API] Unified tool_choice + structured output via triggered tags will-deines/vllm#1

Closed

7 tasks

will-deines mentioned this pull request Mar 18, 2026

[Responses API] tool_choice support (auto / required / none) for GPT-OSS #37433

Open

12 tasks

voipmonitor force-pushed the fix/tool-choice-required-xml-parser branch from eed0ae7 to a269629 Compare March 31, 2026 13:35

This was referenced Apr 13, 2026

vLLM 0.19 may lose tool calls for Qwen/Qwen3.5-35B-A3B-FP8 when XML tool_call is emitted inside <think> #39056

Open

[Bugfix] Treat <tool_call> as implicit reasoning end in Qwen3 parser #35687

Merged

chaunceyjiang self-assigned this Apr 16, 2026

mergify Bot added the needs-rebase label Apr 16, 2026

hoobnn mentioned this pull request May 15, 2026

[Bugfix] Honor tool_choice="none" in Chat Completions streaming #42752

Merged

4 tasks

mergify Bot removed the needs-rebase label May 18, 2026

mergify Bot added the needs-rebase label May 18, 2026

alexeldeib mentioned this pull request May 19, 2026

fix: route Kimi forced tools through native parser #43155

Closed

4 tasks

NYU-ZYJCS mentioned this pull request May 25, 2026

Closed #43613

Closed

pst2154 mentioned this pull request Jun 5, 2026

Fix Qwen3-Coder required tool parsing #44447

Open

Uh oh!

Conversation

voipmonitor commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Non-streaming path (engine/serving.py)

Streaming path (chat_completion/serving.py)

Fix

Non-streaming (engine/serving.py)

Streaming (chat_completion/serving.py)

Note: content alongside tool_calls

Discussion point: Dual parser approach

Testing

Files changed

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented Mar 4, 2026

Uh oh!

mergify Bot commented Mar 4, 2026

Uh oh!

mergify Bot commented Mar 4, 2026

Uh oh!

mergify Bot commented Mar 13, 2026

Uh oh!

andrea-tomassi commented Apr 13, 2026

Uh oh!

fishingpvalues commented Apr 16, 2026

Uh oh!

chaunceyjiang commented Apr 16, 2026

Uh oh!

mergify Bot commented Apr 16, 2026

Uh oh!

fishingpvalues commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noonghunna commented May 12, 2026

Uh oh!

mergify Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

voipmonitor commented Mar 4, 2026 •

edited

Loading

Non-streaming path (`engine/serving.py`)

Streaming path (`chat_completion/serving.py`)

Non-streaming (`engine/serving.py`)

Streaming (`chat_completion/serving.py`)

fishingpvalues commented Apr 16, 2026 •

edited

Loading