[Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode by voipmonitor · Pull Request #35615 · vllm-project/vllm

voipmonitor · 2026-02-28T15:59:58Z

Summary

Fixes a critical bug where the Qwen3Coder streaming tool parser silently drops parameters when speculative decoding is enabled (e.g. num_speculative_tokens=5). This causes downstream clients to receive tool calls with missing fields — for example a write tool call arrives as {"filePath": "/path/file"} with no content field, which the client sees as content: undefined.

Supersedes #35421, which partially addressed the issue but left the two root causes unfixed.

Why #35421 doesn't work

PR #35421 correctly identified and fixed several real issues:

streamed_args_for_tool tracking (append "" on header, accumulate "{"/"}")
The json_started condition that skipped "{" when parameter_prefix was in delta_text
The serving.py double-serialization of already-JSON-string arguments

However, it did not fix the two structural bugs that actually cause parameter loss under speculative decoding:

Bug 1: Close-before-params ordering (the root cause)

In the current code on main (and unchanged by #35421), the </function> close check runs before the parameter extraction loop:

# Line 531 — runs FIRST
if not self.json_closed and self.function_end_token in tool_text:
    self.json_closed = True
    ...
    return DeltaMessage(arguments="}")  # ← emits "}" and returns

# Line 591 — runs SECOND (but never reached if </function> fired above)
if not self.in_param and self.param_count < len(param_starts):
    # process parameter...

With speculative decoding, a single burst can deliver the final parameter value together with </function>. The close check sees </function> in tool_text, emits "}", sets in_function = False, and returns. The parameter loop never executes. The parameter is silently lost.

This PR fixes it by moving parameter processing before the close check.

Bug 2: Single-pass `if` + `return None` early exit

The parameter extraction uses a single if block that processes exactly one parameter per call. When a parameter is incomplete (still streaming), it executes return None:

else:
    # Still streaming, wait for more content
    return None  # ← bails out entirely

With speculative decoding, a burst can contain multiple parameters where the first N are complete but the last is still streaming. The return None discards all the already-complete parameters that were never emitted.

This PR fixes it by using a while loop that accumulates all complete parameter fragments, with break (instead of return None) when one is incomplete. Already-complete fragments are emitted before yielding control.

Bug 3: `already_added` dedup prevents same-name tool calls

The current code deduplicates prev_tool_call_arr entries by function name:

already_added = any(tool.get("name") == self.current_function_name ...)
if not already_added:
    self.prev_tool_call_arr.append(...)

Two calls to the same function (e.g. two read calls) share a single entry. Then current_tool_index=1 but prev_tool_call_arr has length 1 → IndexError.

This PR fixes it by always appending (each tool call is a separate invocation) and using index-based lookups.

Changes

vllm/tool_parsers/qwen3coder_tool_parser.py:
- Reorder: process parameters FIRST (while loop), check </function> AFTER
- Replace single-pass if with while loop + break + fragment accumulation
- Remove already_added name-based dedup; always append to prev_tool_call_arr
- Use index-based prev_tool_call_arr update (companion to always-append)
- Track streamed_args_for_tool throughout (header, {, fragments, })
- Make json_started check unconditional (remove parameter_prefix not in delta_text)
- Replace silent pass with logger.debug on parse errors
- Remove dead in_param branch (never activated)
vllm/entrypoints/openai/chat_completion/serving.py:
- Guard against double-serialization when arguments is already a JSON string

Test plan

Verify with single-token streaming (no speculative decode) — behavior should be unchanged
Verify with num_speculative_tokens=5 — multi-parameter tool calls (e.g. write with filePath + content) should have all parameters present
Verify parallel tool calls with the same function name (e.g. two read calls) work correctly
Run existing test_qwen3coder_tool_parser.py tests

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request provides a comprehensive fix for a critical bug causing parameter loss in Qwen3Coder's streaming tool parser, especially with speculative decoding. The changes are well-reasoned and address the identified root causes effectively by reordering parameter parsing, using a while loop for multiple parameters, and correcting the logic for repeated tool calls. The related fixes in serving.py and improved logging also enhance robustness. I have one suggestion to further improve the streaming parser's resilience to malformed tool call structures.

gemini-code-assist · 2026-02-28T16:02:16Z

+                    else:
+                        if self.tool_call_end_token in tool_text:
+                            param_end_idx = len(value_text)
+                        else:
+                            # Parameter incomplete — break instead of
+                            # return so we still emit any fragments
+                            # accumulated by earlier loop iterations.
+                            break


The fallback logic for determining the end of a parameter value is not fully robust. If </function> is missing and </tool_call> is present anywhere in tool_text, param_end_idx is set to len(value_text). This can incorrectly include the </tool_call> tag in the parameter's value if the input is malformed, leading to invalid JSON fragments being streamed.

A safer approach is to find the index of </tool_call> specifically within value_text and use that as the boundary. If it's not found, it's better to break and treat the parameter as incomplete. This improves robustness against malformed inputs during streaming.

Suggested change

else:

if self.tool_call_end_token in tool_text:

param_end_idx = len(value_text)

else:

# Parameter incomplete — break instead of

# return so we still emit any fragments

# accumulated by earlier loop iterations.

break

else:

# Fallback for malformed XML where </function> is missing.

# Use </tool_call> as a delimiter if present in the value.

tool_end_idx_in_value = value_text.find(self.tool_call_end_token)

if tool_end_idx_in_value != -1:

param_end_idx = tool_end_idx_in_value

else:

# Parameter incomplete — break so we still emit any

# fragments accumulated by earlier loop iterations.

break

…loss With speculative decoding (e.g. num_speculative_tokens=5), token bursts can deliver multiple complete parameters and </function> in a single delta. The existing streaming parser had two ordering/control-flow bugs that caused parameters to be silently dropped, leading to tool calls with missing fields (e.g. `content: undefined` on a file-write tool). Root cause 1 — close-before-params ordering: The </function> check ran BEFORE the parameter extraction loop. When a burst delivered the final parameter together with </function>, the close handler fired first, emitted "}", and set in_function=False — the parameter was never processed. Root cause 2 — single-pass + early return: The parameter extraction used a single `if` block that processed exactly one parameter per call. When the current parameter was incomplete it executed `return None`, discarding any already-complete parameters from the same burst that hadn't been emitted yet. Root cause 3 — already_added dedup: prev_tool_call_arr entries were deduplicated by function name, so two calls to the same function (e.g. two consecutive `read` calls) would share a single entry, causing IndexError or wrong argument updates. Fixes: - Reorder: process parameters FIRST (while-loop), check </function> AFTER, so no parameter can be skipped by an early close. - Loop + break: replace the single-pass `if` with a `while` loop that accumulates all complete parameter fragments, using `break` instead of `return None` when one is incomplete, ensuring earlier complete params are still emitted. - Always-append: remove the already_added check; each tool call gets its own entry indexed by current_tool_index. - Index-based update: update prev_tool_call_arr by current_tool_index instead of name-based search (companion fix for always-append). - streamed_args_for_tool tracking: append "" on header, accumulate fragments and "}" so the serving layer can compute remaining args. - serving.py: guard against double-serialization when prev_tool_call_arr stores arguments as a JSON string (isinstance check). - Remove dead in_param branch (never activated in practice). Supersedes vllm-project#35421 which addressed streamed_args_for_tool tracking and the json_started condition but did not fix the critical close-before- params ordering or the single-pass early-return bugs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Martin Vit <martin@voipmonitor.org>

… XML When </function> is missing but </tool_call> is present, the previous fallback set param_end_idx = len(value_text) which would include the </tool_call> tag as part of the parameter value. Find its position in value_text and use that as the boundary instead. Addresses review feedback from gemini-code-assist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Martin Vit <martin@voipmonitor.org>

voipmonitor · 2026-02-28T16:07:06Z

@chaunceyjiang please review this new, the old one #35421 is still wrong.

mergify · 2026-02-28T16:10:03Z

Hi @voipmonitor, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Martin Vit <martin@voipmonitor.org>

MadeBy561 · 2026-02-28T17:02:43Z

This fix has the following working perfectly for me currently:

Test	Spec tokens	Result
Single-token streaming	1	Pass
Standard speculative decoding	3	Pass
Multi-parameter tool calls (write with filePath + content)	5	Pass
Parallel same-name tool calls (two read calls in one response)	5	Pass

Cherry-pick vllm-project#35615: - Fix double-serialize bug in serving.py for tool call arguments - Fix multi-param handling in qwen3coder_tool_parser.py for spec decode - Fix streamed_args_for_tool tracking (prevents IndexError) - Handle speculative decode deltas containing multiple complete params

rstanislav · 2026-03-02T19:18:37Z

I can confirm that this PR fixes the tool calls. Without it, about 70% of them fail - most often when reading project files or fetching the file list (for example, when using RooCode). With this PR, I went through nearly an hour long coding session without a single tool-call failure. 👍🥳

chaunceyjiang

LGTM

…e decode (vllm-project#35615) Signed-off-by: Martin Vit <martin@voipmonitor.org> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Audited recent tool parser bug-fix PRs and found that several landed without corresponding test coverage. Added unit tests for each fix to prevent regressions. - Mistral: fast detokenization text detection (PR vllm-project#37209) - Qwen3Coder: malformed XML crash, anyOf double-encoding, speculative decode streaming (PRs vllm-project#36774, vllm-project#36032, vllm-project#35615) - DeepSeekV32: delimiter preservation with fast detokenization, skip_special_tokens adjustment (PR vllm-project#33964) - GLM-4 MoE: zero-argument tool calls, transformers 5.x delimiter handling, Unicode character preservation (PRs vllm-project#32321, vllm-project#31622, vllm-project#30920) - MiniMax M2: anyOf nullable parameter handling for non-null and null values (PR vllm-project#32342) - Step3p5: MTP-style variable-chunk and multi-token streaming (PR vllm-project#33690) - Kimi K2: native tool call ID extraction and multi-turn ID continuity (PR vllm-project#32768) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben Browning <bbrownin@redhat.com>

…e decode (vllm-project#35615) Signed-off-by: Martin Vit <martin@voipmonitor.org> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

voipmonitor requested review from DarkLight1337, aarnphm, chaunceyjiang and russellb as code owners February 28, 2026 15:59

mergify Bot added frontend qwen Related to Qwen models labels Feb 28, 2026

gemini-code-assist Bot reviewed Feb 28, 2026

View reviewed changes

voipmonitor and others added 2 commits February 28, 2026 17:04

voipmonitor force-pushed the fix/qwen3coder-speculative-decode-streaming branch from d6d1430 to 4916ca4 Compare February 28, 2026 16:05

voipmonitor mentioned this pull request Feb 28, 2026

[Bugfix] Fix Qwen3Coder tool call streaming with speculative decoding #35421

Closed

4 tasks

voipmonitor and others added 2 commits February 28, 2026 17:58

Apply ruff format/check fixes (SIM102, E501, line formatting)

82f7a23

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Martin Vit <martin@voipmonitor.org>

Merge branch 'main' into fix/qwen3coder-speculative-decode-streaming

9359241

Merge branch 'main' into fix/qwen3coder-speculative-decode-streaming

108ee0a

JohnTheNerd added a commit to JohnTheNerd/vllm that referenced this pull request Mar 1, 2026

Cherry-picked vllm-project#35615 to fix qwen3.5 tool calling with MTP

f54d528

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 2, 2026

chaunceyjiang self-assigned this Mar 2, 2026

chaunceyjiang mentioned this pull request Mar 2, 2026

[Bugfix] Qwen3Coder streaming tool call JSON missing opening brace in arguments #35716

Closed

chaunceyjiang approved these changes Mar 3, 2026

View reviewed changes

chaunceyjiang merged commit 8ebd872 into vllm-project:main Mar 3, 2026
51 checks passed

This was referenced Mar 3, 2026

[Bug]: Qwen3.5 Can not close thinking by "enable_thinking": false #35574

Closed

[Bug]: Enabling speculative coding causes malformed Tool Calls in Qwen 122B MXFP4 #35800

Open

bbrowning mentioned this pull request Mar 26, 2026

[Misc] Add 20 regression tests for 11 tool parser bug fixes #38172

Merged

gtempo mentioned this pull request Jun 12, 2026

qwen3-coder parser drops tool calls when model omits opening <tool_call> tag — tool call leaks into content as plain text ollama/ollama#16686

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode#35615

[Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode#35615
chaunceyjiang merged 5 commits into
vllm-project:mainfrom
voipmonitor:fix/qwen3coder-speculative-decode-streaming

voipmonitor commented Feb 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Feb 28, 2026

Uh oh!

voipmonitor commented Feb 28, 2026 •

edited

Loading

Uh oh!

mergify Bot commented Feb 28, 2026

Uh oh!

MadeBy561 commented Feb 28, 2026

Uh oh!

rstanislav commented Mar 2, 2026

Uh oh!

chaunceyjiang left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

-                    else:
-                        if self.tool_call_end_token in tool_text:
-                            param_end_idx = len(value_text)
-                        else:
-                            # Parameter incomplete — break instead of
-                            # return so we still emit any fragments
-                            # accumulated by earlier loop iterations.
-                            break
+                    else:
+                        # Fallback for malformed XML where </function> is missing.
+                        # Use </tool_call> as a delimiter if present in the value.
+                        tool_end_idx_in_value = value_text.find(self.tool_call_end_token)
+                        if tool_end_idx_in_value != -1:
+                            param_end_idx = tool_end_idx_in_value
+                        else:
+                            # Parameter incomplete — break so we still emit any
+                            # fragments accumulated by earlier loop iterations.
+                            break

Uh oh!

Conversation

voipmonitor commented Feb 28, 2026

Summary

Why #35421 doesn't work

Bug 1: Close-before-params ordering (the root cause)

Bug 2: Single-pass if + return None early exit

Bug 3: already_added dedup prevents same-name tool calls

Changes

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

voipmonitor commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Feb 28, 2026

Uh oh!

MadeBy561 commented Feb 28, 2026

Uh oh!

rstanislav commented Mar 2, 2026

Uh oh!

chaunceyjiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Bug 2: Single-pass `if` + `return None` early exit

Bug 3: `already_added` dedup prevents same-name tool calls

voipmonitor commented Feb 28, 2026 •

edited

Loading