Skip to content

[Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode#35615

Merged
chaunceyjiang merged 5 commits into
vllm-project:mainfrom
voipmonitor:fix/qwen3coder-speculative-decode-streaming
Mar 3, 2026
Merged

[Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode#35615
chaunceyjiang merged 5 commits into
vllm-project:mainfrom
voipmonitor:fix/qwen3coder-speculative-decode-streaming

Conversation

@voipmonitor

Copy link
Copy Markdown
Contributor

Summary

Fixes a critical bug where the Qwen3Coder streaming tool parser silently drops parameters when speculative decoding is enabled (e.g. num_speculative_tokens=5). This causes downstream clients to receive tool calls with missing fields — for example a write tool call arrives as {"filePath": "/path/file"} with no content field, which the client sees as content: undefined.

Supersedes #35421, which partially addressed the issue but left the two root causes unfixed.

Why #35421 doesn't work

PR #35421 correctly identified and fixed several real issues:

  • streamed_args_for_tool tracking (append "" on header, accumulate "{"/"}")
  • The json_started condition that skipped "{" when parameter_prefix was in delta_text
  • The serving.py double-serialization of already-JSON-string arguments

However, it did not fix the two structural bugs that actually cause parameter loss under speculative decoding:

Bug 1: Close-before-params ordering (the root cause)

In the current code on main (and unchanged by #35421), the </function> close check runs before the parameter extraction loop:

# Line 531 — runs FIRST
if not self.json_closed and self.function_end_token in tool_text:
    self.json_closed = True
    ...
    return DeltaMessage(arguments="}")  # ← emits "}" and returns

# Line 591 — runs SECOND (but never reached if </function> fired above)
if not self.in_param and self.param_count < len(param_starts):
    # process parameter...

With speculative decoding, a single burst can deliver the final parameter value together with </function>. The close check sees </function> in tool_text, emits "}", sets in_function = False, and returns. The parameter loop never executes. The parameter is silently lost.

This PR fixes it by moving parameter processing before the close check.

Bug 2: Single-pass if + return None early exit

The parameter extraction uses a single if block that processes exactly one parameter per call. When a parameter is incomplete (still streaming), it executes return None:

else:
    # Still streaming, wait for more content
    return None  # ← bails out entirely

With speculative decoding, a burst can contain multiple parameters where the first N are complete but the last is still streaming. The return None discards all the already-complete parameters that were never emitted.

This PR fixes it by using a while loop that accumulates all complete parameter fragments, with break (instead of return None) when one is incomplete. Already-complete fragments are emitted before yielding control.

Bug 3: already_added dedup prevents same-name tool calls

The current code deduplicates prev_tool_call_arr entries by function name:

already_added = any(tool.get("name") == self.current_function_name ...)
if not already_added:
    self.prev_tool_call_arr.append(...)

Two calls to the same function (e.g. two read calls) share a single entry. Then current_tool_index=1 but prev_tool_call_arr has length 1 → IndexError.

This PR fixes it by always appending (each tool call is a separate invocation) and using index-based lookups.

Changes

  • vllm/tool_parsers/qwen3coder_tool_parser.py:
    • Reorder: process parameters FIRST (while loop), check </function> AFTER
    • Replace single-pass if with while loop + break + fragment accumulation
    • Remove already_added name-based dedup; always append to prev_tool_call_arr
    • Use index-based prev_tool_call_arr update (companion to always-append)
    • Track streamed_args_for_tool throughout (header, {, fragments, })
    • Make json_started check unconditional (remove parameter_prefix not in delta_text)
    • Replace silent pass with logger.debug on parse errors
    • Remove dead in_param branch (never activated)
  • vllm/entrypoints/openai/chat_completion/serving.py:
    • Guard against double-serialization when arguments is already a JSON string

Test plan

  • Verify with single-token streaming (no speculative decode) — behavior should be unchanged
  • Verify with num_speculative_tokens=5 — multi-parameter tool calls (e.g. write with filePath + content) should have all parameters present
  • Verify parallel tool calls with the same function name (e.g. two read calls) work correctly
  • Run existing test_qwen3coder_tool_parser.py tests

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a comprehensive fix for a critical bug causing parameter loss in Qwen3Coder's streaming tool parser, especially with speculative decoding. The changes are well-reasoned and address the identified root causes effectively by reordering parameter parsing, using a while loop for multiple parameters, and correcting the logic for repeated tool calls. The related fixes in serving.py and improved logging also enhance robustness. I have one suggestion to further improve the streaming parser's resilience to malformed tool call structures.

Comment on lines +581 to +588
else:
if self.tool_call_end_token in tool_text:
param_end_idx = len(value_text)
else:
# Parameter incomplete — break instead of
# return so we still emit any fragments
# accumulated by earlier loop iterations.
break

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The fallback logic for determining the end of a parameter value is not fully robust. If </function> is missing and </tool_call> is present anywhere in tool_text, param_end_idx is set to len(value_text). This can incorrectly include the </tool_call> tag in the parameter's value if the input is malformed, leading to invalid JSON fragments being streamed.

A safer approach is to find the index of </tool_call> specifically within value_text and use that as the boundary. If it's not found, it's better to break and treat the parameter as incomplete. This improves robustness against malformed inputs during streaming.

Suggested change
else:
if self.tool_call_end_token in tool_text:
param_end_idx = len(value_text)
else:
# Parameter incomplete — break instead of
# return so we still emit any fragments
# accumulated by earlier loop iterations.
break
else:
# Fallback for malformed XML where </function> is missing.
# Use </tool_call> as a delimiter if present in the value.
tool_end_idx_in_value = value_text.find(self.tool_call_end_token)
if tool_end_idx_in_value != -1:
param_end_idx = tool_end_idx_in_value
else:
# Parameter incomplete — break so we still emit any
# fragments accumulated by earlier loop iterations.
break

voipmonitor and others added 2 commits February 28, 2026 17:04
…loss

With speculative decoding (e.g. num_speculative_tokens=5), token bursts
can deliver multiple complete parameters and </function> in a single
delta.  The existing streaming parser had two ordering/control-flow bugs
that caused parameters to be silently dropped, leading to tool calls
with missing fields (e.g. `content: undefined` on a file-write tool).

Root cause 1 — close-before-params ordering:
The </function> check ran BEFORE the parameter extraction loop.  When a
burst delivered the final parameter together with </function>, the close
handler fired first, emitted "}", and set in_function=False — the
parameter was never processed.

Root cause 2 — single-pass + early return:
The parameter extraction used a single `if` block that processed exactly
one parameter per call. When the current parameter was incomplete it
executed `return None`, discarding any already-complete parameters from
the same burst that hadn't been emitted yet.

Root cause 3 — already_added dedup:
prev_tool_call_arr entries were deduplicated by function name, so two
calls to the same function (e.g. two consecutive `read` calls) would
share a single entry, causing IndexError or wrong argument updates.

Fixes:
- Reorder: process parameters FIRST (while-loop), check </function>
  AFTER, so no parameter can be skipped by an early close.
- Loop + break: replace the single-pass `if` with a `while` loop that
  accumulates all complete parameter fragments, using `break` instead
  of `return None` when one is incomplete, ensuring earlier complete
  params are still emitted.
- Always-append: remove the already_added check; each tool call gets
  its own entry indexed by current_tool_index.
- Index-based update: update prev_tool_call_arr by current_tool_index
  instead of name-based search (companion fix for always-append).
- streamed_args_for_tool tracking: append "" on header, accumulate
  fragments and "}" so the serving layer can compute remaining args.
- serving.py: guard against double-serialization when prev_tool_call_arr
  stores arguments as a JSON string (isinstance check).
- Remove dead in_param branch (never activated in practice).

Supersedes vllm-project#35421 which addressed streamed_args_for_tool tracking and
the json_started condition but did not fix the critical close-before-
params ordering or the single-pass early-return bugs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Martin Vit <martin@voipmonitor.org>
… XML

When </function> is missing but </tool_call> is present, the previous
fallback set param_end_idx = len(value_text) which would include the
</tool_call> tag as part of the parameter value. Find its position in
value_text and use that as the boundary instead.

Addresses review feedback from gemini-code-assist.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Martin Vit <martin@voipmonitor.org>
@voipmonitor

voipmonitor commented Feb 28, 2026

Copy link
Copy Markdown
Contributor Author

@chaunceyjiang please review this new, the old one #35421 is still wrong.

@mergify

mergify Bot commented Feb 28, 2026

Copy link
Copy Markdown
Contributor

Hi @voipmonitor, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

voipmonitor and others added 2 commits February 28, 2026 17:58
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Martin Vit <martin@voipmonitor.org>
@MadeBy561

Copy link
Copy Markdown

This fix has the following working perfectly for me currently:

Test Spec tokens Result
Single-token streaming 1 Pass
Standard speculative decoding 3 Pass
Multi-parameter tool calls (write with filePath + content) 5 Pass
Parallel same-name tool calls (two read calls in one response) 5 Pass

AlexGS74 added a commit to AlexGS74/vllm-patches that referenced this pull request Mar 1, 2026
Cherry-pick vllm-project#35615:
- Fix double-serialize bug in serving.py for tool call arguments
- Fix multi-param handling in qwen3coder_tool_parser.py for spec decode
- Fix streamed_args_for_tool tracking (prevents IndexError)
- Handle speculative decode deltas containing multiple complete params
JohnTheNerd added a commit to JohnTheNerd/vllm that referenced this pull request Mar 1, 2026
@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 2, 2026
@chaunceyjiang chaunceyjiang self-assigned this Mar 2, 2026
@rstanislav

Copy link
Copy Markdown

I can confirm that this PR fixes the tool calls. Without it, about 70% of them fail - most often when reading project files or fetching the file list (for example, when using RooCode). With this PR, I went through nearly an hour long coding session without a single tool-call failure. 👍🥳

@chaunceyjiang chaunceyjiang left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chaunceyjiang chaunceyjiang merged commit 8ebd872 into vllm-project:main Mar 3, 2026
51 checks passed
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
…e decode (vllm-project#35615)

Signed-off-by: Martin Vit <martin@voipmonitor.org>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Mar 12, 2026
…e decode (vllm-project#35615)

Signed-off-by: Martin Vit <martin@voipmonitor.org>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
…e decode (vllm-project#35615)

Signed-off-by: Martin Vit <martin@voipmonitor.org>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
bbrowning added a commit to bbrowning/vllm that referenced this pull request Mar 31, 2026
Audited recent tool parser bug-fix PRs and found that several
landed without corresponding test coverage. Added unit tests
for each fix to prevent regressions.

- Mistral: fast detokenization text detection (PR vllm-project#37209)
- Qwen3Coder: malformed XML crash, anyOf double-encoding,
  speculative decode streaming (PRs vllm-project#36774, vllm-project#36032, vllm-project#35615)
- DeepSeekV32: delimiter preservation with fast detokenization,
  skip_special_tokens adjustment (PR vllm-project#33964)
- GLM-4 MoE: zero-argument tool calls, transformers 5.x delimiter
  handling, Unicode character preservation (PRs vllm-project#32321, vllm-project#31622, vllm-project#30920)
- MiniMax M2: anyOf nullable parameter handling for non-null and
  null values (PR vllm-project#32342)
- Step3p5: MTP-style variable-chunk and multi-token streaming
  (PR vllm-project#33690)
- Kimi K2: native tool call ID extraction and multi-turn ID
  continuity (PR vllm-project#32768)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Ben Browning <bbrownin@redhat.com>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
…e decode (vllm-project#35615)

Signed-off-by: Martin Vit <martin@voipmonitor.org>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…e decode (vllm-project#35615)

Signed-off-by: Martin Vit <martin@voipmonitor.org>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…e decode (vllm-project#35615)

Signed-off-by: Martin Vit <martin@voipmonitor.org>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
0826joyce pushed a commit to 0826joyce/vllm-serving-optimization that referenced this pull request May 19, 2026
…e decode (vllm-project#35615)

Signed-off-by: Martin Vit <martin@voipmonitor.org>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants