[Tool Parser] Fix Qwen3Coder streaming parameter loss with speculative decode#35615
Conversation
There was a problem hiding this comment.
Code Review
This pull request provides a comprehensive fix for a critical bug causing parameter loss in Qwen3Coder's streaming tool parser, especially with speculative decoding. The changes are well-reasoned and address the identified root causes effectively by reordering parameter parsing, using a while loop for multiple parameters, and correcting the logic for repeated tool calls. The related fixes in serving.py and improved logging also enhance robustness. I have one suggestion to further improve the streaming parser's resilience to malformed tool call structures.
| else: | ||
| if self.tool_call_end_token in tool_text: | ||
| param_end_idx = len(value_text) | ||
| else: | ||
| # Parameter incomplete — break instead of | ||
| # return so we still emit any fragments | ||
| # accumulated by earlier loop iterations. | ||
| break |
There was a problem hiding this comment.
The fallback logic for determining the end of a parameter value is not fully robust. If </function> is missing and </tool_call> is present anywhere in tool_text, param_end_idx is set to len(value_text). This can incorrectly include the </tool_call> tag in the parameter's value if the input is malformed, leading to invalid JSON fragments being streamed.
A safer approach is to find the index of </tool_call> specifically within value_text and use that as the boundary. If it's not found, it's better to break and treat the parameter as incomplete. This improves robustness against malformed inputs during streaming.
| else: | |
| if self.tool_call_end_token in tool_text: | |
| param_end_idx = len(value_text) | |
| else: | |
| # Parameter incomplete — break instead of | |
| # return so we still emit any fragments | |
| # accumulated by earlier loop iterations. | |
| break | |
| else: | |
| # Fallback for malformed XML where </function> is missing. | |
| # Use </tool_call> as a delimiter if present in the value. | |
| tool_end_idx_in_value = value_text.find(self.tool_call_end_token) | |
| if tool_end_idx_in_value != -1: | |
| param_end_idx = tool_end_idx_in_value | |
| else: | |
| # Parameter incomplete — break so we still emit any | |
| # fragments accumulated by earlier loop iterations. | |
| break |
…loss With speculative decoding (e.g. num_speculative_tokens=5), token bursts can deliver multiple complete parameters and </function> in a single delta. The existing streaming parser had two ordering/control-flow bugs that caused parameters to be silently dropped, leading to tool calls with missing fields (e.g. `content: undefined` on a file-write tool). Root cause 1 — close-before-params ordering: The </function> check ran BEFORE the parameter extraction loop. When a burst delivered the final parameter together with </function>, the close handler fired first, emitted "}", and set in_function=False — the parameter was never processed. Root cause 2 — single-pass + early return: The parameter extraction used a single `if` block that processed exactly one parameter per call. When the current parameter was incomplete it executed `return None`, discarding any already-complete parameters from the same burst that hadn't been emitted yet. Root cause 3 — already_added dedup: prev_tool_call_arr entries were deduplicated by function name, so two calls to the same function (e.g. two consecutive `read` calls) would share a single entry, causing IndexError or wrong argument updates. Fixes: - Reorder: process parameters FIRST (while-loop), check </function> AFTER, so no parameter can be skipped by an early close. - Loop + break: replace the single-pass `if` with a `while` loop that accumulates all complete parameter fragments, using `break` instead of `return None` when one is incomplete, ensuring earlier complete params are still emitted. - Always-append: remove the already_added check; each tool call gets its own entry indexed by current_tool_index. - Index-based update: update prev_tool_call_arr by current_tool_index instead of name-based search (companion fix for always-append). - streamed_args_for_tool tracking: append "" on header, accumulate fragments and "}" so the serving layer can compute remaining args. - serving.py: guard against double-serialization when prev_tool_call_arr stores arguments as a JSON string (isinstance check). - Remove dead in_param branch (never activated in practice). Supersedes vllm-project#35421 which addressed streamed_args_for_tool tracking and the json_started condition but did not fix the critical close-before- params ordering or the single-pass early-return bugs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Martin Vit <martin@voipmonitor.org>
… XML When </function> is missing but </tool_call> is present, the previous fallback set param_end_idx = len(value_text) which would include the </tool_call> tag as part of the parameter value. Find its position in value_text and use that as the boundary instead. Addresses review feedback from gemini-code-assist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Martin Vit <martin@voipmonitor.org>
d6d1430 to
4916ca4
Compare
|
@chaunceyjiang please review this new, the old one #35421 is still wrong. |
|
Hi @voipmonitor, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Martin Vit <martin@voipmonitor.org>
|
This fix has the following working perfectly for me currently:
|
Cherry-pick vllm-project#35615: - Fix double-serialize bug in serving.py for tool call arguments - Fix multi-param handling in qwen3coder_tool_parser.py for spec decode - Fix streamed_args_for_tool tracking (prevents IndexError) - Handle speculative decode deltas containing multiple complete params
|
I can confirm that this PR fixes the tool calls. Without it, about 70% of them fail - most often when reading project files or fetching the file list (for example, when using RooCode). With this PR, I went through nearly an hour long coding session without a single tool-call failure. 👍🥳 |
…e decode (vllm-project#35615) Signed-off-by: Martin Vit <martin@voipmonitor.org> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…e decode (vllm-project#35615) Signed-off-by: Martin Vit <martin@voipmonitor.org> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…e decode (vllm-project#35615) Signed-off-by: Martin Vit <martin@voipmonitor.org> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Audited recent tool parser bug-fix PRs and found that several landed without corresponding test coverage. Added unit tests for each fix to prevent regressions. - Mistral: fast detokenization text detection (PR vllm-project#37209) - Qwen3Coder: malformed XML crash, anyOf double-encoding, speculative decode streaming (PRs vllm-project#36774, vllm-project#36032, vllm-project#35615) - DeepSeekV32: delimiter preservation with fast detokenization, skip_special_tokens adjustment (PR vllm-project#33964) - GLM-4 MoE: zero-argument tool calls, transformers 5.x delimiter handling, Unicode character preservation (PRs vllm-project#32321, vllm-project#31622, vllm-project#30920) - MiniMax M2: anyOf nullable parameter handling for non-null and null values (PR vllm-project#32342) - Step3p5: MTP-style variable-chunk and multi-token streaming (PR vllm-project#33690) - Kimi K2: native tool call ID extraction and multi-turn ID continuity (PR vllm-project#32768) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Ben Browning <bbrownin@redhat.com>
…e decode (vllm-project#35615) Signed-off-by: Martin Vit <martin@voipmonitor.org> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…e decode (vllm-project#35615) Signed-off-by: Martin Vit <martin@voipmonitor.org> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…e decode (vllm-project#35615) Signed-off-by: Martin Vit <martin@voipmonitor.org> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…e decode (vllm-project#35615) Signed-off-by: Martin Vit <martin@voipmonitor.org> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Fixes a critical bug where the Qwen3Coder streaming tool parser silently drops parameters when speculative decoding is enabled (e.g.
num_speculative_tokens=5). This causes downstream clients to receive tool calls with missing fields — for example awritetool call arrives as{"filePath": "/path/file"}with nocontentfield, which the client sees ascontent: undefined.Supersedes #35421, which partially addressed the issue but left the two root causes unfixed.
Why #35421 doesn't work
PR #35421 correctly identified and fixed several real issues:
streamed_args_for_tooltracking (append""on header, accumulate"{"/"}")json_startedcondition that skipped"{"whenparameter_prefixwas indelta_textserving.pydouble-serialization of already-JSON-string argumentsHowever, it did not fix the two structural bugs that actually cause parameter loss under speculative decoding:
Bug 1: Close-before-params ordering (the root cause)
In the current code on
main(and unchanged by #35421), the</function>close check runs before the parameter extraction loop:With speculative decoding, a single burst can deliver the final parameter value together with
</function>. The close check sees</function>intool_text, emits"}", setsin_function = False, and returns. The parameter loop never executes. The parameter is silently lost.This PR fixes it by moving parameter processing before the close check.
Bug 2: Single-pass
if+return Noneearly exitThe parameter extraction uses a single
ifblock that processes exactly one parameter per call. When a parameter is incomplete (still streaming), it executesreturn None:With speculative decoding, a burst can contain multiple parameters where the first N are complete but the last is still streaming. The
return Nonediscards all the already-complete parameters that were never emitted.This PR fixes it by using a
whileloop that accumulates all complete parameter fragments, withbreak(instead ofreturn None) when one is incomplete. Already-complete fragments are emitted before yielding control.Bug 3:
already_addeddedup prevents same-name tool callsThe current code deduplicates
prev_tool_call_arrentries by function name:Two calls to the same function (e.g. two
readcalls) share a single entry. Thencurrent_tool_index=1butprev_tool_call_arrhas length 1 → IndexError.This PR fixes it by always appending (each tool call is a separate invocation) and using index-based lookups.
Changes
vllm/tool_parsers/qwen3coder_tool_parser.py:whileloop), check</function>AFTERifwithwhileloop +break+ fragment accumulationalready_addedname-based dedup; always append toprev_tool_call_arrprev_tool_call_arrupdate (companion to always-append)streamed_args_for_toolthroughout (header,{, fragments,})json_startedcheck unconditional (removeparameter_prefix not in delta_text)passwithlogger.debugon parse errorsin_parambranch (never activated)vllm/entrypoints/openai/chat_completion/serving.py:argumentsis already a JSON stringTest plan
num_speculative_tokens=5— multi-parameter tool calls (e.g.writewithfilePath+content) should have all parameters presentreadcalls) work correctlytest_qwen3coder_tool_parser.pytests🤖 Generated with Claude Code