[Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, end detection, token counting, withhold recovery#40783
Conversation
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
There was a problem hiding this comment.
Code Review
This pull request updates the Qwen3ReasoningParser to correctly handle cases where the <tool_call> tag is fragmented across multiple streaming deltas. It introduces logic to detect the completion of the tag by comparing current_text and previous_text, and adds a corresponding test case. A review comment pointed out a redundant and unreachable elif block in the streaming extraction logic, suggesting a simplified implementation that uses current_text.find() to consistently handle both single-token and fragmented tag scenarios.
|
cc : @qmx |
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
|
I pushed more fixes and the corresponding tests with them, PR name should probably change
all tests covers thoses issues |
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
|
Two more fixes pushed linked to fragmented tokens & reasoning ignoring multiple tool calls (keeping only the last one) |
… was detected to close reasoning Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
<tool_call> tags in Qwen3 streaming reasoning parser… being entirely into content Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
…happen again Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Replace the misleading "to satisfy tests" comment with an explanation of the actual invariant: returning DeltaMessage() (empty) signals "processing ongoing, nothing to emit yet", whereas returning None would signal "stop processing". Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
1. count_reasoning_tokens override for Qwen3.5+ template The chat template places <think> in the prompt, so the model OUTPUT begins with reasoning tokens followed by </think>. The inherited depth-counter starts at depth=0 and is never incremented (no <think> ever appears in the output), so it returned 0 reasoning tokens for every Qwen3.5+ generation. The override treats the start of the output as the reasoning region by default and stops counting at the first </think> or — for the Qwen3.5 implicit-end case — the first <tool_call>. If <think> is present in the output (older 2507 template) reasoning begins right after it. 2. Re-emit withheld <tool_call> prefix when the continuation is not actually a tool call The streaming path withholds the tail of current_text whenever it looks like the prefix of <tool_call>. If the model then emits an unrelated continuation (e.g. <tool_use>, or just <too followed by non-tool-call text), the withheld bytes were silently dropped. The fix combines the withheld bytes from previous_text with this delta's stripped text and re-runs the overlap check on the cumulative text — emitting whatever portion is now safe instead of only the new delta minus the current overlap. 3. Honour enable_thinking=False in the streaming path extract_reasoning already honoured self.thinking_enabled. The streaming counterpart relied entirely on the serving layer to bypass the parser via prompt_is_reasoning_end detection. Direct callers (or any future regression of that bypass) silently routed model output to the reasoning channel. The streaming path now matches extract_reasoning: when thinking is disabled and no </think> is present in the delta, every byte is content. Each fix has a focused regression test in tests/reasoning/test_qwen3_reasoning_parser.py. Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>
d3be225 to
36cb5a1
Compare
|
@ExtReMLapin — first, thank you for this PR. The fragmented-tag analysis in your PR description ("Qwen3 often outputs tool call tags as raw text fragments — What we backported (split into two slices for blast-radius safety)Our existing patch tree already has substantial Qwen3 reasoning-parser modifications (P12 mirroring an earlier fix + P27 BEFORE-THINK preservation + P59 mirroring #39055). Layering #40783's full streaming restructure on top of those would have required re-anchoring against post-P12 layout, which felt risky. So we split your PR into two minimal slices:
The major streaming restructure (the Empirical impactP61 alone gave no measurable delta on our SINGLE-tool reproducer (n=20) — expected, since the bug fires on multi-tool only. P61b gave no measurable delta on non-streaming (also expected). For STREAMING multi-tool requests via aggregator (LibreChat-class clients), the delta should appear on multi-step agentic flows; we don't have an automated test for that yet. One architectural questionThe "removed incorrect paired-token guard in Backport referenceP61 + P61b — both opt-in text patches with anchor validation + drift markers. Credit to you in the docstrings + CREDITS.md. Hope the data point helps. Thanks for the streaming work in particular — the whole |
|
@Sandermage you are right, the description is a leftover from commit ExtReMLapin@c047919 that later got rollbacked because of bad interpretation of what it was doing. I updated the description, the fix is in |
|
@ErykCh you're right, my bad. I tried to reproduce the error using OCR in a test, on main and on my branch, no error, can you reproduce it, either write a test or send me the whole raw output text ? |
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
|
Hi maintainers — chiming in as a downstream user. This PR addresses the streaming side of #39056, which together with #39055 (non-streaming) covers the tool-call-inside-think breakage on Qwen3/Qwen3.5 — one of the open items blocking our planned migration to Qwen3.5-9B/27B for production agentic workloads with native streaming enabled. The fragmented-tag analysis ( Could a code-owner for Thanks @ExtReMLapin for the deep analysis! |
|
@ExtReMLapin fix the merge conflicts |
Resolve conflict in vllm/parser/abstract_parser.py: keep both adjacent blocks added after the tool_call streaming extraction. The reasoning preservation block (HEAD) and the history_tool_call_cnt increment block (main, PR vllm-project#41876) operate on orthogonal fields of delta_message/state and are safe to apply in sequence. Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @ExtReMLapin can you try to split this PR into multiple PRs, each solving a separate issue? (I can help if needed) |
I agree this would be probably much better and easier to review. Will also ask for guidlines on #40861 |




To be used with #40861
Purpose
This PR fixes several critical edge cases in the Qwen3 reasoning parser where reasoning text was lost or incorrectly classified.
A key issue addressed is that Qwen3, while following system prompt instructions during reasoning, often outputs tool call tags as raw text fragments (e.g.,
<+tool+_call+>) instead of a single special token. Since these fragments arrive as multiple regular text tokens across different streaming deltas, the previous ID-based detection would fail or corrupt the output.Key Changes
<tool_call>across deltas. It correctly identifies the reasoning-to-content transition even when the tag is "built" over several steps from regular text tokens, preventing the parser from missing the start of a tool call.vllm/parser/abstract_parser.pywhere the final fragment of reasoning was silently overwritten by tool call content when both occurred in the same delta.is_reasoning_end_streaming, a delta-only variant used byparse_deltainstead ofis_reasoning_end. The paired-token guard inis_reasoning_endis intentionally preserved (it prevents system-prompt tool-call examples from triggering an early reasoning end); only its comment was clarified. The new streaming check inspectsdelta_idswithout the guard, so a complete<tool_call>…</tool_call>delivered in a single delta (MTP / speculative decoding) correctly signals the end of reasoning.extract_content_idsto search for the first occurrence of a tool call instead of the last, ensuring no tool calls are dropped when reasoning ends implicitly.</think>tag is never swallowed back into the reasoning field, even if a tool call follows shortly after.Additional fixes (commit d3be225)
count_reasoning_tokensoverride for the Qwen3.5+ template: theinherited depth-counter starts at depth=0 and is never incremented
(because
<think>lives in the prompt, not the model output), so itreturned 0 reasoning tokens for every Qwen3.5+ generation. This
silently broke
usage.completion_tokens_details.reasoning_tokensreporting in API responses. The override treats the start of the
output as the reasoning region and stops at the first
</think>(or,for the Qwen3.5 implicit-end case, the first
<tool_call>).Recover withheld
<toolprefix when the continuation is afalse positive: the partial-overlap guard withholds the trailing
bytes of
current_textwhen they look like the start of<tool_call>.Previously, if the model emitted an unrelated continuation (e.g.
<tool_use>, or just<tool_belt), those bytes were silently dropped.The parser now re-emits them as reasoning.
Test Result
All new tests pass, and existing reasoning tests remain green.