[Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, end detection, token counting, withhold recovery by ExtReMLapin · Pull Request #40783 · vllm-project/vllm

ExtReMLapin · 2026-04-24T07:23:44Z

To be used with #40861

Purpose

This PR fixes several critical edge cases in the Qwen3 reasoning parser where reasoning text was lost or incorrectly classified.

A key issue addressed is that Qwen3, while following system prompt instructions during reasoning, often outputs tool call tags as raw text fragments (e.g., < + tool + _call + >) instead of a single special token. Since these fragments arrive as multiple regular text tokens across different streaming deltas, the previous ID-based detection would fail or corrupt the output.

Key Changes

Raw Text & Fragmented Tag Support: The parser now tracks the literal string <tool_call> across deltas. It correctly identifies the reasoning-to-content transition even when the tag is "built" over several steps from regular text tokens, preventing the parser from missing the start of a tool call.
Transition Data Integrity: Fixed a bug in vllm/parser/abstract_parser.py where the final fragment of reasoning was silently overwritten by tool call content when both occurred in the same delta.
Speculative Decoding Fix: Added is_reasoning_end_streaming, a delta-only variant used by parse_delta instead of is_reasoning_end. The paired-token guard in is_reasoning_end is intentionally preserved (it prevents system-prompt tool-call examples from triggering an early reasoning end); only its comment was clarified. The new streaming check inspects delta_ids without the guard, so a complete <tool_call>…</tool_call> delivered in a single delta (MTP / speculative decoding) correctly signals the end of reasoning.
Preserve Multiple Tool Calls: Fixed extract_content_ids to search for the first occurrence of a tool call instead of the last, ensuring no tool calls are dropped when reasoning ends implicitly.
Reasoning/Content Boundary: Ensured that any content emitted after an explicit </think> tag is never swallowed back into the reasoning field, even if a tool call follows shortly after.

Additional fixes (commit `d3be225`)

count_reasoning_tokens override for the Qwen3.5+ template: the
inherited depth-counter starts at depth=0 and is never incremented
(because <think> lives in the prompt, not the model output), so it
returned 0 reasoning tokens for every Qwen3.5+ generation. This
silently broke usage.completion_tokens_details.reasoning_tokens
reporting in API responses. The override treats the start of the
output as the reasoning region and stops at the first </think> (or,
for the Qwen3.5 implicit-end case, the first <tool_call>).
Recover withheld <tool prefix when the continuation is a
false positive: the partial-overlap guard withholds the trailing
bytes of current_text when they look like the start of <tool_call>.
Previously, if the model emitted an unrelated continuation (e.g.
<tool_use>, or just <tool_belt), those bytes were silently dropped.
The parser now re-emits them as reasoning.

Test Result

All new tests pass, and existing reasoning tests remain green.

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request updates the Qwen3ReasoningParser to correctly handle cases where the <tool_call> tag is fragmented across multiple streaming deltas. It introduces logic to detect the completion of the tag by comparing current_text and previous_text, and adds a corresponding test case. A review comment pointed out a redundant and unreachable elif block in the streaming extraction logic, suggesting a simplified implementation that uses current_text.find() to consistently handle both single-token and fragmented tag scenarios.

ExtReMLapin · 2026-04-24T07:49:35Z

cc : @qmx
As I commented in your PR

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

ExtReMLapin · 2026-04-24T11:17:46Z

I pushed more fixes and the corresponding tests with them, PR name should probably change

is_reasoning_end was returning False for a completed tool call because the tool call could be something else than outputed as a tool_call token (but as fragmented one)
last reasoning fragment silently dropped at the reasoning -> tool-call transition

all tests covers thoses issues

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

ExtReMLapin · 2026-04-24T14:59:06Z

Two more fixes pushed linked to fragmented tokens & reasoning ignoring multiple tool calls (keeping only the last one)

… was detected to close reasoning Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

… being entirely into content Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

…happen again Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

Replace the misleading "to satisfy tests" comment with an explanation of the actual invariant: returning DeltaMessage() (empty) signals "processing ongoing, nothing to emit yet", whereas returning None would signal "stop processing". Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

1. count_reasoning_tokens override for Qwen3.5+ template The chat template places <think> in the prompt, so the model OUTPUT begins with reasoning tokens followed by </think>. The inherited depth-counter starts at depth=0 and is never incremented (no <think> ever appears in the output), so it returned 0 reasoning tokens for every Qwen3.5+ generation. The override treats the start of the output as the reasoning region by default and stops counting at the first </think> or — for the Qwen3.5 implicit-end case — the first <tool_call>. If <think> is present in the output (older 2507 template) reasoning begins right after it. 2. Re-emit withheld <tool_call> prefix when the continuation is not actually a tool call The streaming path withholds the tail of current_text whenever it looks like the prefix of <tool_call>. If the model then emits an unrelated continuation (e.g. <tool_use>, or just <too followed by non-tool-call text), the withheld bytes were silently dropped. The fix combines the withheld bytes from previous_text with this delta's stripped text and re-runs the overlap check on the cumulative text — emitting whatever portion is now safe instead of only the new delta minus the current overlap. 3. Honour enable_thinking=False in the streaming path extract_reasoning already honoured self.thinking_enabled. The streaming counterpart relied entirely on the serving layer to bypass the parser via prompt_is_reasoning_end detection. Direct callers (or any future regression of that bypass) silently routed model output to the reasoning channel. The streaming path now matches extract_reasoning: when thinking is disabled and no </think> is present in the delta, every byte is content. Each fix has a focused regression test in tests/reasoning/test_qwen3_reasoning_parser.py. Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>

Sandermage · 2026-04-25T14:24:36Z

@ExtReMLapin — first, thank you for this PR. The fragmented-tag analysis in your PR description ("Qwen3 often outputs tool call tags as raw text fragments — < + tool + _call + >") immediately reframed how we thought about the streaming path. Even for tokenizers where the tag is a single special token, knowing the fragmented case exists changed our debugging approach + helped us understand that the bug class spans both streaming and non-streaming paths. Backported parts of this PR on a Qwen3.6-35B-A3B-FP8 production rig (2× A5000, vLLM 0.19.2rc1.dev205+g07351e088) as part of the v7.13 multi-PR investigation. Sharing the data + one architectural question.

What we backported (split into two slices for blast-radius safety)

Our existing patch tree already has substantial Qwen3 reasoning-parser modifications (P12 mirroring an earlier fix + P27 BEFORE-THINK preservation + P59 mirroring #39055). Layering #40783's full streaming restructure on top of those would have required re-anchoring against post-P12 layout, which felt risky. So we split your PR into two minimal slices:

Non-streaming extract_content_ids first-occurrence fix (P61) — single anchor, just changes len(input_ids) - 1 - input_ids[::-1].index(...) to input_ids.index(...). Multi-tool agentic flows now preserve all tool calls instead of dropping all but the last.
Streaming partial_tag_overlap defensive guard (P61b) — adds the partial_tag_overlap import + uses it before the final reasoning emission to hold back partial <tool_call> fragments. For Qwen3 with proper special-token handling this is mostly a no-op but useful defense for tokenizers that emit the tag character-fragmented.

The major streaming restructure (the just_completed_tool_call_tag + position-aware split logic) we deferred — it's correct and probably needed for some configurations, but for our setup (Qwen3 special-token tokenizer + non-streaming non-reasoning workload) the two slices above were sufficient.

Empirical impact

P61 alone gave no measurable delta on our SINGLE-tool reproducer (n=20) — expected, since the bug fires on multi-tool only. P61b gave no measurable delta on non-streaming (also expected). For STREAMING multi-tool requests via aggregator (LibreChat-class clients), the delta should appear on multi-step agentic flows; we don't have an automated test for that yet.

One architectural question

The "removed incorrect paired-token guard in is_reasoning_end" change in your PR — is that a real correctness fix, or just a comment refresh? Looking at your diff for is_reasoning_end, the LOGIC stays the same (the if tool_call_end_token_id ... continue block is preserved); only the comment text changed. The behavioral fix for "complete <tool_call>...</tool_call> in single delta" is in extract_reasoning_streaming via just_completed_tool_call_tag, not in is_reasoning_end. Is that intentional? Worth clarifying in the PR body in case reviewers also wonder.

Backport reference

P61 + P61b — both opt-in text patches with anchor validation + drift markers. Credit to you in the docstrings + CREDITS.md.

Hope the data point helps. Thanks for the streaming work in particular — the whole partial_tag_overlap approach is nicely defensive.

ExtReMLapin · 2026-04-25T16:13:45Z

@Sandermage you are right, the description is a leftover from commit ExtReMLapin@c047919 that later got rollbacked because of bad interpretation of what it was doing. I updated the description, the fix is in is_reasoning_end_streaming

ErykCh · 2026-04-27T18:10:04Z

Could you fix also cutting reasoning content, when there is a tool call?

so this is chunk with reasoning:

and this is next with function call, with logprobs from reasoning part:

ExtReMLapin · 2026-04-27T18:37:44Z

The issue is #39055 this PR @ErykCh

ErykCh · 2026-04-28T09:24:27Z

The issue is #39055 this PR @ErykCh

I'm not so sure.

In my case, tool call is after think tag but still reasoning part is incorrectly parsed.

ExtReMLapin · 2026-04-28T12:28:38Z

@ErykCh you're right, my bad.

I tried to reproduce the error using OCR in a test, on main and on my branch, no error, can you reproduce it, either write a test or send me the whole raw output text ?

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

naroam1 · 2026-05-07T13:51:43Z

Hi maintainers — chiming in as a downstream user.

This PR addresses the streaming side of #39056, which together with #39055 (non-streaming) covers the tool-call-inside-think breakage on Qwen3/Qwen3.5 — one of the open items blocking our planned migration to Qwen3.5-9B/27B for production agentic workloads with native streaming enabled.

The fragmented-tag analysis (< + tool + _call + > arriving as multiple deltas) is exactly the failure mode we expect to hit downstream, since most agentic clients (LiteLLM, OpenCode, Claude Code per upstream issue) consume the OpenAI streaming API. The PR has community validation (@Sandermage Apr 25) and the author has been responsive.

Could a code-owner for vllm/reasoning/qwen3_reasoning_parser.py / vllm/parser/abstract_parser.py take the review pass? Happy to validate against our production stack post-merge.

Thanks @ExtReMLapin for the deep analysis!

gaby · 2026-05-08T01:37:35Z

@ExtReMLapin fix the merge conflicts

Resolve conflict in vllm/parser/abstract_parser.py: keep both adjacent blocks added after the tool_call streaming extraction. The reasoning preservation block (HEAD) and the history_tool_call_cnt increment block (main, PR vllm-project#41876) operate on orthogonal fields of delta_message/state and are safe to apply in sequence. Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

mergify · 2026-05-24T02:03:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ExtReMLapin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ElizaWszola · 2026-06-04T15:18:44Z

Hi @ExtReMLapin can you try to split this PR into multiple PRs, each solving a separate issue? (I can help if needed)

ExtReMLapin · 2026-06-04T18:58:00Z

Hi @ExtReMLapin can you try to split this PR into multiple PRs, each solving a separate issue? (I can help if needed)

I agree this would be probably much better and easier to review.

Will also ask for guidlines on #40861

Support fragmented tags in Qwen3 streaming reasoning parser

9a9f330

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

ExtReMLapin requested review from aarnphm, bbrowning, chaunceyjiang and sfeng33 as code owners April 24, 2026 07:23

claude Bot reviewed Apr 24, 2026

View reviewed changes

ExtReMLapin mentioned this pull request Apr 24, 2026

[Bugfix] Treat <tool_call> as implicit reasoning end in Qwen3 parser #35687

Merged

mergify Bot added qwen Related to Qwen models bug Something isn't working labels Apr 24, 2026

gemini-code-assist Bot reviewed Apr 24, 2026

View reviewed changes

Comment thread vllm/reasoning/qwen3_reasoning_parser.py Outdated

ExtReMLapin mentioned this pull request Apr 24, 2026

[Bugfix] Robust Qwen3 Coder streaming: fragmented tags, speculative decoding fixes, and content tracking #40785

Closed

CNE Pierre FICHEPOIL added 3 commits April 24, 2026 10:27

gemini is right, removed leftover code

c62995d

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

added two more failed tests

c4dea4d

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

fixed the two more failed tests, no regression detected

c047919

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

CNE Pierre FICHEPOIL added 2 commits April 24, 2026 16:25

prevent qwen3 reasoning parser from swallowing tool calls content

642c344

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

prevent Qwen 3 tool call truncation and multi-tool deletion

d50b106

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

ensure tool call integrity for Qwen 3 during streaming when tool call…

a14471d

… was detected to close reasoning Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

ExtReMLapin changed the title ~~[Bugfix] Support fragmented <tool_call> tags in Qwen3 streaming reasoning parser~~ [Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, and incorrect end detection Apr 24, 2026

CNE Pierre FICHEPOIL added 2 commits April 24, 2026 19:30

fixed tool_call leaking(duplicated) into reasoning (start of it) then…

2c024f2

… being entirely into content Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

reverted error from commit c047919 + added test to ensure it doesn't …

177f609

…happen again Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

ExtReMLapin marked this pull request as draft April 24, 2026 22:16

ExtReMLapin marked this pull request as ready for review April 25, 2026 05:02

ExtReMLapin mentioned this pull request Apr 25, 2026

[Bugfix][ToolParser] Fix Qwen3 XML and Coder streaming tool call parser regressions #40861

Open

ExtReMLapin changed the title ~~[Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, and incorrect end detection~~ [Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, end detection, token counting, withhold recovery Apr 25, 2026

CNE Pierre FICHEPOIL and others added 2 commits April 25, 2026 13:32

ExtReMLapin force-pushed the partial_tool_call_qwen3_reasoning branch from d3be225 to 36cb5a1 Compare April 25, 2026 11:33

Sandermage mentioned this pull request Apr 25, 2026

[Bug]: TurboQuant KV × any speculative decoding (MTP or ngram) produces degenerate token loops — confirmed across dense and hybrid attention #40831

Closed

Sandermage mentioned this pull request Apr 25, 2026

[Bug]: ngram speculative decoding default prompt_lookup_min=2 causes tool-call output corruption on Qwen3-class models with structured output (config-only fix: prompt_lookup_min=8) #40875

Open

ExtReMLapin mentioned this pull request Apr 25, 2026

Fix Qwen3 reasoning tool calls embedded inside think #39055

Open

noonghunna mentioned this pull request Apr 25, 2026

[Bug]: MTP × TurboQuant × CUDA graph capture produces degenerate output on Qwen3-Next hybrid (not closed by v7.13 ngram fix tree) #40880

Closed

drrros mentioned this pull request Apr 26, 2026

Eval bug: Answer in think tags. Qwen 3.6 27B ggml-org/llama.cpp#22398

Open

Merge branch 'main' into partial_tool_call_qwen3_reasoning

6078bf2

Merge branch 'main' into partial_tool_call_qwen3_reasoning

f615d0a

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

This was referenced May 1, 2026

[Bugfix] Fix Qwen3Coder prev_tool_call_arr double-emission on parse failure #41466

Draft

[Bugfix] Detect MTP truncation at reasoning-to-tool-call boundary #41467

Draft

fix(spec decode): suppress EOS at draft positions in rejection sampler #41493

Draft

CNE Pierre FICHEPOIL and others added 2 commits May 11, 2026 09:09

Merge branch 'main' into partial_tool_call_qwen3_reasoning

bf87744

mergify Bot added the needs-rebase label May 24, 2026

Uh oh!

Conversation

ExtReMLapin commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Key Changes

Additional fixes (commit d3be225)

Test Result

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ExtReMLapin commented Apr 24, 2026

Uh oh!

ExtReMLapin commented Apr 24, 2026

Uh oh!

ExtReMLapin commented Apr 24, 2026

Uh oh!

Sandermage commented Apr 25, 2026

What we backported (split into two slices for blast-radius safety)

Empirical impact

One architectural question

Backport reference

Uh oh!

ExtReMLapin commented Apr 25, 2026

Uh oh!

ErykCh commented Apr 27, 2026

Uh oh!

ExtReMLapin commented Apr 27, 2026

Uh oh!

ErykCh commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ExtReMLapin commented Apr 28, 2026

Uh oh!

naroam1 commented May 7, 2026

Uh oh!

gaby commented May 8, 2026

Uh oh!

mergify Bot commented May 24, 2026

Uh oh!

ElizaWszola commented Jun 4, 2026

Uh oh!

ExtReMLapin commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ExtReMLapin commented Apr 24, 2026 •

edited

Loading

Additional fixes (commit `d3be225`)

ErykCh commented Apr 28, 2026 •

edited

Loading