Skip to content

[Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, end detection, token counting, withhold recovery#40783

Open
ExtReMLapin wants to merge 15 commits into
vllm-project:mainfrom
ExtReMLapin:partial_tool_call_qwen3_reasoning
Open

[Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, end detection, token counting, withhold recovery#40783
ExtReMLapin wants to merge 15 commits into
vllm-project:mainfrom
ExtReMLapin:partial_tool_call_qwen3_reasoning

Conversation

@ExtReMLapin

@ExtReMLapin ExtReMLapin commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

To be used with #40861

Purpose

This PR fixes several critical edge cases in the Qwen3 reasoning parser where reasoning text was lost or incorrectly classified.

A key issue addressed is that Qwen3, while following system prompt instructions during reasoning, often outputs tool call tags as raw text fragments (e.g., < + tool + _call + >) instead of a single special token. Since these fragments arrive as multiple regular text tokens across different streaming deltas, the previous ID-based detection would fail or corrupt the output.

Key Changes

  • Raw Text & Fragmented Tag Support: The parser now tracks the literal string <tool_call> across deltas. It correctly identifies the reasoning-to-content transition even when the tag is "built" over several steps from regular text tokens, preventing the parser from missing the start of a tool call.
  • Transition Data Integrity: Fixed a bug in vllm/parser/abstract_parser.py where the final fragment of reasoning was silently overwritten by tool call content when both occurred in the same delta.
  • Speculative Decoding Fix: Added is_reasoning_end_streaming, a delta-only variant used by parse_delta instead of is_reasoning_end. The paired-token guard in is_reasoning_end is intentionally preserved (it prevents system-prompt tool-call examples from triggering an early reasoning end); only its comment was clarified. The new streaming check inspects delta_ids without the guard, so a complete <tool_call>…</tool_call> delivered in a single delta (MTP / speculative decoding) correctly signals the end of reasoning.
  • Preserve Multiple Tool Calls: Fixed extract_content_ids to search for the first occurrence of a tool call instead of the last, ensuring no tool calls are dropped when reasoning ends implicitly.
  • Reasoning/Content Boundary: Ensured that any content emitted after an explicit </think> tag is never swallowed back into the reasoning field, even if a tool call follows shortly after.

Additional fixes (commit d3be225)

  • count_reasoning_tokens override for the Qwen3.5+ template: the
    inherited depth-counter starts at depth=0 and is never incremented
    (because <think> lives in the prompt, not the model output), so it
    returned 0 reasoning tokens for every Qwen3.5+ generation. This
    silently broke usage.completion_tokens_details.reasoning_tokens
    reporting in API responses. The override treats the start of the
    output as the reasoning region and stops at the first </think> (or,
    for the Qwen3.5 implicit-end case, the first <tool_call>).

  • Recover withheld <tool prefix when the continuation is a
    false positive
    : the partial-overlap guard withholds the trailing
    bytes of current_text when they look like the start of <tool_call>.
    Previously, if the model emitted an unrelated continuation (e.g.
    <tool_use>, or just <tool_belt), those bytes were silently dropped.
    The parser now re-emits them as reasoning.

Test Result

All new tests pass, and existing reasoning tests remain green.

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added qwen Related to Qwen models bug Something isn't working labels Apr 24, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Qwen3ReasoningParser to correctly handle cases where the <tool_call> tag is fragmented across multiple streaming deltas. It introduces logic to detect the completion of the tag by comparing current_text and previous_text, and adds a corresponding test case. A review comment pointed out a redundant and unreachable elif block in the streaming extraction logic, suggesting a simplified implementation that uses current_text.find() to consistently handle both single-token and fragmented tag scenarios.

Comment thread vllm/reasoning/qwen3_reasoning_parser.py Outdated
@ExtReMLapin

Copy link
Copy Markdown
Contributor Author

cc : @qmx
As I commented in your PR

CNE Pierre FICHEPOIL added 3 commits April 24, 2026 10:27
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
@ExtReMLapin

Copy link
Copy Markdown
Contributor Author

I pushed more fixes and the corresponding tests with them, PR name should probably change

  • is_reasoning_end was returning False for a completed tool call because the tool call could be something else than outputed as a tool_call token (but as fragmented one)
  • last reasoning fragment silently dropped at the reasoning -> tool-call transition

all tests covers thoses issues

CNE Pierre FICHEPOIL added 2 commits April 24, 2026 16:25
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
@ExtReMLapin

Copy link
Copy Markdown
Contributor Author

Two more fixes pushed linked to fragmented tokens & reasoning ignoring multiple tool calls (keeping only the last one)

… was detected to close reasoning

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
@ExtReMLapin ExtReMLapin changed the title [Bugfix] Support fragmented <tool_call> tags in Qwen3 streaming reasoning parser [Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, and incorrect end detection Apr 24, 2026
CNE Pierre FICHEPOIL added 2 commits April 24, 2026 19:30
… being entirely into content

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
…happen again

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
@ExtReMLapin ExtReMLapin marked this pull request as draft April 24, 2026 22:16
@ExtReMLapin ExtReMLapin marked this pull request as ready for review April 25, 2026 05:02
@ExtReMLapin ExtReMLapin changed the title [Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, and incorrect end detection [Bugfix] Fix Qwen3 reasoning parser: raw text tags, transition loss, end detection, token counting, withhold recovery Apr 25, 2026
CNE Pierre FICHEPOIL and others added 2 commits April 25, 2026 13:32
Replace the misleading "to satisfy tests" comment with an explanation
of the actual invariant: returning DeltaMessage() (empty) signals
"processing ongoing, nothing to emit yet", whereas returning None
would signal "stop processing".

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
1. count_reasoning_tokens override for Qwen3.5+ template

   The chat template places <think> in the prompt, so the model OUTPUT
   begins with reasoning tokens followed by </think>.  The inherited
   depth-counter starts at depth=0 and is never incremented (no <think>
   ever appears in the output), so it returned 0 reasoning tokens for
   every Qwen3.5+ generation.

   The override treats the start of the output as the reasoning region
   by default and stops counting at the first </think> or — for the
   Qwen3.5 implicit-end case — the first <tool_call>.  If <think> is
   present in the output (older 2507 template) reasoning begins right
   after it.

2. Re-emit withheld <tool_call> prefix when the continuation is not
   actually a tool call

   The streaming path withholds the tail of current_text whenever it
   looks like the prefix of <tool_call>.  If the model then emits an
   unrelated continuation (e.g. <tool_use>, or just <too followed by
   non-tool-call text), the withheld bytes were silently dropped.
   The fix combines the withheld bytes from previous_text with this
   delta's stripped text and re-runs the overlap check on the
   cumulative text — emitting whatever portion is now safe instead of
   only the new delta minus the current overlap.

3. Honour enable_thinking=False in the streaming path

   extract_reasoning already honoured self.thinking_enabled.  The
   streaming counterpart relied entirely on the serving layer to
   bypass the parser via prompt_is_reasoning_end detection.  Direct
   callers (or any future regression of that bypass) silently routed
   model output to the reasoning channel.  The streaming path now
   matches extract_reasoning: when thinking is disabled and no
   </think> is present in the delta, every byte is content.

Each fix has a focused regression test in
tests/reasoning/test_qwen3_reasoning_parser.py.

Signed-off-by: ExtReMLapin <3909752+ExtReMLapin@users.noreply.github.com>
@ExtReMLapin ExtReMLapin force-pushed the partial_tool_call_qwen3_reasoning branch from d3be225 to 36cb5a1 Compare April 25, 2026 11:33
@Sandermage

Copy link
Copy Markdown
Contributor

@ExtReMLapin — first, thank you for this PR. The fragmented-tag analysis in your PR description ("Qwen3 often outputs tool call tags as raw text fragments — < + tool + _call + >") immediately reframed how we thought about the streaming path. Even for tokenizers where the tag is a single special token, knowing the fragmented case exists changed our debugging approach + helped us understand that the bug class spans both streaming and non-streaming paths. Backported parts of this PR on a Qwen3.6-35B-A3B-FP8 production rig (2× A5000, vLLM 0.19.2rc1.dev205+g07351e088) as part of the v7.13 multi-PR investigation. Sharing the data + one architectural question.

What we backported (split into two slices for blast-radius safety)

Our existing patch tree already has substantial Qwen3 reasoning-parser modifications (P12 mirroring an earlier fix + P27 BEFORE-THINK preservation + P59 mirroring #39055). Layering #40783's full streaming restructure on top of those would have required re-anchoring against post-P12 layout, which felt risky. So we split your PR into two minimal slices:

  1. Non-streaming extract_content_ids first-occurrence fix (P61) — single anchor, just changes len(input_ids) - 1 - input_ids[::-1].index(...) to input_ids.index(...). Multi-tool agentic flows now preserve all tool calls instead of dropping all but the last.

  2. Streaming partial_tag_overlap defensive guard (P61b) — adds the partial_tag_overlap import + uses it before the final reasoning emission to hold back partial <tool_call> fragments. For Qwen3 with proper special-token handling this is mostly a no-op but useful defense for tokenizers that emit the tag character-fragmented.

The major streaming restructure (the just_completed_tool_call_tag + position-aware split logic) we deferred — it's correct and probably needed for some configurations, but for our setup (Qwen3 special-token tokenizer + non-streaming non-reasoning workload) the two slices above were sufficient.

Empirical impact

P61 alone gave no measurable delta on our SINGLE-tool reproducer (n=20) — expected, since the bug fires on multi-tool only. P61b gave no measurable delta on non-streaming (also expected). For STREAMING multi-tool requests via aggregator (LibreChat-class clients), the delta should appear on multi-step agentic flows; we don't have an automated test for that yet.

One architectural question

The "removed incorrect paired-token guard in is_reasoning_end" change in your PR — is that a real correctness fix, or just a comment refresh? Looking at your diff for is_reasoning_end, the LOGIC stays the same (the if tool_call_end_token_id ... continue block is preserved); only the comment text changed. The behavioral fix for "complete <tool_call>...</tool_call> in single delta" is in extract_reasoning_streaming via just_completed_tool_call_tag, not in is_reasoning_end. Is that intentional? Worth clarifying in the PR body in case reviewers also wonder.

Backport reference

P61 + P61b — both opt-in text patches with anchor validation + drift markers. Credit to you in the docstrings + CREDITS.md.

Hope the data point helps. Thanks for the streaming work in particular — the whole partial_tag_overlap approach is nicely defensive.

@ExtReMLapin

Copy link
Copy Markdown
Contributor Author

@Sandermage you are right, the description is a leftover from commit ExtReMLapin@c047919 that later got rollbacked because of bad interpretation of what it was doing. I updated the description, the fix is in is_reasoning_end_streaming

@ErykCh

ErykCh commented Apr 27, 2026

Copy link
Copy Markdown

Could you fix also cutting reasoning content, when there is a tool call?

image

so this is chunk with reasoning:

image

and this is next with function call, with logprobs from reasoning part:

image

@ExtReMLapin

Copy link
Copy Markdown
Contributor Author

The issue is #39055 this PR @ErykCh

@ErykCh

ErykCh commented Apr 28, 2026

Copy link
Copy Markdown

The issue is #39055 this PR @ErykCh

I'm not so sure.

In my case, tool call is after think tag but still reasoning part is incorrectly parsed.

image

@ExtReMLapin

Copy link
Copy Markdown
Contributor Author

@ErykCh you're right, my bad.

I tried to reproduce the error using OCR in a test, on main and on my branch, no error, can you reproduce it, either write a test or send me the whole raw output text ?

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
@naroam1

naroam1 commented May 7, 2026

Copy link
Copy Markdown

Hi maintainers — chiming in as a downstream user.

This PR addresses the streaming side of #39056, which together with #39055 (non-streaming) covers the tool-call-inside-think breakage on Qwen3/Qwen3.5 — one of the open items blocking our planned migration to Qwen3.5-9B/27B for production agentic workloads with native streaming enabled.

The fragmented-tag analysis (< + tool + _call + > arriving as multiple deltas) is exactly the failure mode we expect to hit downstream, since most agentic clients (LiteLLM, OpenCode, Claude Code per upstream issue) consume the OpenAI streaming API. The PR has community validation (@Sandermage Apr 25) and the author has been responsive.

Could a code-owner for vllm/reasoning/qwen3_reasoning_parser.py / vllm/parser/abstract_parser.py take the review pass? Happy to validate against our production stack post-merge.

Thanks @ExtReMLapin for the deep analysis!

@gaby

gaby commented May 8, 2026

Copy link
Copy Markdown

@ExtReMLapin fix the merge conflicts

CNE Pierre FICHEPOIL and others added 2 commits May 11, 2026 09:09
Resolve conflict in vllm/parser/abstract_parser.py: keep both adjacent
blocks added after the tool_call streaming extraction. The reasoning
preservation block (HEAD) and the history_tool_call_cnt increment block
(main, PR vllm-project#41876) operate on orthogonal fields of delta_message/state
and are safe to apply in sequence.

Signed-off-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
@mergify

mergify Bot commented May 24, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ExtReMLapin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 24, 2026
@ElizaWszola

Copy link
Copy Markdown
Contributor

Hi @ExtReMLapin can you try to split this PR into multiple PRs, each solving a separate issue? (I can help if needed)

@ExtReMLapin

Copy link
Copy Markdown
Contributor Author

Hi @ExtReMLapin can you try to split this PR into multiple PRs, each solving a separate issue? (I can help if needed)

I agree this would be probably much better and easier to review.

Will also ask for guidlines on #40861

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working needs-rebase qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants