Skip to content

Improve tagged tool parsing with reasoning#23773

Closed
bartdeboer wants to merge 2 commits into
ggml-org:masterfrom
bartdeboer:qwen-tools-reasoning-fixes
Closed

Improve tagged tool parsing with reasoning#23773
bartdeboer wants to merge 2 commits into
ggml-org:masterfrom
bartdeboer:qwen-tools-reasoning-fixes

Conversation

@bartdeboer

@bartdeboer bartdeboer commented May 27, 2026

Copy link
Copy Markdown

Overview

This PR fixes two tagged-tool parsing issues found while testing Qwen3.5 with reasoning enabled and XML-style tool calls.

Commit 1: Fix tagged tool calls inside reasoning

The current llama.cpp parser splits content parsing and reasoning parsing into separate paths. This means a <tool_call> emitted inside a <think>...</think> block can be consumed as reasoning_content instead of being parsed as a tool call.

This commit combines those paths for the TAG_WITH_TAGGED parser. Reasoning tags now act as text-mode switches:

  • text inside reasoning tags is emitted as reasoning_content
  • text outside reasoning tags is emitted as normal content
  • tagged tool calls are parsed as tool_calls in either mode

The lazy tagged-tool grammar was also adjusted so a triggered <tool_call>...</tool_call> does not have to be the final suffix of the completion. This allows the full parser to handle following segments such as </think> or later content.

Commit 2: Accept required tagged tool args in any order

Tagged parameters include their names in-band, for example <parameter=old_string>. Requiring these parameters to appear in schema/property iteration order is stricter than necessary and caused valid Qwen-style tool calls to fail parsing.

This commit keeps the existing grammar-based parser structure, but generates permutations for required tagged arguments so they can be accepted in any order. Optional arguments can still appear flexibly around them.

This is intended as a narrow, low-impact fix. A more scalable long-term design would be to parse tagged parameters as an unordered list, collect them by name, reject duplicates, check required parameters after parsing, and emit normalized arguments. That would avoid permutation growth, but would require a larger parser/mapper refactor.

To avoid pathological grammar growth, this commit caps the permutation path at six required parameters. Schemas with more than six required parameters keep the previous fixed-order behavior.

This fixes Qwen-style outputs such as:

<think>
Reasoning...

<tool_call>
<function=edit>
<parameter=file>...</parameter>
<parameter=old_string>...</parameter>
<parameter=new_string>...</parameter>
</function>
</tool_call>
</think>

Additional information

This was tested downstream with ctgbot and Qwen3.5 9B using reasoning enabled and XML-style tool calls.

A patched CUDA image used for downstream validation is available here:

ghcr.io/bartdeboer/llama-cpp:server-cuda-ctgbot-patches-a03afef

Digest:

sha256:a952fea7e739130000e9bca0f4f886a76ba0bb95a932fe23d56b929ef8f1255e

Tested locally with:

./build/bin/test-chat --suppress-debug

Related issues:

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES - AI assistance was used for investigation and prototyping. I have a software engineering background, but I am not primarily a C++/llama.cpp contributor. I designed the code shape together with Codex, reviewed the changes, tested them locally, and am responsible for the submitted code.

@github-actions github-actions Bot added the testing Everything test related label May 27, 2026
@bartdeboer bartdeboer force-pushed the qwen-tools-reasoning-fixes branch from d0fdb87 to ed0384d Compare May 29, 2026 21:03
@bartdeboer bartdeboer force-pushed the qwen-tools-reasoning-fixes branch from 2784e8b to ce979f0 Compare June 2, 2026 08:42
@bartdeboer bartdeboer marked this pull request as ready for review June 2, 2026 08:55
@bartdeboer bartdeboer requested review from a team and pwilkin as code owners June 2, 2026 08:55
@Atomic-Germ

Copy link
Copy Markdown
Contributor

Wouldn't a tool call in the middle of a reasoning block be a bad thing? The model isn't expecting it to actually do anything.

@bartowski1182

Copy link
Copy Markdown
Contributor

Some models do this, Kimi K2 I believe for example will reason, tool call, reason more

I do agree it feels weird, but since they do it we should support it

@bartdeboer

Copy link
Copy Markdown
Author

Wouldn't a tool call in the middle of a reasoning block be a bad thing? The model isn't expecting it to actually do anything.

For Qwen3.5-style outputs, the important case is not that tools should execute “inside” reasoning. Rather, the model may start emitting a valid <tool_call> before it emits the closing </think> tag.

vLLM handles this in its Qwen3 reasoning parser by treating <tool_call> as an implicit end of reasoning:

https://docs.vllm.ai/en/stable/api/vllm/reasoning/qwen3_reasoning_parser/

This PR accepts tagged <tool_call> blocks as valid in either reasoning blocks or normal content blocks.

@aldehir

aldehir commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Since this behavior is specific to Qwen, at least for now, I would rather implement a specialized parser than baking this into the autoparser. That said, we have discussed masking the <tool_call> token within reasoning as an alternative.

cc @pwilkin for discussion.

@bartdeboer

Copy link
Copy Markdown
Author

I can open a second PR that implements this as a Qwen-specific parser. That could also address the tool args order issue, although that seems like a more generic tagged-parameter parsing issue.

The main thing I am unsure about is the detection strategy. The existing specialized parsers seem to mostly detect distinctive template signatures, but the Qwen markers (<think>, <tool_call>, <function=, <parameter=) are somewhat generic XML-style markers.

Would you prefer template-signature detection, or would it be acceptable to also use model metadata such as general.architecture == qwen35 when available?

@pwilkin

pwilkin commented Jun 4, 2026

Copy link
Copy Markdown
Member

The approach @aldehir mentioned is in #23478, I'll rebase it on current master.

@bartdeboer

Copy link
Copy Markdown
Author

I tested PR #23478 with Qwen3.5-9B in an agent/toolloop setup.

Test image:

ghcr.io/bartdeboer/llama-cpp:server-cuda-pr23478-44a3568
digest: sha256:d520c1773f665e47053c65bece1d2d243da6b10b5b5b8f2e0e307cfbfdfceb31

I enabled the new option with:

LLAMA_ARG_THINK_PREVENT_TOOL_CALL=1

The option was active: the server started cleanly and logs showed reasoning-budget activation on requests.

Results were mixed.

In one test, the tool loop completed 40 iterations with parsed tool calls successfully.

However, in a later wordcount/test follow-up, it failed. The model appeared to be blocked from emitting the opening <tool_call> while still in reasoning, but then leaked dangling closing fragments into user-visible progress/reasoning text:

The file already has the correct import. Let me run the tests again.

</parameter>
</function>
</tool_call>

So this PR appears to help, and can work for some toolloop paths, but in my testing it did not recover when Qwen had already decided to emit a tagged tool call inside reasoning. Blocking the opening token can leave the rest of the tagged structure to leak as text.

@bartdeboer

Copy link
Copy Markdown
Author

Opened #24202 that implements the specialized tagged thinking/tool parser approach.

@bartdeboer

Copy link
Copy Markdown
Author

Closing this PR in favor of #24202 since I can have only one open PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

5 participants