common/reasoning-budget: force tool call immediately after budget ends, prevent tool call token in reasoning section by pwilkin · Pull Request #23478 · ggml-org/llama.cpp

pwilkin · 2026-05-21T13:38:00Z

Overview

As in title, proof of concept for testing for now.

Additional information

Activated via --reasoning-budget-force-tool-call

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

aldehir · 2026-05-22T21:11:31Z

We can probably use the new tool call start param to initialize an 'exclude_tokens' array within the reasoning budget sampler to prevent certain tokens during reasoning.

…in reasoning (`--reasoning-block-tool-call-start`)

pwilkin · 2026-05-23T17:03:34Z

@aldehir you want it, you got it :)

ggerganov · 2026-05-24T06:07:30Z

I think I noticed a problem with the budget reasoning end logic (this is on master):

Using the following parameters:

reasoning-budget     = 4096
reasoning-budget-message = "... I am thinking for tool long and cannot make a decision. I will now explain the problem to the user and ask them for advice."
chat-template-kwargs = {"preserve_thinking": true}

After hitting the reasoning limit during a message with a long thinking, responding and then me asking a follow-up question, I saw the following logs:

[54288] 5.12.568.305 I slot launch_slot_: id  3 | task 1681 | processing task, is_child = 0
[54288] 5.12.568.314 I slot update_slots: id  3 | task 1681 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 5572
[54288] 5.12.568.330 W slot update_slots: id  3 | task 1681 | old: ...  them for advice." | </think>
[54288] 
[54288] To support both GitHub
[54288] 5.12.568.333 W slot update_slots: id  3 | task 1681 | new: ...  them for advice." | 
[54288] </think>
[54288] 
[54288] To support both
[54288] 5.12.568.334 W slot update_slots: id  3 | task 1681 |     1070     364    9183    1149  248069     271    1206    1761    2107   31038
[54288] 5.12.568.334 W slot update_slots: id  3 | task 1681 |     1070     364    9183    1149     198  248069     271    1206    1761    2107
[54288] 5.12.568.335 W slot update_slots: id  3 | task 1681 | n_past = 4483, slot.prompt.tokens.size() = 5504, seq_id = 3, pos_min = 5503, n_swa = 0
[54288] 5.12.568.336 I slot update_slots: id  3 | task 1681 | Checking checkpoint with [355, 355] against 4482...
[54288] 5.12.572.427 W slot update_slots: id  3 | task 1681 | restored context checkpoint (pos_min = 355, pos_max = 355, n_tokens = 356, n_past = 356, size = 151.024 MiB)
[54288] 5.12.572.433 I slot update_slots: id  3 | task 1681 | cached n_tokens = 356, memory_seq_rm [356, end)

So instead of continuing the generation, we have to go back to an old checkpoint because there is newline in the newly formatted prompt. That newline, right before the </think> token, was not emitted during the first assistant turn.

Is this the expected behavior?

aldehir · 2026-05-24T06:21:41Z

Is this the expected behavior?

No, it's a consequence of trimming whitespace. This change in this PR will probably solve it:

@@ -2396,7 +2405,8 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
         auto_params.supports_thinking = autoparser.reasoning.mode != autoparser::reasoning_mode::NONE;
         if (auto_params.supports_thinking) {
             auto_params.thinking_start_tag = trim_whitespace(autoparser.reasoning.start);
-            auto_params.thinking_end_tag   = trim_whitespace(autoparser.reasoning.end);
+            auto_params.thinking_end_tag   = autoparser.reasoning.end;
+            auto_params.tool_start_tag     = autoparser.tools.format.section_start.empty() ? autoparser.tools.format.per_call_start : autoparser.tools.format.section_start;
         }
         common_peg_arena arena;
         arena.load(auto_params.parser);

But then we risk not matching the end sequence. Might need to improve this.

ggerganov · 2026-05-24T06:22:10Z

Btw, I also noticed that the quotation marks " around the reasoning-budget-message are also injected in the context - why is that?

aldehir · 2026-05-24T06:24:32Z

Btw, I also noticed that the quotation marks " around the reasoning-budget-message are also injected in the context - why is that?

I don't believe the ini file uses quoted strings, it accepts everything after the = as the value with whitespace trimmed.

bartdeboer · 2026-06-06T07:46:35Z

I tested this with Qwen3.5-9B in an agent/tool-loop setup using LLAMA_ARG_THINK_PREVENT_TOOL_CALL=1.

Some runs worked. Tool calls were parsed correctly and the loop succeeded through multiple iterations.

But then it failed with close fragments inside reasoning output:

Reasoning output...

</parameter>
</function>
</tool_call>

It looks like the <tool_call> start tag was blocked successfully. But for Qwen's chain of output, it was already done reasoning and had moved into the tool call phase. Blocking the start tag did not make it change that course.

I have opened #24202 with a specialized parser approach (and closed #23773 because of the one open PR guideline).

aldehir reviewed May 21, 2026

View reviewed changes

Comment thread common/chat.cpp Outdated

github-actions Bot added examples server labels May 21, 2026

pwilkin added 2 commits May 23, 2026 17:37

common/reasoning-budget: force tool call immediately after budget ends

f2e526a

Fix gemma reasoning opener

8fad92b

pwilkin force-pushed the res-budget-force-call branch from 7c2903e to 8fad92b Compare May 23, 2026 15:37

Add option to prevent the generation of the tool call starting token …

44a3568

…in reasoning (`--reasoning-block-tool-call-start`)

github-actions Bot added the testing Everything test related label May 23, 2026

pwilkin changed the title ~~common/reasoning-budget: force tool call immediately after budget ends~~ common/reasoning-budget: force tool call immediately after budget ends, prevent tool call token in reasoning section May 23, 2026

pwilkin mentioned this pull request Jun 4, 2026

Improve tagged tool parsing with reasoning #23773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common/reasoning-budget: force tool call immediately after budget ends, prevent tool call token in reasoning section#23478

common/reasoning-budget: force tool call immediately after budget ends, prevent tool call token in reasoning section#23478
pwilkin wants to merge 3 commits into
ggml-org:masterfrom
pwilkin:res-budget-force-call

pwilkin commented May 21, 2026

Uh oh!

Uh oh!

aldehir commented May 22, 2026

Uh oh!

pwilkin commented May 23, 2026

Uh oh!

ggerganov commented May 24, 2026

Uh oh!

aldehir commented May 24, 2026

Uh oh!

ggerganov commented May 24, 2026 •

edited

Loading

Uh oh!

aldehir commented May 24, 2026

Uh oh!

bartdeboer commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pwilkin commented May 21, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

aldehir commented May 22, 2026

Uh oh!

pwilkin commented May 23, 2026

Uh oh!

ggerganov commented May 24, 2026

Uh oh!

aldehir commented May 24, 2026

Uh oh!

ggerganov commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aldehir commented May 24, 2026

Uh oh!

bartdeboer commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented May 24, 2026 •

edited

Loading