Skip to content

common/reasoning-budget: force tool call immediately after budget ends, prevent tool call token in reasoning section#23478

Draft
pwilkin wants to merge 3 commits into
ggml-org:masterfrom
pwilkin:res-budget-force-call
Draft

common/reasoning-budget: force tool call immediately after budget ends, prevent tool call token in reasoning section#23478
pwilkin wants to merge 3 commits into
ggml-org:masterfrom
pwilkin:res-budget-force-call

Conversation

@pwilkin

@pwilkin pwilkin commented May 21, 2026

Copy link
Copy Markdown
Member

Overview

As in title, proof of concept for testing for now.

Additional information

Activated via --reasoning-budget-force-tool-call

Requirements

Comment thread common/chat.cpp Outdated
@aldehir

aldehir commented May 22, 2026

Copy link
Copy Markdown
Contributor

We can probably use the new tool call start param to initialize an 'exclude_tokens' array within the reasoning budget sampler to prevent certain tokens during reasoning.

@pwilkin pwilkin force-pushed the res-budget-force-call branch from 7c2903e to 8fad92b Compare May 23, 2026 15:37
…in reasoning (`--reasoning-block-tool-call-start`)
@github-actions github-actions Bot added the testing Everything test related label May 23, 2026
@pwilkin

pwilkin commented May 23, 2026

Copy link
Copy Markdown
Member Author

@aldehir you want it, you got it :)

@pwilkin pwilkin changed the title common/reasoning-budget: force tool call immediately after budget ends common/reasoning-budget: force tool call immediately after budget ends, prevent tool call token in reasoning section May 23, 2026
@ggerganov

Copy link
Copy Markdown
Member

I think I noticed a problem with the budget reasoning end logic (this is on master):

Using the following parameters:

reasoning-budget     = 4096
reasoning-budget-message = "... I am thinking for tool long and cannot make a decision. I will now explain the problem to the user and ask them for advice."
chat-template-kwargs = {"preserve_thinking": true}

After hitting the reasoning limit during a message with a long thinking, responding and then me asking a follow-up question, I saw the following logs:

[54288] 5.12.568.305 I slot launch_slot_: id  3 | task 1681 | processing task, is_child = 0
[54288] 5.12.568.314 I slot update_slots: id  3 | task 1681 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 5572
[54288] 5.12.568.330 W slot update_slots: id  3 | task 1681 | old: ...  them for advice." | </think>
[54288] 
[54288] To support both GitHub
[54288] 5.12.568.333 W slot update_slots: id  3 | task 1681 | new: ...  them for advice." | 
[54288] </think>
[54288] 
[54288] To support both
[54288] 5.12.568.334 W slot update_slots: id  3 | task 1681 |     1070     364    9183    1149  248069     271    1206    1761    2107   31038
[54288] 5.12.568.334 W slot update_slots: id  3 | task 1681 |     1070     364    9183    1149     198  248069     271    1206    1761    2107
[54288] 5.12.568.335 W slot update_slots: id  3 | task 1681 | n_past = 4483, slot.prompt.tokens.size() = 5504, seq_id = 3, pos_min = 5503, n_swa = 0
[54288] 5.12.568.336 I slot update_slots: id  3 | task 1681 | Checking checkpoint with [355, 355] against 4482...
[54288] 5.12.572.427 W slot update_slots: id  3 | task 1681 | restored context checkpoint (pos_min = 355, pos_max = 355, n_tokens = 356, n_past = 356, size = 151.024 MiB)
[54288] 5.12.572.433 I slot update_slots: id  3 | task 1681 | cached n_tokens = 356, memory_seq_rm [356, end)

So instead of continuing the generation, we have to go back to an old checkpoint because there is newline in the newly formatted prompt. That newline, right before the </think> token, was not emitted during the first assistant turn.

Is this the expected behavior?

@aldehir

aldehir commented May 24, 2026

Copy link
Copy Markdown
Contributor

Is this the expected behavior?

No, it's a consequence of trimming whitespace. This change in this PR will probably solve it:

@@ -2396,7 +2405,8 @@ static common_chat_params common_chat_templates_apply_jinja(const struct common_
         auto_params.supports_thinking = autoparser.reasoning.mode != autoparser::reasoning_mode::NONE;
         if (auto_params.supports_thinking) {
             auto_params.thinking_start_tag = trim_whitespace(autoparser.reasoning.start);
-            auto_params.thinking_end_tag   = trim_whitespace(autoparser.reasoning.end);
+            auto_params.thinking_end_tag   = autoparser.reasoning.end;
+            auto_params.tool_start_tag     = autoparser.tools.format.section_start.empty() ? autoparser.tools.format.per_call_start : autoparser.tools.format.section_start;
         }
         common_peg_arena arena;
         arena.load(auto_params.parser);

But then we risk not matching the end sequence. Might need to improve this.

@ggerganov

ggerganov commented May 24, 2026

Copy link
Copy Markdown
Member

Btw, I also noticed that the quotation marks " around the reasoning-budget-message are also injected in the context - why is that?

@aldehir

aldehir commented May 24, 2026

Copy link
Copy Markdown
Contributor

Btw, I also noticed that the quotation marks " around the reasoning-budget-message are also injected in the context - why is that?

I don't believe the ini file uses quoted strings, it accepts everything after the = as the value with whitespace trimmed.

@bartdeboer

Copy link
Copy Markdown

I tested this with Qwen3.5-9B in an agent/tool-loop setup using LLAMA_ARG_THINK_PREVENT_TOOL_CALL=1.

Some runs worked. Tool calls were parsed correctly and the loop succeeded through multiple iterations.

But then it failed with close fragments inside reasoning output:

Reasoning output...

</parameter>
</function>
</tool_call>

It looks like the <tool_call> start tag was blocked successfully. But for Qwen's chain of output, it was already done reasoning and had moved into the tool call phase. Blocking the start tag did not make it change that course.

I have opened #24202 with a specialized parser approach (and closed #23773 because of the one open PR guideline).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants