common : fix Step-3.5-Flash format detection and thinking support by jesseposner · Pull Request #19635 · ggml-org/llama.cpp

jesseposner · 2026-02-15T07:42:41Z

Summary

Step-3.5-Flash (196B MoE) uses the same XML tool call output format as Qwen3-Coder and Nemotron 3 Nano (`<tool_call><function=...><parameter=...>`), but its template lacks the bare `` and plural `` markers in the tool enumeration section. The previous detection logic required all five XML markers, so Step-3.5-Flash fell through to Hermes 2 Pro, which doesn't call `func_args_not_string()`. Tool arguments stayed as JSON strings and templates using `arguments|items` crashed.

Reported by multiple users in #19283:

Approach

Per @aldehir's suggestion, Step-3.5-Flash is routed to the Nemotron v3 PEG parser rather than the simpler Qwen3-Coder GBNF grammar. The Nemotron v3 handler already provides `thinking_forced_open` handling, streaming support, and schema-aware parameter parsing.

The `` tag in the template source distinguishes which parser to use: Step-3.5-Flash unconditionally emits ``, Nemotron conditionally emits it (based on `enable_thinking`), and Qwen3-Coder has no `` at all. Since both Step-3.5-Flash and Nemotron need thinking support, the presence of `` routes to the PEG parser path.

Changes

Relax detection to require only the 3 output format markers (`<tool_call>`, `<function=`, `<parameter=`), dropping the tool enumeration markers (``, ``) that Qwen3-Coder and Nemotron have but Step-3.5-Flash lacks
Route models with `` in template source to `common_chat_params_init_nemotron_v3`; models without `` continue to `common_chat_params_init_qwen3_coder_xml`
Add unmodified HuggingFace chat template

Testing

Format detection: verified `COMMON_CHAT_FORMAT_PEG_CONSTRUCTED` with `thinking_forced_open=true`
9 PEG parser tests covering: basic messages with/without thinking, tool calls with/without thinking, parallel tool calls, code string parameters, optional `` closing tags, JSON schema response format
All tests follow Nemotron v3 patterns, adapted for Step-3.5-Flash's unconditional `` (all inputs include `` delimiter since `enable_thinking` defaults to true)
`test-chat` passes
Validated against live server (IQ3_XXS on M4 Max) via `/v1/chat/completions`:
- Tool calls: `reasoning_content` properly separated, `tool_calls` parsed with correct arguments, `finish_reason: "tool_calls"`
- Streaming: tool call argument deltas arrive incrementally (PEG parser)
- Code string params: multi-line Python code round-trips correctly through JSON
- Plain messages: thinking/content separation works

AI Disclosure

Claude was used for codebase exploration, pattern identification, and drafting. All changes follow established patterns from the existing Nemotron v3 format handler. Fully tested locally.

tarruda · 2026-02-15T11:10:45Z

Just tested this with codex and pi, seems to be working perfectly.

pwilkin

Please add tool calling and reasoning tests to the testcase.

aldehir · 2026-02-15T22:50:15Z

I think you can just use the Nemotron parser here. It's the same format: qwen3-coder + thinking.

The difference lies in how it enumerates tools, which is why the original chat detection doesn't work.

tarruda · 2026-02-16T00:28:29Z

While using this with pi CLI agent, I saw the following artifact in the UI:

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: ggml-org#19283

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format.

Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths.

aldehir

@tarruda if you have a chance, can you do a quick test to see if tool calling works as intended with the latest changes.

tarruda · 2026-02-17T20:32:19Z

I did a quick test and couldn't see any special tokens in the frontend. One thing I noticed in the llama-server output was this message:

Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.

aldehir · 2026-02-17T20:55:36Z

I am more interested if it tool calls. That warning is problematic, might be a bug in the jinja code.

pwilkin · 2026-02-17T21:07:53Z

That message is a known bug in the Jinja code, I've got it fixed on the autoparser branch (but @ngxson says he wants to fix it properly) - the thing is, if the template just dumps tools by using tools | tojson, the usage detection mechanism doesn't trigger for tools[0].function.name, which underlies this capability check.

It should not affect parsing though.

tarruda · 2026-02-18T00:08:36Z

I am more interested if it tool calls. That warning is problematic, might be a bug in the jinja code.

Tool calling seems fine on latest version

aldehir · 2026-02-19T20:58:07Z

@pwilkin

* Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one

…ml-org#19635) * common : fix Step-3.5-Flash format detection and thinking support Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: ggml-org#19283 * chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format. * chat : remove dead thinking code from qwen3_coder_xml Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths.

* Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one

* Better estimate for max. nuber of compute nodes * Just in case server: fix crash from adaptive p (ikawrakow#1304) Co-authored-by: firecoperana <firecoperana> Fix tool call for Qwen3.5 (ikawrakow#1300) * Fix tool call for Qwen3.5 Loosely based on mainline changes from: * ggml-org/llama.cpp#19635 * ggml-org/llama.cpp#19765 Also need to change the grammar to allow the model to make multiple tool calls in a row. This was likely broken for Qwen3 Coder prior to this commit. * Fix the grammar for the subsequent parameters after the first one Graph parallel for Qwen3-Next (ikawrakow#1292) * WIP * This works, but is slower than split mode layer Fix llm_arch_is_hybrid (ikawrakow#1305) Fix max nodes (again) (ikawrakow#1306) Fix typo in merge-up-gate-experts argument (ikawrakow#1311) llama-quantize: --dry-run option (ikawrakow#1309) Slightly better graph parallel for Qwen3-Next (ikawrakow#1307) * Make sure we pick the reduced tensor from the right GPU * Minor Minor delta-net tweak (ikawrakow#1308) * Make sure we pick the reduced tensor from the right GPU * Minor * Minor delta-net tweak adaptive p: collect probability before logit bias (ikawrakow#1314) server: propagate task index to response objects for batch requests (ikawrakow#1303) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai> Llama-quantize: Partial requant feature (ikawrakow#1313) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup Display the size of the tensors overriden during the tensor loading (ikawrakow#1318) * Display the size of the tensors overriden during the tensor loading Ex: `Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU` become `Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU` And pass in debug the later displayed size of the unnamed buffer overrides. Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB` That double display is cluttering the screen without being very informative. * change bytes display to MiB. Co-authored-by: Kawrakow <iwankawrakow@gmail.com> --------- Co-authored-by: Kawrakow <iwankawrakow@gmail.com> Fused delta-net (ikawrakow#1315) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name Fix KT quantization yet again (ikawrakow#1321) * Fix KT quantization yet again * Add same 1e-16f check for all quants in iqk_uantize.cpp * Fixes for k-quants * Also this one server: enable checkpoint for recurrent models (ikawrakow#1310) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana> Faster quantization for MoE models with many experts (ikawrakow#1322) Fused delta net 2 (ikawrakow#1320) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name * Don't re-apply L2 norm - it has already been done * This seems quite a bit better * More tweaks * Restore per context buffer size log Not everybody uses models split in 2000 parts, and those who do, actually want to see the biffer sizes. iAdding support for dense Qwen-3.5 models (ikawrakow#1326) add directio to llama-bench

…ml-org#19635) * common : fix Step-3.5-Flash format detection and thinking support Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: ggml-org#19283 * chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format. * chat : remove dead thinking code from qwen3_coder_xml Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths.

jesseposner requested review from ggerganov and pwilkin as code owners February 15, 2026 07:42

github-actions bot added the testing Everything test related label Feb 15, 2026

jesseposner force-pushed the fix-step35-tool-call-detection branch 4 times, most recently from 14738f5 to bd08b11 Compare February 15, 2026 08:51

jesseposner changed the title ~~common : fix Step-3.5-Flash tool call format detection~~ common : fix Step-3.5-Flash format detection and thinking support Feb 15, 2026

jesseposner force-pushed the fix-step35-tool-call-detection branch from bd08b11 to e26fa44 Compare February 15, 2026 09:12

pwilkin requested changes Feb 15, 2026

View reviewed changes

loci-dev mentioned this pull request Feb 16, 2026

UPSTREAM PR #19635: common : fix Step-3.5-Flash format detection and thinking support auroralabs-loci/llama.cpp#1182

Open

jesseposner force-pushed the fix-step35-tool-call-detection branch from 405ae40 to 093314f Compare February 16, 2026 07:04

jesseposner added 3 commits February 15, 2026 23:11

jesseposner force-pushed the fix-step35-tool-call-detection branch from 093314f to bdc1dda Compare February 16, 2026 07:12

aldehir approved these changes Feb 16, 2026

View reviewed changes

pwilkin approved these changes Feb 19, 2026

View reviewed changes

pwilkin merged commit 3dadc88 into ggml-org:master Feb 19, 2026
78 checks passed

sayap mentioned this pull request Feb 22, 2026

Fix tool call for Qwen3.5 ikawrakow/ik_llama.cpp#1300

Merged

4 tasks

aldehir mentioned this pull request Feb 25, 2026

Eval bug: Qwen 3.5 "Template supports tool calls but does not natively describe tools." #19009 #19872

Closed

matbgn mentioned this pull request Feb 25, 2026

Step 3.5 Flash ollama/ollama#14043

Open

aldehir mentioned this pull request Feb 26, 2026

Eval bug: qwen3.5-35b-a3b call tool bug #19905

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common : fix Step-3.5-Flash format detection and thinking support#19635

common : fix Step-3.5-Flash format detection and thinking support#19635
pwilkin merged 3 commits intoggml-org:masterfrom
jesseposner:fix-step35-tool-call-detection

jesseposner commented Feb 15, 2026 •

edited

Loading

Uh oh!

tarruda commented Feb 15, 2026

Uh oh!

pwilkin left a comment

Uh oh!

aldehir commented Feb 15, 2026 •

edited

Loading

Uh oh!

tarruda commented Feb 16, 2026

Uh oh!

aldehir left a comment

Uh oh!

tarruda commented Feb 17, 2026

Uh oh!

aldehir commented Feb 17, 2026

Uh oh!

pwilkin commented Feb 17, 2026

Uh oh!

tarruda commented Feb 18, 2026

Uh oh!

aldehir commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jesseposner commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Changes

Testing

Related

AI Disclosure

Uh oh!

tarruda commented Feb 15, 2026

Uh oh!

pwilkin left a comment

Choose a reason for hiding this comment

Uh oh!

aldehir commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarruda commented Feb 16, 2026

Uh oh!

aldehir left a comment

Choose a reason for hiding this comment

Uh oh!

tarruda commented Feb 17, 2026

Uh oh!

aldehir commented Feb 17, 2026

Uh oh!

pwilkin commented Feb 17, 2026

Uh oh!

tarruda commented Feb 18, 2026

Uh oh!

aldehir commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jesseposner commented Feb 15, 2026 •

edited

Loading

aldehir commented Feb 15, 2026 •

edited

Loading