Skip to content

common : fix Step-3.5-Flash format detection and thinking support#19635

Merged
pwilkin merged 3 commits intoggml-org:masterfrom
jesseposner:fix-step35-tool-call-detection
Feb 19, 2026
Merged

common : fix Step-3.5-Flash format detection and thinking support#19635
pwilkin merged 3 commits intoggml-org:masterfrom
jesseposner:fix-step35-tool-call-detection

Conversation

@jesseposner
Copy link
Contributor

@jesseposner jesseposner commented Feb 15, 2026

Summary

Step-3.5-Flash (196B MoE) uses the same XML tool call output format as Qwen3-Coder and Nemotron 3 Nano (`<tool_call><function=...><parameter=...>`), but its template lacks the bare `` and plural `` markers in the tool enumeration section. The previous detection logic required all five XML markers, so Step-3.5-Flash fell through to Hermes 2 Pro, which doesn't call `func_args_not_string()`. Tool arguments stayed as JSON strings and templates using `arguments|items` crashed.

Reported by multiple users in #19283:

Approach

Per @aldehir's suggestion, Step-3.5-Flash is routed to the Nemotron v3 PEG parser rather than the simpler Qwen3-Coder GBNF grammar. The Nemotron v3 handler already provides `thinking_forced_open` handling, streaming support, and schema-aware parameter parsing.

The `` tag in the template source distinguishes which parser to use: Step-3.5-Flash unconditionally emits ``, Nemotron conditionally emits it (based on `enable_thinking`), and Qwen3-Coder has no `` at all. Since both Step-3.5-Flash and Nemotron need thinking support, the presence of `` routes to the PEG parser path.

Changes

  • Relax detection to require only the 3 output format markers (`<tool_call>`, `<function=`, `<parameter=`), dropping the tool enumeration markers (``, ``) that Qwen3-Coder and Nemotron have but Step-3.5-Flash lacks
  • Route models with `` in template source to `common_chat_params_init_nemotron_v3`; models without `` continue to `common_chat_params_init_qwen3_coder_xml`
  • Add unmodified HuggingFace chat template

Testing

  • Format detection: verified `COMMON_CHAT_FORMAT_PEG_CONSTRUCTED` with `thinking_forced_open=true`
  • 9 PEG parser tests covering: basic messages with/without thinking, tool calls with/without thinking, parallel tool calls, code string parameters, optional `` closing tags, JSON schema response format
  • All tests follow Nemotron v3 patterns, adapted for Step-3.5-Flash's unconditional `` (all inputs include `` delimiter since `enable_thinking` defaults to true)
  • `test-chat` passes
  • Validated against live server (IQ3_XXS on M4 Max) via `/v1/chat/completions`:
    • Tool calls: `reasoning_content` properly separated, `tool_calls` parsed with correct arguments, `finish_reason: "tool_calls"`
    • Streaming: tool call argument deltas arrive incrementally (PEG parser)
    • Code string params: multi-line Python code round-trips correctly through JSON
    • Plain messages: thinking/content separation works

Related

AI Disclosure

Claude was used for codebase exploration, pattern identification, and drafting. All changes follow established patterns from the existing Nemotron v3 format handler. Fully tested locally.

@github-actions github-actions bot added the testing Everything test related label Feb 15, 2026
@jesseposner jesseposner force-pushed the fix-step35-tool-call-detection branch 4 times, most recently from 14738f5 to bd08b11 Compare February 15, 2026 08:51
@jesseposner jesseposner changed the title common : fix Step-3.5-Flash tool call format detection common : fix Step-3.5-Flash format detection and thinking support Feb 15, 2026
@jesseposner jesseposner force-pushed the fix-step35-tool-call-detection branch from bd08b11 to e26fa44 Compare February 15, 2026 09:12
@tarruda
Copy link

tarruda commented Feb 15, 2026

Just tested this with codex and pi, seems to be working perfectly.

Copy link
Collaborator

@pwilkin pwilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tool calling and reasoning tests to the testcase.

@aldehir
Copy link
Collaborator

aldehir commented Feb 15, 2026

I think you can just use the Nemotron parser here. It's the same format: qwen3-coder + thinking.

The difference lies in how it enumerates tools, which is why the original chat detection doesn't work.

@tarruda
Copy link

tarruda commented Feb 16, 2026

While using this with pi CLI agent, I saw the following artifact in the UI:

image

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org#19283
Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.
Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.
@jesseposner jesseposner force-pushed the fix-step35-tool-call-detection branch from 093314f to bdc1dda Compare February 16, 2026 07:12
Copy link
Collaborator

@aldehir aldehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarruda if you have a chance, can you do a quick test to see if tool calling works as intended with the latest changes.

@tarruda
Copy link

tarruda commented Feb 17, 2026

I did a quick test and couldn't see any special tokens in the frontend. One thing I noticed in the llama-server output was this message:

Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.

@aldehir
Copy link
Collaborator

aldehir commented Feb 17, 2026

I am more interested if it tool calls. That warning is problematic, might be a bug in the jinja code.

@pwilkin
Copy link
Collaborator

pwilkin commented Feb 17, 2026

That message is a known bug in the Jinja code, I've got it fixed on the autoparser branch (but @ngxson says he wants to fix it properly) - the thing is, if the template just dumps tools by using tools | tojson, the usage detection mechanism doesn't trigger for tools[0].function.name, which underlies this capability check.

It should not affect parsing though.

@tarruda
Copy link

tarruda commented Feb 18, 2026

I am more interested if it tool calls. That warning is problematic, might be a bug in the jinja code.

Tool calling seems fine on latest version

@aldehir
Copy link
Collaborator

aldehir commented Feb 19, 2026

@pwilkin

@pwilkin pwilkin merged commit 3dadc88 into ggml-org:master Feb 19, 2026
78 checks passed
ikawrakow pushed a commit to ikawrakow/ik_llama.cpp that referenced this pull request Feb 23, 2026
* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
…ml-org#19635)

* common : fix Step-3.5-Flash format detection and thinking support

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org#19283

* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.

* chat : remove dead thinking code from qwen3_coder_xml

Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Better estimate for max. nuber of compute nodes

* Just in case

server: fix crash from adaptive p (ikawrakow#1304)

Co-authored-by: firecoperana <firecoperana>

Fix tool call for Qwen3.5 (ikawrakow#1300)

* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one

Graph parallel for Qwen3-Next (ikawrakow#1292)

* WIP

* This works, but is slower than split mode layer

Fix llm_arch_is_hybrid (ikawrakow#1305)

Fix max nodes (again) (ikawrakow#1306)

Fix typo in merge-up-gate-experts argument (ikawrakow#1311)

llama-quantize: --dry-run option (ikawrakow#1309)

Slightly better graph parallel for Qwen3-Next (ikawrakow#1307)

* Make sure we pick the reduced tensor from the right GPU

* Minor

Minor delta-net tweak (ikawrakow#1308)

* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak

adaptive p: collect probability before logit bias (ikawrakow#1314)

server: propagate task index to response objects for batch requests (ikawrakow#1303)

When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>

Llama-quantize: Partial requant feature (ikawrakow#1313)

* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup

Display the size of the tensors overriden during the tensor loading (ikawrakow#1318)

* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

Fused delta-net (ikawrakow#1315)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

Fix KT quantization yet again (ikawrakow#1321)

* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one

server: enable checkpoint for recurrent models (ikawrakow#1310)

* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>

Faster quantization for MoE models with many experts (ikawrakow#1322)

Fused delta net 2 (ikawrakow#1320)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.

iAdding support for dense Qwen-3.5 models (ikawrakow#1326)

add directio to llama-bench
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
…ml-org#19635)

* common : fix Step-3.5-Flash format detection and thinking support

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org#19283

* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.

* chat : remove dead thinking code from qwen3_coder_xml

Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
…ml-org#19635)

* common : fix Step-3.5-Flash format detection and thinking support

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare <function> and plural <parameters> markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit <think> in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:
- Relax Qwen3-Coder XML detection to only require the 3 shared markers
- Tighten Nemotron v3 branch to also require bare <function> and plural
  <parameters>, preventing Step-3.5-Flash from being misrouted via <think>
- Add thinking_forced_open support to Qwen3-Coder-XML init function
- Add <think>/</think> to preserved tokens
- Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
  grammar root rule, allowing </think> before tool calls
- Add Step-3.5-Flash chat template and format detection test

Builds on: ggml-org#19283

* chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional <think> output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with <think> + XML tool tags use Nemotron v3 PEG
parser; templates without <think> (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional </parameter>
closing tags, and JSON schema response format.

* chat : remove dead thinking code from qwen3_coder_xml

Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
<think> in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants