fix: tool-call JSON leaks into content with stream+tools on tokenizer-template models (#10052) by localai-bot · Pull Request #10057 · mudler/LocalAI

localai-bot · 2026-05-29T07:53:36Z

What

Fixes #10052: with stream: true + tools on a tokenizer-template model (e.g. qwen3), the chat-completions endpoint streamed the tool-call JSON object out through the content field (in addition to emitting the real tool_calls).

Root cause

When use_tokenizer_template delegates templating to the backend (llama.cpp), the backend also owns tool-call grammar generation and parsing. LocalAI was still generating its own GBNF grammar and sending it down. With a Go grammar present, llama.cpp does not pass the tools to its template (has_grammar_from_go=1), so its native peg/json tool parser never engages — it streams the grammar-constrained tool-call JSON back as plain content rather than tool_calls.

Two contributing defects:

Grammar/template inconsistency. The GGUF auto-import path already pairs use_tokenizer_template with grammar.disable, but that block is skipped when a template is already configured — so gallery and hand-written configs (qwen3) that set the tokenizer template directly never got the paired grammar.disable.
Property-order sentinel bug. The JSON-schema→GBNF sort used aOrder != 0 && bOrder != 0 as its guard, treating index 0 (the first key in properties_order) as "unset". So even properties_order: name,arguments fell back to alphabetical order and emitted arguments before name — which prevented the streaming detector from ever gating content.

Changes

SetDefaults now couples use_tokenizer_template ⇒ grammar.disable for every config, so gallery, hand-written, and GGUF-import configs agree. Tools flow to the backend's native (name-first) pipeline. This also fixes already-installed models without editing each config.
gallery/qwen3.yaml (the shared base referenced by every qwen3 entry) sets function.grammar.disable: true explicitly.
Fix the properties_order sentinel so a listed key at index 0 is honored (presence-in-map instead of != 0).

Verification

End-to-end against qwen3-4b (stream:true + tools), built from this branch:

Plain answer: content now carries the actual text (Hello! How can I assist you today?), reasoning is classified separately, and no JSON leaks. Previously content dribbled out {"arguments":...,"name":"answer"}.
Tool call: streams proper name-first tool_calls deltas ({"name":"exec",...} then the arguments) with zero content leak.

New tests (TDD, Ginkgo):

core/config: SetDefaults disables Go grammar when the tokenizer template is used (and leaves it enabled otherwise).
pkg/functions/grammars: properties_order: name,arguments yields name-first; empty order stays alphabetical; listed keys sort ahead of unlisted ones.

golangci-lint run --new-from-merge-base=master → 0 issues.

🤖 Generated with Claude Code

The JSON-schema-to-GBNF property sort used `aOrder != 0 && bOrder != 0` as its "is this key ordered?" guard. That treats index 0 — the first key listed in properties_order — as unset, so `properties_order: name,arguments` fell back to alphabetical ordering and still emitted "arguments" before "name". Use presence in the order map instead: listed keys sort by their index and ahead of unlisted keys, which keep a stable alphabetical order. This makes the documented `properties_order: name,arguments` actually produce name-first tool-call JSON. Relates to #10052. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

…template owns templating (#10052) When use_tokenizer_template delegates templating to the backend (llama.cpp), the backend also owns tool-call grammar generation and parsing. LocalAI was still generating its own GBNF grammar and sending it down. With a grammar present, llama.cpp does not hand the tools to its template, so its native peg/json tool parser never engages: it streams the grammar-constrained tool-call JSON back as plain content instead of emitting tool_calls. In streaming mode the JSON object leaked into the content field, and the Go-side incremental detector never gated content because the LocalAI-generated grammar emitted "arguments" before "name". The GGUF auto-import path already couples use_tokenizer_template with grammar.disable, but that block is skipped when a template is already configured, so gallery and hand-written configs (e.g. qwen3) that set the tokenizer template directly never got the paired grammar.disable. - SetDefaults now enforces the coupling for every config: when use_tokenizer_template is set, grammar generation is disabled and tools flow to the backend's native (name-first) pipeline. This also fixes already-installed models without editing each config. - Set function.grammar.disable in the shared gallery/qwen3.yaml, which is the base config referenced by every qwen3 gallery entry. Verified end to end against qwen3-4b with stream:true + tools: content no longer carries the tool-call JSON, reasoning is classified separately, and tool calls stream as proper name-first tool_calls deltas. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

localai-bot · 2026-05-29T08:47:07Z

Relationship to #9722 / #10055 / #9940

While investigating I confirmed this PR shares a root cause with the two open PRs against #9722. All three are facets of the same tension: LocalAI's Go-side tool-call parsing/streaming runs in parallel with llama.cpp's C++ chat-template autoparser, and the two collide. Each addresses a different collision symptom, at a different layer:

PR	Issue	Symptom	Layer
#9940	#9722	glm-4.5 XML auto-detect mis-extracts a Hermes `<tool_call>{json}</tool_call>` blob as the function name	parse (`ParseXMLIterative`)
#10055	#9722	Go parser and autoparser both emit → the same call appears at multiple `index` values	stream (`processStreamWithTools`)
this PR (#10057)	#10052	Go-generated grammar suppresses the autoparser → the tool-call JSON streams back as `content`	config (decides which path runs)

There is no file overlap between the three branches, so they don't conflict:

fix(streaming): prevent glm-4.5 XML auto-detect from emitting Hermes tool_call JSON blob as function name (#9722) #9940 → pkg/functions/parse.go
fix(openai): stop streaming tool-call double-emission when autoparser is active #10055 → core/http/endpoints/openai/chat_stream_workers.go
this PR → core/config/model_config.go, pkg/functions/grammars/json_schema.go, gallery/qwen3.yaml

Interaction worth a reviewer's attention

This PR makes every use_tokenizer_template model defer to the autoparser path (tool calls arrive via ChatDeltas). That's exactly the path #10055 hardens against double-emission, so the two are complementary and ideally land together — this PR widens the surface #10055 protects. Neither subsumes the other: this PR doesn't fix #9722's Go-side double-emit/misparse, and #10055/#9940 don't fix the #10052 content leak.

Verification (this branch, qwen3-4b, llama.cpp)

Regression: Model responds with tool call JSON object string within the content field when stream+tools enabled #10052 exact repro (stream:true + tools, "Say Hello"): content is now the actual answer text, no JSON object leaks, reasoning is classified separately, and the model no longer force-calls a phantom answer tool.
Tool-triggering prompt: streams proper name-first tool_calls deltas ({"name":"exec",…} then arguments) at a single index, no content leak and no double-emission.
Regression suites green: core/http/endpoints/openai, pkg/functions/..., core/config. golangci-lint --new-from-merge-base=master → 0 issues.

Note for #9940 (FYI, not blocking this PR)

The filterMalformedXMLToolCalls heuristic (drop results whose Name starts with {) only catches the one canonical Hermes shape. Probing ParseXMLIterative(input, nil, false) on this branch, the glm-4.5 auto-detect still mis-extracts a garbage name for these variants, which the {-prefix check does not catch:

input inside `<tool_call>…</tool_call>`	mis-extracted `Name`	caught by `{`-filter
`{"name":"bash",…}` (canonical)	`{"name":"bash",…}`	✅
`Sure: {"name":…}` (leading prose)	`Sure: {"name":…}`	❌
`[{"name":…}]` (parallel calls)	`[{"name":…}]`	❌
`name: bash, …` (brace-less)	`name: bash, …`	❌

A name-shape validation (e.g. reject anything not matching ^[A-Za-z_][A-Za-z0-9_.-]*$) or, at the root, requiring actual glm-4.5 markers before the glm format claims a <tool_call> block, would be more robust than the leading-{ test.

mudler added 2 commits May 29, 2026 07:52

mudler merged commit 73cfedc into master May 29, 2026
57 checks passed

mudler deleted the fix/10052-tokenizer-template-grammar-leak branch May 29, 2026 08:12

mudler added the bug Something isn't working label May 29, 2026

BrewTestBot mentioned this pull request May 29, 2026

localai 4.3.5 Homebrew/homebrew-core#285377

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: tool-call JSON leaks into content with stream+tools on tokenizer-template models (#10052)#10057

fix: tool-call JSON leaks into content with stream+tools on tokenizer-template models (#10052)#10057
mudler merged 2 commits into
masterfrom
fix/10052-tokenizer-template-grammar-leak

localai-bot commented May 29, 2026

Uh oh!

Uh oh!

localai-bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 29, 2026

What

Root cause

Changes

Verification

Uh oh!

Uh oh!

localai-bot commented May 29, 2026

Relationship to #9722 / #10055 / #9940

Interaction worth a reviewer's attention

Verification (this branch, qwen3-4b, llama.cpp)

Note for #9940 (FYI, not blocking this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants