Skip to content

fix: tool-call JSON leaks into content with stream+tools on tokenizer-template models (#10052)#10057

Merged
mudler merged 2 commits into
masterfrom
fix/10052-tokenizer-template-grammar-leak
May 29, 2026
Merged

fix: tool-call JSON leaks into content with stream+tools on tokenizer-template models (#10052)#10057
mudler merged 2 commits into
masterfrom
fix/10052-tokenizer-template-grammar-leak

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

What

Fixes #10052: with stream: true + tools on a tokenizer-template model (e.g. qwen3), the chat-completions endpoint streamed the tool-call JSON object out through the content field (in addition to emitting the real tool_calls).

Root cause

When use_tokenizer_template delegates templating to the backend (llama.cpp), the backend also owns tool-call grammar generation and parsing. LocalAI was still generating its own GBNF grammar and sending it down. With a Go grammar present, llama.cpp does not pass the tools to its template (has_grammar_from_go=1), so its native peg/json tool parser never engages — it streams the grammar-constrained tool-call JSON back as plain content rather than tool_calls.

Two contributing defects:

  1. Grammar/template inconsistency. The GGUF auto-import path already pairs use_tokenizer_template with grammar.disable, but that block is skipped when a template is already configured — so gallery and hand-written configs (qwen3) that set the tokenizer template directly never got the paired grammar.disable.
  2. Property-order sentinel bug. The JSON-schema→GBNF sort used aOrder != 0 && bOrder != 0 as its guard, treating index 0 (the first key in properties_order) as "unset". So even properties_order: name,arguments fell back to alphabetical order and emitted arguments before name — which prevented the streaming detector from ever gating content.

Changes

  • SetDefaults now couples use_tokenizer_template ⇒ grammar.disable for every config, so gallery, hand-written, and GGUF-import configs agree. Tools flow to the backend's native (name-first) pipeline. This also fixes already-installed models without editing each config.
  • gallery/qwen3.yaml (the shared base referenced by every qwen3 entry) sets function.grammar.disable: true explicitly.
  • Fix the properties_order sentinel so a listed key at index 0 is honored (presence-in-map instead of != 0).

Verification

End-to-end against qwen3-4b (stream:true + tools), built from this branch:

  • Plain answer: content now carries the actual text (Hello! How can I assist you today?), reasoning is classified separately, and no JSON leaks. Previously content dribbled out {"arguments":...,"name":"answer"}.
  • Tool call: streams proper name-first tool_calls deltas ({"name":"exec",...} then the arguments) with zero content leak.

New tests (TDD, Ginkgo):

  • core/config: SetDefaults disables Go grammar when the tokenizer template is used (and leaves it enabled otherwise).
  • pkg/functions/grammars: properties_order: name,arguments yields name-first; empty order stays alphabetical; listed keys sort ahead of unlisted ones.

golangci-lint run --new-from-merge-base=master → 0 issues.

🤖 Generated with Claude Code

mudler added 2 commits May 29, 2026 07:52
The JSON-schema-to-GBNF property sort used `aOrder != 0 && bOrder != 0` as
its "is this key ordered?" guard. That treats index 0 — the first key listed
in properties_order — as unset, so `properties_order: name,arguments` fell
back to alphabetical ordering and still emitted "arguments" before "name".

Use presence in the order map instead: listed keys sort by their index and
ahead of unlisted keys, which keep a stable alphabetical order. This makes
the documented `properties_order: name,arguments` actually produce
name-first tool-call JSON. Relates to #10052.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…template owns templating (#10052)

When use_tokenizer_template delegates templating to the backend (llama.cpp),
the backend also owns tool-call grammar generation and parsing. LocalAI was
still generating its own GBNF grammar and sending it down. With a grammar
present, llama.cpp does not hand the tools to its template, so its native
peg/json tool parser never engages: it streams the grammar-constrained
tool-call JSON back as plain content instead of emitting tool_calls. In
streaming mode the JSON object leaked into the content field, and the
Go-side incremental detector never gated content because the
LocalAI-generated grammar emitted "arguments" before "name".

The GGUF auto-import path already couples use_tokenizer_template with
grammar.disable, but that block is skipped when a template is already
configured, so gallery and hand-written configs (e.g. qwen3) that set the
tokenizer template directly never got the paired grammar.disable.

- SetDefaults now enforces the coupling for every config: when
  use_tokenizer_template is set, grammar generation is disabled and tools
  flow to the backend's native (name-first) pipeline. This also fixes
  already-installed models without editing each config.
- Set function.grammar.disable in the shared gallery/qwen3.yaml, which is
  the base config referenced by every qwen3 gallery entry.

Verified end to end against qwen3-4b with stream:true + tools: content no
longer carries the tool-call JSON, reasoning is classified separately, and
tool calls stream as proper name-first tool_calls deltas.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler merged commit 73cfedc into master May 29, 2026
57 checks passed
@mudler mudler deleted the fix/10052-tokenizer-template-grammar-leak branch May 29, 2026 08:12
@localai-bot

Copy link
Copy Markdown
Collaborator Author

Relationship to #9722 / #10055 / #9940

While investigating I confirmed this PR shares a root cause with the two open PRs against #9722. All three are facets of the same tension: LocalAI's Go-side tool-call parsing/streaming runs in parallel with llama.cpp's C++ chat-template autoparser, and the two collide. Each addresses a different collision symptom, at a different layer:

PR Issue Symptom Layer
#9940 #9722 glm-4.5 XML auto-detect mis-extracts a Hermes <tool_call>{json}</tool_call> blob as the function name parse (ParseXMLIterative)
#10055 #9722 Go parser and autoparser both emit → the same call appears at multiple index values stream (processStreamWithTools)
this PR (#10057) #10052 Go-generated grammar suppresses the autoparser → the tool-call JSON streams back as content config (decides which path runs)

There is no file overlap between the three branches, so they don't conflict:

Interaction worth a reviewer's attention

This PR makes every use_tokenizer_template model defer to the autoparser path (tool calls arrive via ChatDeltas). That's exactly the path #10055 hardens against double-emission, so the two are complementary and ideally land together — this PR widens the surface #10055 protects. Neither subsumes the other: this PR doesn't fix #9722's Go-side double-emit/misparse, and #10055/#9940 don't fix the #10052 content leak.

Verification (this branch, qwen3-4b, llama.cpp)

  • Regression: Model responds with tool call JSON object string within the content field when stream+tools enabled #10052 exact repro (stream:true + tools, "Say Hello"): content is now the actual answer text, no JSON object leaks, reasoning is classified separately, and the model no longer force-calls a phantom answer tool.
  • Tool-triggering prompt: streams proper name-first tool_calls deltas ({"name":"exec",…} then arguments) at a single index, no content leak and no double-emission.
  • Regression suites green: core/http/endpoints/openai, pkg/functions/..., core/config. golangci-lint --new-from-merge-base=master → 0 issues.

Note for #9940 (FYI, not blocking this PR)

The filterMalformedXMLToolCalls heuristic (drop results whose Name starts with {) only catches the one canonical Hermes shape. Probing ParseXMLIterative(input, nil, false) on this branch, the glm-4.5 auto-detect still mis-extracts a garbage name for these variants, which the {-prefix check does not catch:

input inside <tool_call>…</tool_call> mis-extracted Name caught by {-filter
{"name":"bash",…} (canonical) {"name":"bash",…}
Sure: {"name":…} (leading prose) Sure: {"name":…}
[{"name":…}] (parallel calls) [{"name":…}]
name: bash, … (brace-less) name: bash, …

A name-shape validation (e.g. reject anything not matching ^[A-Za-z_][A-Za-z0-9_.-]*$) or, at the root, requiring actual glm-4.5 markers before the glm format claims a <tool_call> block, would be more robust than the leading-{ test.

@mudler mudler added the bug Something isn't working label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: Model responds with tool call JSON object string within the content field when stream+tools enabled

2 participants