fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk by localai-bot · Pull Request #10000 · mudler/LocalAI

localai-bot · 2026-05-25T21:26:33Z

Summary

Follow-up to #9991 (#9985) and #9999 (#9988). With both of those plus the post-#9985 gallery shape (qwen3-4b on use_tokenizer_template: true + use_jinja: true), the streaming-with-tools SSE stream still ended with a spurious chunk:

data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":{...}}"}}

— the same tool-call JSON the client already received as delta.tool_calls, masquerading as reasoning.

Root cause

processStreamWithTools builds a ReasoningExtractor from DetectThinkingStartToken(template, …). After #9985, template is cfg.GetModelTemplate() — qwen3's jinja chat template, which contains <think> inside an {% if enable_thinking %} block. DetectThinkingStartToken doesn't evaluate jinja conditionals, so it returns "<think>" unconditionally. Every output chunk then runs through PrependThinkingTokenIfNeeded, which synthesizes a leading <think> and makes ExtractReasoning treat the rest of the output as reasoning. For a non-thinking tool-call (the model just emits {"name":"exec",…}), extractor.Reasoning() ends up holding the tool-call JSON.

The autoparser correctly classifies zero reasoning — qwen3's tool format isn't on llama.cpp's recognized-tool list, so all tokens land in ChatDelta.Content. But processStreamWithTools's end-of-stream flush preferred extractor.Reasoning() (polluted) over functions.ReasoningFromChatDeltas(chatDeltas) (empty), and buildDeferredToolCallChunks emitted the polluted state.

Fix shape

Sticky preferAutoparser flag in processStreamWithTools, mirroring the one Regression: Reasoning/thinking output provided as regular output #9985 added to processStream. Once any ChatDelta carries content or reasoning, the flag stays on; the worker stops falling back to the Go-side extractor mid-stream and stops trusting extractor.Reasoning() at end-of-stream.
chooseDeferredReasoning helper that selects the end-of-stream source. With preferAutoparser=true, return functions.ReasoningFromChatDeltas(chatDeltas). Otherwise fall back to extractor.Reasoning() (correct for vLLM and other autoparser-less backends).

Tests

Four Ginkgo specs lock down the helper's contract:

autoparser active, no reasoning_content classified → polluted Go-side state is ignored, returns empty
autoparser active with real reasoning_content → autoparser data passes through verbatim (no-regression for jinja-with-recognized-format models)
autoparser not active, genuine <think>…</think> content → falls back to Go-side extractor (vLLM-style backends)
autoparser not active even when vestigial ChatDeltas are present → still falls back to Go-side
full go test ./core/http/endpoints/openai/... green

End-to-end (combined with #9999 layered locally)

Streaming + tools against qwen3-4b on the post-#9985 jinja gallery shape:

Field	Before this PR	After this PR
content chunks	18 (`{"arguments": {"cmd": "echo hello"}, "name": "exec"}` token-by-token)	18 (unchanged)
tool_calls	1 (`name='exec'`, args=`{"cmd":"echo hello"}`)	1 (unchanged)
reasoning chunks	1, carrying the same JSON as the tool_call	0
finish_reason	`tool_calls`	`tool_calls`

Dependencies

Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to make the trailing chunk observable end-to-end — without #9999, the stream is clamped at {" by the healing-marker stub bug and the trailing reasoning chunk never even gets a chance to fire. The helper unit tests are independent of #9999 and pass on master alone.

Will rebase cleanly once #9999 merges.

🤖 Generated with Claude Code

…iling reasoning chunk When the C++ autoparser is in pure-content fallback mode (qwen3-4b after model emits a tool-call JSON in non-thinking mode, the streaming worker ended the SSE stream with a spurious data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}} chunk carrying the same JSON that was already in delta.tool_calls. The Go-side ReasoningExtractor is configured from DetectThinkingStartToken, which scans the model's jinja chat template verbatim and finds <think> inside an {% if enable_thinking %} block without evaluating the conditional. Every output chunk then runs through PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and makes ExtractReasoning treat everything after as reasoning. The autoparser correctly classifies zero reasoning (qwen3's tool format isn't on llama.cpp's recognized-tool list, so all tokens land in ChatDelta.Content), but processStreamWithTools then preferred extractor.Reasoning() over functions.ReasoningFromChatDeltas at the end-of-stream flush — handing the polluted Go-side state to buildDeferredToolCallChunks, which emitted it as a trailing reasoning chunk. Two changes: * Add a sticky preferAutoparser flag to processStreamWithTools, mirroring the analogous flag in processStream from #9985. Once any ChatDelta carries content or reasoning, the flag stays on for the rest of the stream and the worker stops falling back to the Go-side extractor for per-token deltas. This avoids the per-chunk leak path and the cumulative pollution. * Extract chooseDeferredReasoning, a small helper that selects the end-of-stream reasoning source. When preferAutoparser is set, return functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to extractor.Reasoning() (the correct source for vLLM and other backends with no autoparser). The helper has a focused test suite covering both sides of the contract: autoparser-active with empty reasoning (the qwen3 case — the fix's purpose), autoparser-active with real reasoning_content (jinja-with-recognized-format models), and autoparser-not-active with genuine Go-side reasoning (vLLM-style backends). E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with name='exec' and the right arguments, finish_reason=tool_calls, and zero reasoning chunks — down from one polluted reasoning chunk before this fix. Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to make the trailing chunk observable end-to-end; the helper unit tests are independent. Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler force-pushed the fix/streaming-tools-prefill-reasoning-leak branch from ef5acda to d4ac597 Compare May 25, 2026 22:03

mudler merged commit e4c70fc into master May 26, 2026
57 checks passed

mudler deleted the fix/streaming-tools-prefill-reasoning-leak branch May 26, 2026 06:34

rsclafani mentioned this pull request May 27, 2026

/v1/responses silently returns empty output: convertORInputToMessages leaves Content nil; array items without type:"message" are dropped #10039

Closed

BrewTestBot mentioned this pull request May 27, 2026

localai 4.3.2 Homebrew/homebrew-core#285003

Merged

localai-bot added the bug Something isn't working label Jun 10, 2026

localai-bot mentioned this pull request Jun 12, 2026

Agent always ever answers {"{" #9419

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk#10000

fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk#10000
mudler merged 1 commit into
masterfrom
fix/streaming-tools-prefill-reasoning-leak

localai-bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 25, 2026

Summary

Root cause

Fix shape

Tests

End-to-end (combined with #9999 layered locally)

Dependencies

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants