fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk#10000
Merged
Merged
Conversation
…iling reasoning chunk
When the C++ autoparser is in pure-content fallback mode (qwen3-4b after
model emits a tool-call JSON in non-thinking mode, the streaming worker
ended the SSE stream with a spurious
data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}}
chunk carrying the same JSON that was already in delta.tool_calls.
The Go-side ReasoningExtractor is configured from
DetectThinkingStartToken, which scans the model's jinja chat template
verbatim and finds <think> inside an {% if enable_thinking %} block
without evaluating the conditional. Every output chunk then runs through
PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and
makes ExtractReasoning treat everything after as reasoning. The autoparser
correctly classifies zero reasoning (qwen3's tool format isn't on
llama.cpp's recognized-tool list, so all tokens land in
ChatDelta.Content), but processStreamWithTools then preferred
extractor.Reasoning() over functions.ReasoningFromChatDeltas at the
end-of-stream flush — handing the polluted Go-side state to
buildDeferredToolCallChunks, which emitted it as a trailing reasoning
chunk.
Two changes:
* Add a sticky preferAutoparser flag to processStreamWithTools, mirroring
the analogous flag in processStream from #9985. Once any ChatDelta
carries content or reasoning, the flag stays on for the rest of the
stream and the worker stops falling back to the Go-side extractor for
per-token deltas. This avoids the per-chunk leak path and the cumulative
pollution.
* Extract chooseDeferredReasoning, a small helper that selects the
end-of-stream reasoning source. When preferAutoparser is set, return
functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to
extractor.Reasoning() (the correct source for vLLM and other backends
with no autoparser).
The helper has a focused test suite covering both sides of the contract:
autoparser-active with empty reasoning (the qwen3 case — the fix's
purpose), autoparser-active with real reasoning_content
(jinja-with-recognized-format models), and autoparser-not-active with
genuine Go-side reasoning (vLLM-style backends).
E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery
shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with
name='exec' and the right arguments, finish_reason=tool_calls, and zero
reasoning chunks — down from one polluted reasoning chunk before this
fix.
Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to
make the trailing chunk observable end-to-end; the helper unit tests are
independent.
Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
ef5acda to
d4ac597
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #9991 (#9985) and #9999 (#9988). With both of those plus the post-#9985 gallery shape (qwen3-4b on
use_tokenizer_template: true+use_jinja: true), the streaming-with-tools SSE stream still ended with a spurious chunk:— the same tool-call JSON the client already received as
delta.tool_calls, masquerading as reasoning.Root cause
processStreamWithToolsbuilds aReasoningExtractorfromDetectThinkingStartToken(template, …). After #9985,templateiscfg.GetModelTemplate()— qwen3's jinja chat template, which contains<think>inside an{% if enable_thinking %}block.DetectThinkingStartTokendoesn't evaluate jinja conditionals, so it returns"<think>"unconditionally. Every output chunk then runs throughPrependThinkingTokenIfNeeded, which synthesizes a leading<think>and makesExtractReasoningtreat the rest of the output as reasoning. For a non-thinking tool-call (the model just emits{"name":"exec",…}),extractor.Reasoning()ends up holding the tool-call JSON.The autoparser correctly classifies zero reasoning — qwen3's tool format isn't on llama.cpp's recognized-tool list, so all tokens land in
ChatDelta.Content. ButprocessStreamWithTools's end-of-stream flush preferredextractor.Reasoning()(polluted) overfunctions.ReasoningFromChatDeltas(chatDeltas)(empty), andbuildDeferredToolCallChunksemitted the polluted state.Fix shape
preferAutoparserflag inprocessStreamWithTools, mirroring the one Regression: Reasoning/thinking output provided as regular output #9985 added toprocessStream. Once anyChatDeltacarries content or reasoning, the flag stays on; the worker stops falling back to the Go-side extractor mid-stream and stops trustingextractor.Reasoning()at end-of-stream.chooseDeferredReasoninghelper that selects the end-of-stream source. WithpreferAutoparser=true, returnfunctions.ReasoningFromChatDeltas(chatDeltas). Otherwise fall back toextractor.Reasoning()(correct for vLLM and other autoparser-less backends).Tests
Four Ginkgo specs lock down the helper's contract:
reasoning_content→ autoparser data passes through verbatim (no-regression for jinja-with-recognized-format models)<think>…</think>content → falls back to Go-side extractor (vLLM-style backends)go test ./core/http/endpoints/openai/...greenEnd-to-end (combined with #9999 layered locally)
Streaming + tools against qwen3-4b on the post-#9985 jinja gallery shape:
{"arguments": {"cmd": "echo hello"}, "name": "exec"}token-by-token)name='exec', args={"cmd":"echo hello"})tool_callstool_callsDependencies
Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to make the trailing chunk observable end-to-end — without #9999, the stream is clamped at
{"by the healing-marker stub bug and the trailing reasoning chunk never even gets a chance to fire. The helper unit tests are independent of #9999 and pass on master alone.Will rebase cleanly once #9999 merges.
🤖 Generated with Claude Code