Skip to content

fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk#10000

Merged
mudler merged 1 commit into
masterfrom
fix/streaming-tools-prefill-reasoning-leak
May 26, 2026
Merged

fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk#10000
mudler merged 1 commit into
masterfrom
fix/streaming-tools-prefill-reasoning-leak

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #9991 (#9985) and #9999 (#9988). With both of those plus the post-#9985 gallery shape (qwen3-4b on use_tokenizer_template: true + use_jinja: true), the streaming-with-tools SSE stream still ended with a spurious chunk:

data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":{...}}"}}

— the same tool-call JSON the client already received as delta.tool_calls, masquerading as reasoning.

Root cause

processStreamWithTools builds a ReasoningExtractor from DetectThinkingStartToken(template, …). After #9985, template is cfg.GetModelTemplate() — qwen3's jinja chat template, which contains <think> inside an {% if enable_thinking %} block. DetectThinkingStartToken doesn't evaluate jinja conditionals, so it returns "<think>" unconditionally. Every output chunk then runs through PrependThinkingTokenIfNeeded, which synthesizes a leading <think> and makes ExtractReasoning treat the rest of the output as reasoning. For a non-thinking tool-call (the model just emits {"name":"exec",…}), extractor.Reasoning() ends up holding the tool-call JSON.

The autoparser correctly classifies zero reasoning — qwen3's tool format isn't on llama.cpp's recognized-tool list, so all tokens land in ChatDelta.Content. But processStreamWithTools's end-of-stream flush preferred extractor.Reasoning() (polluted) over functions.ReasoningFromChatDeltas(chatDeltas) (empty), and buildDeferredToolCallChunks emitted the polluted state.

Fix shape

  • Sticky preferAutoparser flag in processStreamWithTools, mirroring the one Regression: Reasoning/thinking output provided as regular output #9985 added to processStream. Once any ChatDelta carries content or reasoning, the flag stays on; the worker stops falling back to the Go-side extractor mid-stream and stops trusting extractor.Reasoning() at end-of-stream.
  • chooseDeferredReasoning helper that selects the end-of-stream source. With preferAutoparser=true, return functions.ReasoningFromChatDeltas(chatDeltas). Otherwise fall back to extractor.Reasoning() (correct for vLLM and other autoparser-less backends).

Tests

Four Ginkgo specs lock down the helper's contract:

  • autoparser active, no reasoning_content classified → polluted Go-side state is ignored, returns empty
  • autoparser active with real reasoning_content → autoparser data passes through verbatim (no-regression for jinja-with-recognized-format models)
  • autoparser not active, genuine <think>…</think> content → falls back to Go-side extractor (vLLM-style backends)
  • autoparser not active even when vestigial ChatDeltas are present → still falls back to Go-side
  • full go test ./core/http/endpoints/openai/... green

End-to-end (combined with #9999 layered locally)

Streaming + tools against qwen3-4b on the post-#9985 jinja gallery shape:

Field Before this PR After this PR
content chunks 18 ({"arguments": {"cmd": "echo hello"}, "name": "exec"} token-by-token) 18 (unchanged)
tool_calls 1 (name='exec', args={"cmd":"echo hello"}) 1 (unchanged)
reasoning chunks 1, carrying the same JSON as the tool_call 0
finish_reason tool_calls tool_calls

Dependencies

Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to make the trailing chunk observable end-to-end — without #9999, the stream is clamped at {" by the healing-marker stub bug and the trailing reasoning chunk never even gets a chance to fire. The helper unit tests are independent of #9999 and pass on master alone.

Will rebase cleanly once #9999 merges.

🤖 Generated with Claude Code

…iling reasoning chunk

When the C++ autoparser is in pure-content fallback mode (qwen3-4b after
model emits a tool-call JSON in non-thinking mode, the streaming worker
ended the SSE stream with a spurious

    data: {...,"delta":{"reasoning":"{\"name\":\"exec\",\"arguments\":...}"}}

chunk carrying the same JSON that was already in delta.tool_calls.

The Go-side ReasoningExtractor is configured from
DetectThinkingStartToken, which scans the model's jinja chat template
verbatim and finds <think> inside an {% if enable_thinking %} block
without evaluating the conditional. Every output chunk then runs through
PrependThinkingTokenIfNeeded, which synthesizes a <think> in front and
makes ExtractReasoning treat everything after as reasoning. The autoparser
correctly classifies zero reasoning (qwen3's tool format isn't on
llama.cpp's recognized-tool list, so all tokens land in
ChatDelta.Content), but processStreamWithTools then preferred
extractor.Reasoning() over functions.ReasoningFromChatDeltas at the
end-of-stream flush — handing the polluted Go-side state to
buildDeferredToolCallChunks, which emitted it as a trailing reasoning
chunk.

Two changes:

* Add a sticky preferAutoparser flag to processStreamWithTools, mirroring
  the analogous flag in processStream from #9985. Once any ChatDelta
  carries content or reasoning, the flag stays on for the rest of the
  stream and the worker stops falling back to the Go-side extractor for
  per-token deltas. This avoids the per-chunk leak path and the cumulative
  pollution.

* Extract chooseDeferredReasoning, a small helper that selects the
  end-of-stream reasoning source. When preferAutoparser is set, return
  functions.ReasoningFromChatDeltas(chatDeltas); otherwise fall back to
  extractor.Reasoning() (the correct source for vLLM and other backends
  with no autoparser).

The helper has a focused test suite covering both sides of the contract:
autoparser-active with empty reasoning (the qwen3 case — the fix's
purpose), autoparser-active with real reasoning_content
(jinja-with-recognized-format models), and autoparser-not-active with
genuine Go-side reasoning (vLLM-style backends).

E2E with combined #9988 and this fix on qwen3-4b post-#9985 gallery
shape: 18 content chunks of the tool-call JSON, 1 tool_call chunk with
name='exec' and the right arguments, finish_reason=tool_calls, and zero
reasoning chunks — down from one polluted reasoning chunk before this
fix.

Depends on #9999 (the streaming JSON tool-call gating bug for qwen3) to
make the trailing chunk observable end-to-end; the helper unit tests are
independent.

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler force-pushed the fix/streaming-tools-prefill-reasoning-leak branch from ef5acda to d4ac597 Compare May 25, 2026 22:03
@mudler mudler merged commit e4c70fc into master May 26, 2026
57 checks passed
@mudler mudler deleted the fix/streaming-tools-prefill-reasoning-leak branch May 26, 2026 06:34
@localai-bot localai-bot added the bug Something isn't working label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants