fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode by localai-bot · Pull Request #9991 · mudler/LocalAI

localai-bot · 2026-05-25T20:00:04Z

Summary

Fixes #9985 — qwen3-4b (and the rest of the qwen3 family) was returning the <think>...</think> block inside the OpenAI content field instead of in a separate reasoning field. Regression from v4.0.0, introduced by the C++ autoparser ChatDeltas path (#9224).

Root cause

When LocalAI templates a thinking model outside of jinja (the default for the qwen3 gallery), llama.cpp's chat parser falls back to a "pure content" PEG parser. It dumps the entire raw response — <think> tags and all — into ChatDelta.Content and leaves ChatDelta.ReasoningContent empty. The Go side in chat.go then preferred the autoparser's content over tokenCallback's correctly-split result, so the tags leaked through.

Debug log showing the bug:

[ChatDeltas] non-streaming Predict received deltas from C++ autoparser total_deltas=1
[ChatDeltas] non-SSE no-tools: overriding result with C++ autoparser deltas content_len=376 reasoning_len=0

Fix shape

Conditional fallback. applyAutoparserOverride (extracted from chat.go's inline override) now runs Go-side ExtractReasoningWithConfig when the autoparser delivered content but no reasoning. When the autoparser DID populate ReasoningContent, we trust it untouched — jinja-enabled installs are not regressed.
Streaming gets a sticky preferAutoparser flag. It flips on the first chunk where the autoparser classified reasoning_content; until then the streaming worker uses the Go-side extractor's deltas.
Realtime mirrors the non-streaming fallback.
gallery/qwen3.yaml now enables use_jinja:true so the autoparser classifies <think> natively for the 20+ qwen3 family entries sharing this template. The Go-side fallback still covers older on-disk installs and any future imported models without jinja.

Test plan

go test ./core/http/endpoints/openai/ ./core/http/endpoints/openresponses/ ./pkg/reasoning/ ./pkg/functions/ — green
New Ginkgo specs in chat_test.go covering:
- autoparser delivered <think> in content + empty reasoning → split correctly (red without fix, green with fix)
- autoparser already populated reasoning → passthrough untouched (no-regression on jinja path)
- plain content, no reasoning tags → passthrough
- empty <think></think> block from qwen3 /no_think → tags stripped, no spurious reasoning field
- empty chatDeltas → returns existing result
golangci-lint run --new-from-merge-base=master — 0 new issues
End-to-end against running qwen3-4b (Q4_K_M):
- Default thinking mode: content clean, reasoning in its own field
- /no_think mode: empty think block stripped cleanly
- Streaming: reasoning chunks delivered in delta.reasoning, content chunks clean
- use_jinja:true variant (working-autoparser baseline): content_len=39 reasoning_len=376 from autoparser — Go-side fallback bypassed as expected

🤖 Generated with Claude Code

…in pure-content mode When LocalAI templates a thinking model outside of jinja (the default for the qwen3 gallery family), llama.cpp's chat parser falls back to a "pure content" PEG parser that dumps the entire raw response into ChatDelta.Content with an empty ReasoningContent. The Go side then trusted that content verbatim and overrode tokenCallback's correctly-split reasoning, so <think>...</think> blocks ended up in the OpenAI `content` field. Regression from v4.0.0 introduced when the autoparser ChatDeltas path was added (#9224). The override now runs Go-side reasoning extraction defensively when the autoparser delivered content but no reasoning. The streaming worker gains a sticky preferAutoparser flag that flips on the first chunk where the autoparser classified reasoning_content; until then we use the streaming Go-side extractor. Realtime mirrors the non-streaming fallback. When the autoparser already populated ReasoningContent we trust it untouched, so jinja-enabled installs are not regressed. gallery/qwen3.yaml now enables use_jinja, letting the autoparser classify <think> natively for all 20+ qwen3 family entries that share this template. Fixes #9985 Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…rs (#10225) * fix(reasoning): stop prefilled <think> from swallowing tag-less answers When a chat template injects the thinking start token into the prompt (so DetectThinkingStartToken returns e.g. "<think>"), the model's output begins inside a reasoning block and carries only the closing tag. The non-jinja autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the start token so the extractor can pair it with the model's </think>. But on a COMPLETE response that contains no closing tag, the model answered directly with no reasoning at all. Prepending the start token there manufactures an unclosed block that swallows the entire answer into reasoning, leaving the OpenAI `content` field empty. This breaks short/direct answers — session names, JSON summaries, any terse completion where the model skips the think block — which come back with empty content. Regression surfaced by #9991, which added the defensive prefill extraction to the complete-response paths. Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token when the response actually contains the matching closing tag (proof a reasoning block exists). Genuine reasoning tags already in the content still extract; tag-less content stays content. Apply it at every complete-response site (applyAutoparserOverride, realtime, openresponses). The streaming per-token extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an as-yet-unclosed block is legitimate and must surface as reasoning deltas. Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag pairs to package scope so both helpers share one source of truth. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(reasoning): cover the enable_thinking=false non-thinking-mode regression Adds the end-to-end case that actually broke session summaries / auto-titles and was not covered before: a request with enable_thinking=false against a <think>-capable model. In non-thinking mode the model emits no reasoning block, so llama.cpp's autoparser returns ChatDeltas with content set and reasoning_content empty (verified against stock llama-server: same model with chat_template_kwargs.enable_thinking=false returns reasoning_content=null, content="hello"). thinkingStartToken is still "<think>" because it is detected per-model from the enable_thinking=true render, so the old code prepended it and swallowed the answer. The test fails without the ExtractReasoningComplete gate. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mudler mentioned this pull request May 25, 2026

Regression: Reasoning/thinking output provided as regular output #9985

Closed

mudler merged commit 1c6c3ad into master May 25, 2026
57 checks passed

mudler deleted the fix/9985-autoparser-reasoning-leak branch May 25, 2026 20:39

This was referenced May 25, 2026

fix(streaming/tools): stop healing-marker stubs from gating off content #9999

Merged

fix(streaming/tools): don't leak prefill-misclassified content as trailing reasoning chunk #10000

Merged

BrewTestBot mentioned this pull request May 27, 2026

localai 4.3.2 Homebrew/homebrew-core#285003

Merged

localai-bot mentioned this pull request Jun 8, 2026

fix(reasoning): stop prefilled <think> from swallowing tag-less answers #10225

Merged

localai-bot added the bug Something isn't working label Jun 10, 2026

localai-bot mentioned this pull request Jun 12, 2026

Agent always ever answers {"{" #9419

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode#9991

fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode#9991
mudler merged 1 commit into
masterfrom
fix/9985-autoparser-reasoning-leak

localai-bot commented May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented May 25, 2026

Summary

Root cause

Fix shape

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants