fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode#9991
Merged
Merged
Conversation
…in pure-content mode When LocalAI templates a thinking model outside of jinja (the default for the qwen3 gallery family), llama.cpp's chat parser falls back to a "pure content" PEG parser that dumps the entire raw response into ChatDelta.Content with an empty ReasoningContent. The Go side then trusted that content verbatim and overrode tokenCallback's correctly-split reasoning, so <think>...</think> blocks ended up in the OpenAI `content` field. Regression from v4.0.0 introduced when the autoparser ChatDeltas path was added (#9224). The override now runs Go-side reasoning extraction defensively when the autoparser delivered content but no reasoning. The streaming worker gains a sticky preferAutoparser flag that flips on the first chunk where the autoparser classified reasoning_content; until then we use the streaming Go-side extractor. Realtime mirrors the non-streaming fallback. When the autoparser already populated ReasoningContent we trust it untouched, so jinja-enabled installs are not regressed. gallery/qwen3.yaml now enables use_jinja, letting the autoparser classify <think> natively for all 20+ qwen3 family entries that share this template. Fixes #9985 Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This was referenced May 25, 2026
mudler
added a commit
that referenced
this pull request
Jun 9, 2026
…rs (#10225) * fix(reasoning): stop prefilled <think> from swallowing tag-less answers When a chat template injects the thinking start token into the prompt (so DetectThinkingStartToken returns e.g. "<think>"), the model's output begins inside a reasoning block and carries only the closing tag. The non-jinja autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the start token so the extractor can pair it with the model's </think>. But on a COMPLETE response that contains no closing tag, the model answered directly with no reasoning at all. Prepending the start token there manufactures an unclosed block that swallows the entire answer into reasoning, leaving the OpenAI `content` field empty. This breaks short/direct answers — session names, JSON summaries, any terse completion where the model skips the think block — which come back with empty content. Regression surfaced by #9991, which added the defensive prefill extraction to the complete-response paths. Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token when the response actually contains the matching closing tag (proof a reasoning block exists). Genuine reasoning tags already in the content still extract; tag-less content stays content. Apply it at every complete-response site (applyAutoparserOverride, realtime, openresponses). The streaming per-token extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an as-yet-unclosed block is legitimate and must surface as reasoning deltas. Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag pairs to package scope so both helpers share one source of truth. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(reasoning): cover the enable_thinking=false non-thinking-mode regression Adds the end-to-end case that actually broke session summaries / auto-titles and was not covered before: a request with enable_thinking=false against a <think>-capable model. In non-thinking mode the model emits no reasoning block, so llama.cpp's autoparser returns ChatDeltas with content set and reasoning_content empty (verified against stock llama-server: same model with chat_template_kwargs.enable_thinking=false returns reasoning_content=null, content="hello"). thinkingStartToken is still "<think>" because it is detected per-model from the enable_thinking=true render, so the old code prepended it and swallowed the answer. The test fails without the ExtractReasoningComplete gate. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #9985 — qwen3-4b (and the rest of the qwen3 family) was returning the
<think>...</think>block inside the OpenAIcontentfield instead of in a separatereasoningfield. Regression from v4.0.0, introduced by the C++ autoparser ChatDeltas path (#9224).Root cause
When LocalAI templates a thinking model outside of jinja (the default for the qwen3 gallery), llama.cpp's chat parser falls back to a "pure content" PEG parser. It dumps the entire raw response —
<think>tags and all — intoChatDelta.Contentand leavesChatDelta.ReasoningContentempty. The Go side inchat.gothen preferred the autoparser's content overtokenCallback's correctly-split result, so the tags leaked through.Debug log showing the bug:
Fix shape
applyAutoparserOverride(extracted from chat.go's inline override) now runs Go-sideExtractReasoningWithConfigwhen the autoparser delivered content but no reasoning. When the autoparser DID populateReasoningContent, we trust it untouched — jinja-enabled installs are not regressed.preferAutoparserflag. It flips on the first chunk where the autoparser classifiedreasoning_content; until then the streaming worker uses the Go-side extractor's deltas.gallery/qwen3.yamlnow enablesuse_jinja:trueso the autoparser classifies<think>natively for the 20+ qwen3 family entries sharing this template. The Go-side fallback still covers older on-disk installs and any future imported models without jinja.Test plan
go test ./core/http/endpoints/openai/ ./core/http/endpoints/openresponses/ ./pkg/reasoning/ ./pkg/functions/— greenchat_test.gocovering:<think>in content + empty reasoning → split correctly (red without fix, green with fix)<think></think>block from qwen3/no_think→ tags stripped, no spurious reasoning fieldgolangci-lint run --new-from-merge-base=master— 0 new issues/no_thinkmode: empty think block stripped cleanlydelta.reasoning, content chunks cleanuse_jinja:truevariant (working-autoparser baseline):content_len=39 reasoning_len=376from autoparser — Go-side fallback bypassed as expected🤖 Generated with Claude Code