Skip to content

fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode#9991

Merged
mudler merged 1 commit into
masterfrom
fix/9985-autoparser-reasoning-leak
May 25, 2026
Merged

fix(reasoning): stop <think> leaking into content when autoparser is in pure-content mode#9991
mudler merged 1 commit into
masterfrom
fix/9985-autoparser-reasoning-leak

Conversation

@localai-bot

Copy link
Copy Markdown
Collaborator

Summary

Fixes #9985 — qwen3-4b (and the rest of the qwen3 family) was returning the <think>...</think> block inside the OpenAI content field instead of in a separate reasoning field. Regression from v4.0.0, introduced by the C++ autoparser ChatDeltas path (#9224).

Root cause

When LocalAI templates a thinking model outside of jinja (the default for the qwen3 gallery), llama.cpp's chat parser falls back to a "pure content" PEG parser. It dumps the entire raw response — <think> tags and all — into ChatDelta.Content and leaves ChatDelta.ReasoningContent empty. The Go side in chat.go then preferred the autoparser's content over tokenCallback's correctly-split result, so the tags leaked through.

Debug log showing the bug:

[ChatDeltas] non-streaming Predict received deltas from C++ autoparser total_deltas=1
[ChatDeltas] non-SSE no-tools: overriding result with C++ autoparser deltas content_len=376 reasoning_len=0

Fix shape

  • Conditional fallback. applyAutoparserOverride (extracted from chat.go's inline override) now runs Go-side ExtractReasoningWithConfig when the autoparser delivered content but no reasoning. When the autoparser DID populate ReasoningContent, we trust it untouched — jinja-enabled installs are not regressed.
  • Streaming gets a sticky preferAutoparser flag. It flips on the first chunk where the autoparser classified reasoning_content; until then the streaming worker uses the Go-side extractor's deltas.
  • Realtime mirrors the non-streaming fallback.
  • gallery/qwen3.yaml now enables use_jinja:true so the autoparser classifies <think> natively for the 20+ qwen3 family entries sharing this template. The Go-side fallback still covers older on-disk installs and any future imported models without jinja.

Test plan

  • go test ./core/http/endpoints/openai/ ./core/http/endpoints/openresponses/ ./pkg/reasoning/ ./pkg/functions/ — green
  • New Ginkgo specs in chat_test.go covering:
    • autoparser delivered <think> in content + empty reasoning → split correctly (red without fix, green with fix)
    • autoparser already populated reasoning → passthrough untouched (no-regression on jinja path)
    • plain content, no reasoning tags → passthrough
    • empty <think></think> block from qwen3 /no_think → tags stripped, no spurious reasoning field
    • empty chatDeltas → returns existing result
  • golangci-lint run --new-from-merge-base=master — 0 new issues
  • End-to-end against running qwen3-4b (Q4_K_M):
    • Default thinking mode: content clean, reasoning in its own field
    • /no_think mode: empty think block stripped cleanly
    • Streaming: reasoning chunks delivered in delta.reasoning, content chunks clean
    • use_jinja:true variant (working-autoparser baseline): content_len=39 reasoning_len=376 from autoparser — Go-side fallback bypassed as expected

🤖 Generated with Claude Code

…in pure-content mode

When LocalAI templates a thinking model outside of jinja (the default for
the qwen3 gallery family), llama.cpp's chat parser falls back to a
"pure content" PEG parser that dumps the entire raw response into
ChatDelta.Content with an empty ReasoningContent. The Go side then
trusted that content verbatim and overrode tokenCallback's
correctly-split reasoning, so <think>...</think> blocks ended up in the
OpenAI `content` field. Regression from v4.0.0 introduced when the
autoparser ChatDeltas path was added (#9224).

The override now runs Go-side reasoning extraction defensively when the
autoparser delivered content but no reasoning. The streaming worker
gains a sticky preferAutoparser flag that flips on the first chunk
where the autoparser classified reasoning_content; until then we use
the streaming Go-side extractor. Realtime mirrors the non-streaming
fallback. When the autoparser already populated ReasoningContent we
trust it untouched, so jinja-enabled installs are not regressed.

gallery/qwen3.yaml now enables use_jinja, letting the autoparser
classify <think> natively for all 20+ qwen3 family entries that share
this template.

Fixes #9985

Assisted-by: Claude:opus-4-7 [Read] [Edit] [Bash] [Write]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit 1c6c3ad into master May 25, 2026
57 checks passed
@mudler mudler deleted the fix/9985-autoparser-reasoning-leak branch May 25, 2026 20:39
mudler added a commit that referenced this pull request Jun 9, 2026
…rs (#10225)

* fix(reasoning): stop prefilled <think> from swallowing tag-less answers

When a chat template injects the thinking start token into the prompt (so
DetectThinkingStartToken returns e.g. "<think>"), the model's output begins
inside a reasoning block and carries only the closing tag. The non-jinja
autoparser fallback (peg-native "pure content" mode, issue #9985) prepends the
start token so the extractor can pair it with the model's </think>.

But on a COMPLETE response that contains no closing tag, the model answered
directly with no reasoning at all. Prepending the start token there manufactures
an unclosed block that swallows the entire answer into reasoning, leaving the
OpenAI `content` field empty. This breaks short/direct answers — session names,
JSON summaries, any terse completion where the model skips the think block —
which come back with empty content. Regression surfaced by #9991, which added
the defensive prefill extraction to the complete-response paths.

Add reasoning.ExtractReasoningComplete: it only honors a prefilled start token
when the response actually contains the matching closing tag (proof a reasoning
block exists). Genuine reasoning tags already in the content still extract;
tag-less content stays content. Apply it at every complete-response site
(applyAutoparserOverride, realtime, openresponses). The streaming per-token
extractor is intentionally left on ExtractReasoningWithConfig — mid-stream an
as-yet-unclosed block is legitimate and must surface as reasoning deltas.

Also adds reasoning.ClosingTokenForStart and hoists the default reasoning tag
pairs to package scope so both helpers share one source of truth.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(reasoning): cover the enable_thinking=false non-thinking-mode regression

Adds the end-to-end case that actually broke session summaries / auto-titles
and was not covered before: a request with enable_thinking=false against a
<think>-capable model. In non-thinking mode the model emits no reasoning block,
so llama.cpp's autoparser returns ChatDeltas with content set and
reasoning_content empty (verified against stock llama-server: same model with
chat_template_kwargs.enable_thinking=false returns reasoning_content=null,
content="hello"). thinkingStartToken is still "<think>" because it is detected
per-model from the enable_thinking=true render, so the old code prepended it and
swallowed the answer. The test fails without the ExtractReasoningComplete gate.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@localai-bot localai-bot added the bug Something isn't working label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression: Reasoning/thinking output provided as regular output

2 participants