-
-
Notifications
You must be signed in to change notification settings - Fork 79.1k
[Bug]: In-turn reasoning dropped on multi-turn tool replay for non-400 openai models (gemma4/vLLM) — silent agentic-quality regression #91645
Copy link
Copy link
Closed
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.Auth, provider routing, model choice, or SecretRef resolution may break.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Metadata
Metadata
Assignees
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.Auth, provider routing, model choice, or SecretRef resolution may break.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Bug type
Behavior bug (incorrect output/state without crash)
Summary
For openai-completions reasoning models whose API does not require
reasoning_contenton assistant messages (e.g. gemma4 / Gemma-4-12B on vLLM), OpenClaw drops the model's in-turn reasoning when replaying multi-turn tool conversations. Since the provider doesn't 400 on the missing field (unlike DeepSeek/Moonshot/Xiaomi), the failure is silent: no error, just degraded multi-step tool behavior — tool-callargumentsintermittently collapse to{}, and the model re-issues identical tool calls because it can't see its own prior reasoning.This is distinct from the known API-compat issues (#70392, #81419, #91558, #89660, #91106): those inject an empty
reasoning_content: ""to satisfy strict provider schemas and avoid a 400. Here the provider is fine without the field — what's missing for quality is the real in-turn reasoning text, so the model keeps continuity across tool calls.Evidence (captured off the wire)
TCP tee between OpenClaw and vLLM; one multi-turn tool exchange via
openclaw agent --local --thinking high. Deepest request OpenClaw sent (system, user, 3× assistant tool-call, 3× tool):tool_callspreservedreasoning_contentpresent on 0 / 3 in-turn assistant tool-call messagesThe Gemma 4 chat template re-injects in-turn reasoning (
reasoning/reasoning_content→<|channel>thought…<channel|>, gated to messages after the last user turn — it correctly keeps in-turn reasoning and drops completed-turn reasoning). OpenClaw sends nothing, so re-injection never fires. Across stored trajectories on this install: 13execcalls with emptyarguments(exec requires acommand), plus repeated identical tool calls within sessions.Suspected root cause
reasoningContentappears empty at replay for these models, so the existing populate path never runs:requiresReasoningContentOnAssistantMessagesonly governs theelse(empty-field API-compat shim) and so does not restore real reasoning. The gap looks upstream: in-turn reasoning isn't carried from the session store intoreasoningContentfor openai-format models that aren't on the DeepSeek/Xiaomi detect list. (Same populate code in 2026.6.1 and 2026.6.5-beta — not a regression.)Proposed direction
Preserve real in-turn reasoning (reasoning generated since the last user message) on replayed assistant tool-call messages for openai reasoning models, independent of the empty-field API-compat path. This matches what the Gemma model card + chat template assume, and the community reports of the same symptom (Gemma-4-12B tool-calling PSA on r/LocalLLaMA; Qwen3.6 analogue: earendil-works/pi#3325 — "after 2-3 turns every tool call collapses to
arguments: {}").Steps to reproduce
api: openai-completionson vLLM,compat.thinkingFormat: "openai",reasoning: true.--thinking high.tool_callsbut noreasoning_content; over several turns, arguments degrade / tool calls repeat.Environment
api: openai-completions,compat.thinkingFormat: "openai",reasoning: true,--thinking highWorkaround
Make each tool action self-contained (a single call returns the final value — e.g. server-computed totals, or a
gog | jq | dateone-liner) so turns don't depend on cross-turn reasoning. Restores correctness for those flows but doesn't help genuinely multi-step reasoning.