fix(agents): escalate LLM idle timeout to model fallback after profile rotation#80449
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. for a source-level reproduction: current main sets Real behavior proof Next step before merge Security Review detailsBest possible solution: Land the narrow failover-policy plumbing so LLM idle watchdog timeouts are treated as timeout-shaped assistant failures after profile rotation, while leaving broader timeout phase splitting and terminal abort policy to their separate PRs. Do we have a high-confidence way to reproduce the issue? Yes for a source-level reproduction: current main sets Is this the best way to solve the issue? Yes. Threading What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against b0c817ee9d50. Re-review progress:
|
f2b8529 to
a715c1f
Compare
|
Addressed both review items. P3 (preserve timeout reason after idle rotations) — fixed in 928a570. Real behavior proof — added live patched-build evidence to the PR body. Built the branch in a clean sandbox, ran a real
Full log excerpt and a note on differential testing against the unpatched installed build ( @clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
…e rotation When the LLM idle watchdog fires (model produced no tokens for N seconds), idleTimedOut is set in handleAssistantFailover but was never passed into resolveRunFailoverDecision. As a result, shouldRotateAssistant saw neither failoverReason nor timedOut (the run-budget timeout) set, returned false, and the decision fell through to continue_normal -- the agent silently froze without surfacing an error or advancing the fallback chain. Fixes openclaw#76877 (regression since 2026.4.24). Changes: - failover-policy.ts: add idleTimedOut to AssistantDecisionParams; include it in shouldRotateAssistant and reason selection in resolveRunFailoverDecision - assistant-failover.ts: pass idleTimedOut into resolveRunFailoverDecision - failover-policy.test.ts: 4 new cases for idle timeout path; update existing assistant stage cases with the new required field (idleTimedOut: false)
- failover-policy.test.ts: move 4 new it() blocks inside describe() (they were orphaned outside the block and would not execute) - run.ts: add idleTimedOut to the assistantFailoverDecision call site (missing required field caused TypeScript error and reproduced the freeze for the initial-decision code path in the outer loop) - assistant-failover.ts: treat idleTimedOut same as timedOut in markFailedProfile to avoid incorrect profile failure recording - assistant-failover.ts: add warn log when idle timeout rotates a profile - assistant-failover.ts: extend resolveAssistantFailoverErrorMessage to accept idleTimedOut so surface_error emits "LLM request timed out." instead of the generic "LLM request failed."
Addresses Codex P3 review finding: when shouldRotateAssistant fires on idleTimedOut alone (timedOut=false), mergeRetryFailoverReason was passed timedOut: params.timedOut (false), so the accumulated retry reason did not record 'timeout'. Pass timedOut || idleTimedOut so the timeout reason survives idle-only rotations and downstream fallback_model receives the correct reason.
7ffba2f to
1ed8e80
Compare
|
Landed via rebase onto main.
Thanks @jimdawdy-hub! |
Summary
Fixes #76877 — regression since 2026.4.24.
When the LLM idle watchdog fires (
idleTimedOut = true), the agent silently freezes instead of rotating auth profiles or advancing the model fallback chain.Root cause:
idleTimedOutis tracked inhandleAssistantFailoverbut was never passed intoresolveRunFailoverDecision. As a result,shouldRotateAssistantsaw neitherfailoverReasonnortimedOut(the run-budget timeout) set, returnedfalse, and the decision fell through tocontinue_normal— the agent produced no response and no error.This is distinct from the two existing open PRs:
What changed
failover-policy.tsidleTimedOut: booleantoAssistantDecisionParamsshouldRotateAssistantnow returnstrueforidleTimedOutresolveRunFailoverDecisionemitsreason: "timeout"when eithertimedOutoridleTimedOutis trueassistant-failover.tsidleTimedOuttoresolveRunFailoverDecisionin the profile-rotation branchmarkFailedProfilenow treatsidleTimedOutsame astimedOut(suppresses incorrect profile failure recording for silence events)resolveAssistantFailoverErrorMessagenow acceptsidleTimedOut— surfaces"LLM request timed out."instead of"LLM request failed."when idle timeout exhausts all escalationwarnlog when idle timeout causes a profile rotationrun.tsidleTimedOutto theassistantFailoverDecisioncall site (missing required field was a TypeScript compile error and reproduced the freeze for the initial-decision path)failover-policy.test.tsdescribe("resolveRunFailoverDecision", ...)covering the idle timeout pathidleTimedOut: falseBehaviour after the fix
continue_normal(freeze)rotate_profilecontinue_normal(freeze)fallback_model(reason:"timeout")continue_normal(freeze)surface_error+"LLM request timed out."surface_error(unchanged)surface_error(unchanged)timedOutReal behavior proof
Behavior or issue addressed: Agent (dashboard session
f5f5f669) silently froze mid-turn withprocessing, q=1, age=20s. No error surfaced to the user. The configured fallback model was never invoked despite being present in config. The freeze was caused by the LLM idle watchdog firing after a 16-second event-loop stall (Telegram channel startup blocked the Node event loop), leaving the reply runner waiting indefinitely with no response and no escalation.Real environment tested: Desktop Linux (kernel 7.0.3-zen1-2-zen), OpenClaw v2026.5.2 and v2026.5.10-beta.2. Gateway running via npm install on localhost. Telegram channel active. Model config:
primary: minimax/MiniMax-M2.7,fallbacks: ["minimax/MiniMax-M2.5"]. Tailscale mesh active (gateway/ws connections confirmed in log).Exact steps or command run after this patch:
openclaw gateway startopenclaw.jsonwrite (sha256 change logged)channels.telegram.start-accountstarves the reply runneropenclaw statusand gateway diagnostic log forprompt-errorevent and agent stuck inprocessingstateresolveRunFailoverDecisionreceivesidleTimedOut: true, returnsfallback_model, M2.5 fallback engagesEvidence after fix: Gateway runtime log captured 2026-05-10 14:01:54 (pre-patch, confirming freeze pattern). Two independent reproductions:
Reproduction 1 — gateway diagnostic log (redacted session IDs):
Reproduction 2 — Najef's session JSONL (comment #4416225566), line 86, showing
stopReason: toolUsewith notoolCallobject in the content array, followed by cascading malformed tool results at L87–L91, ending in a 120s idle timeout. Fallback to MiniMax-M2.5 (configured infallbacks) was never triggered.Both cases traced through
handleAssistantFailover→resolveRunFailoverDecisionwithidleTimedOut: true, timedOut: false, failoverReason: null→shouldRotateAssistantreturnsfalse→continue_normal→ agent freeze.Observed result after fix: After the patch, the
idleTimedOutflag reachesresolveRunFailoverDecision.shouldRotateAssistantnow returnstrueforidleTimedOut. With profiles exhausted and fallback configured, the decision isfallback_modelwithreason: "timeout"— the fallback chain advances instead of freezing. Validated via tsx execution of 11 decision-tree assertions (4 new idle-timeout paths + 7 regression cases, all passing).Live patched-build proof (added 2026-05-10): Built the PR branch locally (
pnpm buildon commit928a570ain a clean sandbox) and ran a realopenclaw agent --localturn against a hanging HTTPS endpoint (Cloudflare Quick Tunnel → local Node server that accepts the OpenAI-compatible POST and never streams a chunk). Configured the moonshot provider withtimeoutSeconds: 5and two auth profiles to exercise rotation. Gateway diagnostic log from the run (timestamps, runId redacted to last 6 chars):Reading the trace: the 5-second idle watchdog fired (
durationMs=5028),handleAssistantFailoverentered therotate_profilebranch (newidle timeout (model silent)warn line confirms the patched code path executed), rotation exhausted, the secondresolveRunFailoverDecisioncall returnedfallback_modelwithreason: "timeout", and aFailoverErrorsurfaced to the caller — no silent freeze.Note on differential testing: re-running the same scenario against the unpatched installed build (
2026.5.10-beta.2, sha8287402da7) produced an identical surface outcome — aFailoverErrorwithreason=timeout. That is expected and is evidence the patch is safe: when an idle timeout fires without an active tool execution,attempt.tssets bothidleTimedOut=trueandtimedOut=true, and the pre-existingtimedOutclause inshouldRotateAssistantalready covered this path. The patch's behavioral effect is in the harder-to-fabricate case where the watchdog fires while a tool is mid-execution (timedOutDuringToolExecution=truegates thetimedOutclause off); the new|| params.idleTimedOutclause is what unfreezes that path. The decision-tree tests atfailover-policy.test.ts:303-388cover those branches exhaustively, and the live run above confirms the new log + decision wiring is exercised end-to-end in a real openclaw process.What was not tested: A live post-patch reproduction with
timedOutDuringToolExecution=truesimultaneous withidleTimedOut=true(requires the LLM stream to stall while a tool is actively running, which cannot be reliably forced from outside the runtime). That branch is covered by the decision-tree tests but not exercised in the live run above. Compaction-timeout end-to-end flows were not retested; that code path uses thetimedOutflag which is unchanged.AI-assisted disclosure
Built with Claude Code (claude-sonnet-4-6) with human review and sign-off from Jim Dawdy. Logic traced through
failover-policy.ts,assistant-failover.ts,llm-idle-timeout.ts,run.ts, andfailover-matches.ts. A code review pass caught two additional issues (orphaned tests outsidedescribe, missingrun.tscall site) that were fixed before pushing. I reviewed and understand every changed line.