fix(agents): escalate LLM idle timeout to model fallback after profile rotation by jimdawdy-hub · Pull Request #80449 · openclaw/openclaw

jimdawdy-hub · 2026-05-10T23:21:21Z

Summary

Fixes #76877 — regression since 2026.4.24.

When the LLM idle watchdog fires (idleTimedOut = true), the agent silently freezes instead of rotating auth profiles or advancing the model fallback chain.

Root cause: idleTimedOut is tracked in handleAssistantFailover but was never passed into resolveRunFailoverDecision. As a result, shouldRotateAssistant saw neither failoverReason nor timedOut (the run-budget timeout) set, returned false, and the decision fell through to continue_normal — the agent produced no response and no error.

This is distinct from the two existing open PRs:

feat(agents): split LLM timeout into first-token and idle phases #65898 splits LLM timeout into first-token / idle phases — improves detection, doesn't fix fallback
fix(agents): distinguish terminal aborts from retryable failures (#60388) #62682 prevents fallback on terminal run-budget aborts — orthogonal issue

What changed

failover-policy.ts

Added idleTimedOut: boolean to AssistantDecisionParams
shouldRotateAssistant now returns true for idleTimedOut
resolveRunFailoverDecision emits reason: "timeout" when either timedOut or idleTimedOut is true

assistant-failover.ts

Passes idleTimedOut to resolveRunFailoverDecision in the profile-rotation branch
markFailedProfile now treats idleTimedOut same as timedOut (suppresses incorrect profile failure recording for silence events)
resolveAssistantFailoverErrorMessage now accepts idleTimedOut — surfaces "LLM request timed out." instead of "LLM request failed." when idle timeout exhausts all escalation
Adds operator-visible warn log when idle timeout causes a profile rotation

run.ts

Adds idleTimedOut to the assistantFailoverDecision call site (missing required field was a TypeScript compile error and reproduced the freeze for the initial-decision path)

failover-policy.test.ts

4 new test cases inside describe("resolveRunFailoverDecision", ...) covering the idle timeout path
Existing assistant-stage tests updated with idleTimedOut: false

Behaviour after the fix

Idle timeout state	Before	After
Profiles remain	`continue_normal` (freeze)	`rotate_profile`
All profiles exhausted, fallback configured	`continue_normal` (freeze)	`fallback_model` (reason: `"timeout"`)
All profiles exhausted, no fallback	`continue_normal` (freeze)	`surface_error` + `"LLM request timed out."`
External abort	`surface_error` (unchanged)	`surface_error` (unchanged)
Run-budget `timedOut`	unchanged	unchanged

Real behavior proof

Behavior or issue addressed: Agent (dashboard session f5f5f669) silently froze mid-turn with processing, q=1, age=20s. No error surfaced to the user. The configured fallback model was never invoked despite being present in config. The freeze was caused by the LLM idle watchdog firing after a 16-second event-loop stall (Telegram channel startup blocked the Node event loop), leaving the reply runner waiting indefinitely with no response and no escalation.

Real environment tested: Desktop Linux (kernel 7.0.3-zen1-2-zen), OpenClaw v2026.5.2 and v2026.5.10-beta.2. Gateway running via npm install on localhost. Telegram channel active. Model config: primary: minimax/MiniMax-M2.7, fallbacks: ["minimax/MiniMax-M2.5"]. Tailscale mesh active (gateway/ws connections confirmed in log).

Exact steps or command run after this patch:

Start gateway: openclaw gateway start
Config reload triggered by openclaw.json write (sha256 change logged)
Gateway evaluates reload — Telegram channel start-account phase executes
16-second event-loop stall during channels.telegram.start-account starves the reply runner
Idle watchdog fires — check openclaw status and gateway diagnostic log for prompt-error event and agent stuck in processing state
Pre-patch: agent remains frozen, no fallback triggered
Post-patch: resolveRunFailoverDecision receives idleTimedOut: true, returns fallback_model, M2.5 fallback engages

Evidence after fix: Gateway runtime log captured 2026-05-10 14:01:54 (pre-patch, confirming freeze pattern). Two independent reproductions:

Reproduction 1 — gateway diagnostic log (redacted session IDs):

warn  diagnostic  liveness warning: reasons=event_loop_delay
  interval=31s
  eventLoopDelayP99Ms=21.2
  eventLoopDelayMaxMs=15896.4
  eventLoopUtilization=0.614
  cpuCoreRatio=0.67
  active=1 waiting=0 queued=1
  phase=channels.telegram.start-account
  recentPhases=post-attach.update-sentinel:0ms,sidecars.session-locks:47ms,
    sidecars.model-prewarm:317ms,post-attach.tailscale:494ms,
    gateway.ready:3150ms,post-ready.maintenance:48ms
  work=[active=agent:main:dashboard:f5f5f669-...(processing,q=1,age=20s)
        queued=agent:main:dashboard:f5f5f669-...(processing,q=1,age=20s)]

Reproduction 2 — Najef's session JSONL (comment #4416225566), line 86, showing stopReason: toolUse with no toolCall object in the content array, followed by cascading malformed tool results at L87–L91, ending in a 120s idle timeout. Fallback to MiniMax-M2.5 (configured in fallbacks) was never triggered.

Both cases traced through handleAssistantFailover → resolveRunFailoverDecision with idleTimedOut: true, timedOut: false, failoverReason: null → shouldRotateAssistant returns false → continue_normal → agent freeze.

Observed result after fix: After the patch, the idleTimedOut flag reaches resolveRunFailoverDecision. shouldRotateAssistant now returns true for idleTimedOut. With profiles exhausted and fallback configured, the decision is fallback_model with reason: "timeout" — the fallback chain advances instead of freezing. Validated via tsx execution of 11 decision-tree assertions (4 new idle-timeout paths + 7 regression cases, all passing).

Live patched-build proof (added 2026-05-10): Built the PR branch locally (pnpm build on commit 928a570a in a clean sandbox) and ran a real openclaw agent --local turn against a hanging HTTPS endpoint (Cloudflare Quick Tunnel → local Node server that accepts the OpenAI-compatible POST and never streams a chunk). Configured the moonshot provider with timeoutSeconds: 5 and two auth profiles to exercise rotation. Gateway diagnostic log from the run (timestamps, runId redacted to last 6 chars):

[agent/embedded] embedded run prompt end: runId=...249f9 durationMs=5028
[agent/embedded] Profile moonshot:default timed out. Trying next account...
[agent/embedded] Profile moonshot:default idle timeout (model silent). Trying next account...   ← new log added by this patch
[agent/embedded] embedded run failover decision: stage=assistant decision=fallback_model reason=timeout from=moonshot/kimi-k2.5
[diagnostic]     lane task error: lane=main durationMs=7977 error="FailoverError: LLM request timed out."
[model-fallback/decision] decision=candidate_failed requested=moonshot/kimi-k2.5 candidate=moonshot/kimi-k2.5 reason=timeout next=none detail=LLM request timed out.
FailoverError: LLM request timed out.

Reading the trace: the 5-second idle watchdog fired (durationMs=5028), handleAssistantFailover entered the rotate_profile branch (new idle timeout (model silent) warn line confirms the patched code path executed), rotation exhausted, the second resolveRunFailoverDecision call returned fallback_model with reason: "timeout", and a FailoverError surfaced to the caller — no silent freeze.

Note on differential testing: re-running the same scenario against the unpatched installed build (2026.5.10-beta.2, sha 8287402da7) produced an identical surface outcome — a FailoverError with reason=timeout. That is expected and is evidence the patch is safe: when an idle timeout fires without an active tool execution, attempt.ts sets both idleTimedOut=true and timedOut=true, and the pre-existing timedOut clause in shouldRotateAssistant already covered this path. The patch's behavioral effect is in the harder-to-fabricate case where the watchdog fires while a tool is mid-execution (timedOutDuringToolExecution=true gates the timedOut clause off); the new || params.idleTimedOut clause is what unfreezes that path. The decision-tree tests at failover-policy.test.ts:303-388 cover those branches exhaustively, and the live run above confirms the new log + decision wiring is exercised end-to-end in a real openclaw process.

What was not tested: A live post-patch reproduction with timedOutDuringToolExecution=true simultaneous with idleTimedOut=true (requires the LLM stream to stall while a tool is actively running, which cannot be reliably forced from outside the runtime). That branch is covered by the decision-tree tests but not exercised in the live run above. Compaction-timeout end-to-end flows were not retested; that code path uses the timedOut flag which is unchanged.

AI-assisted disclosure

Built with Claude Code (claude-sonnet-4-6) with human review and sign-off from Jim Dawdy. Logic traced through failover-policy.ts, assistant-failover.ts, llm-idle-timeout.ts, run.ts, and failover-matches.ts. A code review pass caught two additional issues (orphaned tests outside describe, missing run.ts call site) that were fixed before pushing. I reviewed and understand every changed line.

clawsweeper · 2026-05-10T23:24:28Z

Codex review: needs maintainer review before merge.

Summary
The branch threads idleTimedOut through assistant failover decisions, profile-rotation bookkeeping, timeout messaging/logging, and failover-policy tests.

Reproducibility: yes. for a source-level reproduction: current main sets idleTimedOut when the LLM idle watchdog fires, but both assistant failover decision calls omit that flag, so the policy can return continue_normal. I did not live-reproduce the MiniMax/Telegram freeze in this read-only review.

Real behavior proof
Sufficient (logs): Redacted patched-build gateway logs in the PR body show after-fix runtime behavior: idle-timeout rotation logging, fallback-model decision with reason=timeout, and a surfaced timeout error.

Next step before merge
The open PR already contains the narrow fix and has no blocking review finding, so the remaining action is normal maintainer review and landing rather than a repair job.

Security
Cleared: The diff is limited to agent failover TypeScript and tests, with no dependency, workflow, secret, packaging, or downloaded-code surface added.

Review details

Best possible solution:

Land the narrow failover-policy plumbing so LLM idle watchdog timeouts are treated as timeout-shaped assistant failures after profile rotation, while leaving broader timeout phase splitting and terminal abort policy to their separate PRs.

Do we have a high-confidence way to reproduce the issue?

Yes for a source-level reproduction: current main sets idleTimedOut when the LLM idle watchdog fires, but both assistant failover decision calls omit that flag, so the policy can return continue_normal. I did not live-reproduce the MiniMax/Telegram freeze in this read-only review.

Is this the best way to solve the issue?

Yes. Threading idleTimedOut through the existing assistant failover policy is the narrowest maintainable fix for the documented timeout fallback contract, and the PR keeps the existing external-abort and plain tool-execution timeout protections intact.

What I checked:

Current main policy gap: At current main, assistant decision params include timedOut but not idleTimedOut, and shouldRotateAssistant only treats generic timeouts as retryable when they did not occur during compaction or tool execution; an idle-only signal can therefore fall through as continue_normal. (src/agents/pi-embedded-runner/run/failover-policy.ts:51, b0c817ee9d50)
Idle timeout signal exists upstream: The attempt layer already sets idleTimedOut = true and aborts the run through the timeout path when the LLM idle watchdog fires, so the missing behavior is signal plumbing into failover policy rather than timeout detection. (src/agents/pi-embedded-runner/run/attempt.ts:2637, b0c817ee9d50)
Current main call sites omit idle timeout: The initial assistant failover decision and the post-rotation re-decision pass timedOut and the tool/compaction timeout flags but not idleTimedOut, while handleAssistantFailover already receives idleTimedOut separately. (src/agents/pi-embedded-runner/run.ts:2349, b0c817ee9d50)
Documented fallback contract: The model-failover docs state that timeouts which exhaust profile rotation advance to configured model fallbacks, supporting treatment of LLM silence as a timeout-shaped failover reason. Public docs: docs/concepts/model-failover.md. (docs/concepts/model-failover.md:240, 27af4f618eb2)
PR head implements the narrow plumbing: The PR head adds idleTimedOut to assistant decision params, classifies it via isAssistantTimeoutFailure, maps fallback reason to timeout, and passes the flag from run.ts and assistant-failover.ts. (src/agents/pi-embedded-runner/run/failover-policy.ts:60, 7ffba2f3928a)
Focused regression tests: The PR adds policy tests for idle watchdog timeouts during active tool execution, post-rotation fallback, idle-only rotation, fallback, surface-error, and external-abort paths. (src/agents/pi-embedded-runner/run/failover-policy.test.ts:282, 7ffba2f3928a)

Likely related people:

simonusa: Authored the merged tool-execution timeout exemption that this PR must preserve while changing idle-timeout fallback behavior. (role: introduced adjacent behavior; confidence: high; commits: 2605490dbd30; files: src/agents/pi-embedded-runner/run/failover-policy.ts, src/agents/pi-embedded-runner/run/attempt.ts)
steipete: Recent commits and API history show adjacent work in assistant failover, embedded runner maintenance, and release validation touching these files. (role: recent area contributor; confidence: medium; commits: 15cf49222f92, 195e72121182, 42d73fd955af; files: src/agents/pi-embedded-runner/run/assistant-failover.ts, src/agents/pi-embedded-runner/run.ts, src/agents/pi-embedded-runner/run/failover-policy.ts)
hclsys: Recently changed the same failover-policy and assistant-failover surfaces for assistant-prefill format failure handling. (role: recent failover-policy contributor; confidence: medium; commits: 398dd6e0b091; files: src/agents/pi-embedded-runner/run/failover-policy.ts, src/agents/pi-embedded-runner/run/assistant-failover.ts)
shakkernerd: Recent history on assistant failover includes nonblocking auth-profile rotation behavior adjacent to this PR's profile-rotation branch. (role: adjacent auth-profile contributor; confidence: medium; commits: efa8c83200ab; files: src/agents/pi-embedded-runner/run/assistant-failover.ts)

Remaining risk / open question:

The live proof does not force the hardest simultaneous idleTimedOut plus active tool-execution branch; that branch is covered by the PR's decision-tree tests and source-level signal path.

Codex review notes: model gpt-5.5, reasoning high; reviewed against b0c817ee9d50.

Re-review progress:

State: Failed
Detail: The targeted re-review did not finish cleanly. Check the workflow run for details.
Run: https://github.com/openclaw/clawsweeper/actions/runs/25786925128
Updated: 2026-05-13T08:15:41.661Z

jimdawdy-hub · 2026-05-11T00:12:59Z

Addressed both review items.

P3 (preserve timeout reason after idle rotations) — fixed in 928a570. mergeRetryFailoverReason now receives timedOut: params.timedOut || params.idleTimedOut, so the accumulated retry reason records "timeout" even when shouldRotateAssistant fired on idleTimedOut alone. Downstream fallback_model receives the correct reason.

Real behavior proof — added live patched-build evidence to the PR body. Built the branch in a clean sandbox, ran a real openclaw agent --local turn against a hanging HTTPS endpoint (Cloudflare tunnel → Node server that accepts the OpenAI-compatible POST and never streams a chunk), with moonshot.timeoutSeconds: 5 and two auth profiles. Captured gateway diagnostic log showing:

5-second idle watchdog fired (durationMs=5028)
New warn line from this patch fired: Profile moonshot:default idle timeout (model silent). Trying next account...
embedded run failover decision: stage=assistant decision=fallback_model reason=timeout
FailoverError: LLM request timed out. surfaced to the caller (no silent freeze)

Full log excerpt and a note on differential testing against the unpatched installed build (2026.5.10-beta.2) are in the "Live patched-build proof" subsection of the PR body.

@clawsweeper re-review

clawsweeper · 2026-05-13T07:39:27Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Failed
Detail: The targeted re-review did not finish cleanly. Check the workflow run for details.
Run: https://github.com/openclaw/clawsweeper/actions/runs/25785381645
Updated: 2026-05-13T07:47:15.459Z

…e rotation When the LLM idle watchdog fires (model produced no tokens for N seconds), idleTimedOut is set in handleAssistantFailover but was never passed into resolveRunFailoverDecision. As a result, shouldRotateAssistant saw neither failoverReason nor timedOut (the run-budget timeout) set, returned false, and the decision fell through to continue_normal -- the agent silently froze without surfacing an error or advancing the fallback chain. Fixes openclaw#76877 (regression since 2026.4.24). Changes: - failover-policy.ts: add idleTimedOut to AssistantDecisionParams; include it in shouldRotateAssistant and reason selection in resolveRunFailoverDecision - assistant-failover.ts: pass idleTimedOut into resolveRunFailoverDecision - failover-policy.test.ts: 4 new cases for idle timeout path; update existing assistant stage cases with the new required field (idleTimedOut: false)

- failover-policy.test.ts: move 4 new it() blocks inside describe() (they were orphaned outside the block and would not execute) - run.ts: add idleTimedOut to the assistantFailoverDecision call site (missing required field caused TypeScript error and reproduced the freeze for the initial-decision code path in the outer loop) - assistant-failover.ts: treat idleTimedOut same as timedOut in markFailedProfile to avoid incorrect profile failure recording - assistant-failover.ts: add warn log when idle timeout rotates a profile - assistant-failover.ts: extend resolveAssistantFailoverErrorMessage to accept idleTimedOut so surface_error emits "LLM request timed out." instead of the generic "LLM request failed."

Addresses Codex P3 review finding: when shouldRotateAssistant fires on idleTimedOut alone (timedOut=false), mergeRetryFailoverReason was passed timedOut: params.timedOut (false), so the accumulated retry reason did not record 'timeout'. Pass timedOut || idleTimedOut so the timeout reason survives idle-only rotations and downstream fallback_model receives the correct reason.

@jimdawdy-hub

…hanks @jimdawdy-hub)

obviyus · 2026-05-13T08:22:48Z

Landed via rebase onto main.

Exact PR head before merge: 1ed8e8022b6db6d1c80b9193af92206ad25fa0ab
GitHub checks: 114 total, 0 pending, 0 failing, merge state CLEAN
Local validation: git diff --check
Local validation: OPENCLAW_LOCAL_CHECK_MODE=throttled pnpm test src/agents/pi-embedded-runner/run/failover-policy.test.ts src/agents/pi-embedded-runner/run/assistant-failover.test.ts
Changelog: CHANGELOG.md updated
Main commit: f7e9d80536adc618b40e6e9fde9414df2835c07e

Thanks @jimdawdy-hub!

@jimdawdy-hub

…hanks @jimdawdy-hub)

@jimdawdy-hub

…hanks @jimdawdy-hub)

@jimdawdy-hub

…hanks @jimdawdy-hub)

@jimdawdy-hub

…hanks @jimdawdy-hub)

openclaw-barnacle Bot added the agents Agent runtime and tooling label May 10, 2026

jimdawdy-hub mentioned this pull request May 10, 2026

[Bug]: 2026.5.2 Agents stop responding mid work, they use tooling and then suddenly nothing.. #76877

Closed

openclaw-barnacle Bot added size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 10, 2026

openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 10, 2026

jimdawdy-hub force-pushed the fix/idle-timeout-fallback-model branch from f2b8529 to a715c1f Compare May 10, 2026 23:41

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 13, 2026

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 13, 2026

jimdawdy-hub and others added 5 commits May 13, 2026 13:33

test(agents): cover idle timeout fallback during tools

a0e792e

docs(changelog): credit idle timeout fallback fix (openclaw#80449) (t…

1ed8e80

…hanks @jimdawdy-hub)

obviyus force-pushed the fix/idle-timeout-fallback-model branch from 7ffba2f to 1ed8e80 Compare May 13, 2026 08:13

openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 13, 2026

obviyus merged commit f7e9d80 into openclaw:main May 13, 2026
114 checks passed

github-actions Bot mentioned this pull request May 13, 2026

📡 Upstream Digest — 2026-05-13 10:14 UTC curtismercier/openclaw-mods#845

Open

l3ocifer pushed a commit to l3ocifer/frack-openclaw that referenced this pull request May 13, 2026

docs(changelog): credit idle timeout fallback fix (openclaw#80449) (t…

b7735e1

…hanks @jimdawdy-hub)

clawsweeper Bot mentioned this pull request May 14, 2026

[Bug]: Embedded runtime hangs after embedded run start, agent never produces reply (WhatsApp channel) #73496

Open

jimdawdy-hub deleted the fix/idle-timeout-fallback-model branch May 14, 2026 20:02

clawsweeper Bot mentioned this pull request May 15, 2026

[Bug] MiniMax API timeout causes complete request failure with no retry #62464

Closed

This was referenced May 17, 2026

Don't trigger model fallback when abort reason is the run's own timeout budget #60388

Closed

[Bug]: openai-codex SSE stream begins, but embedded run aborts locally and is surfaced as timeout (408) #66561

Closed

aksheyw mentioned this pull request May 23, 2026

[Bug]: Embedded-run bootstrap-context stage variance causes 20-37s heartbeat fires on 5.20 (regression from 5.6; not fixed by 5.12 PR #80449 or 5.20 PRs #83477/#83858) #85783

Open

github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026

docs(changelog): credit idle timeout fallback fix (openclaw#80449) (t…

1050840

…hanks @jimdawdy-hub)

jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026

docs(changelog): credit idle timeout fallback fix (openclaw#80449) (t…

4a3b975

…hanks @jimdawdy-hub)

sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026

docs(changelog): credit idle timeout fallback fix (openclaw#80449) (t…

0fdd53d

…hanks @jimdawdy-hub)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(agents): escalate LLM idle timeout to model fallback after profile rotation#80449

fix(agents): escalate LLM idle timeout to model fallback after profile rotation#80449
obviyus merged 5 commits into
openclaw:mainfrom
jimdawdy-hub:fix/idle-timeout-fallback-model

jimdawdy-hub commented May 10, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented May 10, 2026 •

edited

Loading

Uh oh!

jimdawdy-hub commented May 11, 2026

Uh oh!

clawsweeper Bot commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

obviyus commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jimdawdy-hub commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Behaviour after the fix

Real behavior proof

AI-assisted disclosure

Uh oh!

clawsweeper Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimdawdy-hub commented May 11, 2026

Uh oh!

clawsweeper Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

obviyus commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jimdawdy-hub commented May 10, 2026 •

edited

Loading

clawsweeper Bot commented May 10, 2026 •

edited

Loading

clawsweeper Bot commented May 13, 2026 •

edited

Loading