Skip to content

fix(agents): escalate LLM idle timeout to model fallback after profile rotation#80449

Merged
obviyus merged 5 commits into
openclaw:mainfrom
jimdawdy-hub:fix/idle-timeout-fallback-model
May 13, 2026
Merged

fix(agents): escalate LLM idle timeout to model fallback after profile rotation#80449
obviyus merged 5 commits into
openclaw:mainfrom
jimdawdy-hub:fix/idle-timeout-fallback-model

Conversation

@jimdawdy-hub

@jimdawdy-hub jimdawdy-hub commented May 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #76877 — regression since 2026.4.24.

When the LLM idle watchdog fires (idleTimedOut = true), the agent silently freezes instead of rotating auth profiles or advancing the model fallback chain.

Root cause: idleTimedOut is tracked in handleAssistantFailover but was never passed into resolveRunFailoverDecision. As a result, shouldRotateAssistant saw neither failoverReason nor timedOut (the run-budget timeout) set, returned false, and the decision fell through to continue_normal — the agent produced no response and no error.

This is distinct from the two existing open PRs:

What changed

failover-policy.ts

  • Added idleTimedOut: boolean to AssistantDecisionParams
  • shouldRotateAssistant now returns true for idleTimedOut
  • resolveRunFailoverDecision emits reason: "timeout" when either timedOut or idleTimedOut is true

assistant-failover.ts

  • Passes idleTimedOut to resolveRunFailoverDecision in the profile-rotation branch
  • markFailedProfile now treats idleTimedOut same as timedOut (suppresses incorrect profile failure recording for silence events)
  • resolveAssistantFailoverErrorMessage now accepts idleTimedOut — surfaces "LLM request timed out." instead of "LLM request failed." when idle timeout exhausts all escalation
  • Adds operator-visible warn log when idle timeout causes a profile rotation

run.ts

  • Adds idleTimedOut to the assistantFailoverDecision call site (missing required field was a TypeScript compile error and reproduced the freeze for the initial-decision path)

failover-policy.test.ts

  • 4 new test cases inside describe("resolveRunFailoverDecision", ...) covering the idle timeout path
  • Existing assistant-stage tests updated with idleTimedOut: false

Behaviour after the fix

Idle timeout state Before After
Profiles remain continue_normal (freeze) rotate_profile
All profiles exhausted, fallback configured continue_normal (freeze) fallback_model (reason: "timeout")
All profiles exhausted, no fallback continue_normal (freeze) surface_error + "LLM request timed out."
External abort surface_error (unchanged) surface_error (unchanged)
Run-budget timedOut unchanged unchanged

Real behavior proof

Behavior or issue addressed: Agent (dashboard session f5f5f669) silently froze mid-turn with processing, q=1, age=20s. No error surfaced to the user. The configured fallback model was never invoked despite being present in config. The freeze was caused by the LLM idle watchdog firing after a 16-second event-loop stall (Telegram channel startup blocked the Node event loop), leaving the reply runner waiting indefinitely with no response and no escalation.

Real environment tested: Desktop Linux (kernel 7.0.3-zen1-2-zen), OpenClaw v2026.5.2 and v2026.5.10-beta.2. Gateway running via npm install on localhost. Telegram channel active. Model config: primary: minimax/MiniMax-M2.7, fallbacks: ["minimax/MiniMax-M2.5"]. Tailscale mesh active (gateway/ws connections confirmed in log).

Exact steps or command run after this patch:

  1. Start gateway: openclaw gateway start
  2. Config reload triggered by openclaw.json write (sha256 change logged)
  3. Gateway evaluates reload — Telegram channel start-account phase executes
  4. 16-second event-loop stall during channels.telegram.start-account starves the reply runner
  5. Idle watchdog fires — check openclaw status and gateway diagnostic log for prompt-error event and agent stuck in processing state
  6. Pre-patch: agent remains frozen, no fallback triggered
  7. Post-patch: resolveRunFailoverDecision receives idleTimedOut: true, returns fallback_model, M2.5 fallback engages

Evidence after fix: Gateway runtime log captured 2026-05-10 14:01:54 (pre-patch, confirming freeze pattern). Two independent reproductions:

Reproduction 1 — gateway diagnostic log (redacted session IDs):

warn  diagnostic  liveness warning: reasons=event_loop_delay
  interval=31s
  eventLoopDelayP99Ms=21.2
  eventLoopDelayMaxMs=15896.4
  eventLoopUtilization=0.614
  cpuCoreRatio=0.67
  active=1 waiting=0 queued=1
  phase=channels.telegram.start-account
  recentPhases=post-attach.update-sentinel:0ms,sidecars.session-locks:47ms,
    sidecars.model-prewarm:317ms,post-attach.tailscale:494ms,
    gateway.ready:3150ms,post-ready.maintenance:48ms
  work=[active=agent:main:dashboard:f5f5f669-...(processing,q=1,age=20s)
        queued=agent:main:dashboard:f5f5f669-...(processing,q=1,age=20s)]

Reproduction 2 — Najef's session JSONL (comment #4416225566), line 86, showing stopReason: toolUse with no toolCall object in the content array, followed by cascading malformed tool results at L87–L91, ending in a 120s idle timeout. Fallback to MiniMax-M2.5 (configured in fallbacks) was never triggered.

Both cases traced through handleAssistantFailoverresolveRunFailoverDecision with idleTimedOut: true, timedOut: false, failoverReason: nullshouldRotateAssistant returns falsecontinue_normal → agent freeze.

Observed result after fix: After the patch, the idleTimedOut flag reaches resolveRunFailoverDecision. shouldRotateAssistant now returns true for idleTimedOut. With profiles exhausted and fallback configured, the decision is fallback_model with reason: "timeout" — the fallback chain advances instead of freezing. Validated via tsx execution of 11 decision-tree assertions (4 new idle-timeout paths + 7 regression cases, all passing).

Live patched-build proof (added 2026-05-10): Built the PR branch locally (pnpm build on commit 928a570a in a clean sandbox) and ran a real openclaw agent --local turn against a hanging HTTPS endpoint (Cloudflare Quick Tunnel → local Node server that accepts the OpenAI-compatible POST and never streams a chunk). Configured the moonshot provider with timeoutSeconds: 5 and two auth profiles to exercise rotation. Gateway diagnostic log from the run (timestamps, runId redacted to last 6 chars):

[agent/embedded] embedded run prompt end: runId=...249f9 durationMs=5028
[agent/embedded] Profile moonshot:default timed out. Trying next account...
[agent/embedded] Profile moonshot:default idle timeout (model silent). Trying next account...   ← new log added by this patch
[agent/embedded] embedded run failover decision: stage=assistant decision=fallback_model reason=timeout from=moonshot/kimi-k2.5
[diagnostic]     lane task error: lane=main durationMs=7977 error="FailoverError: LLM request timed out."
[model-fallback/decision] decision=candidate_failed requested=moonshot/kimi-k2.5 candidate=moonshot/kimi-k2.5 reason=timeout next=none detail=LLM request timed out.
FailoverError: LLM request timed out.

Reading the trace: the 5-second idle watchdog fired (durationMs=5028), handleAssistantFailover entered the rotate_profile branch (new idle timeout (model silent) warn line confirms the patched code path executed), rotation exhausted, the second resolveRunFailoverDecision call returned fallback_model with reason: "timeout", and a FailoverError surfaced to the caller — no silent freeze.

Note on differential testing: re-running the same scenario against the unpatched installed build (2026.5.10-beta.2, sha 8287402da7) produced an identical surface outcome — a FailoverError with reason=timeout. That is expected and is evidence the patch is safe: when an idle timeout fires without an active tool execution, attempt.ts sets both idleTimedOut=true and timedOut=true, and the pre-existing timedOut clause in shouldRotateAssistant already covered this path. The patch's behavioral effect is in the harder-to-fabricate case where the watchdog fires while a tool is mid-execution (timedOutDuringToolExecution=true gates the timedOut clause off); the new || params.idleTimedOut clause is what unfreezes that path. The decision-tree tests at failover-policy.test.ts:303-388 cover those branches exhaustively, and the live run above confirms the new log + decision wiring is exercised end-to-end in a real openclaw process.

What was not tested: A live post-patch reproduction with timedOutDuringToolExecution=true simultaneous with idleTimedOut=true (requires the LLM stream to stall while a tool is actively running, which cannot be reliably forced from outside the runtime). That branch is covered by the decision-tree tests but not exercised in the live run above. Compaction-timeout end-to-end flows were not retested; that code path uses the timedOut flag which is unchanged.

AI-assisted disclosure

Built with Claude Code (claude-sonnet-4-6) with human review and sign-off from Jim Dawdy. Logic traced through failover-policy.ts, assistant-failover.ts, llm-idle-timeout.ts, run.ts, and failover-matches.ts. A code review pass caught two additional issues (orphaned tests outside describe, missing run.ts call site) that were fixed before pushing. I reviewed and understand every changed line.

@openclaw-barnacle openclaw-barnacle Bot added the agents Agent runtime and tooling label May 10, 2026
@openclaw-barnacle openclaw-barnacle Bot added size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 10, 2026
@clawsweeper

clawsweeper Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge.

Summary
The branch threads idleTimedOut through assistant failover decisions, profile-rotation bookkeeping, timeout messaging/logging, and failover-policy tests.

Reproducibility: yes. for a source-level reproduction: current main sets idleTimedOut when the LLM idle watchdog fires, but both assistant failover decision calls omit that flag, so the policy can return continue_normal. I did not live-reproduce the MiniMax/Telegram freeze in this read-only review.

Real behavior proof
Sufficient (logs): Redacted patched-build gateway logs in the PR body show after-fix runtime behavior: idle-timeout rotation logging, fallback-model decision with reason=timeout, and a surfaced timeout error.

Next step before merge
The open PR already contains the narrow fix and has no blocking review finding, so the remaining action is normal maintainer review and landing rather than a repair job.

Security
Cleared: The diff is limited to agent failover TypeScript and tests, with no dependency, workflow, secret, packaging, or downloaded-code surface added.

Review details

Best possible solution:

Land the narrow failover-policy plumbing so LLM idle watchdog timeouts are treated as timeout-shaped assistant failures after profile rotation, while leaving broader timeout phase splitting and terminal abort policy to their separate PRs.

Do we have a high-confidence way to reproduce the issue?

Yes for a source-level reproduction: current main sets idleTimedOut when the LLM idle watchdog fires, but both assistant failover decision calls omit that flag, so the policy can return continue_normal. I did not live-reproduce the MiniMax/Telegram freeze in this read-only review.

Is this the best way to solve the issue?

Yes. Threading idleTimedOut through the existing assistant failover policy is the narrowest maintainable fix for the documented timeout fallback contract, and the PR keeps the existing external-abort and plain tool-execution timeout protections intact.

What I checked:

Likely related people:

  • simonusa: Authored the merged tool-execution timeout exemption that this PR must preserve while changing idle-timeout fallback behavior. (role: introduced adjacent behavior; confidence: high; commits: 2605490dbd30; files: src/agents/pi-embedded-runner/run/failover-policy.ts, src/agents/pi-embedded-runner/run/attempt.ts)
  • steipete: Recent commits and API history show adjacent work in assistant failover, embedded runner maintenance, and release validation touching these files. (role: recent area contributor; confidence: medium; commits: 15cf49222f92, 195e72121182, 42d73fd955af; files: src/agents/pi-embedded-runner/run/assistant-failover.ts, src/agents/pi-embedded-runner/run.ts, src/agents/pi-embedded-runner/run/failover-policy.ts)
  • hclsys: Recently changed the same failover-policy and assistant-failover surfaces for assistant-prefill format failure handling. (role: recent failover-policy contributor; confidence: medium; commits: 398dd6e0b091; files: src/agents/pi-embedded-runner/run/failover-policy.ts, src/agents/pi-embedded-runner/run/assistant-failover.ts)
  • shakkernerd: Recent history on assistant failover includes nonblocking auth-profile rotation behavior adjacent to this PR's profile-rotation branch. (role: adjacent auth-profile contributor; confidence: medium; commits: efa8c83200ab; files: src/agents/pi-embedded-runner/run/assistant-failover.ts)

Remaining risk / open question:

  • The live proof does not force the hardest simultaneous idleTimedOut plus active tool-execution branch; that branch is covered by the PR's decision-tree tests and source-level signal path.

Codex review notes: model gpt-5.5, reasoning high; reviewed against b0c817ee9d50.

Re-review progress:

@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 10, 2026
@jimdawdy-hub jimdawdy-hub force-pushed the fix/idle-timeout-fallback-model branch from f2b8529 to a715c1f Compare May 10, 2026 23:41
@jimdawdy-hub

Copy link
Copy Markdown
Contributor Author

Addressed both review items.

P3 (preserve timeout reason after idle rotations) — fixed in 928a570. mergeRetryFailoverReason now receives timedOut: params.timedOut || params.idleTimedOut, so the accumulated retry reason records "timeout" even when shouldRotateAssistant fired on idleTimedOut alone. Downstream fallback_model receives the correct reason.

Real behavior proof — added live patched-build evidence to the PR body. Built the branch in a clean sandbox, ran a real openclaw agent --local turn against a hanging HTTPS endpoint (Cloudflare tunnel → Node server that accepts the OpenAI-compatible POST and never streams a chunk), with moonshot.timeoutSeconds: 5 and two auth profiles. Captured gateway diagnostic log showing:

  • 5-second idle watchdog fired (durationMs=5028)
  • New warn line from this patch fired: Profile moonshot:default idle timeout (model silent). Trying next account...
  • embedded run failover decision: stage=assistant decision=fallback_model reason=timeout
  • FailoverError: LLM request timed out. surfaced to the caller (no silent freeze)

Full log excerpt and a note on differential testing against the unpatched installed build (2026.5.10-beta.2) are in the "Live patched-build proof" subsection of the PR body.

@clawsweeper re-review

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 11, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 13, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 13, 2026
@clawsweeper

clawsweeper Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

jimdawdy-hub and others added 5 commits May 13, 2026 13:33
…e rotation

When the LLM idle watchdog fires (model produced no tokens for N seconds),
idleTimedOut is set in handleAssistantFailover but was never passed into
resolveRunFailoverDecision. As a result, shouldRotateAssistant saw neither
failoverReason nor timedOut (the run-budget timeout) set, returned false,
and the decision fell through to continue_normal -- the agent silently froze
without surfacing an error or advancing the fallback chain.

Fixes openclaw#76877 (regression since 2026.4.24).

Changes:
- failover-policy.ts: add idleTimedOut to AssistantDecisionParams; include it
  in shouldRotateAssistant and reason selection in resolveRunFailoverDecision
- assistant-failover.ts: pass idleTimedOut into resolveRunFailoverDecision
- failover-policy.test.ts: 4 new cases for idle timeout path; update existing
  assistant stage cases with the new required field (idleTimedOut: false)
- failover-policy.test.ts: move 4 new it() blocks inside describe()
  (they were orphaned outside the block and would not execute)
- run.ts: add idleTimedOut to the assistantFailoverDecision call site
  (missing required field caused TypeScript error and reproduced the freeze
  for the initial-decision code path in the outer loop)
- assistant-failover.ts: treat idleTimedOut same as timedOut in
  markFailedProfile to avoid incorrect profile failure recording
- assistant-failover.ts: add warn log when idle timeout rotates a profile
- assistant-failover.ts: extend resolveAssistantFailoverErrorMessage to
  accept idleTimedOut so surface_error emits "LLM request timed out."
  instead of the generic "LLM request failed."
Addresses Codex P3 review finding: when shouldRotateAssistant fires on
idleTimedOut alone (timedOut=false), mergeRetryFailoverReason was passed
timedOut: params.timedOut (false), so the accumulated retry reason did
not record 'timeout'. Pass timedOut || idleTimedOut so the timeout reason
survives idle-only rotations and downstream fallback_model receives the
correct reason.
@obviyus obviyus force-pushed the fix/idle-timeout-fallback-model branch from 7ffba2f to 1ed8e80 Compare May 13, 2026 08:13
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 13, 2026
@obviyus obviyus merged commit f7e9d80 into openclaw:main May 13, 2026
114 checks passed
@obviyus

obviyus commented May 13, 2026

Copy link
Copy Markdown
Contributor

Landed via rebase onto main.

  • Exact PR head before merge: 1ed8e8022b6db6d1c80b9193af92206ad25fa0ab
  • GitHub checks: 114 total, 0 pending, 0 failing, merge state CLEAN
  • Local validation: git diff --check
  • Local validation: OPENCLAW_LOCAL_CHECK_MODE=throttled pnpm test src/agents/pi-embedded-runner/run/failover-policy.test.ts src/agents/pi-embedded-runner/run/assistant-failover.test.ts
  • Changelog: CHANGELOG.md updated
  • Main commit: f7e9d80536adc618b40e6e9fde9414df2835c07e

Thanks @jimdawdy-hub!

github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling proof: supplied External PR includes structured after-fix real behavior proof. size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: 2026.5.2 Agents stop responding mid work, they use tooling and then suddenly nothing..

2 participants