Bug type
Agent behavior
Summary
Agent run timeout during long tool execution (e.g. process(poll)) is misclassified as "LLM request timed out", triggering unnecessary model fallback — even though the primary model responded correctly.
Steps to reproduce
- Configure an agent with
claude-opus-4-6 as primary model and a fallback model chain
- In a session, have the agent spawn a background process via
exec (background: true)
- Have the agent monitor the process using
process(poll) with 2-minute poll intervals
- Wait for the total run time to exceed
DEFAULT_AGENT_TIMEOUT_SECONDS (600s / 10 minutes)
- Observe: the run is aborted with
"FailoverError: LLM request timed out." and falls back to the next model in the chain
Alternatively: any tool call that takes a long time (browser automation, long exec, repeated process polling) can trigger the same behavior.
Expected behavior
Tool execution time should not count toward the LLM request timeout. The primary model had already responded successfully and the agent was executing tool calls — no LLM request was in flight when the timeout fired.
Precedent: PR #46889 added timedOutDuringCompaction to exempt compaction operations from triggering failover. Long-running tool execution should receive the same treatment.
Actual behavior
The run-level timer fires after 600s regardless of what the agent is doing. When it fires during tool execution:
abortRun(true) sets timedOut = true
- The failover decision checks
timedOut && !timedOutDuringCompaction → enters fallback branch
- A
FailoverError: LLM request timed out. is thrown
- The system falls over to the next configured model
- The fallback model receives the full context (~61K tokens) but has no useful work to do (the original task was already being handled)
Gateway log sequence (redacted):
WARN embedded run timeout: runId=<redacted> sessionId=<redacted> timeoutMs=600000
DEBUG run cleanup: runId=<redacted> sessionId=<redacted> aborted=true timedOut=true
WARN embedded_run_failover_decision: stage=assistant decision=fallback_model failoverReason=timeout provider=anthropic model=claude-opus-4-6 timedOut=true status=408
ERROR lane task error: lane=main durationMs=630119 error="FailoverError: LLM request timed out."
The primary model (Opus) had completed its response. The agent was in a process(poll) loop monitoring a background worker:
- Poll 1: 2 min wait → got output
- Poll 2: 2 min wait → got output
- Poll 3: 2 min wait → got output
- Poll 4: 2 min wait → got output
- Poll 5: started, aborted at ~17s by run timeout
Total tool execution time: ~10 minutes. No LLM request was pending.
OpenClaw version
2026.3.13 (61d171a)
Operating system
Ubuntu 24.04.4 LTS, Linux 6.17.0-19-generic x86_64
Install method
npm (global)
Model
claude-opus-4-6 (Anthropic) — primary model that was "timed out"
Provider / routing chain
anthropic/claude-opus-4-6 → openrouter/google/gemini-3.1-pro-preview (fallback #1) → openrouter/openai/gpt-5.4 (fallback #2) → ollama/qwen3.5:9b (fallback #3)
Additional provider/model setup details
Default agent timeout is the built-in DEFAULT_AGENT_TIMEOUT_SECONDS = 600. No custom agents.defaults.timeoutSeconds override was set.
The fallback model (gemini-3.1-pro-preview via OpenRouter) consumed ~61K input tokens at $0.145 cost but produced 0 completion tokens — confirming it had no useful work to do.
Logs, screenshots, and evidence
Source code analysis (minified bundle auth-profiles-DDVivXkv.js)
Timeout declaration and abort:
// Line 109601-109602
let timedOut = false;
let timedOutDuringCompaction = false;
Failover condition (line 111112):
if (!aborted && failoverFailure || timedOut && !timedOutDuringCompaction) {
// → enters fallback branch even when timeout was caused by tool execution
}
User-facing error (line 111176):
if (timedOut && !timedOutDuringCompaction && payloads.length === 0) return {
payloads: [{ text: "Request timed out before a response was generated...", isError: true }],
// ...
};
Note: timedOutDuringCompaction is the only exemption. There is no timedOutDuringToolExecution or equivalent.
Compaction precedent (PR #46889)
The timedOutDuringCompaction mechanism proves the design intent: certain non-LLM operations should not trigger failover. Tool execution is a missing case.
Impact and severity
- Affected: Any agent using long-running tools (process polling, browser automation, multi-minute exec tasks) with a fallback model chain configured
- Severity: Blocks workflow + unnecessary cost — the fallback model is invoked with the full conversation context but produces no useful output
- Frequency: Deterministic — any tool execution exceeding
DEFAULT_AGENT_TIMEOUT_SECONDS (600s) will trigger this
- Consequence: (1) Wasted API cost on the fallback model, (2) original task is interrupted mid-execution, (3) misleading "LLM request timed out" error when the LLM was not involved
Additional information
Suggested fix
Add a timedOutDuringToolExecution flag (or refactor to a general timeoutCause enum) so tool execution time is exempt from the failover path, consistent with the existing compaction exemption:
// Current: only compaction is exempt
if (timedOut && !timedOutDuringCompaction) { /* failover */ }
// Proposed: tool execution also exempt
if (timedOut && !timedOutDuringCompaction && !timedOutDuringToolExecution) { /* failover */ }
// Or better: general timeout cause
if (timedOut && timeoutCause === 'llm_request') { /* failover */ }
Alternatively, the run deadline could be extended while tool execution is actively in flight (same approach as PR #46889 does for compaction).
Bug type
Agent behavior
Summary
Agent run timeout during long tool execution (e.g.
process(poll)) is misclassified as "LLM request timed out", triggering unnecessary model fallback — even though the primary model responded correctly.Steps to reproduce
claude-opus-4-6as primary model and a fallback model chainexec(background: true)process(poll)with 2-minute poll intervalsDEFAULT_AGENT_TIMEOUT_SECONDS(600s / 10 minutes)"FailoverError: LLM request timed out."and falls back to the next model in the chainAlternatively: any tool call that takes a long time (browser automation, long exec, repeated process polling) can trigger the same behavior.
Expected behavior
Tool execution time should not count toward the LLM request timeout. The primary model had already responded successfully and the agent was executing tool calls — no LLM request was in flight when the timeout fired.
Precedent: PR #46889 added
timedOutDuringCompactionto exempt compaction operations from triggering failover. Long-running tool execution should receive the same treatment.Actual behavior
The run-level timer fires after 600s regardless of what the agent is doing. When it fires during tool execution:
abortRun(true)setstimedOut = truetimedOut && !timedOutDuringCompaction→ enters fallback branchFailoverError: LLM request timed out.is thrownGateway log sequence (redacted):
The primary model (Opus) had completed its response. The agent was in a
process(poll)loop monitoring a background worker:Total tool execution time: ~10 minutes. No LLM request was pending.
OpenClaw version
2026.3.13 (61d171a)
Operating system
Ubuntu 24.04.4 LTS, Linux 6.17.0-19-generic x86_64
Install method
npm (global)
Model
claude-opus-4-6 (Anthropic) — primary model that was "timed out"
Provider / routing chain
anthropic/claude-opus-4-6 → openrouter/google/gemini-3.1-pro-preview (fallback #1) → openrouter/openai/gpt-5.4 (fallback #2) → ollama/qwen3.5:9b (fallback #3)
Additional provider/model setup details
Default agent timeout is the built-in
DEFAULT_AGENT_TIMEOUT_SECONDS = 600. No customagents.defaults.timeoutSecondsoverride was set.The fallback model (gemini-3.1-pro-preview via OpenRouter) consumed ~61K input tokens at $0.145 cost but produced 0 completion tokens — confirming it had no useful work to do.
Logs, screenshots, and evidence
Source code analysis (minified bundle
auth-profiles-DDVivXkv.js)Timeout declaration and abort:
Failover condition (line 111112):
User-facing error (line 111176):
Note:
timedOutDuringCompactionis the only exemption. There is notimedOutDuringToolExecutionor equivalent.Compaction precedent (PR #46889)
The
timedOutDuringCompactionmechanism proves the design intent: certain non-LLM operations should not trigger failover. Tool execution is a missing case.Impact and severity
DEFAULT_AGENT_TIMEOUT_SECONDS(600s) will trigger thisAdditional information
Suggested fix
Add a
timedOutDuringToolExecutionflag (or refactor to a generaltimeoutCauseenum) so tool execution time is exempt from the failover path, consistent with the existing compaction exemption:Alternatively, the run deadline could be extended while tool execution is actively in flight (same approach as PR #46889 does for compaction).