Skip to content

Agent run timeout during tool execution misclassified as LLM timeout, triggers unnecessary model fallback #52147

@andychu666

Description

@andychu666

Bug type

Agent behavior

Summary

Agent run timeout during long tool execution (e.g. process(poll)) is misclassified as "LLM request timed out", triggering unnecessary model fallback — even though the primary model responded correctly.

Steps to reproduce

  1. Configure an agent with claude-opus-4-6 as primary model and a fallback model chain
  2. In a session, have the agent spawn a background process via exec (background: true)
  3. Have the agent monitor the process using process(poll) with 2-minute poll intervals
  4. Wait for the total run time to exceed DEFAULT_AGENT_TIMEOUT_SECONDS (600s / 10 minutes)
  5. Observe: the run is aborted with "FailoverError: LLM request timed out." and falls back to the next model in the chain

Alternatively: any tool call that takes a long time (browser automation, long exec, repeated process polling) can trigger the same behavior.

Expected behavior

Tool execution time should not count toward the LLM request timeout. The primary model had already responded successfully and the agent was executing tool calls — no LLM request was in flight when the timeout fired.

Precedent: PR #46889 added timedOutDuringCompaction to exempt compaction operations from triggering failover. Long-running tool execution should receive the same treatment.

Actual behavior

The run-level timer fires after 600s regardless of what the agent is doing. When it fires during tool execution:

  1. abortRun(true) sets timedOut = true
  2. The failover decision checks timedOut && !timedOutDuringCompaction → enters fallback branch
  3. A FailoverError: LLM request timed out. is thrown
  4. The system falls over to the next configured model
  5. The fallback model receives the full context (~61K tokens) but has no useful work to do (the original task was already being handled)

Gateway log sequence (redacted):

WARN  embedded run timeout: runId=<redacted> sessionId=<redacted> timeoutMs=600000
DEBUG run cleanup: runId=<redacted> sessionId=<redacted> aborted=true timedOut=true
WARN  embedded_run_failover_decision: stage=assistant decision=fallback_model failoverReason=timeout provider=anthropic model=claude-opus-4-6 timedOut=true status=408
ERROR lane task error: lane=main durationMs=630119 error="FailoverError: LLM request timed out."

The primary model (Opus) had completed its response. The agent was in a process(poll) loop monitoring a background worker:

  • Poll 1: 2 min wait → got output
  • Poll 2: 2 min wait → got output
  • Poll 3: 2 min wait → got output
  • Poll 4: 2 min wait → got output
  • Poll 5: started, aborted at ~17s by run timeout

Total tool execution time: ~10 minutes. No LLM request was pending.

OpenClaw version

2026.3.13 (61d171a)

Operating system

Ubuntu 24.04.4 LTS, Linux 6.17.0-19-generic x86_64

Install method

npm (global)

Model

claude-opus-4-6 (Anthropic) — primary model that was "timed out"

Provider / routing chain

anthropic/claude-opus-4-6 → openrouter/google/gemini-3.1-pro-preview (fallback #1) → openrouter/openai/gpt-5.4 (fallback #2) → ollama/qwen3.5:9b (fallback #3)

Additional provider/model setup details

Default agent timeout is the built-in DEFAULT_AGENT_TIMEOUT_SECONDS = 600. No custom agents.defaults.timeoutSeconds override was set.

The fallback model (gemini-3.1-pro-preview via OpenRouter) consumed ~61K input tokens at $0.145 cost but produced 0 completion tokens — confirming it had no useful work to do.

Logs, screenshots, and evidence

Source code analysis (minified bundle auth-profiles-DDVivXkv.js)

Timeout declaration and abort:

// Line 109601-109602
let timedOut = false;
let timedOutDuringCompaction = false;

Failover condition (line 111112):

if (!aborted && failoverFailure || timedOut && !timedOutDuringCompaction) {
    // → enters fallback branch even when timeout was caused by tool execution
}

User-facing error (line 111176):

if (timedOut && !timedOutDuringCompaction && payloads.length === 0) return {
    payloads: [{ text: "Request timed out before a response was generated...", isError: true }],
    // ...
};

Note: timedOutDuringCompaction is the only exemption. There is no timedOutDuringToolExecution or equivalent.

Compaction precedent (PR #46889)

The timedOutDuringCompaction mechanism proves the design intent: certain non-LLM operations should not trigger failover. Tool execution is a missing case.

Impact and severity

  • Affected: Any agent using long-running tools (process polling, browser automation, multi-minute exec tasks) with a fallback model chain configured
  • Severity: Blocks workflow + unnecessary cost — the fallback model is invoked with the full conversation context but produces no useful output
  • Frequency: Deterministic — any tool execution exceeding DEFAULT_AGENT_TIMEOUT_SECONDS (600s) will trigger this
  • Consequence: (1) Wasted API cost on the fallback model, (2) original task is interrupted mid-execution, (3) misleading "LLM request timed out" error when the LLM was not involved

Additional information

Suggested fix

Add a timedOutDuringToolExecution flag (or refactor to a general timeoutCause enum) so tool execution time is exempt from the failover path, consistent with the existing compaction exemption:

// Current: only compaction is exempt
if (timedOut && !timedOutDuringCompaction) { /* failover */ }

// Proposed: tool execution also exempt
if (timedOut && !timedOutDuringCompaction && !timedOutDuringToolExecution) { /* failover */ }

// Or better: general timeout cause
if (timedOut && timeoutCause === 'llm_request') { /* failover */ }

Alternatively, the run deadline could be extended while tool execution is actively in flight (same approach as PR #46889 does for compaction).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions