Agent run timeout during tool execution misclassified as LLM timeout, triggers unnecessary model fallback


## Bug type
Agent behavior

## Summary
Agent run timeout during long tool execution (e.g. `process(poll)`) is misclassified as "LLM request timed out", triggering unnecessary model fallback — even though the primary model responded correctly.

## Steps to reproduce

1. Configure an agent with `claude-opus-4-6` as primary model and a fallback model chain
2. In a session, have the agent spawn a background process via `exec` (background: true)
3. Have the agent monitor the process using `process(poll)` with 2-minute poll intervals
4. Wait for the total run time to exceed `DEFAULT_AGENT_TIMEOUT_SECONDS` (600s / 10 minutes)
5. Observe: the run is aborted with `"FailoverError: LLM request timed out."` and falls back to the next model in the chain

Alternatively: any tool call that takes a long time (browser automation, long exec, repeated process polling) can trigger the same behavior.

## Expected behavior

Tool execution time should not count toward the LLM request timeout. The primary model had already responded successfully and the agent was executing tool calls — no LLM request was in flight when the timeout fired.

Precedent: PR #46889 added `timedOutDuringCompaction` to exempt compaction operations from triggering failover. Long-running tool execution should receive the same treatment.

## Actual behavior

The run-level timer fires after 600s regardless of what the agent is doing. When it fires during tool execution:

1. `abortRun(true)` sets `timedOut = true`
2. The failover decision checks `timedOut && !timedOutDuringCompaction` → enters fallback branch
3. A `FailoverError: LLM request timed out.` is thrown
4. The system falls over to the next configured model
5. The fallback model receives the full context (~61K tokens) but has no useful work to do (the original task was already being handled)

Gateway log sequence (redacted):
```
WARN  embedded run timeout: runId=<redacted> sessionId=<redacted> timeoutMs=600000
DEBUG run cleanup: runId=<redacted> sessionId=<redacted> aborted=true timedOut=true
WARN  embedded_run_failover_decision: stage=assistant decision=fallback_model failoverReason=timeout provider=anthropic model=claude-opus-4-6 timedOut=true status=408
ERROR lane task error: lane=main durationMs=630119 error="FailoverError: LLM request timed out."
```

The primary model (Opus) had completed its response. The agent was in a `process(poll)` loop monitoring a background worker:
- Poll 1: 2 min wait → got output
- Poll 2: 2 min wait → got output
- Poll 3: 2 min wait → got output
- Poll 4: 2 min wait → got output
- Poll 5: started, aborted at ~17s by run timeout

Total tool execution time: ~10 minutes. No LLM request was pending.

## OpenClaw version
2026.3.13 (61d171a)

## Operating system
Ubuntu 24.04.4 LTS, Linux 6.17.0-19-generic x86_64

## Install method
npm (global)

## Model
claude-opus-4-6 (Anthropic) — primary model that was "timed out"

## Provider / routing chain
anthropic/claude-opus-4-6 → openrouter/google/gemini-3.1-pro-preview (fallback #1) → openrouter/openai/gpt-5.4 (fallback #2) → ollama/qwen3.5:9b (fallback #3)

## Additional provider/model setup details
Default agent timeout is the built-in `DEFAULT_AGENT_TIMEOUT_SECONDS = 600`. No custom `agents.defaults.timeoutSeconds` override was set.

The fallback model (gemini-3.1-pro-preview via OpenRouter) consumed ~61K input tokens at $0.145 cost but produced 0 completion tokens — confirming it had no useful work to do.

## Logs, screenshots, and evidence

### Source code analysis (minified bundle `auth-profiles-DDVivXkv.js`)

**Timeout declaration and abort:**
```js
// Line 109601-109602
let timedOut = false;
let timedOutDuringCompaction = false;
```

**Failover condition (line 111112):**
```js
if (!aborted && failoverFailure || timedOut && !timedOutDuringCompaction) {
    // → enters fallback branch even when timeout was caused by tool execution
}
```

**User-facing error (line 111176):**
```js
if (timedOut && !timedOutDuringCompaction && payloads.length === 0) return {
    payloads: [{ text: "Request timed out before a response was generated...", isError: true }],
    // ...
};
```

Note: `timedOutDuringCompaction` is the **only** exemption. There is no `timedOutDuringToolExecution` or equivalent.

### Compaction precedent (PR #46889)
The `timedOutDuringCompaction` mechanism proves the design intent: certain non-LLM operations should not trigger failover. Tool execution is a missing case.

## Impact and severity

- **Affected:** Any agent using long-running tools (process polling, browser automation, multi-minute exec tasks) with a fallback model chain configured
- **Severity:** Blocks workflow + unnecessary cost — the fallback model is invoked with the full conversation context but produces no useful output
- **Frequency:** Deterministic — any tool execution exceeding `DEFAULT_AGENT_TIMEOUT_SECONDS` (600s) will trigger this
- **Consequence:** (1) Wasted API cost on the fallback model, (2) original task is interrupted mid-execution, (3) misleading "LLM request timed out" error when the LLM was not involved

## Additional information

### Suggested fix

Add a `timedOutDuringToolExecution` flag (or refactor to a general `timeoutCause` enum) so tool execution time is exempt from the failover path, consistent with the existing compaction exemption:

```js
// Current: only compaction is exempt
if (timedOut && !timedOutDuringCompaction) { /* failover */ }

// Proposed: tool execution also exempt
if (timedOut && !timedOutDuringCompaction && !timedOutDuringToolExecution) { /* failover */ }

// Or better: general timeout cause
if (timedOut && timeoutCause === 'llm_request') { /* failover */ }
```

Alternatively, the run deadline could be extended while tool execution is actively in flight (same approach as PR #46889 does for compaction).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agent run timeout during tool execution misclassified as LLM timeout, triggers unnecessary model fallback #52147

Bug type

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Source code analysis (minified bundle `auth-profiles-DDVivXkv.js`)

Compaction precedent (PR #46889)

Impact and severity

Additional information

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Agent run timeout during tool execution misclassified as LLM timeout, triggers unnecessary model fallback #52147

Description

Bug type

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Source code analysis (minified bundle auth-profiles-DDVivXkv.js)

Compaction precedent (PR #46889)

Impact and severity

Additional information

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Source code analysis (minified bundle `auth-profiles-DDVivXkv.js`)