[Bug]: Agent hangs indefinitely on failed tool calls — no timeout, no recovery, no fallback

# OpenClaw Bug Report & Feature Request: Agent Hangs Indefinitely on Failed Tool Calls

**Reporter:** Tim (Architect)  
**Date:** 2026-02-03  
**Severity:** Critical — caused 8+ hours of total agent downtime in a single day  
**Version:** OpenClaw 2026.2.1 (ed4529e)  
**Platform:** macOS (Mac Mini, Apple Silicon), Node 25.5.0  
**Channel:** Telegram (long-polling mode)

---

## Summary

When a tool call fails to return a result (RPC wedge, provider timeout, or malformed response), the agent hangs silently for up to 600 seconds (`DEFAULT_AGENT_TIMEOUT_SECONDS` in `agents/timeout.js`) with no recovery mechanism. During this window, the agent is completely unresponsive on all channels — Telegram messages queue but are never processed, and the web UI chat shows no activity. This happened **three separate times** in a single day, each requiring manual intervention (session file deletion + gateway restart) to recover.

## Environment

- **Primary model:** `openai-codex/gpt-5.2`
- **Fallbacks:** `anthropic/claude-opus-4-5`, `anthropic/claude-sonnet-4-5`, `moonshot/kimi-k2.5`, `ollama/qwen3:8b`
- **Channel:** Telegram (`@SSFdelta_bot`), long-polling, `streamMode: "partial"`
- **Gateway:** local, loopback, port 18789

## Reproduction Steps

1. Agent receives a message via Telegram
2. Agent attempts tool calls (e.g., `session_status`, `exec`, `edit`)
3. Tool call returns "No result provided" or hangs indefinitely
4. Agent enters `processing` state and stays there for up to 600 seconds
5. All subsequent messages queue but are not processed
6. No error message is sent to the user
7. No automatic recovery occurs
8. Only fix: manually `rm -f ~/.openclaw/agents/main/sessions/sessions.json && openclaw gateway restart`

## Root Causes Identified

### 1. No tool call timeout enforcement
The exec host (`infra/exec-host.js`) has a 20-second default timeout, but the agent-level timeout (`agents/timeout.js`) defaults to **600 seconds**. When a tool RPC wedges, the agent waits the full 10 minutes before timing out. There is no separate, shorter timeout for individual tool calls vs. the overall agent run.

### 2. No circuit breaker on repeated tool failures
During one incident, the agent attempted the same failing `edit` operation on `questionnaire-flow.spec.ts` **four consecutive times**, each returning "Found 2 occurrences of the text." The model (GPT-5.2) did not adapt or bail out — it retried the identical failing call. There is no mechanism to detect repeated identical failures and halt tool usage.

### 3. No graceful degradation to text-only mode
When all tool calls are failing (due to provider auth issues, network failures, or wedged RPCs), the agent should be able to fall back to conversational responses. Currently, it simply hangs.

### 4. Silent failure — no user notification
When the agent enters a hung state, the user receives no notification. Messages appear delivered in Telegram but no response ever arrives. There is no "I'm experiencing issues, please stand by" fallback message.

### 5. Context loss on recovery
The only reliable recovery method is deleting `sessions.json` and restarting the gateway. This destroys all session context, meaning the agent loses the entire conversation history and any work-in-progress. There is no way to abort a stuck run while preserving session state. Commands like `openclaw run abort --all`, `openclaw session flush`, and `openclaw gateway restart --flush-pending` do not exist.

## Compounding Factors

During the incidents, multiple provider failures amplified the problem:

- **Anthropic:** `Invalid bearer token` (401), `credit balance too low`, `Provider in cooldown`
- **Network:** Repeated `TypeError: fetch failed` on outbound calls
- **Missing dependency:** `docker not found` when agent attempted `exec` with Docker
- **LLM timeouts:** Multiple `LLM request timed out` after 600 seconds

With the primary model failing and fallback providers also experiencing issues, the agent had no reliable path to complete a turn.

## Workaround Applied

Set `agents.defaults.timeoutSeconds: 60` in `openclaw.json` to reduce hang time from 10 minutes to 1 minute. This mitigates but does not fix the underlying issues.

## Requested Features

### P0 — Critical

1. **Per-tool-call timeout** (separate from agent run timeout): Hard kill individual tool calls after 30-60 seconds, return error to the model, and let it continue the turn without that tool's result.

2. **Stuck run abort without context loss**: A command like `openclaw run abort` that kills the active run but preserves session history, so the agent can resume on the next message without amnesia.

3. **Automatic hang detection + recovery**: A built-in watchdog that detects when a session has been in `processing` state for longer than `timeoutSeconds`, automatically aborts the run, and sends a notification to the user on the active channel.

### P1 — High

4. **Circuit breaker on tool failures**: After N consecutive tool call failures (configurable, default 3), automatically disable tool calls for the session and continue in text-only mode. Notify the user that tools are temporarily unavailable.

5. **Fallback message on hang**: If the agent cannot complete a turn within `timeoutSeconds`, send a predefined message to the user (e.g., "⚠️ I'm experiencing issues processing your request. I'll be back shortly.") rather than going completely silent.

6. **Model-level retry guardrail**: Detect when the model is retrying the same failing tool call with identical parameters and force a different action (skip, rephrase, or bail to text response).

### P2 — Important

7. **Persistent conversation memory**: Session context should survive gateway restarts and session file resets. Chat history from channels (Telegram, web UI) should be reloadable into a new session so context recovery doesn't require manual intervention.

8. **Health monitoring endpoint**: An API or CLI command that reports whether the agent is currently stuck, how long it's been processing, and what tool call it's waiting on — enabling external monitoring and alerting.

## Impact

- **8+ hours of total downtime** across 3 incidents in one day
- **Complete context loss** on each recovery (3 times)
- **Zero productive work** accomplished through the agent
- **Manual babysitting required** — defeats the purpose of an autonomous agent

This is a fundamental reliability gap. An always-on agent that can't recover from a failed tool call without human intervention and context destruction is not production-ready for autonomous operation.

---

*Filed by the Architect, with diagnostic assistance from Claude (Anthropic).*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Agent hangs indefinitely on failed tool calls — no timeout, no recovery, no fallback #8288

OpenClaw Bug Report & Feature Request: Agent Hangs Indefinitely on Failed Tool Calls

Summary

Environment

Reproduction Steps

Root Causes Identified

1. No tool call timeout enforcement

2. No circuit breaker on repeated tool failures

3. No graceful degradation to text-only mode

4. Silent failure — no user notification

5. Context loss on recovery

Compounding Factors

Workaround Applied

Requested Features

P0 — Critical

P1 — High

P2 — Important

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Agent hangs indefinitely on failed tool calls — no timeout, no recovery, no fallback #8288

Description

OpenClaw Bug Report & Feature Request: Agent Hangs Indefinitely on Failed Tool Calls

Summary

Environment

Reproduction Steps

Root Causes Identified

1. No tool call timeout enforcement

2. No circuit breaker on repeated tool failures

3. No graceful degradation to text-only mode

4. Silent failure — no user notification

5. Context loss on recovery

Compounding Factors

Workaround Applied

Requested Features

P0 — Critical

P1 — High

P2 — Important

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions