-
-
Notifications
You must be signed in to change notification settings - Fork 67.3k
Description
OpenClaw Bug Report & Feature Request: Agent Hangs Indefinitely on Failed Tool Calls
Reporter: Tim (Architect)
Date: 2026-02-03
Severity: Critical — caused 8+ hours of total agent downtime in a single day
Version: OpenClaw 2026.2.1 (ed4529e)
Platform: macOS (Mac Mini, Apple Silicon), Node 25.5.0
Channel: Telegram (long-polling mode)
Summary
When a tool call fails to return a result (RPC wedge, provider timeout, or malformed response), the agent hangs silently for up to 600 seconds (DEFAULT_AGENT_TIMEOUT_SECONDS in agents/timeout.js) with no recovery mechanism. During this window, the agent is completely unresponsive on all channels — Telegram messages queue but are never processed, and the web UI chat shows no activity. This happened three separate times in a single day, each requiring manual intervention (session file deletion + gateway restart) to recover.
Environment
- Primary model:
openai-codex/gpt-5.2 - Fallbacks:
anthropic/claude-opus-4-5,anthropic/claude-sonnet-4-5,moonshot/kimi-k2.5,ollama/qwen3:8b - Channel: Telegram (
@SSFdelta_bot), long-polling,streamMode: "partial" - Gateway: local, loopback, port 18789
Reproduction Steps
- Agent receives a message via Telegram
- Agent attempts tool calls (e.g.,
session_status,exec,edit) - Tool call returns "No result provided" or hangs indefinitely
- Agent enters
processingstate and stays there for up to 600 seconds - All subsequent messages queue but are not processed
- No error message is sent to the user
- No automatic recovery occurs
- Only fix: manually
rm -f ~/.openclaw/agents/main/sessions/sessions.json && openclaw gateway restart
Root Causes Identified
1. No tool call timeout enforcement
The exec host (infra/exec-host.js) has a 20-second default timeout, but the agent-level timeout (agents/timeout.js) defaults to 600 seconds. When a tool RPC wedges, the agent waits the full 10 minutes before timing out. There is no separate, shorter timeout for individual tool calls vs. the overall agent run.
2. No circuit breaker on repeated tool failures
During one incident, the agent attempted the same failing edit operation on questionnaire-flow.spec.ts four consecutive times, each returning "Found 2 occurrences of the text." The model (GPT-5.2) did not adapt or bail out — it retried the identical failing call. There is no mechanism to detect repeated identical failures and halt tool usage.
3. No graceful degradation to text-only mode
When all tool calls are failing (due to provider auth issues, network failures, or wedged RPCs), the agent should be able to fall back to conversational responses. Currently, it simply hangs.
4. Silent failure — no user notification
When the agent enters a hung state, the user receives no notification. Messages appear delivered in Telegram but no response ever arrives. There is no "I'm experiencing issues, please stand by" fallback message.
5. Context loss on recovery
The only reliable recovery method is deleting sessions.json and restarting the gateway. This destroys all session context, meaning the agent loses the entire conversation history and any work-in-progress. There is no way to abort a stuck run while preserving session state. Commands like openclaw run abort --all, openclaw session flush, and openclaw gateway restart --flush-pending do not exist.
Compounding Factors
During the incidents, multiple provider failures amplified the problem:
- Anthropic:
Invalid bearer token(401),credit balance too low,Provider in cooldown - Network: Repeated
TypeError: fetch failedon outbound calls - Missing dependency:
docker not foundwhen agent attemptedexecwith Docker - LLM timeouts: Multiple
LLM request timed outafter 600 seconds
With the primary model failing and fallback providers also experiencing issues, the agent had no reliable path to complete a turn.
Workaround Applied
Set agents.defaults.timeoutSeconds: 60 in openclaw.json to reduce hang time from 10 minutes to 1 minute. This mitigates but does not fix the underlying issues.
Requested Features
P0 — Critical
-
Per-tool-call timeout (separate from agent run timeout): Hard kill individual tool calls after 30-60 seconds, return error to the model, and let it continue the turn without that tool's result.
-
Stuck run abort without context loss: A command like
openclaw run abortthat kills the active run but preserves session history, so the agent can resume on the next message without amnesia. -
Automatic hang detection + recovery: A built-in watchdog that detects when a session has been in
processingstate for longer thantimeoutSeconds, automatically aborts the run, and sends a notification to the user on the active channel.
P1 — High
-
Circuit breaker on tool failures: After N consecutive tool call failures (configurable, default 3), automatically disable tool calls for the session and continue in text-only mode. Notify the user that tools are temporarily unavailable.
-
Fallback message on hang: If the agent cannot complete a turn within
timeoutSeconds, send a predefined message to the user (e.g., "⚠️ I'm experiencing issues processing your request. I'll be back shortly.") rather than going completely silent. -
Model-level retry guardrail: Detect when the model is retrying the same failing tool call with identical parameters and force a different action (skip, rephrase, or bail to text response).
P2 — Important
-
Persistent conversation memory: Session context should survive gateway restarts and session file resets. Chat history from channels (Telegram, web UI) should be reloadable into a new session so context recovery doesn't require manual intervention.
-
Health monitoring endpoint: An API or CLI command that reports whether the agent is currently stuck, how long it's been processing, and what tool call it's waiting on — enabling external monitoring and alerting.
Impact
- 8+ hours of total downtime across 3 incidents in one day
- Complete context loss on each recovery (3 times)
- Zero productive work accomplished through the agent
- Manual babysitting required — defeats the purpose of an autonomous agent
This is a fundamental reliability gap. An always-on agent that can't recover from a failed tool call without human intervention and context destruction is not production-ready for autonomous operation.
Filed by the Architect, with diagnostic assistance from Claude (Anthropic).