Skip to content

[Bug]: Agent hangs indefinitely on failed tool calls — no timeout, no recovery, no fallback #8288

@bluepointglobal-afk

Description

@bluepointglobal-afk

OpenClaw Bug Report & Feature Request: Agent Hangs Indefinitely on Failed Tool Calls

Reporter: Tim (Architect)
Date: 2026-02-03
Severity: Critical — caused 8+ hours of total agent downtime in a single day
Version: OpenClaw 2026.2.1 (ed4529e)
Platform: macOS (Mac Mini, Apple Silicon), Node 25.5.0
Channel: Telegram (long-polling mode)


Summary

When a tool call fails to return a result (RPC wedge, provider timeout, or malformed response), the agent hangs silently for up to 600 seconds (DEFAULT_AGENT_TIMEOUT_SECONDS in agents/timeout.js) with no recovery mechanism. During this window, the agent is completely unresponsive on all channels — Telegram messages queue but are never processed, and the web UI chat shows no activity. This happened three separate times in a single day, each requiring manual intervention (session file deletion + gateway restart) to recover.

Environment

  • Primary model: openai-codex/gpt-5.2
  • Fallbacks: anthropic/claude-opus-4-5, anthropic/claude-sonnet-4-5, moonshot/kimi-k2.5, ollama/qwen3:8b
  • Channel: Telegram (@SSFdelta_bot), long-polling, streamMode: "partial"
  • Gateway: local, loopback, port 18789

Reproduction Steps

  1. Agent receives a message via Telegram
  2. Agent attempts tool calls (e.g., session_status, exec, edit)
  3. Tool call returns "No result provided" or hangs indefinitely
  4. Agent enters processing state and stays there for up to 600 seconds
  5. All subsequent messages queue but are not processed
  6. No error message is sent to the user
  7. No automatic recovery occurs
  8. Only fix: manually rm -f ~/.openclaw/agents/main/sessions/sessions.json && openclaw gateway restart

Root Causes Identified

1. No tool call timeout enforcement

The exec host (infra/exec-host.js) has a 20-second default timeout, but the agent-level timeout (agents/timeout.js) defaults to 600 seconds. When a tool RPC wedges, the agent waits the full 10 minutes before timing out. There is no separate, shorter timeout for individual tool calls vs. the overall agent run.

2. No circuit breaker on repeated tool failures

During one incident, the agent attempted the same failing edit operation on questionnaire-flow.spec.ts four consecutive times, each returning "Found 2 occurrences of the text." The model (GPT-5.2) did not adapt or bail out — it retried the identical failing call. There is no mechanism to detect repeated identical failures and halt tool usage.

3. No graceful degradation to text-only mode

When all tool calls are failing (due to provider auth issues, network failures, or wedged RPCs), the agent should be able to fall back to conversational responses. Currently, it simply hangs.

4. Silent failure — no user notification

When the agent enters a hung state, the user receives no notification. Messages appear delivered in Telegram but no response ever arrives. There is no "I'm experiencing issues, please stand by" fallback message.

5. Context loss on recovery

The only reliable recovery method is deleting sessions.json and restarting the gateway. This destroys all session context, meaning the agent loses the entire conversation history and any work-in-progress. There is no way to abort a stuck run while preserving session state. Commands like openclaw run abort --all, openclaw session flush, and openclaw gateway restart --flush-pending do not exist.

Compounding Factors

During the incidents, multiple provider failures amplified the problem:

  • Anthropic: Invalid bearer token (401), credit balance too low, Provider in cooldown
  • Network: Repeated TypeError: fetch failed on outbound calls
  • Missing dependency: docker not found when agent attempted exec with Docker
  • LLM timeouts: Multiple LLM request timed out after 600 seconds

With the primary model failing and fallback providers also experiencing issues, the agent had no reliable path to complete a turn.

Workaround Applied

Set agents.defaults.timeoutSeconds: 60 in openclaw.json to reduce hang time from 10 minutes to 1 minute. This mitigates but does not fix the underlying issues.

Requested Features

P0 — Critical

  1. Per-tool-call timeout (separate from agent run timeout): Hard kill individual tool calls after 30-60 seconds, return error to the model, and let it continue the turn without that tool's result.

  2. Stuck run abort without context loss: A command like openclaw run abort that kills the active run but preserves session history, so the agent can resume on the next message without amnesia.

  3. Automatic hang detection + recovery: A built-in watchdog that detects when a session has been in processing state for longer than timeoutSeconds, automatically aborts the run, and sends a notification to the user on the active channel.

P1 — High

  1. Circuit breaker on tool failures: After N consecutive tool call failures (configurable, default 3), automatically disable tool calls for the session and continue in text-only mode. Notify the user that tools are temporarily unavailable.

  2. Fallback message on hang: If the agent cannot complete a turn within timeoutSeconds, send a predefined message to the user (e.g., "⚠️ I'm experiencing issues processing your request. I'll be back shortly.") rather than going completely silent.

  3. Model-level retry guardrail: Detect when the model is retrying the same failing tool call with identical parameters and force a different action (skip, rephrase, or bail to text response).

P2 — Important

  1. Persistent conversation memory: Session context should survive gateway restarts and session file resets. Chat history from channels (Telegram, web UI) should be reloadable into a new session so context recovery doesn't require manual intervention.

  2. Health monitoring endpoint: An API or CLI command that reports whether the agent is currently stuck, how long it's been processing, and what tool call it's waiting on — enabling external monitoring and alerting.

Impact

  • 8+ hours of total downtime across 3 incidents in one day
  • Complete context loss on each recovery (3 times)
  • Zero productive work accomplished through the agent
  • Manual babysitting required — defeats the purpose of an autonomous agent

This is a fundamental reliability gap. An always-on agent that can't recover from a failed tool call without human intervention and context destruction is not production-ready for autonomous operation.


Filed by the Architect, with diagnostic assistance from Claude (Anthropic).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions