Skip to content

Session deadlock: dangling toolCall with no toolResult blocks agent loop permanently #53889

@zijunl

Description

@zijunl

Bug: Session deadlock on lost toolResult

Summary

When a tool execution result is lost (network/IPC failure, process killed, timeout), the JSONL session log ends with a toolCall message but no corresponding toolResult. On next message, the agent loop blocks forever waiting for the result that will never arrive.

Reproduction

  1. Agent issues an exec tool call that launches a background process with &
  2. Before the toolResult is written to JSONL, the connection is lost / process is interrupted
  3. Session JSONL ends with a role: assistant message containing type: toolCall
  4. Any subsequent message into the session is silently dropped — the agent never processes it
  5. Log shows: typing TTL reached (2m); stopping typing indicator and nothing else

Observed in logs

2026-03-24T11:37:41-07:00: typing TTL reached (2m); stopping typing indicator

Session file had 61 lines, last line was an assistant toolCall for exec with no toolResult following it. Session status showed running indefinitely.

Root Cause

The agent loop likely reads pending toolCalls from JSONL and waits for their results before processing new inbound messages. When a result is lost (IPC drop, timeout), there is no recovery path — the session deadlocks silently.

Proposed Fixes

1. Session startup guard (highest priority)
On session load/resume, scan JSONL for any toolCall entries with no matching toolResult. Auto-inject a synthetic error toolResult before allowing new messages in:

{
  "role": "toolResult",
  "toolCallId": "<dangling-id>",
  "toolName": "<name>",
  "content": [{"type": "text", "text": "[tool execution lost — session was interrupted]"}],
  "isError": true
}

2. exec tool hard timeout + guaranteed write-back
The exec tool should:

  • Have a hard deadline (suggested: 30s default, configurable)
  • Write the toolResult to JSONL before returning to caller, even on failure/timeout
  • Never silently drop results on network/IPC failure

3. Background commands (&) should return immediately
If a command is backgrounded, the tool should return immediately with PID, not wait indefinitely.

Workaround (manual)

Append a synthetic toolResult line to the JSONL matching the dangling toolCallId:

{"type":"message","id":"<uuid>","parentId":"<dangling-msg-id>","timestamp":"...","message":{"role":"toolResult","toolCallId":"<dangling-id>","toolName":"exec","content":[{"type":"text","text":"[tool execution lost]"}],"isError":true,"timestamp":...}}

This unblocks the session immediately without a full reset.

Impact

  • Session silently stops responding to all messages
  • No error surfaced to user — looks like the bot is just slow
  • Only recovery currently is full session reset (loses context)
  • Reproducible whenever exec tool result is lost mid-flight

Environment

  • OpenClaw version: current (March 2026)
  • macOS Darwin 25.3.0 arm64
  • Node v25.6.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions