Skip to content

pdf tool can hang indefinitely, blocking session and preventing /new /restart recovery #68649

@crisandrews

Description

@crisandrews

Bug Description

The pdf tool can hang indefinitely when fetching/processing a large remote PDF, causing the entire agent session to become a zombie. While the session is stuck:

  • No new messages from the user are processed (they queue behind the stuck tool call)
  • Slash commands (/new, /restart) are also enqueued and cannot interrupt the stuck run
  • /tasks reports "All clear" even though the session is effectively dead
  • The only recovery is a full gateway restart

Steps to Reproduce

  1. In a direct WhatsApp chat, ask the agent to research a topic
  2. Agent does multiple web_search calls (all succeed)
  3. Agent calls the pdf tool on a large remote PDF (in this case, a 244-page PDF from www-cdn.anthropic.com)
  4. The pdf tool call never returns — no result, no error, no timeout
  5. Session becomes a zombie
  6. All subsequent messages (including /new, /restart) queue behind the stuck tool call and are never processed

Evidence from Logs

# Last entry in session transcript — pdf tool call with no response
timestamp: 2026-04-18T16:41:04.934Z
role: assistant
toolCall: pdf
  pdf: "https://www-cdn.anthropic.com/08ab9158070959f88f296514c21b7facce6f52bc.pdf"
  pages: "1-20"
  prompt: "Find the exact section(s) that mention..."
# NO toolResult follows — session ends here

# Gateway diagnostic warning
lane wait exceeded: lane=nested waitedMs=164407 queueAhead=0

Expected Behavior

  1. The pdf tool should have a timeout (e.g., 60-120 seconds) after which it returns an error
  2. /new and /restart should be able to interrupt a stuck tool execution, not queue behind it
  3. If a tool call hangs, the session should eventually recover on its own rather than requiring a gateway restart

Environment

  • OpenClaw version: 2026.4.14 → 2026.4.15 (issue present in both)
  • OS: macOS 26.3 (arm64)
  • Channel: WhatsApp
  • Model: openai-codex/gpt-5.4 (session fell back from claude-opus-4-6 due to separate auth issue)

Impact

  • User-facing: complete loss of the agent with no way to recover except restarting the gateway (which affects all agents)
  • The session appears alive (gateway shows it as active, updated recently) but is actually dead
  • Users send multiple messages thinking the agent is slow, which only deepens the queue

Suggested Fix

  1. Add a configurable timeout to the pdf tool (default ~120s)
  2. Allow /new and /restart to bypass the run queue and force-reset the session
  3. Consider a session-level watchdog that detects tool calls that have been pending beyond a threshold

Metadata

Metadata

Assignees

Labels

P2Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions