Problem
When a session or tool becomes stuck despite the abort signal (e.g., the abort handler fails, the shell hard-stop does not kill the process, or a provider ignores cancellation), there is no safety net to detect and recover.
Orphaned tool parts on crash
If the process exits (crash or clean shutdown) while tool executions are in-flight, their database state remains "running" forever. On restart, these orphaned parts are never cleaned up.
Stuck tools during runtime
A tool part that enters "running" but never transitions to "completed" or "error" — due to a missed abort signal, a deadlocked child process, or a provider that never responds — is invisible to the system. No periodic check detects or recovers it.
Idle sessions
Sessions where no activity occurs for an extended period (e.g., a subagent waiting on a prompt that will never come) are never detected or cancelled.
Expected behavior
- On startup, orphaned
"running" tool parts from the previous process should be marked as errored
- A periodic watchdog should detect tool parts stuck beyond a configurable timeout and cancel their sessions
- Leaf-level filtering: only force-error actual stuck tools, not task tools that are waiting on child sessions (let normal error propagation handle those)
- Idle sessions with no activity beyond a configurable threshold should be cancelled
- Configurable via
experimental.tool_timeout, experimental.task_timeout, experimental.idle_timeout
Relationship
This is a safety-net complement to #20096 (tool timeout). While #20096 prevents new hangs, this catches cases where the timeout mechanism itself is bypassed.
Problem
When a session or tool becomes stuck despite the abort signal (e.g., the abort handler fails, the shell hard-stop does not kill the process, or a provider ignores cancellation), there is no safety net to detect and recover.
Orphaned tool parts on crash
If the process exits (crash or clean shutdown) while tool executions are in-flight, their database state remains
"running"forever. On restart, these orphaned parts are never cleaned up.Stuck tools during runtime
A tool part that enters
"running"but never transitions to"completed"or"error"— due to a missed abort signal, a deadlocked child process, or a provider that never responds — is invisible to the system. No periodic check detects or recovers it.Idle sessions
Sessions where no activity occurs for an extended period (e.g., a subagent waiting on a prompt that will never come) are never detected or cancelled.
Expected behavior
"running"tool parts from the previous process should be marked as erroredexperimental.tool_timeout,experimental.task_timeout,experimental.idle_timeoutRelationship
This is a safety-net complement to #20096 (tool timeout). While #20096 prevents new hangs, this catches cases where the timeout mechanism itself is bypassed.