Summary
A live Telegram agent turn wedged when a bash tool call ran:
openclaw sessions --agent dex --limit 10 --json
The command appears to query session state through the same gateway/session system that is executing the active tool call. The result was a blocked tool call that stalled the topic, survived normal restart drain attempts, and required systemd SIGKILL after the drain timeout.
Version
OpenClaw 2026.5.14-beta.1 (cef4145)
Gateway service:
openclaw-gateway.service - OpenClaw Gateway (v2026.5.14-beta.1)
node /home/clawadmin/.npm-global/lib/node_modules/openclaw/dist/index.js gateway --port 18789
Evidence
Diagnostics repeatedly detected the stuck state but had no recovery path:
2026-05-14T19:01:03.584-06:00 [diagnostic] stalled session: sessionId=991d536d-8ea4-4058-b02a-a0cf45ed9f14 sessionKey=agent:main:telegram:group:-1003821464158:topic:4836 state=processing age=142s queueDepth=1 reason=blocked_tool_call classification=blocked_tool_call activeWorkKind=tool_call lastProgress=codex_app_server:notification:rawResponseItem/completed lastProgressAge=142s activeTool=bash activeToolCallId=exec-7c0d240d-fc1a-44c7-b98e-0c09f0aa9061 activeToolAge=147s terminalProgressStale=true recovery=none
Restart attempted to drain active work instead of killing/reaping the stale tool call quickly:
2026-05-14T19:03:57.796-06:00 [gateway] draining 4 active task(s) and 2 active embedded run(s) before restart with timeout 300000ms
2026-05-14T19:04:27.798-06:00 [gateway] still draining 4 active task(s) and 2 active embedded run(s) before restart
openclaw-gateway.service: State 'stop-sigterm' timed out. Killing.
openclaw-gateway.service: Killing process 260558 (node) with signal SIGKILL.
The same pattern appeared again after restart:
2026-05-14T19:07:31.965-06:00 [diagnostic] stalled session: sessionId=991d536d-8ea4-4058-b02a-a0cf45ed9f14 sessionKey=agent:main:telegram:group:-1003821464158:topic:4836 state=processing age=134s queueDepth=1 reason=blocked_tool_call classification=blocked_tool_call activeWorkKind=tool_call lastProgress=codex_app_server:notification:item/completed lastProgressAge=134s activeTool=bash activeToolCallId=exec-7b8662e0-1940-4afd-93f4-bca1a71ca8bf activeToolAge=138s recovery=none
Expected behavior
- Tool calls should have bounded timeout/cancel behavior.
- Gateway restart should force-kill stale active tool calls after a short drain window.
- If
openclaw sessions is unsafe from inside an active agent turn, it should fail fast with a clear diagnostic.
- If diagnostics can classify
blocked_tool_call, recovery should not be none.
Actual behavior
- The topic remained stuck waiting for bash.
- Bash was waiting on
openclaw sessions.
- Gateway restart tried to drain the stuck work and only recovered after systemd killed the process.
- The diagnostic correctly identified
blocked_tool_call but had no recovery path.
Local mitigation
We added a local wrapper guard that blocks only openclaw sessions... from live agent/tool-call ancestry and logs the block. This is a temporary circuit breaker, not a product fix. Normal openclaw --version still works, and openclaw sessions --help works when the local guard is disabled.
Suggested fixes
- Add hard timeout/cancellation around tool calls.
- Make
openclaw sessions safe from active turns or explicitly reject it in that context.
- On restart, force-reap stale tool calls after a short drain period.
- Add recovery behavior for
classification=blocked_tool_call instead of recovery=none.
Summary
A live Telegram agent turn wedged when a bash tool call ran:
The command appears to query session state through the same gateway/session system that is executing the active tool call. The result was a blocked tool call that stalled the topic, survived normal restart drain attempts, and required systemd SIGKILL after the drain timeout.
Version
Gateway service:
Evidence
Diagnostics repeatedly detected the stuck state but had no recovery path:
Restart attempted to drain active work instead of killing/reaping the stale tool call quickly:
The same pattern appeared again after restart:
Expected behavior
openclaw sessionsis unsafe from inside an active agent turn, it should fail fast with a clear diagnostic.blocked_tool_call, recovery should not benone.Actual behavior
openclaw sessions.blocked_tool_callbut had no recovery path.Local mitigation
We added a local wrapper guard that blocks only
openclaw sessions...from live agent/tool-call ancestry and logs the block. This is a temporary circuit breaker, not a product fix. Normalopenclaw --versionstill works, andopenclaw sessions --helpworks when the local guard is disabled.Suggested fixes
openclaw sessionssafe from active turns or explicitly reject it in that context.classification=blocked_tool_callinstead ofrecovery=none.