Summary
OpenClaw 2026.5.22 can deadlock an embedded Pi agent turn at the tool-call boundary. In production this made the Gateway appear active/running while Feishu replies, /health, /status, and openclaw gateway status became intermittently or fully unresponsive. This is a catastrophic Gateway availability bug because one agent turn can effectively starve the main Gateway process.
Impact
- Affected runtime: observed on
2026.5.22.
- Last known stable runtime in this environment:
2026.5.19.
- Channel impact: Feishu WebSocket remained connected at first, but replies and probes stalled.
- Operational impact: systemd still reported
active/running, so normal service checks were misleading.
Observed diagnostics
Representative liveness diagnostics from the affected runtime:
liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
eventLoopDelayP99Ms=43050.3
eventLoopUtilization=1
active=... processing/model_call ... last=model_call:started | ... processing/embedded_run ...
liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
eventLoopDelayP99Ms=53888.4
eventLoopUtilization=1
active=... processing/tool_call ... last=tool:exec:started
Additional symptoms:
per-chat task exceeded 300000ms cap
Feishu WebSocket reconnects
Gateway HTTP probes time out despite the process still listening on 127.0.0.1:18789
Root cause analysis
The issue appears to be in the 2026.5.22-era Pi embedded session write-lock changes around:
src/agents/pi-embedded-runner/run/attempt.session-lock.ts
The problematic interaction is:
installAwaitableSessionEventQueue() wraps _handleAgentEvent() so event handling is represented by / waits on session._agentEventQueue.
installSessionExternalHookWriteLock() wraps agent.beforeToolCall and calls waitForSessionEventQueue(session) before taking the write lock.
- During normal event processing, the current
_agentEventQueue entry handles a tool_call event and invokes beforeToolCall.
- The wrapped
beforeToolCall then waits for session._agentEventQueue — which is the same queue promise containing the current event handler.
That creates a self-wait/deadlock:
_handleAgentEvent
-> _agentEventQueue current entry
-> _processAgentEvent(tool_call)
-> agent.beforeToolCall()
-> waitForSessionEventQueue(session)
-> waits for current _agentEventQueue entry to complete
The current entry cannot complete because it is waiting inside itself.
Candidate fix
Track the currently executing session event queue entry with AsyncLocalStorage, and make waitForSessionEventQueue(session) return immediately only when it is called from inside that same active session event processing context.
External hook/cleanup/provider paths should still drain pending session events normally.
Minimal patch shape:
const activeSessionEventProcessing = new AsyncLocalStorage<unknown>();
async function waitForSessionEventQueue(session: unknown): Promise<void> {
// Hooks invoked by the queue entry itself must not wait for that same entry to finish.
if (activeSessionEventProcessing.getStore() === session) {
return;
}
// existing queue-drain logic...
}
session["_processAgentEvent"] = async function lockedProcessAgentEvent(this: unknown, event: unknown) {
return await activeSessionEventProcessing.run(session, async () => {
if (!eventMayReachTranscriptWriters(session, event)) {
return await original.call(this, event);
}
return await params.withSessionWriteLock(async () => await original.call(this, event));
});
};
Regression test
Add a test that simulates the actual self-wait path:
_handleAgentEvent -> _agentEventQueue -> _processAgentEvent -> beforeToolCall
The test should prove the hook completes and does not time out when invoked from inside the active queue entry, while existing tests should continue to prove external hooks still drain queued session events.
A local candidate fix passed:
node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
# 2 files passed, 66 tests passed
git diff --check -- src/agents/pi-embedded-runner/run/attempt.session-lock.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
# OK
XDG_DATA_HOME=/home/liao/.hermes/tmp/pnpm-data pnpm build
# OK
Candidate fix commit in fork:
e1765a6dcd fix(agents): avoid session event queue self-wait
Workaround
Rollback Gateway runtime to 2026.5.19 and restart externally. In this environment the 5.19 rollback restored:
openclaw gateway status: OK
/health: 200
/status: 200
Feishu WS: ready
If config was written by 2026.5.22, 5.19 may refuse startup with exit 78. For an intentional rollback/recovery window, the service needs:
OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1
This should be treated as a recovery-only workaround, not a long-term fix.
Summary
OpenClaw 2026.5.22 can deadlock an embedded Pi agent turn at the tool-call boundary. In production this made the Gateway appear
active/runningwhile Feishu replies,/health,/status, andopenclaw gateway statusbecame intermittently or fully unresponsive. This is a catastrophic Gateway availability bug because one agent turn can effectively starve the main Gateway process.Impact
2026.5.22.2026.5.19.active/running, so normal service checks were misleading.Observed diagnostics
Representative liveness diagnostics from the affected runtime:
Additional symptoms:
Root cause analysis
The issue appears to be in the 2026.5.22-era Pi embedded session write-lock changes around:
The problematic interaction is:
installAwaitableSessionEventQueue()wraps_handleAgentEvent()so event handling is represented by / waits onsession._agentEventQueue.installSessionExternalHookWriteLock()wrapsagent.beforeToolCalland callswaitForSessionEventQueue(session)before taking the write lock._agentEventQueueentry handles atool_callevent and invokesbeforeToolCall.beforeToolCallthen waits forsession._agentEventQueue— which is the same queue promise containing the current event handler.That creates a self-wait/deadlock:
The current entry cannot complete because it is waiting inside itself.
Candidate fix
Track the currently executing session event queue entry with
AsyncLocalStorage, and makewaitForSessionEventQueue(session)return immediately only when it is called from inside that same active session event processing context.External hook/cleanup/provider paths should still drain pending session events normally.
Minimal patch shape:
Regression test
Add a test that simulates the actual self-wait path:
The test should prove the hook completes and does not time out when invoked from inside the active queue entry, while existing tests should continue to prove external hooks still drain queued session events.
A local candidate fix passed:
Candidate fix commit in fork:
Workaround
Rollback Gateway runtime to
2026.5.19and restart externally. In this environment the 5.19 rollback restored:If config was written by 2026.5.22, 5.19 may refuse startup with exit 78. For an intentional rollback/recovery window, the service needs:
This should be treated as a recovery-only workaround, not a long-term fix.