Skip to content

bug: Pi session event queue self-wait can hang Gateway at tool calls #86093

@de1tydev

Description

@de1tydev

Summary

OpenClaw 2026.5.22 can deadlock an embedded Pi agent turn at the tool-call boundary. In production this made the Gateway appear active/running while Feishu replies, /health, /status, and openclaw gateway status became intermittently or fully unresponsive. This is a catastrophic Gateway availability bug because one agent turn can effectively starve the main Gateway process.

Impact

  • Affected runtime: observed on 2026.5.22.
  • Last known stable runtime in this environment: 2026.5.19.
  • Channel impact: Feishu WebSocket remained connected at first, but replies and probes stalled.
  • Operational impact: systemd still reported active/running, so normal service checks were misleading.

Observed diagnostics

Representative liveness diagnostics from the affected runtime:

liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayP99Ms=43050.3
  eventLoopUtilization=1
  active=... processing/model_call ... last=model_call:started | ... processing/embedded_run ...

liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu
  eventLoopDelayP99Ms=53888.4
  eventLoopUtilization=1
  active=... processing/tool_call ... last=tool:exec:started

Additional symptoms:

per-chat task exceeded 300000ms cap
Feishu WebSocket reconnects
Gateway HTTP probes time out despite the process still listening on 127.0.0.1:18789

Root cause analysis

The issue appears to be in the 2026.5.22-era Pi embedded session write-lock changes around:

src/agents/pi-embedded-runner/run/attempt.session-lock.ts

The problematic interaction is:

  1. installAwaitableSessionEventQueue() wraps _handleAgentEvent() so event handling is represented by / waits on session._agentEventQueue.
  2. installSessionExternalHookWriteLock() wraps agent.beforeToolCall and calls waitForSessionEventQueue(session) before taking the write lock.
  3. During normal event processing, the current _agentEventQueue entry handles a tool_call event and invokes beforeToolCall.
  4. The wrapped beforeToolCall then waits for session._agentEventQueue — which is the same queue promise containing the current event handler.

That creates a self-wait/deadlock:

_handleAgentEvent
  -> _agentEventQueue current entry
    -> _processAgentEvent(tool_call)
      -> agent.beforeToolCall()
        -> waitForSessionEventQueue(session)
          -> waits for current _agentEventQueue entry to complete

The current entry cannot complete because it is waiting inside itself.

Candidate fix

Track the currently executing session event queue entry with AsyncLocalStorage, and make waitForSessionEventQueue(session) return immediately only when it is called from inside that same active session event processing context.

External hook/cleanup/provider paths should still drain pending session events normally.

Minimal patch shape:

const activeSessionEventProcessing = new AsyncLocalStorage<unknown>();

async function waitForSessionEventQueue(session: unknown): Promise<void> {
  // Hooks invoked by the queue entry itself must not wait for that same entry to finish.
  if (activeSessionEventProcessing.getStore() === session) {
    return;
  }
  // existing queue-drain logic...
}

session["_processAgentEvent"] = async function lockedProcessAgentEvent(this: unknown, event: unknown) {
  return await activeSessionEventProcessing.run(session, async () => {
    if (!eventMayReachTranscriptWriters(session, event)) {
      return await original.call(this, event);
    }
    return await params.withSessionWriteLock(async () => await original.call(this, event));
  });
};

Regression test

Add a test that simulates the actual self-wait path:

_handleAgentEvent -> _agentEventQueue -> _processAgentEvent -> beforeToolCall

The test should prove the hook completes and does not time out when invoked from inside the active queue entry, while existing tests should continue to prove external hooks still drain queued session events.

A local candidate fix passed:

node scripts/run-vitest.mjs src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
# 2 files passed, 66 tests passed

git diff --check -- src/agents/pi-embedded-runner/run/attempt.session-lock.ts src/agents/pi-embedded-runner/run/attempt.session-lock.test.ts
# OK

XDG_DATA_HOME=/home/liao/.hermes/tmp/pnpm-data pnpm build
# OK

Candidate fix commit in fork:

e1765a6dcd fix(agents): avoid session event queue self-wait

Workaround

Rollback Gateway runtime to 2026.5.19 and restart externally. In this environment the 5.19 rollback restored:

openclaw gateway status: OK
/health: 200
/status: 200
Feishu WS: ready

If config was written by 2026.5.22, 5.19 may refuse startup with exit 78. For an intentional rollback/recovery window, the service needs:

OPENCLAW_ALLOW_OLDER_BINARY_DESTRUCTIVE_ACTIONS=1

This should be treated as a recovery-only workaround, not a long-term fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions