Skip to content

[Bug]: Stale replyRunRegistry lock causes indefinite inbound dispatch hang — no timeout on waitForIdle() for visible messages #90535

@Jerry-Xin

Description

@Jerry-Xin

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

No

Summary

replyRunRegistry in-memory lock leaks after a prior agent turn completes or fails abnormally, causing all subsequent inbound messages for the affected session to hang indefinitely in admitReplyTurn() → waitForIdle() with no timeout. The gateway logs the inbound message receipt and read-receipt acknowledgment, then produces zero further output — no model call, no error, no outbound delivery. Only a full gateway restart clears the stale lock.

This is the same class of bug as #84710 (Telegram channel) but observed on the Octo (custom WebSocket) channel, with a complete code-level root cause trace.

Environment

  • OpenClaw: 2026.5.28 (e932160)
  • OS: macOS 25.2.0 (Apple Silicon)
  • Node: v22.22.1
  • Channel: Octo (WebSocket-based IM, via openclaw-channel-octo plugin)
  • Gateway: LaunchAgent, embedded mode
  • Model: Anthropic Claude (via proxy), model-independent bug

Observed behavior

Timeline (all timestamps UTC+8)

Time Event Outcome
Jun 4 16:36 Agent completes a normal turn for <user-A> on <bot-account> ✅ Response delivered, session status → done
Jun 4 18:18 <user-A> sends new DM to <bot-account> ❌ recv + readReceipt logged, no dispatch
Jun 5 11:18:16 <user-A> sends another DM (quotes a reply, 187 chars) ❌ recv + readReceipt logged, no dispatch
Jun 5 11:18:28 <user-A> sends follow-up DM ❌ recv + readReceipt logged, no dispatch
Jun 5 11:19:57 <user-B> sends DM to same <bot-account> ✅ recv → readReceipt → [deliver-buffer] fallback text sent in 3s
Jun 5 11:32:52 <user-A> sends another DM ❌ recv + readReceipt logged, no dispatch
Jun 5 11:38:29 Gateway restart 🔄 In-memory state cleared
Jun 5 11:39:31 <user-A> sends DM ✅ recv → readReceipt → [deliver-buffer] fallback text sent (18 chars)

Key observations

  1. User-specific: Only <user-A>'s session is stuck. <user-B> on the same bot account works fine (different sessionKey).
  2. Session shows done: The session store reports status: done — this is purely an in-memory lock leak, not a persisted state issue.
  3. No error logged: Between readReceipt sent OK and the next unrelated log entry, there is zero output — no error, no warning, no dispatch log. The code silently hangs.
  4. Programmatic delivery works: Sending a message via the message tool (which bypasses the inbound dispatch pipeline) succeeds, confirming the session store and outbound path are healthy.
  5. Gateway restart fixes it: Clears replyRunRegistry in-memory singleton → lock gone → messages dispatch normally.

Gateway log signature (redacted)

# Stuck user — recv logged, then silence:
[octo] [<bot>] recv message from=<user-A> channel=<user-A> type=1
[octo] sending readReceipt+typing to channel=<user-A> type=1
[octo] typing sent OK
[octo] readReceipt sent OK
<nothing — no dispatch, no deliver-buffer, no error>

# Working user — full pipeline:
[octo] [<bot>] recv message from=<user-B> channel=<user-B> type=1
[octo] sending readReceipt+typing to channel=<user-B> type=1
[octo] readReceipt sent OK
[octo] typing sent OK
[octo] [deliver-buffer] fallback text sent (12 chars)

Root cause analysis

Traced through the compiled source. The hang occurs in the core dispatch pipeline, not in the channel plugin.

Call chain

Channel plugin (octo inbound.js)
  → core.channel.reply.dispatchReplyWithBufferedBlockDispatcher()
    → dispatchInboundMessageWithBufferedDispatcher()  [dispatch-*.js]
      → ensureDispatchReplyOperation("dispatch")
        → admitReplyTurn()  [reply-turn-admission-*.js]
          → createReplyOperation()  [reply-run-registry-*.js]
            → THROWS ReplyRunAlreadyActiveError (stale lock exists)
          → waitForIdle(sessionKey, undefined, ...)
            → HANGS FOREVER (no timeoutMs for "visible" kind)

Code-level detail

reply-turn-admission-*.jsadmitReplyTurn() (line ~2001):

async function admitReplyTurn(params) {
  while (true) {
    try {
      return { status: "owned", operation: createReplyOperation({...}) };
    } catch (error) {
      if (!(error instanceof ReplyRunAlreadyActiveError)) throw error;
      // For "visible" kind: waitForActive=true, waitTimeoutMs=undefined
      const waitTimeoutMs = params.waitTimeoutMs
        ?? (params.kind === "queued_followup" ? 15e3 : void 0);
      //                                              ^^^^^^^^
      // undefined for "visible" messages — no timeout!
      if (!await replyRunRegistry.waitForIdle(
        params.sessionKey, waitTimeoutMs, { signal: params.upstreamAbortSignal }
      )) return { status: "skipped", reason: "active-run" };
    }
  }
}

reply-run-registry-*.jswaitForIdle() (line ~248):

waitForIdle(sessionKey, timeoutMs, opts) {
  // ...
  return new Promise((resolve) => {
    const waiter = { finish: (ended) => { /* ... */ resolve(ended); } };
    // Only sets timeout if timeoutMs is a finite number:
    if (typeof timeoutMs === "number" && Number.isFinite(timeoutMs))
      waiter.timer = setTimeout(() => waiter.finish(false), Math.max(100, timeoutMs));
    // When timeoutMs is undefined → no timer → waits forever
    // ...
  });
}

Why the lock leaks

The stale entry in replyRunState.activeRunsByKey persists because a prior reply operation was created (createReplyOperation added it to the map) but never completed its lifecycle (the clearState() callback was never invoked). Possible triggers:

  1. Unhandled promise rejection during the model API call that bypasses the finally block
  2. A heartbeat-driven run that set pendingFinalDelivery without clearing it (see [Bug]: Heartbeat-driven agent replies leave pendingFinalDelivery stuck, blocking subsequent heartbeats #83184)
  3. An embedded run (e.g. Codex app-server) that emitted notification:turn/started then went silent (see Codex app-server emits notification:turn/started then goes silent; embedded run wedges for the full stuck-session recovery window #85251)
  4. A native tool call that never emitted a completion event (see Stale diagnostic tool_call activity can survive recovery/reset and re-block sessions as blocked_tool_call #87310)

Why there is no log output

The logVerbose call at the dispatch rejection site only fires when verbose mode is enabled:

logVerbose(`dispatch-from-config: skipped reply operation admission for ${key}; reason=${reason}`);

At default log level, the hang is completely invisible — no warning, no error, no structured event.

Suggested fixes

  1. Add a TTL / max-wait timeout to waitForIdle() for visible messages: Even 60–120s would prevent permanent hangs. The current code only sets a timeout for queued_followup (15s) — visible messages get undefined (infinite wait).

  2. Promote the dispatch-skip log to log.warn: Silent hangs are the worst failure mode. At minimum, log a warning when admitReplyTurn returns skipped with reason active-run.

  3. Add a stale-lock reaper: Periodically scan replyRunState.activeRunsByKey for entries older than N minutes and force-clear them (the registry already exports forceClearReplyRunBySessionId).

  4. Stuck-session recovery should clear replyRunRegistry: The existing health-monitor / stuck-session recovery path should also check and clear stale entries in the in-memory reply run registry, not just persisted session state.

Related issues

Issue Title Relevance
#84710 Telegram inbound dispatch hangs after "Inbound message" log Same bug, different channel — identical symptoms (recv logged → silence → restart fixes)
#77485 ReplyRunAlreadyActiveError fires every other gateway-WS chat call (50% reply failure) Same root causeReplyRunAlreadyActiveError blocking dispatch; partial fix in 5.4 didn't cover all paths
#83184 Heartbeat-driven agent replies leave pendingFinalDelivery stuck, blocking subsequent heartbeats Potential trigger — heartbeat runs not clearing state, which may cause the initial lock leak
#87310 Stale diagnostic tool_call activity survives recovery/reset and re-blocks sessions Same class — in-memory state outliving its source, blocking future dispatch
#85251 Codex app-server emits notification:turn/started then goes silent; embedded run wedges Potential trigger — embedded run never completing could leave reply operation active
#86538 Session write-lock timeouts block subagent delivery lanes Same class — lock-based state management without adequate timeout/recovery
#88870 Stuck-session recovery aborts long-but-active agent runs with misleading reason Related recovery gap — recovery mechanism itself has edge cases
#86963 Orphaned native Codex thread wedges session permanently, silently dropping messages Same symptom — session permanently stuck, messages silently dropped

Repro notes

  • Intermittent but sticky: Once the lock leaks, it persists until restart. The initial leak trigger is not deterministic — we observed it after a normal-looking completed turn with ~1h42m gap before the next message.
  • Multi-agent amplifier: Environments with many agents/bot-accounts sharing maxConcurrent limits may increase the chance of lock contention and leak.
  • Channel-independent: This is a core dispatch issue. The channel plugin (Octo, Telegram, etc.) correctly delivers the message to the core — the core's admitReplyTurn is where it hangs.

OpenClaw version

2026.5.28 (e932160)

Operating system

macOS (Darwin 25.2.0, arm64)

Install method

npm (global)

Model

Anthropic Claude (model-independent — bug is in core dispatch, not model path)

Provider / routing chain

Anthropic via proxy (provider-independent)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions