[Bug]: Stale replyRunRegistry lock causes indefinite inbound dispatch hang — no timeout on waitForIdle() for visible messages

## Bug type

Behavior bug (incorrect output/state without crash)

## Beta release blocker

No

## Summary

`replyRunRegistry` in-memory lock leaks after a prior agent turn completes or fails abnormally, causing all subsequent inbound messages for the affected session to hang indefinitely in `admitReplyTurn() → waitForIdle()` with no timeout. The gateway logs the inbound message receipt and read-receipt acknowledgment, then produces zero further output — no model call, no error, no outbound delivery. Only a full gateway restart clears the stale lock.

This is the same class of bug as #84710 (Telegram channel) but observed on the Octo (custom WebSocket) channel, with a complete code-level root cause trace.

## Environment

- **OpenClaw**: `2026.5.28 (e932160)`
- **OS**: macOS 25.2.0 (Apple Silicon)
- **Node**: v22.22.1
- **Channel**: Octo (WebSocket-based IM, via `openclaw-channel-octo` plugin)
- **Gateway**: LaunchAgent, embedded mode
- **Model**: Anthropic Claude (via proxy), model-independent bug

## Observed behavior

### Timeline (all timestamps UTC+8)

| Time | Event | Outcome |
|------|-------|---------|
| Jun 4 16:36 | Agent completes a normal turn for `<user-A>` on `<bot-account>` | ✅ Response delivered, session status → `done` |
| Jun 4 18:18 | `<user-A>` sends new DM to `<bot-account>` | ❌ recv + readReceipt logged, **no dispatch** |
| Jun 5 11:18:16 | `<user-A>` sends another DM (quotes a reply, 187 chars) | ❌ recv + readReceipt logged, **no dispatch** |
| Jun 5 11:18:28 | `<user-A>` sends follow-up DM | ❌ recv + readReceipt logged, **no dispatch** |
| Jun 5 11:19:57 | `<user-B>` sends DM to same `<bot-account>` | ✅ recv → readReceipt → `[deliver-buffer] fallback text sent` in 3s |
| Jun 5 11:32:52 | `<user-A>` sends another DM | ❌ recv + readReceipt logged, **no dispatch** |
| Jun 5 11:38:29 | Gateway restart | 🔄 In-memory state cleared |
| Jun 5 11:39:31 | `<user-A>` sends DM | ✅ recv → readReceipt → `[deliver-buffer] fallback text sent (18 chars)` |

### Key observations

1. **User-specific**: Only `<user-A>`'s session is stuck. `<user-B>` on the same bot account works fine (different `sessionKey`).
2. **Session shows `done`**: The session store reports `status: done` — this is purely an in-memory lock leak, not a persisted state issue.
3. **No error logged**: Between `readReceipt sent OK` and the next unrelated log entry, there is **zero output** — no error, no warning, no dispatch log. The code silently hangs.
4. **Programmatic delivery works**: Sending a message via the `message` tool (which bypasses the inbound dispatch pipeline) succeeds, confirming the session store and outbound path are healthy.
5. **Gateway restart fixes it**: Clears `replyRunRegistry` in-memory singleton → lock gone → messages dispatch normally.

### Gateway log signature (redacted)

```
# Stuck user — recv logged, then silence:
[octo] [<bot>] recv message from=<user-A> channel=<user-A> type=1
[octo] sending readReceipt+typing to channel=<user-A> type=1
[octo] typing sent OK
[octo] readReceipt sent OK
<nothing — no dispatch, no deliver-buffer, no error>

# Working user — full pipeline:
[octo] [<bot>] recv message from=<user-B> channel=<user-B> type=1
[octo] sending readReceipt+typing to channel=<user-B> type=1
[octo] readReceipt sent OK
[octo] typing sent OK
[octo] [deliver-buffer] fallback text sent (12 chars)
```

## Root cause analysis

Traced through the compiled source. The hang occurs in the core dispatch pipeline, not in the channel plugin.

### Call chain

```
Channel plugin (octo inbound.js)
  → core.channel.reply.dispatchReplyWithBufferedBlockDispatcher()
    → dispatchInboundMessageWithBufferedDispatcher()  [dispatch-*.js]
      → ensureDispatchReplyOperation("dispatch")
        → admitReplyTurn()  [reply-turn-admission-*.js]
          → createReplyOperation()  [reply-run-registry-*.js]
            → THROWS ReplyRunAlreadyActiveError (stale lock exists)
          → waitForIdle(sessionKey, undefined, ...)
            → HANGS FOREVER (no timeoutMs for "visible" kind)
```

### Code-level detail

**`reply-turn-admission-*.js` → `admitReplyTurn()`** (line ~2001):

```js
async function admitReplyTurn(params) {
  while (true) {
    try {
      return { status: "owned", operation: createReplyOperation({...}) };
    } catch (error) {
      if (!(error instanceof ReplyRunAlreadyActiveError)) throw error;
      // For "visible" kind: waitForActive=true, waitTimeoutMs=undefined
      const waitTimeoutMs = params.waitTimeoutMs
        ?? (params.kind === "queued_followup" ? 15e3 : void 0);
      //                                              ^^^^^^^^
      // undefined for "visible" messages — no timeout!
      if (!await replyRunRegistry.waitForIdle(
        params.sessionKey, waitTimeoutMs, { signal: params.upstreamAbortSignal }
      )) return { status: "skipped", reason: "active-run" };
    }
  }
}
```

**`reply-run-registry-*.js` → `waitForIdle()`** (line ~248):

```js
waitForIdle(sessionKey, timeoutMs, opts) {
  // ...
  return new Promise((resolve) => {
    const waiter = { finish: (ended) => { /* ... */ resolve(ended); } };
    // Only sets timeout if timeoutMs is a finite number:
    if (typeof timeoutMs === "number" && Number.isFinite(timeoutMs))
      waiter.timer = setTimeout(() => waiter.finish(false), Math.max(100, timeoutMs));
    // When timeoutMs is undefined → no timer → waits forever
    // ...
  });
}
```

### Why the lock leaks

The stale entry in `replyRunState.activeRunsByKey` persists because a prior reply operation was created (`createReplyOperation` added it to the map) but never completed its lifecycle (the `clearState()` callback was never invoked). Possible triggers:

1. Unhandled promise rejection during the model API call that bypasses the `finally` block
2. A heartbeat-driven run that set `pendingFinalDelivery` without clearing it (see #83184)
3. An embedded run (e.g. Codex app-server) that emitted `notification:turn/started` then went silent (see #85251)
4. A native tool call that never emitted a completion event (see #87310)

### Why there is no log output

The `logVerbose` call at the dispatch rejection site only fires when verbose mode is enabled:

```js
logVerbose(`dispatch-from-config: skipped reply operation admission for ${key}; reason=${reason}`);
```

At default log level, the hang is **completely invisible** — no warning, no error, no structured event.

## Suggested fixes

1. **Add a TTL / max-wait timeout to `waitForIdle()` for visible messages**: Even 60–120s would prevent permanent hangs. The current code only sets a timeout for `queued_followup` (15s) — visible messages get `undefined` (infinite wait).

2. **Promote the dispatch-skip log to `log.warn`**: Silent hangs are the worst failure mode. At minimum, log a warning when `admitReplyTurn` returns `skipped` with reason `active-run`.

3. **Add a stale-lock reaper**: Periodically scan `replyRunState.activeRunsByKey` for entries older than N minutes and force-clear them (the registry already exports `forceClearReplyRunBySessionId`).

4. **Stuck-session recovery should clear `replyRunRegistry`**: The existing health-monitor / stuck-session recovery path should also check and clear stale entries in the in-memory reply run registry, not just persisted session state.

## Related issues

| Issue | Title | Relevance |
|-------|-------|-----------|
| #84710 | Telegram inbound dispatch hangs after "Inbound message" log | **Same bug, different channel** — identical symptoms (recv logged → silence → restart fixes) |
| #77485 | `ReplyRunAlreadyActiveError` fires every other gateway-WS chat call (50% reply failure) | **Same root cause** — `ReplyRunAlreadyActiveError` blocking dispatch; partial fix in 5.4 didn't cover all paths |
| #83184 | Heartbeat-driven agent replies leave `pendingFinalDelivery` stuck, blocking subsequent heartbeats | **Potential trigger** — heartbeat runs not clearing state, which may cause the initial lock leak |
| #87310 | Stale diagnostic `tool_call` activity survives recovery/reset and re-blocks sessions | **Same class** — in-memory state outliving its source, blocking future dispatch |
| #85251 | Codex app-server emits `notification:turn/started` then goes silent; embedded run wedges | **Potential trigger** — embedded run never completing could leave reply operation active |
| #86538 | Session write-lock timeouts block subagent delivery lanes | **Same class** — lock-based state management without adequate timeout/recovery |
| #88870 | Stuck-session recovery aborts long-but-active agent runs with misleading reason | **Related recovery gap** — recovery mechanism itself has edge cases |
| #86963 | Orphaned native Codex thread wedges session permanently, silently dropping messages | **Same symptom** — session permanently stuck, messages silently dropped |

## Repro notes

- **Intermittent but sticky**: Once the lock leaks, it persists until restart. The initial leak trigger is not deterministic — we observed it after a normal-looking completed turn with ~1h42m gap before the next message.
- **Multi-agent amplifier**: Environments with many agents/bot-accounts sharing `maxConcurrent` limits may increase the chance of lock contention and leak.
- **Channel-independent**: This is a core dispatch issue. The channel plugin (Octo, Telegram, etc.) correctly delivers the message to the core — the core's `admitReplyTurn` is where it hangs.

## OpenClaw version

2026.5.28 (e932160)

## Operating system

macOS (Darwin 25.2.0, arm64)

## Install method

npm (global)

## Model

Anthropic Claude (model-independent — bug is in core dispatch, not model path)

## Provider / routing chain

Anthropic via proxy (provider-independent)


Time	Event	Outcome
Jun 4 16:36	Agent completes a normal turn for `<user-A>` on `<bot-account>`	✅ Response delivered, session status → `done`
Jun 4 18:18	`<user-A>` sends new DM to `<bot-account>`	❌ recv + readReceipt logged, no dispatch
Jun 5 11:18:16	`<user-A>` sends another DM (quotes a reply, 187 chars)	❌ recv + readReceipt logged, no dispatch
Jun 5 11:18:28	`<user-A>` sends follow-up DM	❌ recv + readReceipt logged, no dispatch
Jun 5 11:19:57	`<user-B>` sends DM to same `<bot-account>`	✅ recv → readReceipt → `[deliver-buffer] fallback text sent` in 3s
Jun 5 11:32:52	`<user-A>` sends another DM	❌ recv + readReceipt logged, no dispatch
Jun 5 11:38:29	Gateway restart	🔄 In-memory state cleared
Jun 5 11:39:31	`<user-A>` sends DM	✅ recv → readReceipt → `[deliver-buffer] fallback text sent (18 chars)`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Stale replyRunRegistry lock causes indefinite inbound dispatch hang — no timeout on waitForIdle() for visible messages #90535

Bug type

Beta release blocker

Summary

Environment

Observed behavior

Timeline (all timestamps UTC+8)

Key observations

Gateway log signature (redacted)

Root cause analysis

Call chain

Code-level detail

Why the lock leaks

Why there is no log output

Suggested fixes

Related issues

Repro notes

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue	Title	Relevance
#84710	Telegram inbound dispatch hangs after "Inbound message" log	Same bug, different channel — identical symptoms (recv logged → silence → restart fixes)
#77485	`ReplyRunAlreadyActiveError` fires every other gateway-WS chat call (50% reply failure)	Same root cause — `ReplyRunAlreadyActiveError` blocking dispatch; partial fix in 5.4 didn't cover all paths
#83184	Heartbeat-driven agent replies leave `pendingFinalDelivery` stuck, blocking subsequent heartbeats	Potential trigger — heartbeat runs not clearing state, which may cause the initial lock leak
#87310	Stale diagnostic `tool_call` activity survives recovery/reset and re-blocks sessions	Same class — in-memory state outliving its source, blocking future dispatch
#85251	Codex app-server emits `notification:turn/started` then goes silent; embedded run wedges	Potential trigger — embedded run never completing could leave reply operation active
#86538	Session write-lock timeouts block subagent delivery lanes	Same class — lock-based state management without adequate timeout/recovery
#88870	Stuck-session recovery aborts long-but-active agent runs with misleading reason	Related recovery gap — recovery mechanism itself has edge cases
#86963	Orphaned native Codex thread wedges session permanently, silently dropping messages	Same symptom — session permanently stuck, messages silently dropped

Uh oh!

[Bug]: Stale replyRunRegistry lock causes indefinite inbound dispatch hang — no timeout on waitForIdle() for visible messages #90535

Description

Bug type

Beta release blocker

Summary

Environment

Observed behavior

Timeline (all timestamps UTC+8)

Key observations

Gateway log signature (redacted)

Root cause analysis

Call chain

Code-level detail

Why the lock leaks

Why there is no log output

Suggested fixes

Related issues

Repro notes

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions