You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
replyRunRegistry in-memory lock leaks after a prior agent turn completes or fails abnormally, causing all subsequent inbound messages for the affected session to hang indefinitely in admitReplyTurn() → waitForIdle() with no timeout. The gateway logs the inbound message receipt and read-receipt acknowledgment, then produces zero further output — no model call, no error, no outbound delivery. Only a full gateway restart clears the stale lock.
This is the same class of bug as #84710 (Telegram channel) but observed on the Octo (custom WebSocket) channel, with a complete code-level root cause trace.
Environment
OpenClaw: 2026.5.28 (e932160)
OS: macOS 25.2.0 (Apple Silicon)
Node: v22.22.1
Channel: Octo (WebSocket-based IM, via openclaw-channel-octo plugin)
Gateway: LaunchAgent, embedded mode
Model: Anthropic Claude (via proxy), model-independent bug
Observed behavior
Timeline (all timestamps UTC+8)
Time
Event
Outcome
Jun 4 16:36
Agent completes a normal turn for <user-A> on <bot-account>
✅ Response delivered, session status → done
Jun 4 18:18
<user-A> sends new DM to <bot-account>
❌ recv + readReceipt logged, no dispatch
Jun 5 11:18:16
<user-A> sends another DM (quotes a reply, 187 chars)
❌ recv + readReceipt logged, no dispatch
Jun 5 11:18:28
<user-A> sends follow-up DM
❌ recv + readReceipt logged, no dispatch
Jun 5 11:19:57
<user-B> sends DM to same <bot-account>
✅ recv → readReceipt → [deliver-buffer] fallback text sent in 3s
Jun 5 11:32:52
<user-A> sends another DM
❌ recv + readReceipt logged, no dispatch
Jun 5 11:38:29
Gateway restart
🔄 In-memory state cleared
Jun 5 11:39:31
<user-A> sends DM
✅ recv → readReceipt → [deliver-buffer] fallback text sent (18 chars)
Key observations
User-specific: Only <user-A>'s session is stuck. <user-B> on the same bot account works fine (different sessionKey).
Session shows done: The session store reports status: done — this is purely an in-memory lock leak, not a persisted state issue.
No error logged: Between readReceipt sent OK and the next unrelated log entry, there is zero output — no error, no warning, no dispatch log. The code silently hangs.
Programmatic delivery works: Sending a message via the message tool (which bypasses the inbound dispatch pipeline) succeeds, confirming the session store and outbound path are healthy.
# Stuck user — recv logged, then silence:
[octo] [<bot>] recv message from=<user-A> channel=<user-A> type=1
[octo] sending readReceipt+typing to channel=<user-A> type=1
[octo] typing sent OK
[octo] readReceipt sent OK
<nothing — no dispatch, no deliver-buffer, no error>
# Working user — full pipeline:
[octo] [<bot>] recv message from=<user-B> channel=<user-B> type=1
[octo] sending readReceipt+typing to channel=<user-B> type=1
[octo] readReceipt sent OK
[octo] typing sent OK
[octo] [deliver-buffer] fallback text sent (12 chars)
Root cause analysis
Traced through the compiled source. The hang occurs in the core dispatch pipeline, not in the channel plugin.
waitForIdle(sessionKey,timeoutMs,opts){// ...returnnewPromise((resolve)=>{constwaiter={finish: (ended)=>{/* ... */resolve(ended);}};// Only sets timeout if timeoutMs is a finite number:if(typeoftimeoutMs==="number"&&Number.isFinite(timeoutMs))waiter.timer=setTimeout(()=>waiter.finish(false),Math.max(100,timeoutMs));// When timeoutMs is undefined → no timer → waits forever// ...});}
Why the lock leaks
The stale entry in replyRunState.activeRunsByKey persists because a prior reply operation was created (createReplyOperation added it to the map) but never completed its lifecycle (the clearState() callback was never invoked). Possible triggers:
Unhandled promise rejection during the model API call that bypasses the finally block
The logVerbose call at the dispatch rejection site only fires when verbose mode is enabled:
logVerbose(`dispatch-from-config: skipped reply operation admission for ${key}; reason=${reason}`);
At default log level, the hang is completely invisible — no warning, no error, no structured event.
Suggested fixes
Add a TTL / max-wait timeout to waitForIdle() for visible messages: Even 60–120s would prevent permanent hangs. The current code only sets a timeout for queued_followup (15s) — visible messages get undefined (infinite wait).
Promote the dispatch-skip log to log.warn: Silent hangs are the worst failure mode. At minimum, log a warning when admitReplyTurn returns skipped with reason active-run.
Add a stale-lock reaper: Periodically scan replyRunState.activeRunsByKey for entries older than N minutes and force-clear them (the registry already exports forceClearReplyRunBySessionId).
Stuck-session recovery should clear replyRunRegistry: The existing health-monitor / stuck-session recovery path should also check and clear stale entries in the in-memory reply run registry, not just persisted session state.
Same symptom — session permanently stuck, messages silently dropped
Repro notes
Intermittent but sticky: Once the lock leaks, it persists until restart. The initial leak trigger is not deterministic — we observed it after a normal-looking completed turn with ~1h42m gap before the next message.
Multi-agent amplifier: Environments with many agents/bot-accounts sharing maxConcurrent limits may increase the chance of lock contention and leak.
Channel-independent: This is a core dispatch issue. The channel plugin (Octo, Telegram, etc.) correctly delivers the message to the core — the core's admitReplyTurn is where it hangs.
Bug type
Behavior bug (incorrect output/state without crash)
Beta release blocker
No
Summary
replyRunRegistryin-memory lock leaks after a prior agent turn completes or fails abnormally, causing all subsequent inbound messages for the affected session to hang indefinitely inadmitReplyTurn() → waitForIdle()with no timeout. The gateway logs the inbound message receipt and read-receipt acknowledgment, then produces zero further output — no model call, no error, no outbound delivery. Only a full gateway restart clears the stale lock.This is the same class of bug as #84710 (Telegram channel) but observed on the Octo (custom WebSocket) channel, with a complete code-level root cause trace.
Environment
2026.5.28 (e932160)openclaw-channel-octoplugin)Observed behavior
Timeline (all timestamps UTC+8)
<user-A>on<bot-account>done<user-A>sends new DM to<bot-account><user-A>sends another DM (quotes a reply, 187 chars)<user-A>sends follow-up DM<user-B>sends DM to same<bot-account>[deliver-buffer] fallback text sentin 3s<user-A>sends another DM<user-A>sends DM[deliver-buffer] fallback text sent (18 chars)Key observations
<user-A>'s session is stuck.<user-B>on the same bot account works fine (differentsessionKey).done: The session store reportsstatus: done— this is purely an in-memory lock leak, not a persisted state issue.readReceipt sent OKand the next unrelated log entry, there is zero output — no error, no warning, no dispatch log. The code silently hangs.messagetool (which bypasses the inbound dispatch pipeline) succeeds, confirming the session store and outbound path are healthy.replyRunRegistryin-memory singleton → lock gone → messages dispatch normally.Gateway log signature (redacted)
Root cause analysis
Traced through the compiled source. The hang occurs in the core dispatch pipeline, not in the channel plugin.
Call chain
Code-level detail
reply-turn-admission-*.js→admitReplyTurn()(line ~2001):reply-run-registry-*.js→waitForIdle()(line ~248):Why the lock leaks
The stale entry in
replyRunState.activeRunsByKeypersists because a prior reply operation was created (createReplyOperationadded it to the map) but never completed its lifecycle (theclearState()callback was never invoked). Possible triggers:finallyblockpendingFinalDeliverywithout clearing it (see [Bug]: Heartbeat-driven agent replies leave pendingFinalDelivery stuck, blocking subsequent heartbeats #83184)notification:turn/startedthen went silent (see Codex app-server emits notification:turn/started then goes silent; embedded run wedges for the full stuck-session recovery window #85251)Why there is no log output
The
logVerbosecall at the dispatch rejection site only fires when verbose mode is enabled:At default log level, the hang is completely invisible — no warning, no error, no structured event.
Suggested fixes
Add a TTL / max-wait timeout to
waitForIdle()for visible messages: Even 60–120s would prevent permanent hangs. The current code only sets a timeout forqueued_followup(15s) — visible messages getundefined(infinite wait).Promote the dispatch-skip log to
log.warn: Silent hangs are the worst failure mode. At minimum, log a warning whenadmitReplyTurnreturnsskippedwith reasonactive-run.Add a stale-lock reaper: Periodically scan
replyRunState.activeRunsByKeyfor entries older than N minutes and force-clear them (the registry already exportsforceClearReplyRunBySessionId).Stuck-session recovery should clear
replyRunRegistry: The existing health-monitor / stuck-session recovery path should also check and clear stale entries in the in-memory reply run registry, not just persisted session state.Related issues
ReplyRunAlreadyActiveErrorfires every other gateway-WS chat call (50% reply failure)ReplyRunAlreadyActiveErrorblocking dispatch; partial fix in 5.4 didn't cover all pathspendingFinalDeliverystuck, blocking subsequent heartbeatstool_callactivity survives recovery/reset and re-blocks sessionsnotification:turn/startedthen goes silent; embedded run wedgesRepro notes
maxConcurrentlimits may increase the chance of lock contention and leak.admitReplyTurnis where it hangs.OpenClaw version
2026.5.28 (e932160)
Operating system
macOS (Darwin 25.2.0, arm64)
Install method
npm (global)
Model
Anthropic Claude (model-independent — bug is in core dispatch, not model path)
Provider / routing chain
Anthropic via proxy (provider-independent)