Skip to content

[Bug]: Gateway event-loop stalls cause cross-channel latency, missed replies, and channel disconnects #75882

@mgonto

Description

@mgonto

Bug type

Performance / reliability regression

Summary

Gateway intermittently stalls its Node event loop for tens to hundreds of seconds, causing cross-channel latency/failures. This is not limited to WhatsApp: during the same periods Telegram polling/send actions stall or fail, Slack socket pings/pongs time out, and WhatsApp Web repeatedly disconnects/exits. WhatsApp additionally hits recurring 408/428 reconnect/session-expiry failures, tracked separately in #75736.

User-visible impact

  • WhatsApp messages sometimes get an automatic reaction, but no assistant reply is sent or the reply is delayed by minutes.
  • Telegram is also occasionally slow and has send/dispatch failures.
  • Slack socket mode repeatedly disconnects/restarts.
  • Gateway status probes can show channels as connected briefly, then stopped/disconnected shortly after.

Why the WhatsApp reaction happens but no answer follows

Logs show WhatsApp inbound/reaction handling can complete before the assistant run/delivery path finishes. After the reaction, the gateway/agent path can stall on event-loop delay, session/lane waits, file lock timeouts, LLM timeout, or WhatsApp listener flapping. This creates the visible pattern: ✅ reaction arrives, but no final answer is delivered.

Environment

  • OpenClaw: 2026.4.29 (a448042)
  • OS: Linux 6.8.0-100-generic x64
  • Node: 22.22.0
  • Gateway: systemd user service
  • Install: npm/pnpm global CLI
  • Channels enabled: WhatsApp, Telegram, Slack
  • Host resources during investigation: RAM available ~1.5–1.6GiB, disk ~88% full, gateway process around 35–40% RSS and CPU spikes during stalls.

Current channel state example

Gateway reachable.
- Slack default: enabled, configured, stopped, disconnected, error: channel stop timed out after 5000ms
- Telegram default: enabled, configured, running, connected, mode: polling, works
- WhatsApp default: enabled, configured, linked, stopped, disconnected, error: channel exited without an error

Sanitized evidence

Event-loop / liveness stalls

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=173s eventLoopDelayP99Ms=171798.7 eventLoopDelayMaxMs=171798.7 eventLoopUtilization=1 cpuCoreRatio=1.055 active=0 waiting=0 queued=0
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=47s eventLoopDelayP99Ms=22666 eventLoopDelayMaxMs=22666 eventLoopUtilization=0.997 cpuCoreRatio=1.038 active=2 waiting=0 queued=1
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=34s eventLoopDelayP99Ms=12146.7 eventLoopDelayMaxMs=12146.7 eventLoopUtilization=1 cpuCoreRatio=1.031 active=0 waiting=0 queued=2
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=181s eventLoopDelayP99Ms=171530.3 eventLoopDelayMaxMs=171530.3 eventLoopUtilization=1 cpuCoreRatio=1.054 active=1 waiting=0 queued=2

Telegram affected too

[telegram] Polling stall detected (active getUpdates stuck for 172.51s); forcing restart.
[telegram] [diag] polling cycle finished reason=polling stall detected durationMs=172510 error=Network request for 'getUpdates' failed!
[telegram] polling runner stopped (polling stall detected); restarting in 2.49s.
[telegram] sendChatAction failed: Network request for 'sendChatAction' failed!
[telegram] dispatch failed: SessionWriteLockTimeoutError: session file locked (timeout 10000ms): .../sessions.json.lock

Slack affected too

[WARN] socket-mode:SlackWebSocket A pong wasn't received from the server before the timeout of 15000ms!
[slack] socket disconnected (disconnect). retry 1/12 in 2s
[health-monitor] [slack:default] health-monitor: restarting (reason: disconnected)
[slack] [default] channel stop exceeded 5000ms after abort; continuing shutdown

WhatsApp flapping / delivery failures

[whatsapp] Web connection closed (status 408). Retry 1/12 in 2.2s… (status=408 Request Time-out Connection was lost)
[whatsapp] Web connection closed (status 428: session expired or precondition required). Relink with `openclaw channels login --channel whatsapp`. Stopping web monitoring.
[whatsapp] [default] channel exited without an error
[whatsapp] [default] auto-restart attempt 3/10 in 22s
[tools] message failed: Error: No active WhatsApp Web listener (account: default).

Reaction without timely answer pattern

[whatsapp] Sending reaction "✅" -> message ...
[whatsapp] Inbound message ... (direct, 106 chars)
[whatsapp] Sent reaction "✅" -> message ...
[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=47s eventLoopDelayP99Ms=22666 eventLoopDelayMaxMs=22666 eventLoopUtilization=0.997 cpuCoreRatio=1.038 active=2 waiting=0 queued=1
[diagnostic] lane wait exceeded: lane=session:agent:main:hook:ingress waitedMs=403418 queueAhead=4

Agent/session/lane symptoms

[diagnostic] lane wait exceeded: lane=session:agent:main:hook:ingress waitedMs=594650 queueAhead=4
[diagnostic] lane task error: lane=cron-nested durationMs=508892 error="FailoverError: LLM request timed out."
[diagnostic] lane task error: lane=session:agent:main:cron:... durationMs=636726 error="FailoverError: LLM request timed out."
[agent/embedded] agent cleanup timed out: ... step=pi-trajectory-flush timeoutMs=10000
[agent/embedded] [context-overflow-diag] sessionKey=agent:main:whatsapp:default:direct:... source=assistantError ... error=Context overflow: estimated context size exceeds safe threshold during tool loop.

Counts from a 12h log sample

event_loop_liveness: 173
telegram_polling_stall: 9
whatsapp_428: 8
wa_no_listener: 8
wa_channel_exited: 11
stuck_session: 84
gateway_timeout: 113
slack_disconnect: 64
assistant_error/context-overflow: 3
send_fail: 7

Hypotheses

  1. Gateway event loop is being blocked by one or more synchronous/CPU-heavy or file-lock-heavy operations, causing all channel transports to miss heartbeats/timeouts.
  2. Session persistence / trajectory flushing may be contributing: sessions.json.lock timeout and pi-trajectory-flush cleanup timeout appear near stalls.
  3. LLM/tool-loop timeouts and context-overflow diagnostics may be leaving sessions in long processing_without_queue states, causing lane waits and downstream delivery delays.
  4. WhatsApp has an additional channel-specific reconnect/session-expiry bug ([Bug]: gateway/channels/whatsapp ⁠ — channel exited with HTTP 428 Precondition Required #75736), which becomes more visible under event-loop stalls.

Expected behavior

  • A stuck agent run or trajectory flush should not block channel polling/websocket heartbeats for 10–170s.
  • Inbound ack/reaction and assistant reply delivery should not diverge silently; if a reply cannot be delivered, the failure should be recoverable/observable.
  • Telegram/Slack/WhatsApp transports should remain responsive even when one session or cron is stuck.

Actual behavior

Gateway event-loop stalls correlate with Telegram polling stalls, Slack pings timing out, WhatsApp disconnects/exits, lane waits, session lock failures, and missed/delayed user replies.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions