Problem
Discord connections successfully complete the handshake (HELLO → READY/RESUMED), run for a period, then become zombie/unstable. The health monitor detects this ("stuck" / "stale-socket") and restarts the provider, but the cycle repeats indefinitely — creating an effective infinite restart loop where messages sent during unstable periods are silently dropped.
This is distinct from #13688 (HELLO never received / infinite resume loop), which appears fixed in v2026.3.2.
Environment
- OpenClaw: v2026.3.2 (stable)
- OS: Ubuntu 24.04
- Discord: Single private server, one user
- Other channels: Telegram and WhatsApp remain operational throughout
Production evidence (Mar 7, 2026)
Health monitor restarted Discord provider every 5–15 minutes for 4+ hours:
02:15 [health-monitor] [discord:default] restarting (reason: stale-socket)
02:30 [health-monitor] [discord:default] restarting (reason: stuck)
02:45 [health-monitor] [discord:default] restarting (reason: stuck)
03:00 [health-monitor] [discord:default] restarting (reason: stuck)
03:10 [health-monitor] [discord:default] restarting (reason: stuck)
03:20 [health-monitor] [discord:default] restarting (reason: stuck)
03:25 [gateway/channels] restarting discord channel
03:44 [gateway/channels] restarting discord channel (×4 in 8 seconds)
04:50 [health-monitor] [discord:default] restarting (reason: stale-socket)
05:00 [health-monitor] [discord:default] restarting (reason: stuck)
Key observations:
- Zero
connection stalled: no HELLO messages — the HELLO stall circuit breaker never fires because HELLO IS received
- WebSocket code 1005 disconnects appear between restarts:
gateway: Attempting resume with backoff: 1000ms / WebSocket connection closed with code 1005
- After each health monitor restart, Discord reconnects successfully (
channels resolved, logged in to discord) but eventually goes unstable again
Root cause analysis
The health monitor restart creates a new provider lifecycle, which resets two protective mechanisms:
1. consecutiveHelloStalls counter (local variable)
// Inside runDiscordGatewayLifecycle — recreated on every restart
let consecutiveHelloStalls = 0;
Each health monitor restart creates a new lifecycle with a fresh counter at 0. Even if HELLO stalls did occur, the counter can never accumulate across restarts.
2. reconnectStallWatchdog (disarmed on socket open)
The watchdog is armed on disconnect/hello-timeout but disarmed on every WebSocket connection opened event. In a rapid open→close→open cycle, the watchdog timeout (5 min) is never reached because each new open resets it.
Result
Neither circuit breaker can protect against the pattern: "connection succeeds → runs briefly → dies → health monitor restarts → repeat." The health monitor's DEFAULT_MAX_RESTARTS_PER_HOUR = 10 limits restart frequency but doesn't break the cycle.
Impact
Messages sent to the bot during "stuck" periods are silently lost — no error, no retry, no notification to the user. The bot appears online (Discord presence) but doesn't respond.
Suggested fixes
-
Persist restart context across provider lifecycles — track "number of health monitor restarts in last N minutes" at the monitor level (not inside the lifecycle). After a threshold (e.g., 5 restarts in 30 min), take more drastic action:
- Force a fresh IDENTIFY (clear session state before restart)
- Apply a longer cooldown before reconnecting
- Emit a warning event for monitoring/alerting
-
Don't disarm reconnectStallWatchdog on socket open — disarm it on READY/RESUMED instead, so rapid open→close cycles are caught.
-
Track Discord-specific health separately from the generic channel monitor — the WebSocket code 1005/1006 pattern and zombie heartbeat detection could inform a Discord-specific circuit breaker.
Relationship to other issues
Problem
Discord connections successfully complete the handshake (HELLO → READY/RESUMED), run for a period, then become zombie/unstable. The health monitor detects this ("stuck" / "stale-socket") and restarts the provider, but the cycle repeats indefinitely — creating an effective infinite restart loop where messages sent during unstable periods are silently dropped.
This is distinct from #13688 (HELLO never received / infinite resume loop), which appears fixed in v2026.3.2.
Environment
Production evidence (Mar 7, 2026)
Health monitor restarted Discord provider every 5–15 minutes for 4+ hours:
Key observations:
connection stalled: no HELLOmessages — the HELLO stall circuit breaker never fires because HELLO IS receivedgateway: Attempting resume with backoff: 1000ms/WebSocket connection closed with code 1005channels resolved,logged in to discord) but eventually goes unstable againRoot cause analysis
The health monitor restart creates a new provider lifecycle, which resets two protective mechanisms:
1.
consecutiveHelloStallscounter (local variable)Each health monitor restart creates a new lifecycle with a fresh counter at 0. Even if HELLO stalls did occur, the counter can never accumulate across restarts.
2.
reconnectStallWatchdog(disarmed on socket open)The watchdog is armed on disconnect/hello-timeout but disarmed on every
WebSocket connection openedevent. In a rapid open→close→open cycle, the watchdog timeout (5 min) is never reached because each newopenresets it.Result
Neither circuit breaker can protect against the pattern: "connection succeeds → runs briefly → dies → health monitor restarts → repeat." The health monitor's
DEFAULT_MAX_RESTARTS_PER_HOUR = 10limits restart frequency but doesn't break the cycle.Impact
Messages sent to the bot during "stuck" periods are silently lost — no error, no retry, no notification to the user. The bot appears online (Discord presence) but doesn't respond.
Suggested fixes
Persist restart context across provider lifecycles — track "number of health monitor restarts in last N minutes" at the monitor level (not inside the lifecycle). After a threshold (e.g., 5 restarts in 30 min), take more drastic action:
Don't disarm
reconnectStallWatchdogon socketopen— disarm it onREADY/RESUMEDinstead, so rapid open→close cycles are caught.Track Discord-specific health separately from the generic channel monitor — the WebSocket code 1005/1006 pattern and zombie heartbeat detection could inform a Discord-specific circuit breaker.
Relationship to other issues
reconnectAttemptsfix)