Skip to content

Discord: health monitor restart loop — post-connection zombie sessions evade circuit breakers #38596

@jeanmonet

Description

@jeanmonet

Problem

Discord connections successfully complete the handshake (HELLO → READY/RESUMED), run for a period, then become zombie/unstable. The health monitor detects this ("stuck" / "stale-socket") and restarts the provider, but the cycle repeats indefinitely — creating an effective infinite restart loop where messages sent during unstable periods are silently dropped.

This is distinct from #13688 (HELLO never received / infinite resume loop), which appears fixed in v2026.3.2.

Environment

  • OpenClaw: v2026.3.2 (stable)
  • OS: Ubuntu 24.04
  • Discord: Single private server, one user
  • Other channels: Telegram and WhatsApp remain operational throughout

Production evidence (Mar 7, 2026)

Health monitor restarted Discord provider every 5–15 minutes for 4+ hours:

02:15 [health-monitor] [discord:default] restarting (reason: stale-socket)
02:30 [health-monitor] [discord:default] restarting (reason: stuck)
02:45 [health-monitor] [discord:default] restarting (reason: stuck)
03:00 [health-monitor] [discord:default] restarting (reason: stuck)
03:10 [health-monitor] [discord:default] restarting (reason: stuck)
03:20 [health-monitor] [discord:default] restarting (reason: stuck)
03:25 [gateway/channels] restarting discord channel
03:44 [gateway/channels] restarting discord channel (×4 in 8 seconds)
04:50 [health-monitor] [discord:default] restarting (reason: stale-socket)
05:00 [health-monitor] [discord:default] restarting (reason: stuck)

Key observations:

  • Zero connection stalled: no HELLO messages — the HELLO stall circuit breaker never fires because HELLO IS received
  • WebSocket code 1005 disconnects appear between restarts: gateway: Attempting resume with backoff: 1000ms / WebSocket connection closed with code 1005
  • After each health monitor restart, Discord reconnects successfully (channels resolved, logged in to discord) but eventually goes unstable again

Root cause analysis

The health monitor restart creates a new provider lifecycle, which resets two protective mechanisms:

1. consecutiveHelloStalls counter (local variable)

// Inside runDiscordGatewayLifecycle — recreated on every restart
let consecutiveHelloStalls = 0;

Each health monitor restart creates a new lifecycle with a fresh counter at 0. Even if HELLO stalls did occur, the counter can never accumulate across restarts.

2. reconnectStallWatchdog (disarmed on socket open)

The watchdog is armed on disconnect/hello-timeout but disarmed on every WebSocket connection opened event. In a rapid open→close→open cycle, the watchdog timeout (5 min) is never reached because each new open resets it.

Result

Neither circuit breaker can protect against the pattern: "connection succeeds → runs briefly → dies → health monitor restarts → repeat." The health monitor's DEFAULT_MAX_RESTARTS_PER_HOUR = 10 limits restart frequency but doesn't break the cycle.

Impact

Messages sent to the bot during "stuck" periods are silently lost — no error, no retry, no notification to the user. The bot appears online (Discord presence) but doesn't respond.

Suggested fixes

  1. Persist restart context across provider lifecycles — track "number of health monitor restarts in last N minutes" at the monitor level (not inside the lifecycle). After a threshold (e.g., 5 restarts in 30 min), take more drastic action:

    • Force a fresh IDENTIFY (clear session state before restart)
    • Apply a longer cooldown before reconnecting
    • Emit a warning event for monitoring/alerting
  2. Don't disarm reconnectStallWatchdog on socket open — disarm it on READY/RESUMED instead, so rapid open→close cycles are caught.

  3. Track Discord-specific health separately from the generic channel monitor — the WebSocket code 1005/1006 pattern and zombie heartbeat detection could inform a Discord-specific circuit breaker.

Relationship to other issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions