Skip to content

BUG: Discord health-monitor triggers uncaught exception crash loop (v2026.3.24) #54931

@kAIborg24

Description

@kAIborg24

Bug type

Regression (worked before, now fails)

Summary

After upgrading from v2026.3.11 to v2026.3.24, the gateway crashes every ~35 minutes due to Discord's health-monitor detecting stale sockets and triggering a reconnection path that throws an uncaught exception. Zero crashes occurred across 5+ days on v2026.3.11. On v2026.3.24, 16 crashes occurred in a single day.

Steps to reproduce

  1. Install v2026.3.24 with Discord channel enabled (single guild, allowlist-only)
  2. Gateway runs normally for ~30-35 minutes
  3. Discord health-monitor detects a stale WebSocket (no events within staleSocketMinutes, default 30)
  4. Health-monitor calls stopChannel() → triggers onAbort()
  5. onAbort() sets gateway.options.reconnect = { maxAttempts: 0 } then calls gateway.disconnect()
  6. WebSocket closes with code 1005 ("No Status Received")
  7. handleClose(1005) → handleReconnectionAttempt() → checks reconnectAttempts(0) >= maxAttempts(0) → true
  8. Emits new Error("Max reconnect attempts (0) reached after code 1005")
  9. Error is uncaught → entire Node.js process crashes
  10. systemd restarts → cycle repeats every ~35 minutes

Expected behavior

Health-monitor should gracefully restart the Discord channel without crashing the gateway process.

Actual behavior

The onAbort handler sets maxAttempts: 0 before disconnecting. The WebSocket close handler then fires and immediately triggers the max-attempts error path (0 >= 0 is true), emitting an uncaught exception that crashes the entire Node.js process.

OpenClaw version

2026.3.24 (upgraded from 2026.3.11)

Operating system

Ubuntu 24.04 LTS (Linux 6.18.7 x64)

Install method

npm global

Model

anthropic/claude-opus-4-6 / anthropic/claude-sonnet-4-6

Provider / routing chain

openclaw -> anthropic (direct)

Additional provider/model setup details

Bug is in Discord WebSocket lifecycle management, not model-specific.

Logs, screenshots, and evidence

[health-monitor] [discord:default] health-monitor: restarting (reason: stale-socket)
[openclaw] Uncaught exception: Error: Max reconnect attempts (0) reached after code 1005
    at SafeGatewayPlugin.handleReconnectionAttempt (provider-CAlWEl41.js:3318:47)
    at SafeGatewayPlugin.handleClose (provider-CAlWEl41.js:3364:8)
    at WebSocket.<anonymous> (provider-CAlWEl41.js:3307:9)

Root cause in provider-CAlWEl41.js:

Line 6952 — onAbort sets: gateway.options.reconnect = { maxAttempts: 0 };

Lines 3316-3318 — Reconnection handler checks:

const { maxAttempts = 5 } = this.options.reconnect ?? {};
if (this.reconnectAttempts >= maxAttempts) {
    this.emitter.emit("error", new Error(`Max reconnect attempts (${maxAttempts}) reached...`));

Crash frequency data:

• Mar 17-24 (v2026.3.11): 0 crashes across 5+ days
• Mar 25 (v2026.3.24): 16 crashes in one day, every ~35 min

Impact and severity

High — Gateway crashes every ~35 minutes. All running subagent sessions are disrupted or killed. Subagent completion announce-back fails after restart ("Outbound not configured for channel: telegram"). Long-running subagent tasks (30-75 min) have near-zero chance of completing.

Additional information

Suggested fixes:

  1. (Preferred) Set a flag to suppress the close handler rather than manipulating maxAttempts — lifecycleStopping already exists on line 6944, add a check in handleClose
  2. Set maxAttempts to a sentinel value that handleReconnectionAttempt treats as "intentional shutdown, don't emit error"
  3. Catch the error in the health-monitor's restart flow so it doesn't propagate as uncaught

Workaround: Disable Discord (channels.discord.enabled: false).

Note: Also observed a secondary issue — with Discord channel disabled but Discord plugin still enabled (plugins.entries.discord.enabled: true), message-action-discovery still tries to resolve the Discord token SecretRef, causing a separate crash ("Unhandled promise rejection: channels.discord.token: unresolved SecretRef"). Both the channel AND plugin must be disabled as workaround.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingregressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions