Skip to content

[Bug]: Discord plugin: post-restart gateway wedges at "awaiting gateway readiness" on macOS (no 15s timeout fires) #70841

@HeilbronAILabs

Description

@HeilbronAILabs

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Summary

On macOS, restarting the OpenClaw gateway often leaves the Discord plugin permanently stuck at awaiting gateway readiness. The underlying TCP/TLS connection to gateway.discord.gg is established, but the Discord IDENTIFY → READY exchange in carbonGateway.GatewayPlugin never completes. The plugin's wait-for-ready guard returns "ready" anyway because of a ?? true nullish-default, so the documented discord: gateway was not ready after 15000ms error never fires and the plugin proceeds with a non-functional gateway. Inbound Discord messages are silently dropped (outbound REST still works).

The failure is intermittent (~20% of restarts succeed on the affected host), so it looks like a race condition in the carbon gateway's initial handshake that lands badly far more often on this Mac than on Linux.

Environment

OpenClaw 2026.4.21 (f788c88)
Host OS macOS 26.4.1 (build 25E253), Darwin 25.4.0
Node v25.9.0
Plugins enabled acpx, browser, device-pair, discord, phone-control, talk-voice, telegram
Other host (works) DGX Spark / Linux, same OpenClaw 2026.4.21 — no reproduction

Symptom

After (almost any) gateway restart on the affected Mac:

[gateway] ready (7 plugins: ...; 11.7s)
[discord] [default] starting provider (@bot…)
[discord] client initialized as <bot-id> (bot<bot-id>); awaiting gateway readiness

…and that's it. No further Discord log entries. No MESSAGE_CREATE, no dispatched events. openclaw status --deep self-reports Discord as "OK" because it inspects local token state, not gateway liveness.

Restart success rate today on this host (from gateway.log): 20 successful logins / 101 starts ≈ 20%.

Reproducer

On the affected macOS host, with Discord plugin configured and previously working:

  1. Send message to a Discord channel the bot is in → reply received from agent (gateway is healthy).
  2. Trigger any process-level restart: kill -9 <gateway-pid>, openclaw gateway restart, SIGUSR1, SIGTERM+respawn, supervisor reload, or full Mac reboot.
  3. Wait for [gateway] ready ... to appear in ~/.openclaw/logs/gateway.log.
  4. Within ~1s, [discord] client initialized as ...; awaiting gateway readiness is logged.
  5. Send a Discord message → no reply, no inbound Discord events in the log.

Roughly 1 in 5 restarts skips this and reaches [discord] logged in to discord as <bot-id> instead.

The only reliable recovery path observed is an in-process channel-level hot-reload from an already-healthy gateway (e.g. health-monitor stale-socket recovery). All process-level restart paths land in the stuck state with the same probability.

Evidence the WS layer is fine

  • Bare-metal probe from the same host with the same bot token: WS open in 141 ms, IDENTIFY → READY in 1.6 s, GUILD_CREATE arrives. So the network path, token, and intents are all good.
  • lsof -p <gateway-pid> -i on a stuck gateway shows an ESTABLISHED TCP to 162.159.135.234:443 (= gateway.discord.gg). The TLS/WSS handshake completed; the wedge is at the Discord protocol layer (HELLO/IDENTIFY/READY), inside the carbon library.
  • No duplicate processes, no on-disk Discord/resume state to clear, no system sleep around the failures, intents (incl. Message Content) confirmed enabled.

Suspected bug in provider-CraktAkD.js

waitForGatewayReady (lines 6049–6090 of dist/extensions/discord/provider-CraktAkD.js) starts roughly:

async function waitForGatewayReady(params) {
  if (params.gateway?.isConnected ?? true) return;   // line ~6056
  // ...poll up to DISCORD_GATEWAY_READY_TIMEOUT_MS (15000ms, line 5849)
  // ...else throw "discord: gateway was not ready after 15000ms"
}

The ?? true short-circuit means: if params.gateway is undefined, the function reports "ready" immediately. This is what we believe is happening on the wedged restarts — no carbon gateway instance has been wired in by the time waitForGatewayReady runs, so it returns success without waiting and without throwing.

That hypothesis is consistent with the second observation: the documented timeout error

discord: gateway was not ready after 15000ms

never appears in either gateway.log or gateway.err.log for any of the broken restarts in the past 24 h. If the wait actually ran against a real-but-not-ready gateway, we would expect this log line.

The downstream formatDiscordStartupStatusMessage (line ~7024) then logs:

gatewayReady: lifecycleGateway?.isConnected === true

which evaluates to false (because lifecycleGateway is undefined), producing the awaiting gateway readiness log line — but the plugin lifecycle has already proceeded past the readiness gate.

Suggested fix(es)

  1. Tighten the readiness guard. Replace params.gateway?.isConnected ?? true with params.gateway?.isConnected === true (matching the check at line 7024). If params.gateway is missing, the plugin should fail loudly, not silently report ready.
  2. Validate params.gateway is non-null at call-site (line ~6131). If the lifecycle is invoking waitForGatewayReady before a gateway instance has been attached, that's the upstream bug to fix. Either way the wait should not silently pass.
  3. Consider whether the carbon gateway IDENTIFY can wedge when it loses a race during initial connect on macOS Node 25, and whether a bounded retry with full reconnect (rather than just polling isConnected) is appropriate.

Data points & log excerpts

Today's restarts on the affected host (2026-04-23, all EDT):

Time Trigger Result
11:59:44 health-monitor stale-socket restart logged in (1.7 s)
12:34:44 health-monitor stale-socket restart stuck
13:34, 13:39, 14:04 manual full process restart stuck
15:29:27 health-monitor restart stuck
16:04:27 health-monitor restart logged in
16:23:53 SIGTERM → supervisor respawn (config reload) stuck
16:42:24 SIGUSR1 full-process restart stuck
17:10:59 kill -9 + supervisor respawn stuck
~17:30 full Mac reboot stuck
17:29:27, 17:40:14, 18:13:27, 18:14:32 various reload restarts stuck
18:35:54 supervisor restart logged in

Two preceding Gateway websocket closed: 1006 events (12:41:00 and 12:55:11 EDT) immediately preceded the broken streak; nothing similar fires now.

gateway.err.log has zero discord: gateway was not ready after 15000ms entries across the whole day.

What we ruled out

Routing/binding gap • embedded-agent auth • Message Content Intent • IDENTIFY rate-limit (969/1000 remaining at probe time) • token validity (REST works) • macOS sleep • orphaned processes • on-disk session state • network-layer reachability.

Workaround

None reliable. health-monitor's in-process channel hot-reload sometimes recovers a wedged gateway, but only when there's a live gateway to hot-reload — which is exactly what we don't have post-restart. Restarting in a loop until you "win the race" works in ~1/5 attempts.

Steps to reproduce

On the affected macOS host (Tahoe 26.4.1), with Discord plugin configured and previously working:

  1. Send message to a Discord channel the bot is in → reply received from agent (gateway is healthy).
  2. Trigger any process-level restart: kill -9 <gateway-pid>, openclaw gateway restart, SIGUSR1, SIGTERM+respawn, supervisor reload, or full Mac reboot.
  3. Wait for [gateway] ready ... to appear in ~/.openclaw/logs/gateway.log.
  4. Within ~1s, [discord] client initialized as ...; awaiting gateway readiness is logged.
  5. Send a Discord message → no reply, no inbound Discord events in the log.

Roughly 1 in 5 restarts skips this and reaches [discord] logged in to discord as <bot-id> instead.

The only reliable recovery path observed is an in-process channel-level hot-reload from an already-healthy gateway (e.g. health-monitor stale-socket recovery). All process-level restart paths land in the stuck state with the same probability.

Evidence the WS layer is fine

  • Bare-metal probe from the same host with the same bot token: WS open in 141 ms, IDENTIFY → READY in 1.6 s, GUILD_CREATE arrives. So the network path, token, and intents are all good.
  • lsof -p <gateway-pid> -i on a stuck gateway shows an ESTABLISHED TCP to 162.159.135.234:443 (= gateway.discord.gg). The TLS/WSS handshake completed; the wedge is at the Discord protocol layer (HELLO/IDENTIFY/READY), inside the carbon library.
  • No duplicate processes, no on-disk Discord/resume state to clear, no system sleep around the failures, intents (incl. Message Content) confirmed enabled.

Expected behavior

Expect to be able to receive messages via Discord on channel assigned to OpenClaw agent.

Actual behavior

After (almost any) gateway restart on the affected Mac:

[gateway] ready (7 plugins: ...; 11.7s)
[discord] [default] starting provider (@bot…)
[discord] client initialized as <bot-id> (bot<bot-id>); awaiting gateway readiness

…and that's it. No further Discord log entries. No MESSAGE_CREATE, no dispatched events. openclaw status --deep self-reports Discord as "OK" because it inspects local token state, not gateway liveness.

Restart success rate today on this host (from gateway.log): 20 successful logins / 101 starts ≈ 20%.

OpenClaw version

4.21

Operating system

macOS Tahoe 26.4.1

Install method

nom

Model

Claude Opus 4.7

Provider / routing chain

openclaw, Claude opus 4.7

Additional provider/model setup details

No response

Logs, screenshots, and evidence

## Suspected bug in `provider-CraktAkD.js`

`waitForGatewayReady` (lines 6049–6090 of `dist/extensions/discord/provider-CraktAkD.js`) starts roughly:


async function waitForGatewayReady(params) {
  if (params.gateway?.isConnected ?? true) return;   // line ~6056
  // ...poll up to DISCORD_GATEWAY_READY_TIMEOUT_MS (15000ms, line 5849)
  // ...else throw "discord: gateway was not ready after 15000ms"
}


The `?? true` short-circuit means: **if `params.gateway` is undefined, the function reports "ready" immediately.** This is what we believe is happening on the wedged restarts — no carbon gateway instance has been wired in by the time `waitForGatewayReady` runs, so it returns success without waiting and without throwing.

That hypothesis is consistent with the second observation: the documented timeout error


discord: gateway was not ready after 15000ms


**never appears** in either `gateway.log` or `gateway.err.log` for any of the broken restarts in the past 24 h. If the wait actually ran against a real-but-not-ready gateway, we would expect this log line.

The downstream `formatDiscordStartupStatusMessage` (line ~7024) then logs:


gatewayReady: lifecycleGateway?.isConnected === true


which evaluates to `false` (because `lifecycleGateway` is undefined), producing the `awaiting gateway readiness` log line — but the plugin lifecycle has already proceeded past the readiness gate.

## Suggested fix(es)

1. **Tighten the readiness guard.** Replace `params.gateway?.isConnected ?? true` with `params.gateway?.isConnected === true` (matching the check at line 7024). If `params.gateway` is missing, the plugin should fail loudly, not silently report ready.
2. **Validate `params.gateway` is non-null at call-site** (line ~6131). If the lifecycle is invoking `waitForGatewayReady` before a gateway instance has been attached, that's the upstream bug to fix. Either way the wait should not silently pass.
3. **Consider whether the carbon gateway IDENTIFY can wedge** when it loses a race during initial connect on macOS Node 25, and whether a bounded retry with full reconnect (rather than just polling `isConnected`) is appropriate.

## Data points & log excerpts

Today's restarts on the affected host (2026-04-23, all EDT):

| Time     | Trigger                                           | Result |
|----------|---------------------------------------------------|--------|
| 11:59:44 | health-monitor stale-socket restart               | logged in (1.7 s) |
| 12:34:44 | health-monitor stale-socket restart               | stuck |
| 13:34, 13:39, 14:04 | manual full process restart            | stuck |
| 15:29:27 | health-monitor restart                            | stuck |
| 16:04:27 | health-monitor restart                            | logged in |
| 16:23:53 | SIGTERM → supervisor respawn (config reload)     | stuck |
| 16:42:24 | SIGUSR1 full-process restart                      | stuck |
| 17:10:59 | `kill -9` + supervisor respawn                    | stuck |
| ~17:30   | full Mac reboot                                   | stuck |
| 17:29:27, 17:40:14, 18:13:27, 18:14:32 | various reload restarts | stuck |
| 18:35:54 | supervisor restart                                | logged in |

Two preceding `Gateway websocket closed: 1006` events (12:41:00 and 12:55:11 EDT) immediately preceded the broken streak; nothing similar fires now.

`gateway.err.log` has zero `discord: gateway was not ready after 15000ms` entries across the whole day.

## What we ruled out

Routing/binding gap • embedded-agent auth • Message Content Intent • IDENTIFY rate-limit (969/1000 remaining at probe time) • token validity (REST works) • macOS sleep • orphaned processes • on-disk session state • network-layer reachability.

## Workaround

None reliable. `health-monitor`'s in-process channel hot-reload sometimes recovers a wedged gateway, but only when there's a live gateway to hot-reload — which is exactly what we don't have post-restart. Restarting in a loop until you "win the race" works in ~1/5 attempts.

Impact and severity

Discord is completely unusable with this release on this server. Telegram works fine.

Additional information

Have confirmed this feature works fine after the update to 4.21 on NVIDIA DGX Spark and appeared to work briefly on the MacOS. Continues to work fine on Spark (to a different Discord server).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingduplicateThis issue or pull request already existsregressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions