[Bug]: Discord plugin: post-restart gateway wedges at "awaiting gateway readiness" on macOS (no 15s timeout fires)

### Bug type

Regression (worked before, now fails)

### Beta release blocker

No

### Summary

## Summary

On macOS, restarting the OpenClaw gateway often leaves the Discord plugin permanently stuck at `awaiting gateway readiness`. The underlying TCP/TLS connection to `gateway.discord.gg` is established, but the Discord IDENTIFY → READY exchange in `carbonGateway.GatewayPlugin` never completes. The plugin's wait-for-ready guard returns "ready" anyway because of a `?? true` nullish-default, so the documented `discord: gateway was not ready after 15000ms` error never fires and the plugin proceeds with a non-functional gateway. Inbound Discord messages are silently dropped (outbound REST still works).

The failure is intermittent (~20% of restarts succeed on the affected host), so it looks like a race condition in the carbon gateway's initial handshake that lands badly far more often on this Mac than on Linux.

## Environment

| | |
|---|---|
| OpenClaw | `2026.4.21` (`f788c88`) |
| Host OS  | macOS `26.4.1` (build `25E253`), Darwin 25.4.0 |
| Node     | `v25.9.0` |
| Plugins enabled | acpx, browser, device-pair, discord, phone-control, talk-voice, telegram |
| Other host (works) | DGX Spark / Linux, same OpenClaw `2026.4.21` — no reproduction |

## Symptom

After (almost any) gateway restart on the affected Mac:

```
[gateway] ready (7 plugins: ...; 11.7s)
[discord] [default] starting provider (@bot…)
[discord] client initialized as <bot-id> (bot<bot-id>); awaiting gateway readiness
```

…and that's it. No further Discord log entries. No `MESSAGE_CREATE`, no dispatched events. `openclaw status --deep` self-reports Discord as "OK" because it inspects local token state, not gateway liveness.

Restart success rate today on this host (from `gateway.log`): **20 successful logins / 101 starts ≈ 20%**.

## Reproducer

On the affected macOS host, with Discord plugin configured and previously working:

1. Send message to a Discord channel the bot is in → reply received from agent (gateway is healthy).
2. Trigger any process-level restart: `kill -9 <gateway-pid>`, `openclaw gateway restart`, `SIGUSR1`, `SIGTERM`+respawn, supervisor reload, or full Mac reboot.
3. Wait for `[gateway] ready ...` to appear in `~/.openclaw/logs/gateway.log`.
4. Within ~1s, `[discord] client initialized as ...; awaiting gateway readiness` is logged.
5. Send a Discord message → no reply, no inbound Discord events in the log.

Roughly 1 in 5 restarts skips this and reaches `[discord] logged in to discord as <bot-id>` instead.

The only reliable recovery path observed is an **in-process channel-level hot-reload from an already-healthy gateway** (e.g. `health-monitor` stale-socket recovery). All process-level restart paths land in the stuck state with the same probability.

## Evidence the WS layer is fine

- Bare-metal probe from the same host with the same bot token: WS open in 141 ms, IDENTIFY → READY in 1.6 s, GUILD_CREATE arrives. So the network path, token, and intents are all good.
- `lsof -p <gateway-pid> -i` on a stuck gateway shows an `ESTABLISHED` TCP to `162.159.135.234:443` (= `gateway.discord.gg`). The TLS/WSS handshake completed; the wedge is at the Discord protocol layer (HELLO/IDENTIFY/READY), inside the carbon library.
- No duplicate processes, no on-disk Discord/resume state to clear, no system sleep around the failures, intents (incl. Message Content) confirmed enabled.

## Suspected bug in `provider-CraktAkD.js`

`waitForGatewayReady` (lines 6049–6090 of `dist/extensions/discord/provider-CraktAkD.js`) starts roughly:

```js
async function waitForGatewayReady(params) {
  if (params.gateway?.isConnected ?? true) return;   // line ~6056
  // ...poll up to DISCORD_GATEWAY_READY_TIMEOUT_MS (15000ms, line 5849)
  // ...else throw "discord: gateway was not ready after 15000ms"
}
```

The `?? true` short-circuit means: **if `params.gateway` is undefined, the function reports "ready" immediately.** This is what we believe is happening on the wedged restarts — no carbon gateway instance has been wired in by the time `waitForGatewayReady` runs, so it returns success without waiting and without throwing.

That hypothesis is consistent with the second observation: the documented timeout error

```
discord: gateway was not ready after 15000ms
```

**never appears** in either `gateway.log` or `gateway.err.log` for any of the broken restarts in the past 24 h. If the wait actually ran against a real-but-not-ready gateway, we would expect this log line.

The downstream `formatDiscordStartupStatusMessage` (line ~7024) then logs:

```js
gatewayReady: lifecycleGateway?.isConnected === true
```

which evaluates to `false` (because `lifecycleGateway` is undefined), producing the `awaiting gateway readiness` log line — but the plugin lifecycle has already proceeded past the readiness gate.

## Suggested fix(es)

1. **Tighten the readiness guard.** Replace `params.gateway?.isConnected ?? true` with `params.gateway?.isConnected === true` (matching the check at line 7024). If `params.gateway` is missing, the plugin should fail loudly, not silently report ready.
2. **Validate `params.gateway` is non-null at call-site** (line ~6131). If the lifecycle is invoking `waitForGatewayReady` before a gateway instance has been attached, that's the upstream bug to fix. Either way the wait should not silently pass.
3. **Consider whether the carbon gateway IDENTIFY can wedge** when it loses a race during initial connect on macOS Node 25, and whether a bounded retry with full reconnect (rather than just polling `isConnected`) is appropriate.

## Data points & log excerpts

Today's restarts on the affected host (2026-04-23, all EDT):

| Time     | Trigger                                           | Result |
|----------|---------------------------------------------------|--------|
| 11:59:44 | health-monitor stale-socket restart               | logged in (1.7 s) |
| 12:34:44 | health-monitor stale-socket restart               | stuck |
| 13:34, 13:39, 14:04 | manual full process restart            | stuck |
| 15:29:27 | health-monitor restart                            | stuck |
| 16:04:27 | health-monitor restart                            | logged in |
| 16:23:53 | SIGTERM → supervisor respawn (config reload)     | stuck |
| 16:42:24 | SIGUSR1 full-process restart                      | stuck |
| 17:10:59 | `kill -9` + supervisor respawn                    | stuck |
| ~17:30   | full Mac reboot                                   | stuck |
| 17:29:27, 17:40:14, 18:13:27, 18:14:32 | various reload restarts | stuck |
| 18:35:54 | supervisor restart                                | logged in |

Two preceding `Gateway websocket closed: 1006` events (12:41:00 and 12:55:11 EDT) immediately preceded the broken streak; nothing similar fires now.

`gateway.err.log` has zero `discord: gateway was not ready after 15000ms` entries across the whole day.

## What we ruled out

Routing/binding gap • embedded-agent auth • Message Content Intent • IDENTIFY rate-limit (969/1000 remaining at probe time) • token validity (REST works) • macOS sleep • orphaned processes • on-disk session state • network-layer reachability.

## Workaround

None reliable. `health-monitor`'s in-process channel hot-reload sometimes recovers a wedged gateway, but only when there's a live gateway to hot-reload — which is exactly what we don't have post-restart. Restarting in a loop until you "win the race" works in ~1/5 attempts.

### Steps to reproduce

On the affected macOS host (Tahoe 26.4.1), with Discord plugin configured and previously working:

1. Send message to a Discord channel the bot is in → reply received from agent (gateway is healthy).
2. Trigger any process-level restart: `kill -9 <gateway-pid>`, `openclaw gateway restart`, `SIGUSR1`, `SIGTERM`+respawn, supervisor reload, or full Mac reboot.
3. Wait for `[gateway] ready ...` to appear in `~/.openclaw/logs/gateway.log`.
4. Within ~1s, `[discord] client initialized as ...; awaiting gateway readiness` is logged.
5. Send a Discord message → no reply, no inbound Discord events in the log.

Roughly 1 in 5 restarts skips this and reaches `[discord] logged in to discord as <bot-id>` instead.

The only reliable recovery path observed is an **in-process channel-level hot-reload from an already-healthy gateway** (e.g. `health-monitor` stale-socket recovery). All process-level restart paths land in the stuck state with the same probability.

## Evidence the WS layer is fine

- Bare-metal probe from the same host with the same bot token: WS open in 141 ms, IDENTIFY → READY in 1.6 s, GUILD_CREATE arrives. So the network path, token, and intents are all good.
- `lsof -p <gateway-pid> -i` on a stuck gateway shows an `ESTABLISHED` TCP to `162.159.135.234:443` (= `gateway.discord.gg`). The TLS/WSS handshake completed; the wedge is at the Discord protocol layer (HELLO/IDENTIFY/READY), inside the carbon library.
- No duplicate processes, no on-disk Discord/resume state to clear, no system sleep around the failures, intents (incl. Message Content) confirmed enabled.

### Expected behavior

Expect to be able to receive messages via Discord on channel assigned to OpenClaw agent.

### Actual behavior

After (almost any) gateway restart on the affected Mac:

```
[gateway] ready (7 plugins: ...; 11.7s)
[discord] [default] starting provider (@bot…)
[discord] client initialized as <bot-id> (bot<bot-id>); awaiting gateway readiness
```

…and that's it. No further Discord log entries. No `MESSAGE_CREATE`, no dispatched events. `openclaw status --deep` self-reports Discord as "OK" because it inspects local token state, not gateway liveness.

Restart success rate today on this host (from `gateway.log`): **20 successful logins / 101 starts ≈ 20%**.

### OpenClaw version

4.21

### Operating system

macOS Tahoe 26.4.1

### Install method

nom

### Model

Claude Opus 4.7

### Provider / routing chain

openclaw, Claude opus 4.7

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence

```shell
## Suspected bug in `provider-CraktAkD.js`

`waitForGatewayReady` (lines 6049–6090 of `dist/extensions/discord/provider-CraktAkD.js`) starts roughly:


async function waitForGatewayReady(params) {
  if (params.gateway?.isConnected ?? true) return;   // line ~6056
  // ...poll up to DISCORD_GATEWAY_READY_TIMEOUT_MS (15000ms, line 5849)
  // ...else throw "discord: gateway was not ready after 15000ms"
}


The `?? true` short-circuit means: **if `params.gateway` is undefined, the function reports "ready" immediately.** This is what we believe is happening on the wedged restarts — no carbon gateway instance has been wired in by the time `waitForGatewayReady` runs, so it returns success without waiting and without throwing.

That hypothesis is consistent with the second observation: the documented timeout error


discord: gateway was not ready after 15000ms


**never appears** in either `gateway.log` or `gateway.err.log` for any of the broken restarts in the past 24 h. If the wait actually ran against a real-but-not-ready gateway, we would expect this log line.

The downstream `formatDiscordStartupStatusMessage` (line ~7024) then logs:


gatewayReady: lifecycleGateway?.isConnected === true


which evaluates to `false` (because `lifecycleGateway` is undefined), producing the `awaiting gateway readiness` log line — but the plugin lifecycle has already proceeded past the readiness gate.

## Suggested fix(es)

1. **Tighten the readiness guard.** Replace `params.gateway?.isConnected ?? true` with `params.gateway?.isConnected === true` (matching the check at line 7024). If `params.gateway` is missing, the plugin should fail loudly, not silently report ready.
2. **Validate `params.gateway` is non-null at call-site** (line ~6131). If the lifecycle is invoking `waitForGatewayReady` before a gateway instance has been attached, that's the upstream bug to fix. Either way the wait should not silently pass.
3. **Consider whether the carbon gateway IDENTIFY can wedge** when it loses a race during initial connect on macOS Node 25, and whether a bounded retry with full reconnect (rather than just polling `isConnected`) is appropriate.

## Data points & log excerpts

Today's restarts on the affected host (2026-04-23, all EDT):

| Time     | Trigger                                           | Result |
|----------|---------------------------------------------------|--------|
| 11:59:44 | health-monitor stale-socket restart               | logged in (1.7 s) |
| 12:34:44 | health-monitor stale-socket restart               | stuck |
| 13:34, 13:39, 14:04 | manual full process restart            | stuck |
| 15:29:27 | health-monitor restart                            | stuck |
| 16:04:27 | health-monitor restart                            | logged in |
| 16:23:53 | SIGTERM → supervisor respawn (config reload)     | stuck |
| 16:42:24 | SIGUSR1 full-process restart                      | stuck |
| 17:10:59 | `kill -9` + supervisor respawn                    | stuck |
| ~17:30   | full Mac reboot                                   | stuck |
| 17:29:27, 17:40:14, 18:13:27, 18:14:32 | various reload restarts | stuck |
| 18:35:54 | supervisor restart                                | logged in |

Two preceding `Gateway websocket closed: 1006` events (12:41:00 and 12:55:11 EDT) immediately preceded the broken streak; nothing similar fires now.

`gateway.err.log` has zero `discord: gateway was not ready after 15000ms` entries across the whole day.

## What we ruled out

Routing/binding gap • embedded-agent auth • Message Content Intent • IDENTIFY rate-limit (969/1000 remaining at probe time) • token validity (REST works) • macOS sleep • orphaned processes • on-disk session state • network-layer reachability.

## Workaround

None reliable. `health-monitor`'s in-process channel hot-reload sometimes recovers a wedged gateway, but only when there's a live gateway to hot-reload — which is exactly what we don't have post-restart. Restarting in a loop until you "win the race" works in ~1/5 attempts.
```

### Impact and severity

Discord is completely unusable with this release on this server.  Telegram works fine.

### Additional information

Have confirmed this feature works fine after the update to 4.21 on NVIDIA DGX Spark and appeared to work briefly on the MacOS.  Continues to work fine on Spark (to a different Discord server).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Discord plugin: post-restart gateway wedges at "awaiting gateway readiness" on macOS (no 15s timeout fires) #70841

Bug type

Beta release blocker

Summary

Summary

Environment

Symptom

Reproducer

Evidence the WS layer is fine

Suspected bug in `provider-CraktAkD.js`

Suggested fix(es)

Data points & log excerpts

What we ruled out

Workaround

Steps to reproduce

Evidence the WS layer is fine

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


OpenClaw	`2026.4.21` (`f788c88`)
Host OS	macOS `26.4.1` (build `25E253`), Darwin 25.4.0
Node	`v25.9.0`
Plugins enabled	acpx, browser, device-pair, discord, phone-control, talk-voice, telegram
Other host (works)	DGX Spark / Linux, same OpenClaw `2026.4.21` — no reproduction

Time	Trigger	Result
11:59:44	health-monitor stale-socket restart	logged in (1.7 s)
12:34:44	health-monitor stale-socket restart	stuck
13:34, 13:39, 14:04	manual full process restart	stuck
15:29:27	health-monitor restart	stuck
16:04:27	health-monitor restart	logged in
16:23:53	SIGTERM → supervisor respawn (config reload)	stuck
16:42:24	SIGUSR1 full-process restart	stuck
17:10:59	`kill -9` + supervisor respawn	stuck
~17:30	full Mac reboot	stuck
17:29:27, 17:40:14, 18:13:27, 18:14:32	various reload restarts	stuck
18:35:54	supervisor restart	logged in

Uh oh!

[Bug]: Discord plugin: post-restart gateway wedges at "awaiting gateway readiness" on macOS (no 15s timeout fires) #70841

Description

Bug type

Beta release blocker

Summary

Summary

Environment

Symptom

Reproducer

Evidence the WS layer is fine

Suspected bug in provider-CraktAkD.js

Suggested fix(es)

Data points & log excerpts

What we ruled out

Workaround

Steps to reproduce

Evidence the WS layer is fine

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Suspected bug in `provider-CraktAkD.js`