Bug type
Regression (worked before, now fails)
Beta release blocker
No
Summary
Summary
On macOS, restarting the OpenClaw gateway often leaves the Discord plugin permanently stuck at awaiting gateway readiness. The underlying TCP/TLS connection to gateway.discord.gg is established, but the Discord IDENTIFY → READY exchange in carbonGateway.GatewayPlugin never completes. The plugin's wait-for-ready guard returns "ready" anyway because of a ?? true nullish-default, so the documented discord: gateway was not ready after 15000ms error never fires and the plugin proceeds with a non-functional gateway. Inbound Discord messages are silently dropped (outbound REST still works).
The failure is intermittent (~20% of restarts succeed on the affected host), so it looks like a race condition in the carbon gateway's initial handshake that lands badly far more often on this Mac than on Linux.
Environment
|
|
| OpenClaw |
2026.4.21 (f788c88) |
| Host OS |
macOS 26.4.1 (build 25E253), Darwin 25.4.0 |
| Node |
v25.9.0 |
| Plugins enabled |
acpx, browser, device-pair, discord, phone-control, talk-voice, telegram |
| Other host (works) |
DGX Spark / Linux, same OpenClaw 2026.4.21 — no reproduction |
Symptom
After (almost any) gateway restart on the affected Mac:
[gateway] ready (7 plugins: ...; 11.7s)
[discord] [default] starting provider (@bot…)
[discord] client initialized as <bot-id> (bot<bot-id>); awaiting gateway readiness
…and that's it. No further Discord log entries. No MESSAGE_CREATE, no dispatched events. openclaw status --deep self-reports Discord as "OK" because it inspects local token state, not gateway liveness.
Restart success rate today on this host (from gateway.log): 20 successful logins / 101 starts ≈ 20%.
Reproducer
On the affected macOS host, with Discord plugin configured and previously working:
- Send message to a Discord channel the bot is in → reply received from agent (gateway is healthy).
- Trigger any process-level restart:
kill -9 <gateway-pid>, openclaw gateway restart, SIGUSR1, SIGTERM+respawn, supervisor reload, or full Mac reboot.
- Wait for
[gateway] ready ... to appear in ~/.openclaw/logs/gateway.log.
- Within ~1s,
[discord] client initialized as ...; awaiting gateway readiness is logged.
- Send a Discord message → no reply, no inbound Discord events in the log.
Roughly 1 in 5 restarts skips this and reaches [discord] logged in to discord as <bot-id> instead.
The only reliable recovery path observed is an in-process channel-level hot-reload from an already-healthy gateway (e.g. health-monitor stale-socket recovery). All process-level restart paths land in the stuck state with the same probability.
Evidence the WS layer is fine
- Bare-metal probe from the same host with the same bot token: WS open in 141 ms, IDENTIFY → READY in 1.6 s, GUILD_CREATE arrives. So the network path, token, and intents are all good.
lsof -p <gateway-pid> -i on a stuck gateway shows an ESTABLISHED TCP to 162.159.135.234:443 (= gateway.discord.gg). The TLS/WSS handshake completed; the wedge is at the Discord protocol layer (HELLO/IDENTIFY/READY), inside the carbon library.
- No duplicate processes, no on-disk Discord/resume state to clear, no system sleep around the failures, intents (incl. Message Content) confirmed enabled.
Suspected bug in provider-CraktAkD.js
waitForGatewayReady (lines 6049–6090 of dist/extensions/discord/provider-CraktAkD.js) starts roughly:
async function waitForGatewayReady(params) {
if (params.gateway?.isConnected ?? true) return; // line ~6056
// ...poll up to DISCORD_GATEWAY_READY_TIMEOUT_MS (15000ms, line 5849)
// ...else throw "discord: gateway was not ready after 15000ms"
}
The ?? true short-circuit means: if params.gateway is undefined, the function reports "ready" immediately. This is what we believe is happening on the wedged restarts — no carbon gateway instance has been wired in by the time waitForGatewayReady runs, so it returns success without waiting and without throwing.
That hypothesis is consistent with the second observation: the documented timeout error
discord: gateway was not ready after 15000ms
never appears in either gateway.log or gateway.err.log for any of the broken restarts in the past 24 h. If the wait actually ran against a real-but-not-ready gateway, we would expect this log line.
The downstream formatDiscordStartupStatusMessage (line ~7024) then logs:
gatewayReady: lifecycleGateway?.isConnected === true
which evaluates to false (because lifecycleGateway is undefined), producing the awaiting gateway readiness log line — but the plugin lifecycle has already proceeded past the readiness gate.
Suggested fix(es)
- Tighten the readiness guard. Replace
params.gateway?.isConnected ?? true with params.gateway?.isConnected === true (matching the check at line 7024). If params.gateway is missing, the plugin should fail loudly, not silently report ready.
- Validate
params.gateway is non-null at call-site (line ~6131). If the lifecycle is invoking waitForGatewayReady before a gateway instance has been attached, that's the upstream bug to fix. Either way the wait should not silently pass.
- Consider whether the carbon gateway IDENTIFY can wedge when it loses a race during initial connect on macOS Node 25, and whether a bounded retry with full reconnect (rather than just polling
isConnected) is appropriate.
Data points & log excerpts
Today's restarts on the affected host (2026-04-23, all EDT):
| Time |
Trigger |
Result |
| 11:59:44 |
health-monitor stale-socket restart |
logged in (1.7 s) |
| 12:34:44 |
health-monitor stale-socket restart |
stuck |
| 13:34, 13:39, 14:04 |
manual full process restart |
stuck |
| 15:29:27 |
health-monitor restart |
stuck |
| 16:04:27 |
health-monitor restart |
logged in |
| 16:23:53 |
SIGTERM → supervisor respawn (config reload) |
stuck |
| 16:42:24 |
SIGUSR1 full-process restart |
stuck |
| 17:10:59 |
kill -9 + supervisor respawn |
stuck |
| ~17:30 |
full Mac reboot |
stuck |
| 17:29:27, 17:40:14, 18:13:27, 18:14:32 |
various reload restarts |
stuck |
| 18:35:54 |
supervisor restart |
logged in |
Two preceding Gateway websocket closed: 1006 events (12:41:00 and 12:55:11 EDT) immediately preceded the broken streak; nothing similar fires now.
gateway.err.log has zero discord: gateway was not ready after 15000ms entries across the whole day.
What we ruled out
Routing/binding gap • embedded-agent auth • Message Content Intent • IDENTIFY rate-limit (969/1000 remaining at probe time) • token validity (REST works) • macOS sleep • orphaned processes • on-disk session state • network-layer reachability.
Workaround
None reliable. health-monitor's in-process channel hot-reload sometimes recovers a wedged gateway, but only when there's a live gateway to hot-reload — which is exactly what we don't have post-restart. Restarting in a loop until you "win the race" works in ~1/5 attempts.
Steps to reproduce
On the affected macOS host (Tahoe 26.4.1), with Discord plugin configured and previously working:
- Send message to a Discord channel the bot is in → reply received from agent (gateway is healthy).
- Trigger any process-level restart:
kill -9 <gateway-pid>, openclaw gateway restart, SIGUSR1, SIGTERM+respawn, supervisor reload, or full Mac reboot.
- Wait for
[gateway] ready ... to appear in ~/.openclaw/logs/gateway.log.
- Within ~1s,
[discord] client initialized as ...; awaiting gateway readiness is logged.
- Send a Discord message → no reply, no inbound Discord events in the log.
Roughly 1 in 5 restarts skips this and reaches [discord] logged in to discord as <bot-id> instead.
The only reliable recovery path observed is an in-process channel-level hot-reload from an already-healthy gateway (e.g. health-monitor stale-socket recovery). All process-level restart paths land in the stuck state with the same probability.
Evidence the WS layer is fine
- Bare-metal probe from the same host with the same bot token: WS open in 141 ms, IDENTIFY → READY in 1.6 s, GUILD_CREATE arrives. So the network path, token, and intents are all good.
lsof -p <gateway-pid> -i on a stuck gateway shows an ESTABLISHED TCP to 162.159.135.234:443 (= gateway.discord.gg). The TLS/WSS handshake completed; the wedge is at the Discord protocol layer (HELLO/IDENTIFY/READY), inside the carbon library.
- No duplicate processes, no on-disk Discord/resume state to clear, no system sleep around the failures, intents (incl. Message Content) confirmed enabled.
Expected behavior
Expect to be able to receive messages via Discord on channel assigned to OpenClaw agent.
Actual behavior
After (almost any) gateway restart on the affected Mac:
[gateway] ready (7 plugins: ...; 11.7s)
[discord] [default] starting provider (@bot…)
[discord] client initialized as <bot-id> (bot<bot-id>); awaiting gateway readiness
…and that's it. No further Discord log entries. No MESSAGE_CREATE, no dispatched events. openclaw status --deep self-reports Discord as "OK" because it inspects local token state, not gateway liveness.
Restart success rate today on this host (from gateway.log): 20 successful logins / 101 starts ≈ 20%.
OpenClaw version
4.21
Operating system
macOS Tahoe 26.4.1
Install method
nom
Model
Claude Opus 4.7
Provider / routing chain
openclaw, Claude opus 4.7
Additional provider/model setup details
No response
Logs, screenshots, and evidence
## Suspected bug in `provider-CraktAkD.js`
`waitForGatewayReady` (lines 6049–6090 of `dist/extensions/discord/provider-CraktAkD.js`) starts roughly:
async function waitForGatewayReady(params) {
if (params.gateway?.isConnected ?? true) return; // line ~6056
// ...poll up to DISCORD_GATEWAY_READY_TIMEOUT_MS (15000ms, line 5849)
// ...else throw "discord: gateway was not ready after 15000ms"
}
The `?? true` short-circuit means: **if `params.gateway` is undefined, the function reports "ready" immediately.** This is what we believe is happening on the wedged restarts — no carbon gateway instance has been wired in by the time `waitForGatewayReady` runs, so it returns success without waiting and without throwing.
That hypothesis is consistent with the second observation: the documented timeout error
discord: gateway was not ready after 15000ms
**never appears** in either `gateway.log` or `gateway.err.log` for any of the broken restarts in the past 24 h. If the wait actually ran against a real-but-not-ready gateway, we would expect this log line.
The downstream `formatDiscordStartupStatusMessage` (line ~7024) then logs:
gatewayReady: lifecycleGateway?.isConnected === true
which evaluates to `false` (because `lifecycleGateway` is undefined), producing the `awaiting gateway readiness` log line — but the plugin lifecycle has already proceeded past the readiness gate.
## Suggested fix(es)
1. **Tighten the readiness guard.** Replace `params.gateway?.isConnected ?? true` with `params.gateway?.isConnected === true` (matching the check at line 7024). If `params.gateway` is missing, the plugin should fail loudly, not silently report ready.
2. **Validate `params.gateway` is non-null at call-site** (line ~6131). If the lifecycle is invoking `waitForGatewayReady` before a gateway instance has been attached, that's the upstream bug to fix. Either way the wait should not silently pass.
3. **Consider whether the carbon gateway IDENTIFY can wedge** when it loses a race during initial connect on macOS Node 25, and whether a bounded retry with full reconnect (rather than just polling `isConnected`) is appropriate.
## Data points & log excerpts
Today's restarts on the affected host (2026-04-23, all EDT):
| Time | Trigger | Result |
|----------|---------------------------------------------------|--------|
| 11:59:44 | health-monitor stale-socket restart | logged in (1.7 s) |
| 12:34:44 | health-monitor stale-socket restart | stuck |
| 13:34, 13:39, 14:04 | manual full process restart | stuck |
| 15:29:27 | health-monitor restart | stuck |
| 16:04:27 | health-monitor restart | logged in |
| 16:23:53 | SIGTERM → supervisor respawn (config reload) | stuck |
| 16:42:24 | SIGUSR1 full-process restart | stuck |
| 17:10:59 | `kill -9` + supervisor respawn | stuck |
| ~17:30 | full Mac reboot | stuck |
| 17:29:27, 17:40:14, 18:13:27, 18:14:32 | various reload restarts | stuck |
| 18:35:54 | supervisor restart | logged in |
Two preceding `Gateway websocket closed: 1006` events (12:41:00 and 12:55:11 EDT) immediately preceded the broken streak; nothing similar fires now.
`gateway.err.log` has zero `discord: gateway was not ready after 15000ms` entries across the whole day.
## What we ruled out
Routing/binding gap • embedded-agent auth • Message Content Intent • IDENTIFY rate-limit (969/1000 remaining at probe time) • token validity (REST works) • macOS sleep • orphaned processes • on-disk session state • network-layer reachability.
## Workaround
None reliable. `health-monitor`'s in-process channel hot-reload sometimes recovers a wedged gateway, but only when there's a live gateway to hot-reload — which is exactly what we don't have post-restart. Restarting in a loop until you "win the race" works in ~1/5 attempts.
Impact and severity
Discord is completely unusable with this release on this server. Telegram works fine.
Additional information
Have confirmed this feature works fine after the update to 4.21 on NVIDIA DGX Spark and appeared to work briefly on the MacOS. Continues to work fine on Spark (to a different Discord server).
Bug type
Regression (worked before, now fails)
Beta release blocker
No
Summary
Summary
On macOS, restarting the OpenClaw gateway often leaves the Discord plugin permanently stuck at
awaiting gateway readiness. The underlying TCP/TLS connection togateway.discord.ggis established, but the Discord IDENTIFY → READY exchange incarbonGateway.GatewayPluginnever completes. The plugin's wait-for-ready guard returns "ready" anyway because of a?? truenullish-default, so the documenteddiscord: gateway was not ready after 15000mserror never fires and the plugin proceeds with a non-functional gateway. Inbound Discord messages are silently dropped (outbound REST still works).The failure is intermittent (~20% of restarts succeed on the affected host), so it looks like a race condition in the carbon gateway's initial handshake that lands badly far more often on this Mac than on Linux.
Environment
2026.4.21(f788c88)26.4.1(build25E253), Darwin 25.4.0v25.9.02026.4.21— no reproductionSymptom
After (almost any) gateway restart on the affected Mac:
…and that's it. No further Discord log entries. No
MESSAGE_CREATE, no dispatched events.openclaw status --deepself-reports Discord as "OK" because it inspects local token state, not gateway liveness.Restart success rate today on this host (from
gateway.log): 20 successful logins / 101 starts ≈ 20%.Reproducer
On the affected macOS host, with Discord plugin configured and previously working:
kill -9 <gateway-pid>,openclaw gateway restart,SIGUSR1,SIGTERM+respawn, supervisor reload, or full Mac reboot.[gateway] ready ...to appear in~/.openclaw/logs/gateway.log.[discord] client initialized as ...; awaiting gateway readinessis logged.Roughly 1 in 5 restarts skips this and reaches
[discord] logged in to discord as <bot-id>instead.The only reliable recovery path observed is an in-process channel-level hot-reload from an already-healthy gateway (e.g.
health-monitorstale-socket recovery). All process-level restart paths land in the stuck state with the same probability.Evidence the WS layer is fine
lsof -p <gateway-pid> -ion a stuck gateway shows anESTABLISHEDTCP to162.159.135.234:443(=gateway.discord.gg). The TLS/WSS handshake completed; the wedge is at the Discord protocol layer (HELLO/IDENTIFY/READY), inside the carbon library.Suspected bug in
provider-CraktAkD.jswaitForGatewayReady(lines 6049–6090 ofdist/extensions/discord/provider-CraktAkD.js) starts roughly:The
?? trueshort-circuit means: ifparams.gatewayis undefined, the function reports "ready" immediately. This is what we believe is happening on the wedged restarts — no carbon gateway instance has been wired in by the timewaitForGatewayReadyruns, so it returns success without waiting and without throwing.That hypothesis is consistent with the second observation: the documented timeout error
never appears in either
gateway.logorgateway.err.logfor any of the broken restarts in the past 24 h. If the wait actually ran against a real-but-not-ready gateway, we would expect this log line.The downstream
formatDiscordStartupStatusMessage(line ~7024) then logs:which evaluates to
false(becauselifecycleGatewayis undefined), producing theawaiting gateway readinesslog line — but the plugin lifecycle has already proceeded past the readiness gate.Suggested fix(es)
params.gateway?.isConnected ?? truewithparams.gateway?.isConnected === true(matching the check at line 7024). Ifparams.gatewayis missing, the plugin should fail loudly, not silently report ready.params.gatewayis non-null at call-site (line ~6131). If the lifecycle is invokingwaitForGatewayReadybefore a gateway instance has been attached, that's the upstream bug to fix. Either way the wait should not silently pass.isConnected) is appropriate.Data points & log excerpts
Today's restarts on the affected host (2026-04-23, all EDT):
kill -9+ supervisor respawnTwo preceding
Gateway websocket closed: 1006events (12:41:00 and 12:55:11 EDT) immediately preceded the broken streak; nothing similar fires now.gateway.err.loghas zerodiscord: gateway was not ready after 15000msentries across the whole day.What we ruled out
Routing/binding gap • embedded-agent auth • Message Content Intent • IDENTIFY rate-limit (969/1000 remaining at probe time) • token validity (REST works) • macOS sleep • orphaned processes • on-disk session state • network-layer reachability.
Workaround
None reliable.
health-monitor's in-process channel hot-reload sometimes recovers a wedged gateway, but only when there's a live gateway to hot-reload — which is exactly what we don't have post-restart. Restarting in a loop until you "win the race" works in ~1/5 attempts.Steps to reproduce
On the affected macOS host (Tahoe 26.4.1), with Discord plugin configured and previously working:
kill -9 <gateway-pid>,openclaw gateway restart,SIGUSR1,SIGTERM+respawn, supervisor reload, or full Mac reboot.[gateway] ready ...to appear in~/.openclaw/logs/gateway.log.[discord] client initialized as ...; awaiting gateway readinessis logged.Roughly 1 in 5 restarts skips this and reaches
[discord] logged in to discord as <bot-id>instead.The only reliable recovery path observed is an in-process channel-level hot-reload from an already-healthy gateway (e.g.
health-monitorstale-socket recovery). All process-level restart paths land in the stuck state with the same probability.Evidence the WS layer is fine
lsof -p <gateway-pid> -ion a stuck gateway shows anESTABLISHEDTCP to162.159.135.234:443(=gateway.discord.gg). The TLS/WSS handshake completed; the wedge is at the Discord protocol layer (HELLO/IDENTIFY/READY), inside the carbon library.Expected behavior
Expect to be able to receive messages via Discord on channel assigned to OpenClaw agent.
Actual behavior
After (almost any) gateway restart on the affected Mac:
…and that's it. No further Discord log entries. No
MESSAGE_CREATE, no dispatched events.openclaw status --deepself-reports Discord as "OK" because it inspects local token state, not gateway liveness.Restart success rate today on this host (from
gateway.log): 20 successful logins / 101 starts ≈ 20%.OpenClaw version
4.21
Operating system
macOS Tahoe 26.4.1
Install method
nom
Model
Claude Opus 4.7
Provider / routing chain
openclaw, Claude opus 4.7
Additional provider/model setup details
No response
Logs, screenshots, and evidence
Impact and severity
Discord is completely unusable with this release on this server. Telegram works fine.
Additional information
Have confirmed this feature works fine after the update to 4.21 on NVIDIA DGX Spark and appeared to work briefly on the MacOS. Continues to work fine on Spark (to a different Discord server).