fix(discord): prevent Identify silent-drop race in gateway startup#53039
fix(discord): prevent Identify silent-drop race in gateway startup#53039IVY-AI-gif wants to merge 3 commits intoopenclaw:mainfrom
Conversation
… identify() silent-drop race Carbon's Client constructor calls plugin.registerClient() without awaiting the returned promise. SafeGatewayPlugin.registerClient contains an async gateway-info fetch that yields control before super.registerClient (which sets this.client) is reached. If the lifecycle readiness-timeout handler calls gateway.connect() during this window, the resulting Hello -> identify() flow finds this.client === undefined and silently returns without sending an Identify payload. The gateway never receives READY, isConnected stays false, and the Discord bot is permanently stuck at "awaiting gateway readiness". Fix: - Set this.client = client at the top of registerClient, before the async fetch, so identify() always has a valid client ref. - Skip super.registerClient when an external connect() has already been triggered to avoid tearing down a live WebSocket. Closes openclaw#52372
|
Hi maintainers! 👋 This is a minimal fix for the Discord gateway stuck-at-"awaiting gateway readiness" issue reported in #52372. The root cause is a race condition between the async The change is intentionally small (two additions: an early Thanks for your time reviewing! 🙏 |
Greptile SummaryThis PR fixes a race condition in
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/discord/src/monitor/gateway-plugin.ts
Line: 267-269
Comment:
**Fragile private-property access via double type cast**
The guard reads `ws` and `isConnecting` through `as unknown as { ws?: unknown; isConnecting?: boolean }`, bypassing TypeScript's access checks to reach what are effectively private fields on Carbon's `GatewayPlugin`. If Carbon ever renames or removes either field (e.g., `_ws`, `socket`, `connecting`), the expression will silently evaluate to `false || false` and `super.registerClient` will be called while a live WebSocket is already open — re-introducing the double-connect teardown this guard is meant to prevent.
Fix #1 (`this.client = client` at the top) is the principal protection against the permanent-stuck state, so the worst-case degraded behavior here is an unnecessary reconnection rather than a hang. Still, a comment noting the exact Carbon source locations that define these fields (or a reference to the Carbon version pinned in `package.json`) would make a future breakage much easier to diagnose.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/discord/src/monitor/gateway-plugin.ts
Line: 239-272
Comment:
**No test coverage for the race-condition guard**
The test file (`provider.proxy.test.ts`) mocks `GatewayPlugin` without `ws` or `isConnecting` fields, so the guard on lines 267–270 is never exercised by the suite. More critically, none of the tests assert that `this.client` is assigned _before_ the async `fetchDiscordGatewayInfoWithTimeout` call returns — which is the core invariant that prevents the identify silent-drop.
A regression test that:
1. delays the fetch mock so it resolves after a simulated `connect()` call,
2. confirms `baseRegisterClientSpy` is **not** called (guard fires), and
3. confirms `plugin.client` is already set by the time `connect()` runs
would give confidence that the fix holds across future refactors and Carbon upgrades.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(discord): set client reference early..." | Re-trigger Greptile |
| const gatewayState = this as unknown as { ws?: unknown; isConnecting?: boolean }; | ||
| if (gatewayState.ws || gatewayState.isConnecting) { | ||
| return; |
There was a problem hiding this comment.
Fragile private-property access via double type cast
The guard reads ws and isConnecting through as unknown as { ws?: unknown; isConnecting?: boolean }, bypassing TypeScript's access checks to reach what are effectively private fields on Carbon's GatewayPlugin. If Carbon ever renames or removes either field (e.g., _ws, socket, connecting), the expression will silently evaluate to false || false and super.registerClient will be called while a live WebSocket is already open — re-introducing the double-connect teardown this guard is meant to prevent.
Fix #1 (this.client = client at the top) is the principal protection against the permanent-stuck state, so the worst-case degraded behavior here is an unnecessary reconnection rather than a hang. Still, a comment noting the exact Carbon source locations that define these fields (or a reference to the Carbon version pinned in package.json) would make a future breakage much easier to diagnose.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/discord/src/monitor/gateway-plugin.ts
Line: 267-269
Comment:
**Fragile private-property access via double type cast**
The guard reads `ws` and `isConnecting` through `as unknown as { ws?: unknown; isConnecting?: boolean }`, bypassing TypeScript's access checks to reach what are effectively private fields on Carbon's `GatewayPlugin`. If Carbon ever renames or removes either field (e.g., `_ws`, `socket`, `connecting`), the expression will silently evaluate to `false || false` and `super.registerClient` will be called while a live WebSocket is already open — re-introducing the double-connect teardown this guard is meant to prevent.
Fix #1 (`this.client = client` at the top) is the principal protection against the permanent-stuck state, so the worst-case degraded behavior here is an unnecessary reconnection rather than a hang. Still, a comment noting the exact Carbon source locations that define these fields (or a reference to the Carbon version pinned in `package.json`) would make a future breakage much easier to diagnose.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Good catch — added a comment in 8cddf6e documenting the exact Carbon source locations (GatewayPlugin.ts lines 49 and 62 for ws and isConnecting) and explicitly noting the degradation behavior: if Carbon renames these fields, the guard becomes a no-op and super.registerClient runs, which may cause a brief unnecessary reconnect but not a permanent hang (Fix #1 guarantees this.client is set regardless).
| override async registerClient(client: Parameters<GatewayPlugin["registerClient"]>[0]) { | ||
| // Set client reference immediately so that identify() works even if | ||
| // the lifecycle timeout handler calls connect() before this async | ||
| // method finishes. Carbon's Client constructor does not await | ||
| // registerClient(), so there is a window where an external connect() | ||
| // call reaches identify() while this.client is still undefined — | ||
| // causing the Identify payload to be silently dropped and the | ||
| // gateway to never reach READY. | ||
| this.client = client; | ||
|
|
||
| if (!this.gatewayInfo || this.gatewayInfoUsedFallback) { | ||
| const resolved = await fetchDiscordGatewayInfoWithTimeout({ | ||
| token: client.options.token, | ||
| fetchImpl: params.fetchImpl, | ||
| fetchInit: params.fetchInit, | ||
| }) | ||
| .then((info) => ({ | ||
| info, | ||
| usedFallback: false, | ||
| })) | ||
| .catch((error) => resolveGatewayInfoWithFallback({ runtime: params.runtime, error })); | ||
| this.gatewayInfo = resolved.info; | ||
| this.gatewayInfoUsedFallback = resolved.usedFallback; | ||
| } | ||
|
|
||
| // If an external caller (e.g. the lifecycle readiness-timeout handler) | ||
| // already triggered connect() while we were fetching gateway metadata, | ||
| // skip super.registerClient to avoid tearing down the live WebSocket. | ||
| const gatewayState = this as unknown as { ws?: unknown; isConnecting?: boolean }; | ||
| if (gatewayState.ws || gatewayState.isConnecting) { | ||
| return; | ||
| } | ||
|
|
||
| return super.registerClient(client); |
There was a problem hiding this comment.
No test coverage for the race-condition guard
The test file (provider.proxy.test.ts) mocks GatewayPlugin without ws or isConnecting fields, so the guard on lines 267–270 is never exercised by the suite. More critically, none of the tests assert that this.client is assigned before the async fetchDiscordGatewayInfoWithTimeout call returns — which is the core invariant that prevents the identify silent-drop.
A regression test that:
- delays the fetch mock so it resolves after a simulated
connect()call, - confirms
baseRegisterClientSpyis not called (guard fires), and - confirms
plugin.clientis already set by the timeconnect()runs
would give confidence that the fix holds across future refactors and Carbon upgrades.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/discord/src/monitor/gateway-plugin.ts
Line: 239-272
Comment:
**No test coverage for the race-condition guard**
The test file (`provider.proxy.test.ts`) mocks `GatewayPlugin` without `ws` or `isConnecting` fields, so the guard on lines 267–270 is never exercised by the suite. More critically, none of the tests assert that `this.client` is assigned _before_ the async `fetchDiscordGatewayInfoWithTimeout` call returns — which is the core invariant that prevents the identify silent-drop.
A regression test that:
1. delays the fetch mock so it resolves after a simulated `connect()` call,
2. confirms `baseRegisterClientSpy` is **not** called (guard fires), and
3. confirms `plugin.client` is already set by the time `connect()` runs
would give confidence that the fix holds across future refactors and Carbon upgrades.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Agreed — the race-condition guard deserves a dedicated regression test. I'll add one that:
- Delays the fetch mock so it resolves after a simulated
connect()call setsws - Asserts
super.registerClientis not called (guard fires) - Asserts
plugin.clientis already set by the timeconnect()runs
Will push the test in a follow-up commit.
There was a problem hiding this comment.
Added two regression tests in 82e7c91:
sets client reference before the async gateway-info fetch completes— delays the fetch mock, confirmsplugin.clientis already set before the fetch resolvesskips super.registerClient when an external connect() sets ws during fetch— simulates the lifecycle timeout handler settingwsduring the fetch window, confirmssuper.registerClientis NOT called
Also added client/ws/isConnecting fields to the mock GatewayPlugin to support the guard logic.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 23a2e88e81
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (gatewayState.ws || gatewayState.isConnecting) { | ||
| return; |
There was a problem hiding this comment.
Keep Carbon's interaction setup when skipping reconnect
When the readiness-timeout path calls connect(false) before the metadata fetch finishes, this early return skips all of Carbon's GatewayPlugin.registerClient() side effects, not just the second connect(). The upstream base method also registers InteractionEventListener when autoInteractions is enabled, and this provider explicitly enables that mode while wiring commands/components/modals in extensions/discord/src/monitor/provider.ts:740-799. In the same race this patch is trying to handle, the gateway will now come up without that listener, so slash commands, buttons, and modals stop working after startup/restart even though the socket is connected.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Thanks for catching this — verified against Carbon's GatewayPlugin.registerClient() source. The base method does indeed register InteractionEventListener when autoInteractions: true, along with shard info setup on the client, before calling this.connect().
When the guard fires and skips super.registerClient(), the following side effects are missed:
InteractionEventListenerregistration — affects slash commands, buttons, modals- Shard info setup —
client.shardId/client.totalShards
The guard is designed to fire in a narrow race-condition window (when an external connect() is already in progress during the async gateway-info fetch). Fix #1 (this.client = client at the top) is the primary fix for the IDENTIFY hang and works regardless of whether the guard fires.
That said, the concern about interaction handling being silently broken in the guard path is valid. Happy to push a follow-up that explicitly handles InteractionEventListener registration and shard info before the early return, so the guard only skips the redundant connect() call while preserving all other side effects. Let me know if you'd like that.
Address Greptile review: document the exact Carbon source locations (GatewayPlugin.ts lines 49, 62) for the ws/isConnecting fields used in the double-connect guard, and note degradation behavior if Carbon renames them.
…guard Add two tests exercising the core race-condition fix: 1. "sets client reference before the async gateway-info fetch completes" — verifies this.client is assigned immediately at the top of registerClient, before the fetch resolves. 2. "skips super.registerClient when an external connect() sets ws during fetch" — simulates the lifecycle timeout handler calling connect() while registerClient is still awaiting the fetch, confirms super.registerClient is NOT called (guard fires). Also adds client/ws/isConnecting fields to the mock GatewayPlugin to support the guard logic.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 82e7c918e7
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const gatewayState = this as unknown as { ws?: unknown; isConnecting?: boolean }; | ||
| if (gatewayState.ws || gatewayState.isConnecting) { | ||
| return; |
There was a problem hiding this comment.
Move this reconnect guard into the reachable startup path
The new ws/isConnecting fast-path only helps if gateway.connect(...) can run before registerClient() finishes fetching /gateway/bot, but the repo’s own startup order makes that impossible. fetchDiscordGatewayInfoWithTimeout() always resolves or falls back within 10 seconds, while the only production reconnect callers are provider.lifecycle.ts:269 and provider.lifecycle.ts:356, and they do not run until runDiscordGatewayLifecycle() is entered at provider.ts:1011 after deployDiscordCommands() and client.fetchUser("@me") have both completed. On the reported startup-not-ready path this branch never fires, so the silent-Identify hang remains unfixed even though this commit adds a guard for it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good analysis of the startup ordering. You're right that under normal conditions, fetchDiscordGatewayInfoWithTimeout() resolves well before the lifecycle timeout handler calls connect() — so the guard rarely fires on the happy path.
The guard is intentionally a defensive safety net rather than the primary fix:
-
Fix fix: add @lid format support and allowFrom wildcard handling #1 (
this.client = clientat the top ofregisterClient) is the primary fix for the reported silent-IDENTIFY hang. It ensuresidentify()always has access tothis.clientregardless of when it runs or who triggers it. Fix fix: add @lid format support and allowFrom wildcard handling #1 alone resolves the issue described in Discord WSS Gateway never sends Identify after WebSocket opened (Linux) #52372. -
Fix Login fails with 'WebSocket Error (socket hang up)' ECONNRESET #2 (the guard) protects against a secondary scenario: if the gateway-info fetch is unusually slow (DNS resolution hang, network timeouts approaching the 10 s fallback boundary),
super.registerClient()would callthis.connect()after an external caller already established a live WebSocket — tearing it down and causing a brief disruption. The guard prevents this at zero overhead when it doesn't fire.
Both fixes together provide defense-in-depth. The bot successfully connects on restart with this patch applied (confirmed on the reporting user's production server), whereas before it would permanently hang at "awaiting gateway readiness".
|
I tested this patch on a live 4-bot setup (v2026.3.23 compiled JS, patched with this PR's changes). Detailed results in #53132. TL;DR: This fix is necessary and improves things (from 0–2/4 bots connecting to a consistent 2/4), but there appears to be a second race condition in the Carbon beta that prevents the remaining bots from reaching READY even with Test results with and without the patch:
The second issue is per-bot (not a multi-bot race) — adding a 5s stagger between account starts didn't help. Likely a separate race inside This PR should still be merged — it fixes a real race. But a second fix (likely in Carbon itself) is needed to fully resolve the startup hang. |
|
Correction to my earlier comment: this PR IS the complete fix. I added diagnostic instrumentation to Carbon's GatewayPlugin on a 4-bot setup (v2026.3.23 + this PR's patch). Results:
My earlier report of "2/4 bots stuck" was wrong — I was misinterpreting OpenClaw's "awaiting gateway readiness" log message as a failure. It's a cosmetic timing issue: the status log fires ~200ms before READY arrives, but the lifecycle READY wait picks it up immediately. All 4 bots are fully functional. Full diagnostic trace and analysis in #53132. |
Supporting evidence: runtime diagnostic traceInstrumented Carbon's Startup sequence (all timestamps within 700ms)Post-startup (heartbeats continue normally for all 4 bots)Verified: all 4 agents respond to Discord messagesManually tested by messaging each bot (Alfred, Krause, Donna, Gilfoyle) on Discord after this startup. All 4 responded. Contrast: without this PR's fixWithout the eager Test environment
|
|
Superseded by #68159. This branch had diverged from #68159 re-applies the same two-line logical fix (early Closing this PR in favor of #68159 to keep the review surface clean. #52372 tracks the underlying issue. |
|
Closing in favor of #68159 (rebased on current main). |
What
Fixes the Discord gateway getting permanently stuck at "awaiting gateway readiness" after restart — the bot connects to the WebSocket but never sends an Identify (op 2) payload after receiving Hello (op 10).
Closes #52372
Root Cause
Carbon's
Clientconstructor callsplugin.registerClient(this)without awaiting the returned promise:OpenClaw's
SafeGatewayPlugin.registerClientcontains an async gateway-info fetch (fetchDiscordGatewayInfoWithTimeout, up to 10s) that yields control beforesuper.registerClient(which setsthis.client) is reached.If the lifecycle's 15-second readiness timeout fires during this window and calls
gateway.connect(false):identify()→ first line:if (!this.client) return→ silently returnsisConnectedstaysfalseforeverThis explains both symptoms reported in #52372:
registerClientdoesn't finish before the first HelloInvalidSessionresponse, which delays the successful Identify past the readiness timeout — the forced reconnect hits the samethis.client === undefinedraceFix
Two changes in
SafeGatewayPlugin.registerClient:Set
this.client = clientimmediately at the top ofregisterClient, before the async fetch. This ensuresidentify()always has a valid client reference, even whenconnect()is called externally.Skip
super.registerClientwhen an externalconnect()has already been triggered (detected by checkingthis.wsorthis.isConnecting). This prevents tearing down a live WebSocket that was established by the lifecycle timeout handler.Testing
provider.proxy.test.tstests cover the SafeGatewayPlugin crash-guard behavioridentify()→send()path now works whenconnect()is called beforeregisterClientfinishesRelated