fix(discord): prevent Identify silent-drop race in gateway startup#68159
Conversation
Greptile SummaryThis PR fixes a real race condition where Carbon's Confidence Score: 5/5Safe to merge — the fix is minimal, well-scoped to SafeGatewayPlugin, and field-level test coverage directly verifies both race conditions. No logic changes to Carbon's public contract or any other shared surface. No P0 or P1 issues found. The early No files require special attention. Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.proxy.test.ts
Line: 476-481
Comment:
**`fetchResolve?.` optional chaining could hide setup failures**
`fetchResolve` is always assigned synchronously before `registerClient` suspends (the mock implementation runs before the first `await` in the fetch chain), so the optional chaining here won't mask a real gap. A definite call with a prior assertion would make the test fail faster and with a clearer message if the setup ever changes:
```suggestion
expect(fetchResolve).toBeDefined();
fetchResolve!({
ok: true,
status: 200,
text: async () => JSON.stringify({ url: "wss://gateway.discord.gg" }),
} as Response);
```
Same pattern applies to the second regression test at the corresponding `fetchResolve?.` call.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "test(discord): use as-unknown-as cast fo..." | Re-trigger Greptile |
| fetchResolve?.({ | ||
| ok: true, | ||
| status: 200, | ||
| text: async () => JSON.stringify({ url: "wss://gateway.discord.gg" }), | ||
| } as Response); | ||
| await registerPromise; |
There was a problem hiding this comment.
fetchResolve?. optional chaining could hide setup failures
fetchResolve is always assigned synchronously before registerClient suspends (the mock implementation runs before the first await in the fetch chain), so the optional chaining here won't mask a real gap. A definite call with a prior assertion would make the test fail faster and with a clearer message if the setup ever changes:
| fetchResolve?.({ | |
| ok: true, | |
| status: 200, | |
| text: async () => JSON.stringify({ url: "wss://gateway.discord.gg" }), | |
| } as Response); | |
| await registerPromise; | |
| expect(fetchResolve).toBeDefined(); | |
| fetchResolve!({ | |
| ok: true, | |
| status: 200, | |
| text: async () => JSON.stringify({ url: "wss://gateway.discord.gg" }), | |
| } as Response); |
Same pattern applies to the second regression test at the corresponding fetchResolve?. call.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.proxy.test.ts
Line: 476-481
Comment:
**`fetchResolve?.` optional chaining could hide setup failures**
`fetchResolve` is always assigned synchronously before `registerClient` suspends (the mock implementation runs before the first `await` in the fetch chain), so the optional chaining here won't mask a real gap. A definite call with a prior assertion would make the test fail faster and with a clearer message if the setup ever changes:
```suggestion
expect(fetchResolve).toBeDefined();
fetchResolve!({
ok: true,
status: 200,
text: async () => JSON.stringify({ url: "wss://gateway.discord.gg" }),
} as Response);
```
Same pattern applies to the second regression test at the corresponding `fetchResolve?.` call.
How can I resolve this? If you propose a fix, please make it concise.c675a29 to
48347fd
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 48347fd3ae
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (gatewayState.ws || gatewayState.isConnecting) { | ||
| return; |
There was a problem hiding this comment.
Keep base client registration side effects on race path
Returning here when ws/isConnecting is truthy skips super.registerClient(client), but that base method performs more than just opening a socket. In Carbon it also applies base registration behavior (including the interaction listener path used when autoInteractions is enabled, which this plugin sets), so on the exact slow-metadata race this patch targets, startup can recover while native interaction handling is left uninitialized. That means bots can come online but silently lose slash-command/component handling until restart.
Useful? React with 👍 / 👎.
|
Quick status note — all Discord- and extension-scoped CI jobs are green on this PR, including the one that was red on #53039:
The two remaining red checks (
Recent main runs confirm this is branch-wide — e.g. run Happy to rebase once |
a6a7646 to
96a69aa
Compare
|
All CI checks are now green ✅ (40 passed, 0 failures). The previously failing Could you please review and merge when you get a chance? Thank you! cc @vincentkoc |
|
Maintainer review: this is a PR, and I think this is the Discord gateway fix we should land. Why this one is stronger than the older issue-only diagnoses:
I also checked it against current Land recommendation:
No blocking code findings from my review. |
96a69aa to
7254782
Compare
|
Landed via squash: 5adf9d2. Thanks @IVY-AI-gif. Carbon source check: What changed before landing:
Proof:
|
Verified against Carbon 0.16.0 source: - Client constructor calls plugin.registerClient(this) without awaiting it. - GatewayPlugin.registerClient publishes client before its awaited metadata fetch. - identify() silently returns when client is missing. This patch matches Carbon's ordering in OpenClaw's subclass, avoids a second super.registerClient call if lifecycle connect already opened the socket during metadata loading, and keeps regression coverage for both ws and isConnecting cases. Local proof: - pnpm test extensions/discord/src/monitor/provider.proxy.test.ts extensions/discord/src/monitor/gateway-plugin.test.ts - pnpm lint:tmp:no-raw-channel-fetch - pnpm check:changed - pnpm check - pnpm test GitHub checks green for 7254782.
Verified against Carbon 0.16.0 source: - Client constructor calls plugin.registerClient(this) without awaiting it. - GatewayPlugin.registerClient publishes client before its awaited metadata fetch. - identify() silently returns when client is missing. This patch matches Carbon's ordering in OpenClaw's subclass, avoids a second super.registerClient call if lifecycle connect already opened the socket during metadata loading, and keeps regression coverage for both ws and isConnecting cases. Local proof: - pnpm test extensions/discord/src/monitor/provider.proxy.test.ts extensions/discord/src/monitor/gateway-plugin.test.ts - pnpm lint:tmp:no-raw-channel-fetch - pnpm check:changed - pnpm check - pnpm test GitHub checks green for 7254782.
Verified against Carbon 0.16.0 source: - Client constructor calls plugin.registerClient(this) without awaiting it. - GatewayPlugin.registerClient publishes client before its awaited metadata fetch. - identify() silently returns when client is missing. This patch matches Carbon's ordering in OpenClaw's subclass, avoids a second super.registerClient call if lifecycle connect already opened the socket during metadata loading, and keeps regression coverage for both ws and isConnecting cases. Local proof: - pnpm test extensions/discord/src/monitor/provider.proxy.test.ts extensions/discord/src/monitor/gateway-plugin.test.ts - pnpm lint:tmp:no-raw-channel-fetch - pnpm check:changed - pnpm check - pnpm test GitHub checks green for 7254782.
Summary
Fixes the Discord gateway stuck at "awaiting gateway readiness" reported in #52372.
Root cause:
@buape/carbon'sClientconstructor does notawait registerClient(). If the lifecycle readiness-timeout handler invokesconnect()→identify()whileSafeGatewayPlugin.registerClientis still awaiting the gateway-metadata fetch, Carbon'sidentify()readsthis.clientasundefinedand silently drops the Identify frame. The WebSocket opens and receives Hello, but never reaches READY.Two minimal additions in
SafeGatewayPlugin.registerClient(extensions/discord/src/monitor/gateway-plugin.ts):this.client = clientat the top of the method so a concurrentidentify()always sees a defined client while the asyncfetchDiscordGatewayInfoWithTimeoutis still resolving.wsorisConnectingon the plugin (i.e. aconnect()raced ahead), skipsuper.registerClient()to avoid tearing down the live WebSocket that the externalconnect()just opened.The
ws/isConnectingaccess is a typed cast: if Carbon ever renames these fields, the guard silently degrades to a no-op andsuper.registerClient()runs again — which may cause a brief reconnect but not a permanent hang, because fix (1) already prevents Identify from being dropped.Why low-risk
SafeGatewayPlugin, Carbon's public contract is unchanged.this.client = clientis a no-op for all existing paths except the pre-awaitrace window.ws/isConnectingguard is placed after thetesting.registerClienthook (which already bypassessuper.registerClient()), so the existing test injection semantics are preserved.override connect()heartbeat-stale-timer guard added in the meantime (for [Bug]: Gateway crashes with Attempted to reconnect zombie connection after disconnecting first and is auto-restarted by systemd #65009 / [Bug]: Discord gateway crash exits OpenClaw on Windows with "Attempted to reconnect zombie connection after disconnecting first" #64011 / [Bug]: Gateway/provider race: stale heartbeat reconnect callback throws after disconnect #63387) is untouched.Tests
Adds two regression tests in
provider.proxy.test.ts:sets client reference before the async gateway-info fetch resolves— mocks a pendingfetch, callsregisterClient, assertsthis.client === clientArgbefore resolving the fetch.skips super.registerClient when an external connect() sets ws during the metadata fetch— simulates the race by settingplugin.wsmid-fetch, assertsbaseRegisterClientSpywas not called.The mock
GatewayPluginin that file is extended withclient,ws,isConnectingfields to match Carbon's shape.Community validation
@Skeptomenos verified the same fix on a live 4-bot setup with runtime instrumentation of Carbon's
GatewayPlugin(see #53039 comments):connect → WS open → IDENTIFY sent → READY received → heartbeats flowingHistory
Supersedes #53039. That branch had diverged ~10k commits from
mainand could no longer be rebased cleanly. The fix is re-applied here on top of currentmain, which has since added anoverride connect()heartbeat-timer guard and atesting.registerClientinjection hook that both need to coexist with this change.Closes #52372
Test plan
extensions/discordunit suite (provider.proxy.test.ts,gateway-plugin.test.ts) — new regression tests added, existing assertions unchanged.check,extension-fast (discord),checks (node, extensions, pnpm test:extensions)) will run on push.Made with Cursor