Skip to content

fix(discord): prevent Identify silent-drop race in gateway startup#68159

Merged
steipete merged 1 commit intoopenclaw:mainfrom
IVY-AI-gif:fix/discord-gateway-identify-race-v2
Apr 24, 2026
Merged

fix(discord): prevent Identify silent-drop race in gateway startup#68159
steipete merged 1 commit intoopenclaw:mainfrom
IVY-AI-gif:fix/discord-gateway-identify-race-v2

Conversation

@IVY-AI-gif
Copy link
Copy Markdown
Contributor

Summary

Fixes the Discord gateway stuck at "awaiting gateway readiness" reported in #52372.

Root cause: @buape/carbon's Client constructor does not await registerClient(). If the lifecycle readiness-timeout handler invokes connect()identify() while SafeGatewayPlugin.registerClient is still awaiting the gateway-metadata fetch, Carbon's identify() reads this.client as undefined and silently drops the Identify frame. The WebSocket opens and receives Hello, but never reaches READY.

Two minimal additions in SafeGatewayPlugin.registerClient (extensions/discord/src/monitor/gateway-plugin.ts):

  1. Assign this.client = client at the top of the method so a concurrent identify() always sees a defined client while the async fetchDiscordGatewayInfoWithTimeout is still resolving.
  2. After metadata has been resolved, if an external caller already set ws or isConnecting on the plugin (i.e. a connect() raced ahead), skip super.registerClient() to avoid tearing down the live WebSocket that the external connect() just opened.

The ws / isConnecting access is a typed cast: if Carbon ever renames these fields, the guard silently degrades to a no-op and super.registerClient() runs again — which may cause a brief reconnect but not a permanent hang, because fix (1) already prevents Identify from being dropped.

Why low-risk

Tests

Adds two regression tests in provider.proxy.test.ts:

  • sets client reference before the async gateway-info fetch resolves — mocks a pending fetch, calls registerClient, asserts this.client === clientArg before resolving the fetch.
  • skips super.registerClient when an external connect() sets ws during the metadata fetch — simulates the race by setting plugin.ws mid-fetch, asserts baseRegisterClientSpy was not called.

The mock GatewayPlugin in that file is extended with client, ws, isConnecting fields to match Carbon's shape.

Community validation

@Skeptomenos verified the same fix on a live 4-bot setup with runtime instrumentation of Carbon's GatewayPlugin (see #53039 comments):

  • All 4 bots: connect → WS open → IDENTIFY sent → READY received → heartbeats flowing
  • Zero errors, zero guard blocks, zero close events during startup.

History

Supersedes #53039. That branch had diverged ~10k commits from main and could no longer be rebased cleanly. The fix is re-applied here on top of current main, which has since added an override connect() heartbeat-timer guard and a testing.registerClient injection hook that both need to coexist with this change.

Closes #52372

Test plan

  • extensions/discord unit suite (provider.proxy.test.ts, gateway-plugin.test.ts) — new regression tests added, existing assertions unchanged.
  • Community field test on 4-bot Discord setup — see fix(discord): prevent Identify silent-drop race in gateway startup #53039 thread for full WS-event trace.
  • CI (check, extension-fast (discord), checks (node, extensions, pnpm test:extensions)) will run on push.

Made with Cursor

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 17, 2026

Greptile Summary

This PR fixes a real race condition where Carbon's identify() could silently fail because this.client was undefined when an external connect() call (from the lifecycle readiness-timeout handler) raced ahead of the async registerClient() metadata fetch. Two minimal additions in SafeGatewayPlugin.registerClient address this: early assignment of this.client = client before any await, and a skip guard on super.registerClient() when ws or isConnecting indicates a live connection was already established. Regression tests directly verify both race conditions.

Confidence Score: 5/5

Safe to merge — the fix is minimal, well-scoped to SafeGatewayPlugin, and field-level test coverage directly verifies both race conditions. No logic changes to Carbon's public contract or any other shared surface.

No P0 or P1 issues found. The early this.client = client assignment is a clear correctness improvement with no downside. The ws/isConnecting guard is intentionally fragile-by-design and documented to degrade gracefully. Regression tests use fake timers correctly and timer state is cleaned up by the existing beforeEach. The only finding is a P2 style nit about optional chaining in tests.

No files require special attention.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.proxy.test.ts
Line: 476-481

Comment:
**`fetchResolve?.` optional chaining could hide setup failures**

`fetchResolve` is always assigned synchronously before `registerClient` suspends (the mock implementation runs before the first `await` in the fetch chain), so the optional chaining here won't mask a real gap. A definite call with a prior assertion would make the test fail faster and with a clearer message if the setup ever changes:

```suggestion
    expect(fetchResolve).toBeDefined();
    fetchResolve!({
      ok: true,
      status: 200,
      text: async () => JSON.stringify({ url: "wss://gateway.discord.gg" }),
    } as Response);
```

Same pattern applies to the second regression test at the corresponding `fetchResolve?.` call.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "test(discord): use as-unknown-as cast fo..." | Re-trigger Greptile

Comment on lines +476 to +481
fetchResolve?.({
ok: true,
status: 200,
text: async () => JSON.stringify({ url: "wss://gateway.discord.gg" }),
} as Response);
await registerPromise;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 fetchResolve?. optional chaining could hide setup failures

fetchResolve is always assigned synchronously before registerClient suspends (the mock implementation runs before the first await in the fetch chain), so the optional chaining here won't mask a real gap. A definite call with a prior assertion would make the test fail faster and with a clearer message if the setup ever changes:

Suggested change
fetchResolve?.({
ok: true,
status: 200,
text: async () => JSON.stringify({ url: "wss://gateway.discord.gg" }),
} as Response);
await registerPromise;
expect(fetchResolve).toBeDefined();
fetchResolve!({
ok: true,
status: 200,
text: async () => JSON.stringify({ url: "wss://gateway.discord.gg" }),
} as Response);

Same pattern applies to the second regression test at the corresponding fetchResolve?. call.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/discord/src/monitor/provider.proxy.test.ts
Line: 476-481

Comment:
**`fetchResolve?.` optional chaining could hide setup failures**

`fetchResolve` is always assigned synchronously before `registerClient` suspends (the mock implementation runs before the first `await` in the fetch chain), so the optional chaining here won't mask a real gap. A definite call with a prior assertion would make the test fail faster and with a clearer message if the setup ever changes:

```suggestion
    expect(fetchResolve).toBeDefined();
    fetchResolve!({
      ok: true,
      status: 200,
      text: async () => JSON.stringify({ url: "wss://gateway.discord.gg" }),
    } as Response);
```

Same pattern applies to the second regression test at the corresponding `fetchResolve?.` call.

How can I resolve this? If you propose a fix, please make it concise.

@IVY-AI-gif IVY-AI-gif force-pushed the fix/discord-gateway-identify-race-v2 branch from c675a29 to 48347fd Compare April 17, 2026 15:07
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 48347fd3ae

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +316 to +317
if (gatewayState.ws || gatewayState.isConnecting) {
return;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep base client registration side effects on race path

Returning here when ws/isConnecting is truthy skips super.registerClient(client), but that base method performs more than just opening a socket. In Carbon it also applies base registration behavior (including the interaction listener path used when autoInteractions is enabled, which this plugin sets), so on the exact slow-metadata race this patch targets, startup can recover while native interaction handling is left uninitialized. That means bots can come online but silently lose slash-command/component handling until restart.

Useful? React with 👍 / 👎.

@IVY-AI-gif
Copy link
Copy Markdown
Contributor Author

Quick status note — all Discord- and extension-scoped CI jobs are green on this PR, including the one that was red on #53039:

  • extension-fast (extension-fast-discord, discord)
  • checks-node-extensions-shard-1..6
  • checks-node-extensions, checks-fast-bundled, checks-fast-contracts-protocol
  • checks-windows-node-test, build-smoke, build-artifacts

The two remaining red checks (check, check-additional) are pre-existing breakage on main and not caused by this PR — the failing compilation / lint findings all point to files this PR never touches:

  • extensions/discord/src/monitor/agent-components.deps.runtime.ts:2TS2305 on readSessionUpdatedAt / resolveStorePath from openclaw/plugin-sdk/session-store-runtime (re-exports out of sync on main).
  • src/commands/agent.acp.test.ts:400, src/commands/message.test.ts:200/243, src/commands/status.test.ts:201/237/239, src/commands/channel-setup/plugin-install.test.ts:38 — assorted TS2322 / TS2339 / TS2551 / TS2556 / TS7006 in core command tests.
  • lint:tmp:no-raw-channel-fetch — flags extensions/discord/src/monitor/gateway-plugin.ts:411,477 (the fetch(input, init as RequestInit) inside createDiscordGatewayPlugin), which is unrelated to SafeGatewayPlugin.registerClient changed here. git blame shows those lines come from the latest main head (commit 48aa076), not this branch.

Recent main runs confirm this is branch-wide — e.g. run 24572210333 on c37caa5 and run 24572086917 on 40c9da1 both failed with the same errors, while 24571597706 on a45ebf3 was the last green.

Happy to rebase once main is green again if you'd like me to retrigger, but no code change is needed on this PR.

@IVY-AI-gif
Copy link
Copy Markdown
Contributor Author

All CI checks are now green ✅ (40 passed, 0 failures).

The previously failing check-additional-boundaries job was caused by a stale line-number allowlist in check-no-raw-channel-fetch.mjs — the fix inserts 24 lines above the existing fetch() callsites, shifting them from L387/L453 to L411/L477. The allowlist has been updated accordingly.

Could you please review and merge when you get a chance? Thank you!

cc @vincentkoc

@steipete
Copy link
Copy Markdown
Contributor

Maintainer review: this is a PR, and I think this is the Discord gateway fix we should land.

Why this one is stronger than the older issue-only diagnoses:

  • It touches the implicated path directly: SafeGatewayPlugin.registerClient() in extensions/discord/src/monitor/gateway-plugin.ts.
  • Current main still awaits gateway metadata before Carbon's normal super.registerClient(client) path sets the client reference. Since Carbon does not await plugin registerClient(), an external lifecycle reconnect can reach identify() while this.client is still unset and silently drop IDENTIFY.
  • The first change (this.client = client before the await) closes that race without changing Carbon's public contract.
  • The second change (skip super.registerClient() when ws/isConnecting was already set during the await) prevents the reconnect path from tearing down the live socket it just opened.
  • The tests cover both race windows directly, not just the broad "awaiting gateway readiness" symptom.

I also checked it against current origin/main (5b8bd63). The merge simulation is clean. The only touched shared file that moved recently is the raw-fetch allowlist; the merged result keeps the current-main BlueBubbles line updates and the PR's Discord line updates (gateway-plugin.ts raw fetch callsites at 411/477), so that part should survive rebase.

Land recommendation:

  1. Update/rebase the branch on current main.
  2. Rerun current CI, especially extension-fast (discord), checks-node-extensions, check, and check-additional-boundaries.
  3. Land if green. Add the changelog entry during landing since this is user-facing Discord startup recovery. Suggested wording: Discord: prevent gateway startup from getting stuck at awaiting gateway readiness when Carbon gateway registration races with a lifecycle reconnect. Fixes #52372. (#68159) Thanks @IVY-AI-gif.

No blocking code findings from my review.

@steipete steipete force-pushed the fix/discord-gateway-identify-race-v2 branch from 96a69aa to 7254782 Compare April 24, 2026 21:37
@steipete steipete merged commit 5adf9d2 into openclaw:main Apr 24, 2026
64 checks passed
@steipete
Copy link
Copy Markdown
Contributor

Landed via squash: 5adf9d2. Thanks @IVY-AI-gif.

Carbon source check: @buape/carbon@0.16.0 has Client call plugin.registerClient(this) without awaiting it; GatewayPlugin.registerClient() publishes client before awaiting metadata; identify() returns silently when client is missing. That confirms the race this PR fixes.

What changed before landing:

  • extensions/discord/src/monitor/gateway-plugin.ts: publish Carbon's client reference before OpenClaw's metadata fetch can yield, and skip a second super.registerClient() if lifecycle connect() already started the socket.
  • extensions/discord/src/monitor/provider.proxy.test.ts: tightened regression coverage for both ws and isConnecting reconnect-start cases.
  • CHANGELOG.md: added user-facing fix note for Discord WSS Gateway never sends Identify after WebSocket opened (Linux) #52372.

Proof:

  • pnpm test extensions/discord/src/monitor/provider.proxy.test.ts extensions/discord/src/monitor/gateway-plugin.test.ts
  • pnpm lint:tmp:no-raw-channel-fetch
  • pnpm check:changed
  • pnpm check
  • pnpm test
  • GitHub checks green for PR head 7254782.

Angfr95 pushed a commit to Angfr95/openclaw that referenced this pull request Apr 25, 2026
Verified against Carbon 0.16.0 source:
- Client constructor calls plugin.registerClient(this) without awaiting it.
- GatewayPlugin.registerClient publishes client before its awaited metadata fetch.
- identify() silently returns when client is missing.

This patch matches Carbon's ordering in OpenClaw's subclass, avoids a second super.registerClient call if lifecycle connect already opened the socket during metadata loading, and keeps regression coverage for both ws and isConnecting cases.

Local proof:
- pnpm test extensions/discord/src/monitor/provider.proxy.test.ts extensions/discord/src/monitor/gateway-plugin.test.ts
- pnpm lint:tmp:no-raw-channel-fetch
- pnpm check:changed
- pnpm check
- pnpm test

GitHub checks green for 7254782.
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
Verified against Carbon 0.16.0 source:
- Client constructor calls plugin.registerClient(this) without awaiting it.
- GatewayPlugin.registerClient publishes client before its awaited metadata fetch.
- identify() silently returns when client is missing.

This patch matches Carbon's ordering in OpenClaw's subclass, avoids a second super.registerClient call if lifecycle connect already opened the socket during metadata loading, and keeps regression coverage for both ws and isConnecting cases.

Local proof:
- pnpm test extensions/discord/src/monitor/provider.proxy.test.ts extensions/discord/src/monitor/gateway-plugin.test.ts
- pnpm lint:tmp:no-raw-channel-fetch
- pnpm check:changed
- pnpm check
- pnpm test

GitHub checks green for 7254782.
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
Verified against Carbon 0.16.0 source:
- Client constructor calls plugin.registerClient(this) without awaiting it.
- GatewayPlugin.registerClient publishes client before its awaited metadata fetch.
- identify() silently returns when client is missing.

This patch matches Carbon's ordering in OpenClaw's subclass, avoids a second super.registerClient call if lifecycle connect already opened the socket during metadata loading, and keeps regression coverage for both ws and isConnecting cases.

Local proof:
- pnpm test extensions/discord/src/monitor/provider.proxy.test.ts extensions/discord/src/monitor/gateway-plugin.test.ts
- pnpm lint:tmp:no-raw-channel-fetch
- pnpm check:changed
- pnpm check
- pnpm test

GitHub checks green for 7254782.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: discord Channel integration: discord scripts Repository scripts size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discord WSS Gateway never sends Identify after WebSocket opened (Linux)

2 participants