Skip to content

fix(discord): clear stale heartbeat timers in SafeGatewayPlugin.connect()#65087

Merged
vincentkoc merged 5 commits intoopenclaw:mainfrom
SARAMALI15792:fix/zombie-heartbeat-race
Apr 12, 2026
Merged

fix(discord): clear stale heartbeat timers in SafeGatewayPlugin.connect()#65087
vincentkoc merged 5 commits intoopenclaw:mainfrom
SARAMALI15792:fix/zombie-heartbeat-race

Conversation

@SARAMALI15792
Copy link
Copy Markdown
Contributor

@SARAMALI15792 SARAMALI15792 commented Apr 12, 2026

What does this PR do?

Adds a connect() override to SafeGatewayPlugin that clears stale heartbeat timers before delegating to the parent, preventing an intermittent uncaught exception that crashes the Discord gateway process and drops in-flight replies.

Root Cause

@buape/carbon@0.15.0 has a race in its heartbeat initialisation:

setTimeout(() => {
    sendHeartbeat()                                          // stopHeartbeat() runs here —
                                                             // but heartbeatInterval is still undefined
    heartbeatInterval = setInterval(sendHeartbeat, interval) // stale interval created after the clear
}, initialDelay)

When sendHeartbeat detects a zombie connection it calls stopHeartbeat(), which clears heartbeatInterval — but the interval has not been assigned yet. The setInterval on the next line then creates a timer whose closure holds closed=true. When it fires ~41 seconds later, reconnectCallback(closed=true) throws inside a setInterval callback. Node.js routes this to process.on('uncaughtException'), bypassing the EventEmitter.on('error') path the gateway supervisor monitors. systemd restarts the service and in-flight replies fail.

Solution Applied

Override connect() in SafeGatewayPlugin to unconditionally clear both heartbeatInterval and firstHeartbeatTimeout before calling super.connect():

public override connect(resume = false): void {
  if (this.heartbeatInterval !== undefined) {
    clearInterval(this.heartbeatInterval);
    this.heartbeatInterval = undefined;
  }
  if (this.firstHeartbeatTimeout !== undefined) {
    clearTimeout(this.firstHeartbeatTimeout);
    this.firstHeartbeatTimeout = undefined;
  }
  super.connect(resume);
}

The parent's connect() only calls stopHeartbeat() when isConnecting=false. When isConnecting=true it returns early — leaving any stale timer alive. This override runs before that early-return check, ensuring stale timers are always cleared on reconnect.

Bottleneck Solved

  • Eliminates intermittent gateway crashes caused by the stale setInterval firing with a closed reconnectCallback
  • No process restart, no dropped in-flight replies
  • Works with the currently published @buape/carbon@0.15.0 without requiring a version bump

Testing

pnpm test:extension discord

Two new unit tests added to extensions/discord/src/monitor/gateway-plugin.test.ts verifying that heartbeatInterval and firstHeartbeatTimeout are cleared when connect() is called while isConnecting=true.

Fixes #65009

@openclaw-barnacle openclaw-barnacle Bot added channel: discord Channel integration: discord size: S labels Apr 12, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 12, 2026

Greptile Summary

Adds a connect() override in SafeGatewayPlugin that unconditionally clears heartbeatInterval and firstHeartbeatTimeout before delegating to super.connect(), preventing a stale-timer crash introduced by a race in @buape/carbon@0.15.0. Two unit tests cover both timer types under the isConnecting=true scenario.

Confidence Score: 5/5

Safe to merge — targeted defensive fix with correct logic and appropriate test coverage for the reported crash scenario.

No P0 or P1 issues found. The override correctly clears stale timers on both the normal and early-return code paths without introducing any new races. The mock faithfully represents the fields needed to exercise the override, and the test structure follows repo conventions.

No files require special attention.

Reviews (1): Last reviewed commit: "test(discord): assert super.connect() de..." | Re-trigger Greptile

SARAMALI15792 and others added 5 commits April 12, 2026 18:31
…ct()

The @buape/carbon@0.15.0 heartbeat setup has a race where stopHeartbeat()
runs before heartbeatInterval is assigned, leaving a stale setInterval with
a closed reconnectCallback. When the stale interval fires ~41s later it
throws an uncaught exception that bypasses the EventEmitter error path and
crashes the gateway process via process.on('uncaughtException').

Add a connect() override in SafeGatewayPlugin that unconditionally clears
both heartbeatInterval and firstHeartbeatTimeout before calling super. The
parent's connect() only calls stopHeartbeat() when isConnecting=false; when
isConnecting=true it returns early without clearing — this override fills
that gap.

Fixes openclaw#65009. Related: openclaw#64011, openclaw#63387, openclaw#62038.
The connect() override added in the heartbeat fix shifted the two
pre-existing fetch() callsites from lines 370/436 to 387/453.
@vincentkoc vincentkoc force-pushed the fix/zombie-heartbeat-race branch from ff03a1e to 360d38e Compare April 12, 2026 17:32
@openclaw-barnacle openclaw-barnacle Bot added the cli CLI command changes label Apr 12, 2026
@vincentkoc vincentkoc merged commit 7995e40 into openclaw:main Apr 12, 2026
39 of 42 checks passed
lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026
…ct() (openclaw#65087)

* fix(discord): clear stale heartbeat timers in SafeGatewayPlugin.connect()

The @buape/carbon@0.15.0 heartbeat setup has a race where stopHeartbeat()
runs before heartbeatInterval is assigned, leaving a stale setInterval with
a closed reconnectCallback. When the stale interval fires ~41s later it
throws an uncaught exception that bypasses the EventEmitter error path and
crashes the gateway process via process.on('uncaughtException').

Add a connect() override in SafeGatewayPlugin that unconditionally clears
both heartbeatInterval and firstHeartbeatTimeout before calling super. The
parent's connect() only calls stopHeartbeat() when isConnecting=false; when
isConnecting=true it returns early without clearing — this override fills
that gap.

Fixes openclaw#65009. Related: openclaw#64011, openclaw#63387, openclaw#62038.

* test(discord): assert super.connect() delegation in SafeGatewayPlugin tests

* fix(ci): update raw-fetch allowlist line numbers for gateway-plugin.ts

The connect() override added in the heartbeat fix shifted the two
pre-existing fetch() callsites from lines 370/436 to 387/453.

* docs(changelog): add discord heartbeat crash note

* test(cli): align plugin registry load-context mock

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
…ct() (openclaw#65087)

* fix(discord): clear stale heartbeat timers in SafeGatewayPlugin.connect()

The @buape/carbon@0.15.0 heartbeat setup has a race where stopHeartbeat()
runs before heartbeatInterval is assigned, leaving a stale setInterval with
a closed reconnectCallback. When the stale interval fires ~41s later it
throws an uncaught exception that bypasses the EventEmitter error path and
crashes the gateway process via process.on('uncaughtException').

Add a connect() override in SafeGatewayPlugin that unconditionally clears
both heartbeatInterval and firstHeartbeatTimeout before calling super. The
parent's connect() only calls stopHeartbeat() when isConnecting=false; when
isConnecting=true it returns early without clearing — this override fills
that gap.

Fixes openclaw#65009. Related: openclaw#64011, openclaw#63387, openclaw#62038.

* test(discord): assert super.connect() delegation in SafeGatewayPlugin tests

* fix(ci): update raw-fetch allowlist line numbers for gateway-plugin.ts

The connect() override added in the heartbeat fix shifted the two
pre-existing fetch() callsites from lines 370/436 to 387/453.

* docs(changelog): add discord heartbeat crash note

* test(cli): align plugin registry load-context mock

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…ct() (openclaw#65087)

* fix(discord): clear stale heartbeat timers in SafeGatewayPlugin.connect()

The @buape/carbon@0.15.0 heartbeat setup has a race where stopHeartbeat()
runs before heartbeatInterval is assigned, leaving a stale setInterval with
a closed reconnectCallback. When the stale interval fires ~41s later it
throws an uncaught exception that bypasses the EventEmitter error path and
crashes the gateway process via process.on('uncaughtException').

Add a connect() override in SafeGatewayPlugin that unconditionally clears
both heartbeatInterval and firstHeartbeatTimeout before calling super. The
parent's connect() only calls stopHeartbeat() when isConnecting=false; when
isConnecting=true it returns early without clearing — this override fills
that gap.

Fixes openclaw#65009. Related: openclaw#64011, openclaw#63387, openclaw#62038.

* test(discord): assert super.connect() delegation in SafeGatewayPlugin tests

* fix(ci): update raw-fetch allowlist line numbers for gateway-plugin.ts

The connect() override added in the heartbeat fix shifted the two
pre-existing fetch() callsites from lines 370/436 to 387/453.

* docs(changelog): add discord heartbeat crash note

* test(cli): align plugin registry load-context mock

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: discord Channel integration: discord cli CLI command changes scripts Repository scripts size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway crashes with Attempted to reconnect zombie connection after disconnecting first and is auto-restarted by systemd

2 participants