Description
The Feishu channel health-monitor periodically detects the connection as stuck and triggers a restart (~every 35 minutes). Due to a bug in @larksuiteoapi/node-sdk's WSClient.reConnect() (reported upstream: larksuite/node-sdk#177), each restart leaks the previous reconnect loop's setTimeout handles, causing unbounded parallel reconnect attempts.
Observed Behavior
After running openclaw-gateway v2026.3.1 for ~3 days:
- 62,000+
[ws] ws connect failed log entries
- Dozens of concurrent leaked
loopReConnect loops (visible from simultaneous retry counters like 66, 364, 1446, 1241, etc.)
- Memory growth from ~750MB RSS to 1.9GB peak
- DingTalk health-monitor exhibits the same
stuck → restart pattern every ~35 minutes
Meanwhile, actual message delivery via Feishu and DingTalk works fine — the latest connection established by each restart is successful. The leaked loops are all from orphaned reconnect chains.
Logs
[health-monitor] [feishu:default] health-monitor: restarting (reason: stuck)
[feishu] feishu[default]: abort signal received, stopping
[feishu] starting feishu[default] (mode: websocket)
[feishu] feishu[default]: WebSocket client started
[error]: [ '[ws]', 'ws connect failed' ]
[error]: [ '[ws]', 'connect failed' ]
[info]: [ '[ws]', 'reconnect' ]
[info]: [ '[ws]', 'ws client ready' ]
# ...then old loops continue:
[info]: [ 'ws', 'unable to connect to the server after trying 828 times")' ]
[info]: [ 'ws', 'unable to connect to the server after trying 1424 times")' ]
Root Cause
The upstream SDK bug (larksuite/node-sdk#177): WSClient.reConnect() stores only the latest setTimeout ID in this.reconnectInterval, so clearTimeout in subsequent restarts can only cancel the most recent timer, not older ones.
Suggested Mitigations (openclaw side)
-
Before restarting the Feishu provider, ensure the SDK's WSClient is fully destroyed (not just abort signaled). Consider creating a fresh WSClient instance on each restart rather than reusing the same one.
-
Tune health-monitor sensitivity: If the connection is actually working (messages are being received and dispatched), consider adjusting the stuck detection logic to avoid unnecessary restarts. The current 35-minute cycle creates a new leaked loop each time.
-
Apply a workaround until the SDK is fixed: wrap the WSClient with a generation counter or AbortController that forcefully exits orphaned loopReConnect callbacks.
Environment
- openclaw-gateway: v2026.3.1
- @larksuiteoapi/node-sdk: 1.59.0
- Node.js: v22 (Linux x86_64, Ubuntu 24.04)
- Uptime at observation: ~3 days
Description
The Feishu channel health-monitor periodically detects the connection as
stuckand triggers a restart (~every 35 minutes). Due to a bug in@larksuiteoapi/node-sdk'sWSClient.reConnect()(reported upstream: larksuite/node-sdk#177), each restart leaks the previous reconnect loop'ssetTimeouthandles, causing unbounded parallel reconnect attempts.Observed Behavior
After running openclaw-gateway v2026.3.1 for ~3 days:
[ws] ws connect failedlog entriesloopReConnectloops (visible from simultaneous retry counters like 66, 364, 1446, 1241, etc.)stuck→ restart pattern every ~35 minutesMeanwhile, actual message delivery via Feishu and DingTalk works fine — the latest connection established by each restart is successful. The leaked loops are all from orphaned reconnect chains.
Logs
Root Cause
The upstream SDK bug (larksuite/node-sdk#177):
WSClient.reConnect()stores only the latestsetTimeoutID inthis.reconnectInterval, soclearTimeoutin subsequent restarts can only cancel the most recent timer, not older ones.Suggested Mitigations (openclaw side)
Before restarting the Feishu provider, ensure the SDK's WSClient is fully destroyed (not just
abortsignaled). Consider creating a freshWSClientinstance on each restart rather than reusing the same one.Tune health-monitor sensitivity: If the connection is actually working (messages are being received and dispatched), consider adjusting the
stuckdetection logic to avoid unnecessary restarts. The current 35-minute cycle creates a new leaked loop each time.Apply a workaround until the SDK is fixed: wrap the WSClient with a generation counter or AbortController that forcefully exits orphaned
loopReConnectcallbacks.Environment