Feishu health-monitor restarts cause leaked reconnect loops in @larksuiteoapi/node-sdk WSClient

## Description

The Feishu channel health-monitor periodically detects the connection as `stuck` and triggers a restart (~every 35 minutes). Due to a bug in `@larksuiteoapi/node-sdk`'s `WSClient.reConnect()` ([reported upstream: larksuite/node-sdk#177](https://github.com/larksuite/node-sdk/issues/177)), each restart leaks the previous reconnect loop's `setTimeout` handles, causing unbounded parallel reconnect attempts.

## Observed Behavior

After running openclaw-gateway v2026.3.1 for ~3 days:

- **62,000+** `[ws] ws connect failed` log entries
- Dozens of concurrent leaked `loopReConnect` loops (visible from simultaneous retry counters like 66, 364, 1446, 1241, etc.)
- Memory growth from ~750MB RSS to **1.9GB peak**
- DingTalk health-monitor exhibits the same `stuck` → restart pattern every ~35 minutes

Meanwhile, **actual message delivery via Feishu and DingTalk works fine** — the latest connection established by each restart is successful. The leaked loops are all from orphaned reconnect chains.

## Logs

```
[health-monitor] [feishu:default] health-monitor: restarting (reason: stuck)
[feishu] feishu[default]: abort signal received, stopping
[feishu] starting feishu[default] (mode: websocket)
[feishu] feishu[default]: WebSocket client started
[error]: [ '[ws]', 'ws connect failed' ]
[error]: [ '[ws]', 'connect failed' ]
[info]: [ '[ws]', 'reconnect' ]
[info]: [ '[ws]', 'ws client ready' ]
# ...then old loops continue:
[info]: [ 'ws', 'unable to connect to the server after trying 828 times")' ]
[info]: [ 'ws', 'unable to connect to the server after trying 1424 times")' ]
```

## Root Cause

The upstream SDK bug (larksuite/node-sdk#177): `WSClient.reConnect()` stores only the latest `setTimeout` ID in `this.reconnectInterval`, so `clearTimeout` in subsequent restarts can only cancel the most recent timer, not older ones.

## Suggested Mitigations (openclaw side)

1. **Before restarting the Feishu provider**, ensure the SDK's WSClient is fully destroyed (not just `abort` signaled). Consider creating a fresh `WSClient` instance on each restart rather than reusing the same one.

2. **Tune health-monitor sensitivity**: If the connection is actually working (messages are being received and dispatched), consider adjusting the `stuck` detection logic to avoid unnecessary restarts. The current 35-minute cycle creates a new leaked loop each time.

3. **Apply a workaround** until the SDK is fixed: wrap the WSClient with a generation counter or AbortController that forcefully exits orphaned `loopReConnect` callbacks.

## Environment

- openclaw-gateway: v2026.3.1
- @larksuiteoapi/node-sdk: 1.59.0
- Node.js: v22 (Linux x86_64, Ubuntu 24.04)
- Uptime at observation: ~3 days

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feishu health-monitor restarts cause leaked reconnect loops in @larksuiteoapi/node-sdk WSClient #40451

Description

Observed Behavior

Logs

Root Cause

Suggested Mitigations (openclaw side)

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Feishu health-monitor restarts cause leaked reconnect loops in @larksuiteoapi/node-sdk WSClient #40451

Description

Description

Observed Behavior

Logs

Root Cause

Suggested Mitigations (openclaw side)

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions