You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(feishu): supervisor loop with health-check for WebSocket reconnection
## Problem
The Lark SDK's `WSClient.start()` is fire-and-forget — it returns
immediately and manages reconnection internally. When the SDK exhausts
its server-configured `reconnectCount` retries, it stops **silently**:
no error thrown, no event emitted, no promise rejected.
The existing `monitorWebSocket()` implementation parks on `abortSignal`
after calling `start()`, so it can never detect this silent death. This
is the root cause of #52618.
## Root cause analysis
Reading the upstream Lark SDK source (`larksuite/node-sdk`, `ws-client/index.ts`):
1. `start()` is declared `async` but calls `this.reConnect(true)` without
`await` — fire-and-forget by design.
2. Internal reconnection: `communicate()` → `ws.on('close')` → `reConnect()`
loops up to `reconnectCount` times (server-configured, typically 7).
3. After exhausting retries, the SDK simply stops. No callback, no event,
no rejected promise. The WSClient instance becomes a zombie.
4. The SDK exposes `getReconnectInfo()` → `{ lastConnectTime, nextConnectTime }`
which is the only observable signal of reconnection state.
## Solution
Replace the park-on-abortSignal pattern with a **supervisor loop +
health-check** that mirrors Slack/Telegram channel patterns:
1. **Outer supervisor loop** owns WSClient lifetime (create → monitor → destroy → retry).
2. **Inner health-check loop** polls `getReconnectInfo()` every 30s.
3. If `nextConnectTime` hasn't advanced for 120s AND `lastConnectTime` is
also stale → SDK declared dead → `close({ force: true })` → supervisor
recreates client with exponential backoff.
4. Non-recoverable errors (bad credentials, disabled app) break the loop.
5. `abortSignal` cleanly exits at any point (gateway restart, config reload).
## Why the previous approach didn't work
The v1 of this PR had a supervisor while-loop, but after `start()` it
immediately parked on `abortSignal`. Since `start()` returns instantly
(fire-and-forget), the while-loop could never re-iterate when the SDK
silently gave up. The health-check polling pattern solves this.
Fixes#52618
0 commit comments