Skip to content

Commit 969f3e3

Browse files
committed
fix(feishu): supervisor loop with health-check for WebSocket reconnection
## Problem The Lark SDK's `WSClient.start()` is fire-and-forget — it returns immediately and manages reconnection internally. When the SDK exhausts its server-configured `reconnectCount` retries, it stops **silently**: no error thrown, no event emitted, no promise rejected. The existing `monitorWebSocket()` implementation parks on `abortSignal` after calling `start()`, so it can never detect this silent death. This is the root cause of #52618. ## Root cause analysis Reading the upstream Lark SDK source (`larksuite/node-sdk`, `ws-client/index.ts`): 1. `start()` is declared `async` but calls `this.reConnect(true)` without `await` — fire-and-forget by design. 2. Internal reconnection: `communicate()` → `ws.on('close')` → `reConnect()` loops up to `reconnectCount` times (server-configured, typically 7). 3. After exhausting retries, the SDK simply stops. No callback, no event, no rejected promise. The WSClient instance becomes a zombie. 4. The SDK exposes `getReconnectInfo()` → `{ lastConnectTime, nextConnectTime }` which is the only observable signal of reconnection state. ## Solution Replace the park-on-abortSignal pattern with a **supervisor loop + health-check** that mirrors Slack/Telegram channel patterns: 1. **Outer supervisor loop** owns WSClient lifetime (create → monitor → destroy → retry). 2. **Inner health-check loop** polls `getReconnectInfo()` every 30s. 3. If `nextConnectTime` hasn't advanced for 120s AND `lastConnectTime` is also stale → SDK declared dead → `close({ force: true })` → supervisor recreates client with exponential backoff. 4. Non-recoverable errors (bad credentials, disabled app) break the loop. 5. `abortSignal` cleanly exits at any point (gateway restart, config reload). ## Why the previous approach didn't work The v1 of this PR had a supervisor while-loop, but after `start()` it immediately parked on `abortSignal`. Since `start()` returns instantly (fire-and-forget), the while-loop could never re-iterate when the SDK silently gave up. The health-check polling pattern solves this. Fixes #52618
1 parent 5ab3782 commit 969f3e3

1 file changed

Lines changed: 179 additions & 26 deletions

File tree

extensions/feishu/src/monitor.transport.ts

Lines changed: 179 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,71 @@ export type MonitorTransportParams = {
2828
eventDispatcher: Lark.EventDispatcher;
2929
};
3030

31+
/**
32+
* Supervisor reconnection policy for Feishu WebSocket.
33+
* Aligned with Slack/Telegram channel patterns (2s initial, 60s max, 1.8x factor).
34+
*/
35+
const FEISHU_WS_RECONNECT_POLICY = {
36+
initialMs: 2_000,
37+
maxMs: 60_000,
38+
factor: 1.8,
39+
jitter: 0.25,
40+
} as const;
41+
42+
/**
43+
* Health check interval for detecting SDK silent death.
44+
*
45+
* The Lark SDK's WSClient manages its own reconnection internally, but after
46+
* exhausting `reconnectCount` attempts it stops silently — no error thrown, no
47+
* event emitted, no promise rejected. The only observable signal is that
48+
* `getReconnectInfo().nextConnectTime` stops advancing.
49+
*
50+
* We poll `getReconnectInfo()` at this interval. If `nextConnectTime` has not
51+
* advanced for `HEALTH_CHECK_STALE_MS`, we consider the SDK dead and recreate
52+
* the client from scratch.
53+
*/
54+
const HEALTH_CHECK_INTERVAL_MS = 30_000;
55+
56+
/**
57+
* How long `nextConnectTime` can remain unchanged before we declare the SDK
58+
* dead. Must be longer than the SDK's own `reconnectInterval` (server-sent,
59+
* typically ~10-30s) to avoid false positives during normal SDK reconnection.
60+
*/
61+
const HEALTH_CHECK_STALE_MS = 120_000;
62+
63+
function computeReconnectDelay(attempt: number): number {
64+
const { initialMs, maxMs, factor, jitter } = FEISHU_WS_RECONNECT_POLICY;
65+
const base = Math.min(initialMs * factor ** (attempt - 1), maxMs);
66+
const jitterRange = base * jitter;
67+
return base + (Math.random() * 2 - 1) * jitterRange;
68+
}
69+
70+
function sleepWithAbort(ms: number, signal?: AbortSignal): Promise<"slept" | "aborted"> {
71+
if (signal?.aborted) return Promise.resolve("aborted");
72+
return new Promise((resolve) => {
73+
const timer = setTimeout(() => resolve("slept"), ms);
74+
signal?.addEventListener(
75+
"abort",
76+
() => {
77+
clearTimeout(timer);
78+
resolve("aborted");
79+
},
80+
{ once: true },
81+
);
82+
});
83+
}
84+
85+
function isNonRecoverableFeishuError(err: unknown): boolean {
86+
const msg = String(err);
87+
return (
88+
msg.includes("credentials not configured") ||
89+
msg.includes("app_id") ||
90+
msg.includes("invalid app") ||
91+
msg.includes("app has been disabled") ||
92+
msg.includes("app not exist")
93+
);
94+
}
95+
3196
function isFeishuWebhookPayload(value: unknown): value is Record<string, unknown> {
3297
return !!value && typeof value === "object" && !Array.isArray(value);
3398
}
@@ -81,6 +146,31 @@ function respondText(res: http.ServerResponse, statusCode: number, body: string)
81146
res.end(body);
82147
}
83148

149+
/**
150+
* WebSocket supervisor loop for Feishu connections.
151+
*
152+
* ## Problem
153+
*
154+
* The Lark SDK's `WSClient.start()` is fire-and-forget: it kicks off an
155+
* internal reconnection loop but returns immediately without awaiting the
156+
* connection. When the SDK exhausts its server-configured `reconnectCount`
157+
* retries, it stops silently — no error, no event, no rejected promise.
158+
* The previous implementation parked on `abortSignal` after calling
159+
* `start()`, so it could never detect this silent death.
160+
*
161+
* ## Solution
162+
*
163+
* We use `WSClient.getReconnectInfo()` — an SDK method that exposes
164+
* `{ lastConnectTime, nextConnectTime }`. A periodic health check polls
165+
* this info; if `nextConnectTime` hasn't advanced for `HEALTH_CHECK_STALE_MS`
166+
* and the last successful connection is also old, the SDK has given up. We
167+
* then `close()` the dead client, apply exponential backoff, and create a
168+
* fresh `WSClient`.
169+
*
170+
* This mirrors the Slack and Telegram supervisor patterns already in
171+
* OpenClaw: an outer loop that owns client lifetime, with the transport
172+
* SDK managing short-term reconnection internally.
173+
*/
84174
export async function monitorWebSocket({
85175
account,
86176
accountId,
@@ -89,41 +179,105 @@ export async function monitorWebSocket({
89179
eventDispatcher,
90180
}: MonitorTransportParams): Promise<void> {
91181
const log = runtime?.log ?? console.log;
92-
log(`feishu[${accountId}]: starting WebSocket connection...`);
182+
const error = runtime?.error ?? console.error;
93183

94-
const wsClient = createFeishuWSClient(account);
95-
wsClients.set(accountId, wsClient);
184+
let supervisorAttempts = 0;
96185

97-
return new Promise((resolve, reject) => {
98-
const cleanup = () => {
99-
wsClients.delete(accountId);
100-
botOpenIds.delete(accountId);
101-
botNames.delete(accountId);
102-
};
186+
while (!abortSignal?.aborted) {
187+
log(`feishu[${accountId}]: starting WebSocket connection...`);
103188

104-
const handleAbort = () => {
105-
log(`feishu[${accountId}]: abort signal received, stopping`);
106-
cleanup();
107-
resolve();
108-
};
189+
const wsClient = createFeishuWSClient(account);
190+
wsClients.set(accountId, wsClient);
109191

110-
if (abortSignal?.aborted) {
111-
cleanup();
112-
resolve();
113-
return;
114-
}
115-
116-
abortSignal?.addEventListener("abort", handleAbort, { once: true });
192+
let sdkDied = false;
117193

118194
try {
119195
wsClient.start({ eventDispatcher });
120196
log(`feishu[${accountId}]: WebSocket client started`);
197+
198+
// SDK started successfully; reset supervisor backoff.
199+
supervisorAttempts = 0;
200+
201+
// Health-check loop: poll getReconnectInfo() to detect SDK silent death.
202+
// The SDK's internal reconnection updates nextConnectTime on each retry.
203+
// When it gives up, nextConnectTime freezes.
204+
let lastSeenNextConnectTime = 0;
205+
let staleStartedAt = 0;
206+
207+
while (!abortSignal?.aborted) {
208+
const result = await sleepWithAbort(HEALTH_CHECK_INTERVAL_MS, abortSignal);
209+
if (result === "aborted") break;
210+
211+
const info = wsClient.getReconnectInfo();
212+
const now = Date.now();
213+
214+
if (info.nextConnectTime !== lastSeenNextConnectTime) {
215+
// SDK is still actively reconnecting — reset staleness tracker.
216+
lastSeenNextConnectTime = info.nextConnectTime;
217+
staleStartedAt = 0;
218+
continue;
219+
}
220+
221+
// nextConnectTime hasn't changed. If lastConnectTime is recent,
222+
// the connection is healthy (no reconnection needed). Only flag
223+
// staleness when the last successful connect is also old.
224+
if (info.lastConnectTime > 0 && now - info.lastConnectTime < HEALTH_CHECK_STALE_MS) {
225+
staleStartedAt = 0;
226+
continue;
227+
}
228+
229+
// Start or continue the staleness timer.
230+
if (staleStartedAt === 0) {
231+
staleStartedAt = now;
232+
continue;
233+
}
234+
235+
if (now - staleStartedAt >= HEALTH_CHECK_STALE_MS) {
236+
error(
237+
`feishu[${accountId}]: SDK reconnection stale for ${Math.round((now - staleStartedAt) / 1000)}s, ` +
238+
`last connect at ${info.lastConnectTime ? new Date(info.lastConnectTime).toISOString() : "never"}. ` +
239+
`Recreating client.`,
240+
);
241+
sdkDied = true;
242+
break;
243+
}
244+
}
121245
} catch (err) {
122-
cleanup();
123-
abortSignal?.removeEventListener("abort", handleAbort);
124-
reject(err);
246+
if (isNonRecoverableFeishuError(err)) {
247+
error(`feishu[${accountId}]: non-recoverable error, giving up: ${String(err)}`);
248+
break;
249+
}
250+
sdkDied = true;
251+
error(`feishu[${accountId}]: WebSocket start error: ${String(err)}`);
252+
} finally {
253+
try {
254+
wsClient.close({ force: true });
255+
} catch {
256+
// close() may throw if already torn down.
257+
}
258+
wsClients.delete(accountId);
259+
botOpenIds.delete(accountId);
260+
botNames.delete(accountId);
125261
}
126-
});
262+
263+
if (abortSignal?.aborted) break;
264+
265+
if (sdkDied) {
266+
supervisorAttempts += 1;
267+
const delayMs = computeReconnectDelay(supervisorAttempts);
268+
log(
269+
`feishu[${accountId}]: supervisor reconnect attempt ${supervisorAttempts}, ` +
270+
`waiting ${Math.round(delayMs / 1000)}s...`,
271+
);
272+
const result = await sleepWithAbort(delayMs, abortSignal);
273+
if (result === "aborted") break;
274+
continue;
275+
}
276+
277+
break;
278+
}
279+
280+
log(`feishu[${accountId}]: WebSocket monitor stopped`);
127281
}
128282

129283
export async function monitorWebhook({
@@ -192,7 +346,6 @@ export async function monitorWebhook({
192346
return;
193347
}
194348

195-
// Lark's default adapter drops invalid signatures as an empty 200. Reject here instead.
196349
if (
197350
!isFeishuWebhookSignatureValid({
198351
headers: req.headers,

0 commit comments

Comments
 (0)