fix(feishu): add supervisor loop for WebSocket reconnection#52623
fix(feishu): add supervisor loop for WebSocket reconnection#52623bridzl wants to merge 1 commit intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR replaces the previous park-on- Key issues found:
Confidence Score: 3/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 195-199
Comment:
**Exponential backoff is always ~2s — reset fires before any failure is observable**
`supervisorAttempts = 0` is reset on line 199 immediately after `wsClient.start()` returns. Because `start()` is fire-and-forget and returns instantly regardless of connection state, this reset always runs before `sdkDied` can ever be set to `true`.
Tracing the cycle for rapid consecutive failures:
1. Iteration 1: `supervisorAttempts = 0` (initial). `start()` called, **reset to 0**. SDK dies → `sdkDied = true` → `supervisorAttempts += 1` → **1**. Delay = `computeReconnectDelay(1)` ≈ 2 s.
2. Iteration 2: `start()` called, **reset to 0 again**. SDK dies → `supervisorAttempts += 1` → **1**. Delay ≈ 2 s.
3. …repeats indefinitely at ~2 s.
The exponential growth (`1.8×`) is never observed because `supervisorAttempts` is reset to 0 before it can accumulate. The intended 60 s cap is unreachable.
The practical risk is that if the Lark server is down or rejecting connections, the supervisor will hammer it every ~2 s continuously, which can trigger server-side rate limiting or IP bans.
The reset should only happen after the health-check confirms a genuinely stable connection — for example, after the first successful `getReconnectInfo()` poll where `lastConnectTime` is recent:
```typescript
wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);
// Do NOT reset here — start() is fire-and-forget and returns immediately.
// Reset is deferred to inside the health-check loop once we observe
// a recent lastConnectTime (connection actually established).
let backoffReset = false;
let lastSeenNextConnectTime = 0;
let staleStartedAt = 0;
while (!abortSignal?.aborted) {
const result = await sleepWithAbort(HEALTH_CHECK_INTERVAL_MS, abortSignal);
if (result === "aborted") break;
const info = wsClient.getReconnectInfo();
const now = Date.now();
// First evidence of a live connection → reset backoff.
if (!backoffReset && info.lastConnectTime > 0 && now - info.lastConnectTime < HEALTH_CHECK_STALE_MS) {
supervisorAttempts = 0;
backoffReset = true;
}
// … rest of staleness checks unchanged
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 85-94
Comment:
**`"app_id"` substring match is too broad**
`msg.includes("app_id")` will match any error message that contains the literal string `app_id`, including internal SDK messages like `"missing field: app_id in event payload"` or similar. This could cause a transient or recoverable error to be silently classified as non-recoverable, permanently stopping the supervisor.
The other patterns (`"invalid app"`, `"app not exist"`, `"app has been disabled"`) are specific and safe. Consider removing the naked `"app_id"` term or replacing it with the exact Lark error string you intend to match (e.g., `"invalid app_id"` or `"app_id not found"`):
```suggestion
function isNonRecoverableFeishuError(err: unknown): boolean {
const msg = String(err);
return (
msg.includes("credentials not configured") ||
msg.includes("invalid app_id") ||
msg.includes("invalid app") ||
msg.includes("app has been disabled") ||
msg.includes("app not exist")
);
}
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "fix(feishu): supervisor loop with health..." | Re-trigger Greptile |
| wsClient.start({ eventDispatcher }); | ||
| log(`feishu[${accountId}]: WebSocket client started`); | ||
|
|
||
| // Reset attempt counter on successful start. | ||
| reconnectAttempts = 0; | ||
|
|
||
| // Wait until abort signal fires. The Lark SDK manages the connection | ||
| // internally; when it exhausts its own retries (typically 7 attempts) | ||
| // the start() promise settles and we fall through to the reconnect | ||
| // loop below. Intentional stops (gateway restart, config reload) fire | ||
| // the abort signal and exit the loop cleanly. | ||
| await new Promise<void>((resolve) => { | ||
| if (abortSignal?.aborted) { | ||
| resolve(); | ||
| return; | ||
| } | ||
| abortSignal?.addEventListener("abort", () => resolve(), { once: true }); | ||
| }); |
There was a problem hiding this comment.
wsClient.start() is not awaited — reconnection loop never triggers
wsClient.start() returns a Promise, but it is called without await. This means:
- The SDK runs in the background, disconnected from the
try/catchand the outerwhileloop. - The code immediately falls through to
await new Promise<void>(...), which resolves only when the abort signal fires. - When the Lark SDK exhausts its 7 internal retries and its Promise settles (rejects), that rejection becomes an unhandled promise rejection — the supervisor loop never sees it.
- The function then hangs indefinitely waiting for an abort signal that never comes.
As a result, the core promise of the PR — "wrap monitorWebSocket() in a supervisor while-loop" — is not fulfilled. The reconnect path and the isNonRecoverableFeishuError fast-fail path are both unreachable.
The fix is to race wsClient.start() against the abort signal so that either path can advance the loop:
const startPromise = wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);
reconnectAttempts = 0;
const abortedPromise = new Promise<void>((resolve) => {
if (abortSignal?.aborted) { resolve(); return; }
abortSignal?.addEventListener("abort", () => resolve(), { once: true });
});
// Race: SDK gives up naturally, OR caller aborts intentionally.
await Promise.race([startPromise, abortedPromise]);
// If the abort signal won the race, break cleanly.
if (abortSignal?.aborted) break;
// Otherwise the SDK exhausted retries; fall through to reconnect.Note: if startPromise rejects, the catch block below already handles it correctly. If abortedPromise wins while startPromise is still pending, the finally block tears down the client maps and the next iteration creates a fresh client — which is the intended behavior.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 147-164
Comment:
**`wsClient.start()` is not awaited — reconnection loop never triggers**
`wsClient.start()` returns a Promise, but it is called without `await`. This means:
1. The SDK runs in the background, disconnected from the `try/catch` and the outer `while` loop.
2. The code immediately falls through to `await new Promise<void>(...)`, which resolves **only** when the abort signal fires.
3. When the Lark SDK exhausts its 7 internal retries and its Promise settles (rejects), that rejection becomes an unhandled promise rejection — the supervisor loop never sees it.
4. The function then hangs indefinitely waiting for an abort signal that never comes.
As a result, the core promise of the PR — "wrap `monitorWebSocket()` in a supervisor while-loop" — is not fulfilled. The reconnect path and the `isNonRecoverableFeishuError` fast-fail path are both unreachable.
The fix is to race `wsClient.start()` against the abort signal so that either path can advance the loop:
```typescript
const startPromise = wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);
reconnectAttempts = 0;
const abortedPromise = new Promise<void>((resolve) => {
if (abortSignal?.aborted) { resolve(); return; }
abortSignal?.addEventListener("abort", () => resolve(), { once: true });
});
// Race: SDK gives up naturally, OR caller aborts intentionally.
await Promise.race([startPromise, abortedPromise]);
// If the abort signal won the race, break cleanly.
if (abortSignal?.aborted) break;
// Otherwise the SDK exhausted retries; fall through to reconnect.
```
Note: if `startPromise` rejects, the `catch` block below already handles it correctly. If `abortedPromise` wins while `startPromise` is still pending, the `finally` block tears down the client maps and the next iteration creates a fresh client — which is the intended behavior.
How can I resolve this? If you propose a fix, please make it concise.| @@ -28,6 +28,50 @@ export type MonitorTransportParams = { | |||
| eventDispatcher: Lark.EventDispatcher; | |||
| }; | |||
|
|
|||
| /** | |||
| * Reconnection policy for Feishu WebSocket connections. | |||
| * Aligned with Slack/Telegram patterns (2s initial, 30s max, 1.8x factor). | |||
There was a problem hiding this comment.
Stale JSDoc comment — "30s max" doesn't match
maxMs: 60_000
The comment says 30s max but the constant directly below sets maxMs: 60_000 (60 seconds).
| * Aligned with Slack/Telegram patterns (2s initial, 30s max, 1.8x factor). | |
| * Aligned with Slack/Telegram patterns (2s initial, 60s max, 1.8x factor). |
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 33
Comment:
**Stale JSDoc comment — "30s max" doesn't match `maxMs: 60_000`**
The comment says `30s max` but the constant directly below sets `maxMs: 60_000` (60 seconds).
```suggestion
* Aligned with Slack/Telegram patterns (2s initial, 60s max, 1.8x factor).
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 30cce342b4
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| await new Promise<void>((resolve) => { | ||
| if (abortSignal?.aborted) { | ||
| resolve(); | ||
| return; | ||
| } | ||
| abortSignal?.addEventListener("abort", () => resolve(), { once: true }); |
There was a problem hiding this comment.
Drive the supervisor from WS failure, not only abort
I checked the upstream @larksuiteoapi/node-sdk WSClient.start() flow used here: it returns after setting up its own reconnect machinery, not when the socket later exhausts retries or gives up. In this implementation we immediately park on abortSignal, so after a real network drop the outer while never iterates and never recreates the client. That means the channel still stays offline until a manual restart, which is the exact outage this change is trying to fix.
Useful? React with 👍 / 👎.
…tion ## Problem The Lark SDK's `WSClient.start()` is fire-and-forget — it returns immediately and manages reconnection internally. When the SDK exhausts its server-configured `reconnectCount` retries, it stops **silently**: no error thrown, no event emitted, no promise rejected. The existing `monitorWebSocket()` implementation parks on `abortSignal` after calling `start()`, so it can never detect this silent death. This is the root cause of openclaw#52618. ## Root cause analysis Reading the upstream Lark SDK source (`larksuite/node-sdk`, `ws-client/index.ts`): 1. `start()` is declared `async` but calls `this.reConnect(true)` without `await` — fire-and-forget by design. 2. Internal reconnection: `communicate()` → `ws.on('close')` → `reConnect()` loops up to `reconnectCount` times (server-configured, typically 7). 3. After exhausting retries, the SDK simply stops. No callback, no event, no rejected promise. The WSClient instance becomes a zombie. 4. The SDK exposes `getReconnectInfo()` → `{ lastConnectTime, nextConnectTime }` which is the only observable signal of reconnection state. ## Solution Replace the park-on-abortSignal pattern with a **supervisor loop + health-check** that mirrors Slack/Telegram channel patterns: 1. **Outer supervisor loop** owns WSClient lifetime (create → monitor → destroy → retry). 2. **Inner health-check loop** polls `getReconnectInfo()` every 30s. 3. If `nextConnectTime` hasn't advanced for 120s AND `lastConnectTime` is also stale → SDK declared dead → `close({ force: true })` → supervisor recreates client with exponential backoff. 4. Non-recoverable errors (bad credentials, disabled app) break the loop. 5. `abortSignal` cleanly exits at any point (gateway restart, config reload). ## Why the previous approach didn't work The v1 of this PR had a supervisor while-loop, but after `start()` it immediately parked on `abortSignal`. Since `start()` returns instantly (fire-and-forget), the while-loop could never re-iterate when the SDK silently gave up. The health-check polling pattern solves this. Fixes openclaw#52618
30cce34 to
969f3e3
Compare
v2: Rewrote based on Lark SDK source analysis + bot review feedbackThanks to @greptile-apps and @chatgpt-codex-connector for catching the critical bug in v1 — What we found in the SDK source// larksuite/node-sdk: ws-client/index.ts
async start(params) {
this.eventDispatcher = eventDispatcher;
this.reConnect(true); // ← no await! fire-and-forget by design
}The SDK's This means Why v1 couldn't workv1 had the right idea (supervisor while-loop) but parked on v2 approach: health-check polling via
|
| wsClient.start({ eventDispatcher }); | ||
| log(`feishu[${accountId}]: WebSocket client started`); | ||
|
|
||
| // SDK started successfully; reset supervisor backoff. | ||
| supervisorAttempts = 0; |
There was a problem hiding this comment.
Exponential backoff is always ~2s — reset fires before any failure is observable
supervisorAttempts = 0 is reset on line 199 immediately after wsClient.start() returns. Because start() is fire-and-forget and returns instantly regardless of connection state, this reset always runs before sdkDied can ever be set to true.
Tracing the cycle for rapid consecutive failures:
- Iteration 1:
supervisorAttempts = 0(initial).start()called, reset to 0. SDK dies →sdkDied = true→supervisorAttempts += 1→ 1. Delay =computeReconnectDelay(1)≈ 2 s. - Iteration 2:
start()called, reset to 0 again. SDK dies →supervisorAttempts += 1→ 1. Delay ≈ 2 s. - …repeats indefinitely at ~2 s.
The exponential growth (1.8×) is never observed because supervisorAttempts is reset to 0 before it can accumulate. The intended 60 s cap is unreachable.
The practical risk is that if the Lark server is down or rejecting connections, the supervisor will hammer it every ~2 s continuously, which can trigger server-side rate limiting or IP bans.
The reset should only happen after the health-check confirms a genuinely stable connection — for example, after the first successful getReconnectInfo() poll where lastConnectTime is recent:
wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);
// Do NOT reset here — start() is fire-and-forget and returns immediately.
// Reset is deferred to inside the health-check loop once we observe
// a recent lastConnectTime (connection actually established).
let backoffReset = false;
let lastSeenNextConnectTime = 0;
let staleStartedAt = 0;
while (!abortSignal?.aborted) {
const result = await sleepWithAbort(HEALTH_CHECK_INTERVAL_MS, abortSignal);
if (result === "aborted") break;
const info = wsClient.getReconnectInfo();
const now = Date.now();
// First evidence of a live connection → reset backoff.
if (!backoffReset && info.lastConnectTime > 0 && now - info.lastConnectTime < HEALTH_CHECK_STALE_MS) {
supervisorAttempts = 0;
backoffReset = true;
}
// … rest of staleness checks unchanged
}Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 195-199
Comment:
**Exponential backoff is always ~2s — reset fires before any failure is observable**
`supervisorAttempts = 0` is reset on line 199 immediately after `wsClient.start()` returns. Because `start()` is fire-and-forget and returns instantly regardless of connection state, this reset always runs before `sdkDied` can ever be set to `true`.
Tracing the cycle for rapid consecutive failures:
1. Iteration 1: `supervisorAttempts = 0` (initial). `start()` called, **reset to 0**. SDK dies → `sdkDied = true` → `supervisorAttempts += 1` → **1**. Delay = `computeReconnectDelay(1)` ≈ 2 s.
2. Iteration 2: `start()` called, **reset to 0 again**. SDK dies → `supervisorAttempts += 1` → **1**. Delay ≈ 2 s.
3. …repeats indefinitely at ~2 s.
The exponential growth (`1.8×`) is never observed because `supervisorAttempts` is reset to 0 before it can accumulate. The intended 60 s cap is unreachable.
The practical risk is that if the Lark server is down or rejecting connections, the supervisor will hammer it every ~2 s continuously, which can trigger server-side rate limiting or IP bans.
The reset should only happen after the health-check confirms a genuinely stable connection — for example, after the first successful `getReconnectInfo()` poll where `lastConnectTime` is recent:
```typescript
wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);
// Do NOT reset here — start() is fire-and-forget and returns immediately.
// Reset is deferred to inside the health-check loop once we observe
// a recent lastConnectTime (connection actually established).
let backoffReset = false;
let lastSeenNextConnectTime = 0;
let staleStartedAt = 0;
while (!abortSignal?.aborted) {
const result = await sleepWithAbort(HEALTH_CHECK_INTERVAL_MS, abortSignal);
if (result === "aborted") break;
const info = wsClient.getReconnectInfo();
const now = Date.now();
// First evidence of a live connection → reset backoff.
if (!backoffReset && info.lastConnectTime > 0 && now - info.lastConnectTime < HEALTH_CHECK_STALE_MS) {
supervisorAttempts = 0;
backoffReset = true;
}
// … rest of staleness checks unchanged
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 969f3e3ba9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| botOpenIds.delete(accountId); | ||
| botNames.delete(accountId); |
There was a problem hiding this comment.
Preserve bot identity across reconnect cycles
This cleanup runs on every supervisor iteration, not just final shutdown, so any reconnect clears botOpenIds/botNames permanently for the account. After the first WS client recreation, handlers in monitor.account.ts read botOpenIds.get(accountId)/botNames.get(accountId) as empty, which breaks mention detection and bot-self checks for subsequent events until the whole account monitor is restarted.
Useful? React with 👍 / 👎.
| signal?.addEventListener( | ||
| "abort", | ||
| () => { | ||
| clearTimeout(timer); | ||
| resolve("aborted"); |
There was a problem hiding this comment.
Remove abort listeners when sleep completes normally
Each call to sleepWithAbort registers a new abort listener, but when the timeout resolves ("slept") the listener is never removed. In the 30s health-check loop this accumulates listeners for the lifetime of the process, causing avoidable memory growth on long-running gateways (especially with multiple Feishu accounts).
Useful? React with 👍 / 👎.
|
Closing this as duplicate or superseded after Codex automated review. PR #52623 is superseded by the newer Feishu WebSocket supervisor PR #57978. Current main still has the original one-shot Feishu WebSocket monitor, but #57978 tracks the same #52618 recovery bug and explicitly addresses the unresolved review defects found in #52623. Best possible solution: Close #52623 as superseded and keep the remaining Feishu WebSocket recovery work focused on #57978, where the same #52618 bug is tracked with fixes for the known review regressions and added regression coverage. What I checked:
So I’m closing this here and keeping the remaining discussion on the canonical linked item. Codex Review notes: model gpt-5.5, reasoning high; reviewed against 76cf013df507. |
Summary
Replaces the park-on-abortSignal pattern in
monitorWebSocket()with a supervisor loop + health-check that can detect and recover from the Lark SDK's silent reconnection exhaustion.Fixes #52618
Root Cause (from Lark SDK source analysis)
We read the upstream
@larksuiteoapi/node-sdksource (ws-client/index.ts) and found:WSClient.start()is fire-and-forget — declaredasyncbut callsthis.reConnect(true)withoutawait, returns immediately.communicate()→ws.on('close')→reConnect()loops up toreconnectCounttimes (server-configured, typically 7).getReconnectInfo()→{ lastConnectTime, nextConnectTime }—nextConnectTimestops advancing when the SDK gives up.The existing
monitorWebSocket()parks onabortSignalafter callingstart(). Sincestart()returns instantly, the function can never detect when the SDK silently stops reconnecting. This is the root cause of #52618.Solution
Supervisor loop + health-check polling, aligned with Slack/Telegram channel patterns:
WSClientlifetime (create → monitor → destroy → retry with exponential backoff).getReconnectInfo()every 30s.nextConnectTimehasn't advanced for 120s ANDlastConnectTimeis also old → SDK is dead →close({ force: true })→ supervisor recreates client.abortSignalcleanly exits at any point (gateway restart, config reload).Why v1 didn't work
The previous version of this PR had a supervisor while-loop, but after
start()it immediately parked on a Promise that only resolved onabortSignal. Sincestart()is fire-and-forget, the while-loop could never re-iterate when the SDK silently gave up. Bot review correctly identified this. The health-check polling pattern solves this fundamental issue.Changes
extensions/feishu/src/monitor.transport.ts— rewritemonitorWebSocket():HEALTH_CHECK_INTERVAL_MS(30s) andHEALTH_CHECK_STALE_MS(120s) constantswsClient.getReconnectInfo()wsClient.close({ force: true })cleanup before recreationTest plan
abortSignal(gateway stop/restart) cleanly exits the loop