fix(feishu): add supervisor loop for WebSocket reconnection by bridzl · Pull Request #52623 · openclaw/openclaw

bridzl · 2026-03-23T04:17:27Z

Summary

Replaces the park-on-abortSignal pattern in monitorWebSocket() with a supervisor loop + health-check that can detect and recover from the Lark SDK's silent reconnection exhaustion.

Fixes #52618

Root Cause (from Lark SDK source analysis)

We read the upstream @larksuiteoapi/node-sdk source (ws-client/index.ts) and found:

WSClient.start() is fire-and-forget — declared async but calls this.reConnect(true) without await, returns immediately.
Internal reconnection: communicate() → ws.on('close') → reConnect() loops up to reconnectCount times (server-configured, typically 7).
Silent death: after exhausting retries, the SDK simply stops. No error thrown, no event emitted, no promise rejected. The WSClient becomes a zombie.
The only observable signal: getReconnectInfo() → { lastConnectTime, nextConnectTime } — nextConnectTime stops advancing when the SDK gives up.

The existing monitorWebSocket() parks on abortSignal after calling start(). Since start() returns instantly, the function can never detect when the SDK silently stops reconnecting. This is the root cause of #52618.

Solution

Supervisor loop + health-check polling, aligned with Slack/Telegram channel patterns:

Outer supervisor loop owns WSClient lifetime (create → monitor → destroy → retry with exponential backoff).
Inner health-check loop polls getReconnectInfo() every 30s.
Staleness detection: if nextConnectTime hasn't advanced for 120s AND lastConnectTime is also old → SDK is dead → close({ force: true }) → supervisor recreates client.
Non-recoverable errors (bad credentials, disabled app) break the loop immediately.
abortSignal cleanly exits at any point (gateway restart, config reload).

Why v1 didn't work

The previous version of this PR had a supervisor while-loop, but after start() it immediately parked on a Promise that only resolved on abortSignal. Since start() is fire-and-forget, the while-loop could never re-iterate when the SDK silently gave up. Bot review correctly identified this. The health-check polling pattern solves this fundamental issue.

Changes

extensions/feishu/src/monitor.transport.ts — rewrite monitorWebSocket():
- Add HEALTH_CHECK_INTERVAL_MS (30s) and HEALTH_CHECK_STALE_MS (120s) constants
- Supervisor outer loop with exponential backoff (2s initial, 60s max, 1.8x factor)
- Health-check inner loop using wsClient.getReconnectInfo()
- Proper wsClient.close({ force: true }) cleanup before recreation
- Non-recoverable error detection (credential/app errors)

Test plan

Verify gateway starts with Feishu WebSocket channel enabled
Simulate network disruption (disconnect WiFi) — supervisor should detect stale connection and recreate client
Verify abortSignal (gateway stop/restart) cleanly exits the loop
Verify non-recoverable errors (invalid appId) don't trigger infinite retry

greptile-apps · 2026-03-23T04:19:58Z

Greptile Summary

This PR replaces the previous park-on-abortSignal implementation in monitorWebSocket() with a supervisor loop + health-check polling approach to detect and recover from the Lark SDK's silent reconnection exhaustion. The overall architecture — outer supervisor loop owning WSClient lifetime, inner loop polling getReconnectInfo(), staleness detection comparing nextConnectTime and lastConnectTime — is well-designed and aligns with the Slack/Telegram patterns in the codebase.

Key issues found:

Exponential backoff is non-functional (P1): supervisorAttempts = 0 is reset immediately after the fire-and-forget wsClient.start() returns, which happens before sdkDied can ever be observed. Every supervisor retry therefore calls computeReconnectDelay(1) (~2 s), so the 1.8× growth factor and 60 s cap are never reached. Under sustained failure this causes the supervisor to hammer the Lark endpoint at ~2 s intervals, risking server-side rate limiting.
Overly broad "app_id" substring match (P2): The isNonRecoverableFeishuError guard matches any error containing the literal app_id, which could permanently stop the supervisor for a recoverable error that merely references an app_id field in its message.

Confidence Score: 3/5

The core health-check supervisor pattern is sound, but the exponential backoff is functionally broken — requires a targeted fix before merge to avoid hammering the Lark endpoint under sustained failures.
The staleness detection logic, sleepWithAbort helper, finally-block cleanup, and abort-signal integration are all implemented correctly. The fundamental design goal (detect SDK silent death via getReconnectInfo()) is valid and an improvement over the previous version. However, resetting supervisorAttempts = 0 immediately after a fire-and-forget start() means the exponential backoff degenerates to a constant ~2 s — a concrete production reliability concern. One targeted fix (deferred reset to first confirmed healthy health-check) unblocks merge.
extensions/feishu/src/monitor.transport.ts — specifically the supervisorAttempts reset on line 199 and the isNonRecoverableFeishuError pattern on line 89.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 195-199

Comment:
**Exponential backoff is always ~2s — reset fires before any failure is observable**

`supervisorAttempts = 0` is reset on line 199 immediately after `wsClient.start()` returns. Because `start()` is fire-and-forget and returns instantly regardless of connection state, this reset always runs before `sdkDied` can ever be set to `true`.

Tracing the cycle for rapid consecutive failures:

1. Iteration 1: `supervisorAttempts = 0` (initial). `start()` called, **reset to 0**. SDK dies → `sdkDied = true` → `supervisorAttempts += 1` → **1**. Delay = `computeReconnectDelay(1)` ≈ 2 s.
2. Iteration 2: `start()` called, **reset to 0 again**. SDK dies → `supervisorAttempts += 1` → **1**. Delay ≈ 2 s.
3. …repeats indefinitely at ~2 s.

The exponential growth (`1.8×`) is never observed because `supervisorAttempts` is reset to 0 before it can accumulate. The intended 60 s cap is unreachable.

The practical risk is that if the Lark server is down or rejecting connections, the supervisor will hammer it every ~2 s continuously, which can trigger server-side rate limiting or IP bans.

The reset should only happen after the health-check confirms a genuinely stable connection — for example, after the first successful `getReconnectInfo()` poll where `lastConnectTime` is recent:

```typescript
wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);

// Do NOT reset here — start() is fire-and-forget and returns immediately.
// Reset is deferred to inside the health-check loop once we observe
// a recent lastConnectTime (connection actually established).
let backoffReset = false;
let lastSeenNextConnectTime = 0;
let staleStartedAt = 0;

while (!abortSignal?.aborted) {
  const result = await sleepWithAbort(HEALTH_CHECK_INTERVAL_MS, abortSignal);
  if (result === "aborted") break;

  const info = wsClient.getReconnectInfo();
  const now = Date.now();

  // First evidence of a live connection → reset backoff.
  if (!backoffReset && info.lastConnectTime > 0 && now - info.lastConnectTime < HEALTH_CHECK_STALE_MS) {
    supervisorAttempts = 0;
    backoffReset = true;
  }
  // … rest of staleness checks unchanged
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 85-94

Comment:
**`"app_id"` substring match is too broad**

`msg.includes("app_id")` will match any error message that contains the literal string `app_id`, including internal SDK messages like `"missing field: app_id in event payload"` or similar. This could cause a transient or recoverable error to be silently classified as non-recoverable, permanently stopping the supervisor.

The other patterns (`"invalid app"`, `"app not exist"`, `"app has been disabled"`) are specific and safe. Consider removing the naked `"app_id"` term or replacing it with the exact Lark error string you intend to match (e.g., `"invalid app_id"` or `"app_id not found"`):

```suggestion
function isNonRecoverableFeishuError(err: unknown): boolean {
  const msg = String(err);
  return (
    msg.includes("credentials not configured") ||
    msg.includes("invalid app_id") ||
    msg.includes("invalid app") ||
    msg.includes("app has been disabled") ||
    msg.includes("app not exist")
  );
}
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (2): Last reviewed commit: "fix(feishu): supervisor loop with health..." | Re-trigger Greptile}

greptile-apps · 2026-03-23T04:20:02Z

      wsClient.start({ eventDispatcher });
      log(`feishu[${accountId}]: WebSocket client started`);
+
+      // Reset attempt counter on successful start.
+      reconnectAttempts = 0;
+
+      // Wait until abort signal fires. The Lark SDK manages the connection
+      // internally; when it exhausts its own retries (typically 7 attempts)
+      // the start() promise settles and we fall through to the reconnect
+      // loop below. Intentional stops (gateway restart, config reload) fire
+      // the abort signal and exit the loop cleanly.
+      await new Promise<void>((resolve) => {
+        if (abortSignal?.aborted) {
+          resolve();
+          return;
+        }
+        abortSignal?.addEventListener("abort", () => resolve(), { once: true });
+      });


wsClient.start() is not awaited — reconnection loop never triggers

wsClient.start() returns a Promise, but it is called without await. This means:

The SDK runs in the background, disconnected from the try/catch and the outer while loop.

The code immediately falls through to await new Promise<void>(...), which resolves only when the abort signal fires.

When the Lark SDK exhausts its 7 internal retries and its Promise settles (rejects), that rejection becomes an unhandled promise rejection — the supervisor loop never sees it.

The function then hangs indefinitely waiting for an abort signal that never comes.

As a result, the core promise of the PR — "wrap monitorWebSocket() in a supervisor while-loop" — is not fulfilled. The reconnect path and the isNonRecoverableFeishuError fast-fail path are both unreachable.

The fix is to race wsClient.start() against the abort signal so that either path can advance the loop:

const startPromise = wsClient.start({ eventDispatcher }); log(`feishu[${accountId}]: WebSocket client started`); reconnectAttempts = 0; const abortedPromise = new Promise<void>((resolve) => { if (abortSignal?.aborted) { resolve(); return; } abortSignal?.addEventListener("abort", () => resolve(), { once: true }); }); // Race: SDK gives up naturally, OR caller aborts intentionally. await Promise.race([startPromise, abortedPromise]); // If the abort signal won the race, break cleanly. if (abortSignal?.aborted) break; // Otherwise the SDK exhausted retries; fall through to reconnect.

Note: if startPromise rejects, the catch block below already handles it correctly. If abortedPromise wins while startPromise is still pending, the finally block tears down the client maps and the next iteration creates a fresh client — which is the intended behavior.

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 147-164 Comment: **`wsClient.start()` is not awaited — reconnection loop never triggers** `wsClient.start()` returns a Promise, but it is called without `await`. This means: 1. The SDK runs in the background, disconnected from the `try/catch` and the outer `while` loop. 2. The code immediately falls through to `await new Promise<void>(...)`, which resolves **only** when the abort signal fires. 3. When the Lark SDK exhausts its 7 internal retries and its Promise settles (rejects), that rejection becomes an unhandled promise rejection — the supervisor loop never sees it. 4. The function then hangs indefinitely waiting for an abort signal that never comes. As a result, the core promise of the PR — "wrap `monitorWebSocket()` in a supervisor while-loop" — is not fulfilled. The reconnect path and the `isNonRecoverableFeishuError` fast-fail path are both unreachable. The fix is to race `wsClient.start()` against the abort signal so that either path can advance the loop: ```typescript const startPromise = wsClient.start({ eventDispatcher }); log(`feishu[${accountId}]: WebSocket client started`); reconnectAttempts = 0; const abortedPromise = new Promise<void>((resolve) => { if (abortSignal?.aborted) { resolve(); return; } abortSignal?.addEventListener("abort", () => resolve(), { once: true }); }); // Race: SDK gives up naturally, OR caller aborts intentionally. await Promise.race([startPromise, abortedPromise]); // If the abort signal won the race, break cleanly. if (abortSignal?.aborted) break; // Otherwise the SDK exhausted retries; fall through to reconnect. ``` Note: if `startPromise` rejects, the `catch` block below already handles it correctly. If `abortedPromise` wins while `startPromise` is still pending, the `finally` block tears down the client maps and the next iteration creates a fresh client — which is the intended behavior. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-23T04:20:03Z

@@ -28,6 +28,50 @@ export type MonitorTransportParams = {
  eventDispatcher: Lark.EventDispatcher;
 };

+/**
+ * Reconnection policy for Feishu WebSocket connections.
+ * Aligned with Slack/Telegram patterns (2s initial, 30s max, 1.8x factor).


Stale JSDoc comment — "30s max" doesn't match maxMs: 60_000

The comment says 30s max but the constant directly below sets maxMs: 60_000 (60 seconds).

Suggested change

* Aligned with Slack/Telegram patterns (2s initial, 30s max, 1.8x factor).

* Aligned with Slack/Telegram patterns (2s initial, 60s max, 1.8x factor).

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 33 Comment: **Stale JSDoc comment — "30s max" doesn't match `maxMs: 60_000`** The comment says `30s max` but the constant directly below sets `maxMs: 60_000` (60 seconds). ```suggestion * Aligned with Slack/Telegram patterns (2s initial, 60s max, 1.8x factor). ``` How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30cce342b4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-23T04:22:36Z

+      await new Promise<void>((resolve) => {
+        if (abortSignal?.aborted) {
+          resolve();
+          return;
+        }
+        abortSignal?.addEventListener("abort", () => resolve(), { once: true });


Drive the supervisor from WS failure, not only abort

I checked the upstream @larksuiteoapi/node-sdk WSClient.start() flow used here: it returns after setting up its own reconnect machinery, not when the socket later exhausts retries or gives up. In this implementation we immediately park on abortSignal, so after a real network drop the outer while never iterates and never recreates the client. That means the channel still stays offline until a manual restart, which is the exact outage this change is trying to fix.

Useful? React with 👍 / 👎.

…tion ## Problem The Lark SDK's `WSClient.start()` is fire-and-forget — it returns immediately and manages reconnection internally. When the SDK exhausts its server-configured `reconnectCount` retries, it stops **silently**: no error thrown, no event emitted, no promise rejected. The existing `monitorWebSocket()` implementation parks on `abortSignal` after calling `start()`, so it can never detect this silent death. This is the root cause of openclaw#52618. ## Root cause analysis Reading the upstream Lark SDK source (`larksuite/node-sdk`, `ws-client/index.ts`): 1. `start()` is declared `async` but calls `this.reConnect(true)` without `await` — fire-and-forget by design. 2. Internal reconnection: `communicate()` → `ws.on('close')` → `reConnect()` loops up to `reconnectCount` times (server-configured, typically 7). 3. After exhausting retries, the SDK simply stops. No callback, no event, no rejected promise. The WSClient instance becomes a zombie. 4. The SDK exposes `getReconnectInfo()` → `{ lastConnectTime, nextConnectTime }` which is the only observable signal of reconnection state. ## Solution Replace the park-on-abortSignal pattern with a **supervisor loop + health-check** that mirrors Slack/Telegram channel patterns: 1. **Outer supervisor loop** owns WSClient lifetime (create → monitor → destroy → retry). 2. **Inner health-check loop** polls `getReconnectInfo()` every 30s. 3. If `nextConnectTime` hasn't advanced for 120s AND `lastConnectTime` is also stale → SDK declared dead → `close({ force: true })` → supervisor recreates client with exponential backoff. 4. Non-recoverable errors (bad credentials, disabled app) break the loop. 5. `abortSignal` cleanly exits at any point (gateway restart, config reload). ## Why the previous approach didn't work The v1 of this PR had a supervisor while-loop, but after `start()` it immediately parked on `abortSignal`. Since `start()` returns instantly (fire-and-forget), the while-loop could never re-iterate when the SDK silently gave up. The health-check polling pattern solves this. Fixes openclaw#52618

bridzl · 2026-03-24T04:55:56Z

v2: Rewrote based on Lark SDK source analysis + bot review feedback

Thanks to @greptile-apps and @chatgpt-codex-connector for catching the critical bug in v1 — wsClient.start() not being awaited. That review led us to dig into the actual Lark SDK source code (larksuite/node-sdk, ws-client/index.ts), which revealed the full picture:

What we found in the SDK source

// larksuite/node-sdk: ws-client/index.ts
async start(params) {
    this.eventDispatcher = eventDispatcher;
    this.reConnect(true);  // ← no await! fire-and-forget by design
}

The SDK's start() is intentionally fire-and-forget. The internal reconnection loop (reConnect → communicate → ws.on('close') → reConnect) runs up to reconnectCount times, then stops silently — no error, no event, no rejection.

This means await wsClient.start() wouldn't help either — the Promise resolves before the connection is even established.

Why v1 couldn't work

v1 had the right idea (supervisor while-loop) but parked on abortSignal after start(). Since start() returns instantly, the loop could never detect SDK death.

v2 approach: health-check polling via `getReconnectInfo()`

The SDK exposes getReconnectInfo() → { lastConnectTime, nextConnectTime }. During active reconnection, nextConnectTime advances with each retry. When the SDK gives up, it freezes.

v2 polls this every 30s. If stale for 120s → declare dead → close({ force: true }) → recreate client with exponential backoff. This is the same supervisor pattern used by the Slack and Telegram channels in OpenClaw.

Rebased onto latest main (e864421). Ready for maintainer review.

greptile-apps · 2026-03-24T04:59:25Z

      wsClient.start({ eventDispatcher });
      log(`feishu[${accountId}]: WebSocket client started`);
+
+      // SDK started successfully; reset supervisor backoff.
+      supervisorAttempts = 0;


Exponential backoff is always ~2s — reset fires before any failure is observable

supervisorAttempts = 0 is reset on line 199 immediately after wsClient.start() returns. Because start() is fire-and-forget and returns instantly regardless of connection state, this reset always runs before sdkDied can ever be set to true.

Tracing the cycle for rapid consecutive failures:

Iteration 1: supervisorAttempts = 0 (initial). start() called, reset to 0. SDK dies → sdkDied = true → supervisorAttempts += 1 → 1. Delay = computeReconnectDelay(1) ≈ 2 s.

Iteration 2: start() called, reset to 0 again. SDK dies → supervisorAttempts += 1 → 1. Delay ≈ 2 s.

…repeats indefinitely at ~2 s.

The exponential growth (1.8×) is never observed because supervisorAttempts is reset to 0 before it can accumulate. The intended 60 s cap is unreachable.

The practical risk is that if the Lark server is down or rejecting connections, the supervisor will hammer it every ~2 s continuously, which can trigger server-side rate limiting or IP bans.

The reset should only happen after the health-check confirms a genuinely stable connection — for example, after the first successful getReconnectInfo() poll where lastConnectTime is recent:

wsClient.start({ eventDispatcher }); log(`feishu[${accountId}]: WebSocket client started`); // Do NOT reset here — start() is fire-and-forget and returns immediately. // Reset is deferred to inside the health-check loop once we observe // a recent lastConnectTime (connection actually established). let backoffReset = false; let lastSeenNextConnectTime = 0; let staleStartedAt = 0; while (!abortSignal?.aborted) { const result = await sleepWithAbort(HEALTH_CHECK_INTERVAL_MS, abortSignal); if (result === "aborted") break; const info = wsClient.getReconnectInfo(); const now = Date.now(); // First evidence of a live connection → reset backoff. if (!backoffReset && info.lastConnectTime > 0 && now - info.lastConnectTime < HEALTH_CHECK_STALE_MS) { supervisorAttempts = 0; backoffReset = true; } // … rest of staleness checks unchanged }

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 195-199 Comment: **Exponential backoff is always ~2s — reset fires before any failure is observable** `supervisorAttempts = 0` is reset on line 199 immediately after `wsClient.start()` returns. Because `start()` is fire-and-forget and returns instantly regardless of connection state, this reset always runs before `sdkDied` can ever be set to `true`. Tracing the cycle for rapid consecutive failures: 1. Iteration 1: `supervisorAttempts = 0` (initial). `start()` called, **reset to 0**. SDK dies → `sdkDied = true` → `supervisorAttempts += 1` → **1**. Delay = `computeReconnectDelay(1)` ≈ 2 s. 2. Iteration 2: `start()` called, **reset to 0 again**. SDK dies → `supervisorAttempts += 1` → **1**. Delay ≈ 2 s. 3. …repeats indefinitely at ~2 s. The exponential growth (`1.8×`) is never observed because `supervisorAttempts` is reset to 0 before it can accumulate. The intended 60 s cap is unreachable. The practical risk is that if the Lark server is down or rejecting connections, the supervisor will hammer it every ~2 s continuously, which can trigger server-side rate limiting or IP bans. The reset should only happen after the health-check confirms a genuinely stable connection — for example, after the first successful `getReconnectInfo()` poll where `lastConnectTime` is recent: ```typescript wsClient.start({ eventDispatcher }); log(`feishu[${accountId}]: WebSocket client started`); // Do NOT reset here — start() is fire-and-forget and returns immediately. // Reset is deferred to inside the health-check loop once we observe // a recent lastConnectTime (connection actually established). let backoffReset = false; let lastSeenNextConnectTime = 0; let staleStartedAt = 0; while (!abortSignal?.aborted) { const result = await sleepWithAbort(HEALTH_CHECK_INTERVAL_MS, abortSignal); if (result === "aborted") break; const info = wsClient.getReconnectInfo(); const now = Date.now(); // First evidence of a live connection → reset backoff. if (!backoffReset && info.lastConnectTime > 0 && now - info.lastConnectTime < HEALTH_CHECK_STALE_MS) { supervisorAttempts = 0; backoffReset = true; } // … rest of staleness checks unchanged } ``` How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 969f3e3ba9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-24T05:00:12Z

+      botOpenIds.delete(accountId);
+      botNames.delete(accountId);


Preserve bot identity across reconnect cycles

This cleanup runs on every supervisor iteration, not just final shutdown, so any reconnect clears botOpenIds/botNames permanently for the account. After the first WS client recreation, handlers in monitor.account.ts read botOpenIds.get(accountId)/botNames.get(accountId) as empty, which breaks mention detection and bot-self checks for subsequent events until the whole account monitor is restarted.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-24T05:00:12Z

+    signal?.addEventListener(
+      "abort",
+      () => {
+        clearTimeout(timer);
+        resolve("aborted");


Remove abort listeners when sleep completes normally

Each call to sleepWithAbort registers a new abort listener, but when the timeout resolves ("slept") the listener is never removed. In the 30s health-check loop this accumulates listeners for the lifetime of the process, causing avoidable memory growth on long-running gateways (especially with multiple Feishu accounts).

Useful? React with 👍 / 👎.

clawsweeper · 2026-04-26T08:12:03Z

Closing this as duplicate or superseded after Codex automated review.

PR #52623 is superseded by the newer Feishu WebSocket supervisor PR #57978. Current main still has the original one-shot Feishu WebSocket monitor, but #57978 tracks the same #52618 recovery bug and explicitly addresses the unresolved review defects found in #52623.

Best possible solution:

Close #52623 as superseded and keep the remaining Feishu WebSocket recovery work focused on #57978, where the same #52618 bug is tracked with fixes for the known review regressions and added regression coverage.

What I checked:

Current main still has the original Feishu WebSocket monitor: On main, monitorWebSocket creates one WSClient, stores it, calls wsClient.start({ eventDispatcher }) fire-and-forget, and only resolves on abort or synchronous start failure; there is no OpenClaw-owned supervisor loop or stall detector in this file. (extensions/feishu/src/monitor.transport.ts:85, 76cf013df507)
Dependency behavior supports the root cause: The installed Feishu plugin depends on @larksuiteoapi/node-sdk ^1.61.1. The 1.61.1 package source shows WSClient.start() resolves after calling this.reConnect(true), while getReconnectInfo() exposes lastConnectTime/nextConnectTime and reconnectCount exhaustion sets isConnecting false and returns. (extensions/feishu/package.json:7, 76cf013df507)
PR fix(feishu): add supervisor loop for WebSocket reconnection #52623 has unresolved correctness issues: The fix(feishu): add supervisor loop for WebSocket reconnection #52623 diff resets supervisorAttempts immediately after fire-and-forget wsClient.start(), matches any error containing app_id as non-recoverable, deletes botOpenIds/botNames on every supervisor cycle, and registers abort listeners in sleepWithAbort without removal on normal sleep completion. These align with the Greptile and Codex review comments in the provided discussion. (extensions/feishu/src/monitor.transport.ts, 969f3e3ba9ec)
PR fix(feishu): supervisor loop to recover from WSClient retry exhaustion #57978 supersedes the same work: The newer open PR fix(feishu): supervisor loop to recover from WSClient retry exhaustion #57978 targets the same Feishu WSClient retry-exhaustion bug, fixes fix(feishu): WebSocket connection not recovered after network disruption #52618, and its body explicitly lists fix(feishu): add supervisor loop for WebSocket reconnection #52623 review issues as addressed. Its diff adds runFeishuWSClientUntilDead(), supervisor backoff via shared helpers, narrowed non-recoverable patterns, bot identity preservation across cycles, and regression tests for the prior P1/P2 concerns. (extensions/feishu/src/monitor.transport.ts, 99d18b489398)
Live remote PR refs still exist: Remote refs show fix(feishu): add supervisor loop for WebSocket reconnection #52623 head at 969f3e3 and fix(feishu): supervisor loop to recover from WSClient retry exhaustion #57978 head at 99d18b4, confirming fix(feishu): supervisor loop to recover from WSClient retry exhaustion #57978 is a distinct later PR rather than the same branch.

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against 76cf013df507.

openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: S labels Mar 23, 2026

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

bridzl force-pushed the fix/feishu-websocket-reconnect branch from 30cce34 to 969f3e3 Compare March 24, 2026 04:55

openclaw-barnacle Bot added size: M and removed size: S labels Mar 24, 2026

greptile-apps Bot reviewed Mar 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 24, 2026

View reviewed changes

zhangxin840 mentioned this pull request Mar 30, 2026

fix(feishu): supervisor loop to recover from WSClient retry exhaustion #57978

Closed

5 tasks

clawsweeper Bot closed this Apr 26, 2026

BingqingLyu mentioned this pull request Apr 27, 2026

fix(feishu): supervisor loop to recover from WSClient retry exhaustion BingqingLyu/openclaw#1703

Open

5 tasks

openclaw-clownfish Bot mentioned this pull request Apr 29, 2026

fix(feishu): supervise WebSocket retry exhaustion #73940

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(feishu): add supervisor loop for WebSocket reconnection#52623

fix(feishu): add supervisor loop for WebSocket reconnection#52623
bridzl wants to merge 1 commit intoopenclaw:mainfrom
bridzl:fix/feishu-websocket-reconnect

bridzl commented Mar 23, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Mar 23, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Mar 23, 2026

Uh oh!

greptile-apps Bot Mar 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Uh oh!

bridzl commented Mar 24, 2026

Uh oh!

greptile-apps Bot Mar 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 24, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 24, 2026

Uh oh!

clawsweeper Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	* Aligned with Slack/Telegram patterns (2s initial, 30s max, 1.8x factor).
	* Aligned with Slack/Telegram patterns (2s initial, 60s max, 1.8x factor).

Uh oh!

Conversation

bridzl commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause (from Lark SDK source analysis)

Solution

Why v1 didn't work

Changes

Test plan

Uh oh!

greptile-apps Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

bridzl commented Mar 24, 2026

v2: Rewrote based on Lark SDK source analysis + bot review feedback

What we found in the SDK source

Why v1 couldn't work

v2 approach: health-check polling via getReconnectInfo()

Uh oh!

greptile-apps Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

clawsweeper Bot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bridzl commented Mar 23, 2026 •

edited

Loading

greptile-apps Bot commented Mar 23, 2026 •

edited

Loading

v2 approach: health-check polling via `getReconnectInfo()`