Skip to content

fix(feishu): add supervisor loop for WebSocket reconnection#52623

Closed
bridzl wants to merge 1 commit intoopenclaw:mainfrom
bridzl:fix/feishu-websocket-reconnect
Closed

fix(feishu): add supervisor loop for WebSocket reconnection#52623
bridzl wants to merge 1 commit intoopenclaw:mainfrom
bridzl:fix/feishu-websocket-reconnect

Conversation

@bridzl
Copy link
Copy Markdown

@bridzl bridzl commented Mar 23, 2026

Summary

Replaces the park-on-abortSignal pattern in monitorWebSocket() with a supervisor loop + health-check that can detect and recover from the Lark SDK's silent reconnection exhaustion.

Fixes #52618

Root Cause (from Lark SDK source analysis)

We read the upstream @larksuiteoapi/node-sdk source (ws-client/index.ts) and found:

  1. WSClient.start() is fire-and-forget — declared async but calls this.reConnect(true) without await, returns immediately.
  2. Internal reconnection: communicate()ws.on('close')reConnect() loops up to reconnectCount times (server-configured, typically 7).
  3. Silent death: after exhausting retries, the SDK simply stops. No error thrown, no event emitted, no promise rejected. The WSClient becomes a zombie.
  4. The only observable signal: getReconnectInfo(){ lastConnectTime, nextConnectTime }nextConnectTime stops advancing when the SDK gives up.

The existing monitorWebSocket() parks on abortSignal after calling start(). Since start() returns instantly, the function can never detect when the SDK silently stops reconnecting. This is the root cause of #52618.

Solution

Supervisor loop + health-check polling, aligned with Slack/Telegram channel patterns:

  1. Outer supervisor loop owns WSClient lifetime (create → monitor → destroy → retry with exponential backoff).
  2. Inner health-check loop polls getReconnectInfo() every 30s.
  3. Staleness detection: if nextConnectTime hasn't advanced for 120s AND lastConnectTime is also old → SDK is dead → close({ force: true }) → supervisor recreates client.
  4. Non-recoverable errors (bad credentials, disabled app) break the loop immediately.
  5. abortSignal cleanly exits at any point (gateway restart, config reload).

Why v1 didn't work

The previous version of this PR had a supervisor while-loop, but after start() it immediately parked on a Promise that only resolved on abortSignal. Since start() is fire-and-forget, the while-loop could never re-iterate when the SDK silently gave up. Bot review correctly identified this. The health-check polling pattern solves this fundamental issue.

Changes

  • extensions/feishu/src/monitor.transport.ts — rewrite monitorWebSocket():
    • Add HEALTH_CHECK_INTERVAL_MS (30s) and HEALTH_CHECK_STALE_MS (120s) constants
    • Supervisor outer loop with exponential backoff (2s initial, 60s max, 1.8x factor)
    • Health-check inner loop using wsClient.getReconnectInfo()
    • Proper wsClient.close({ force: true }) cleanup before recreation
    • Non-recoverable error detection (credential/app errors)

Test plan

  • Verify gateway starts with Feishu WebSocket channel enabled
  • Simulate network disruption (disconnect WiFi) — supervisor should detect stale connection and recreate client
  • Verify abortSignal (gateway stop/restart) cleanly exits the loop
  • Verify non-recoverable errors (invalid appId) don't trigger infinite retry

@openclaw-barnacle openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: S labels Mar 23, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 23, 2026

Greptile Summary

This PR replaces the previous park-on-abortSignal implementation in monitorWebSocket() with a supervisor loop + health-check polling approach to detect and recover from the Lark SDK's silent reconnection exhaustion. The overall architecture — outer supervisor loop owning WSClient lifetime, inner loop polling getReconnectInfo(), staleness detection comparing nextConnectTime and lastConnectTime — is well-designed and aligns with the Slack/Telegram patterns in the codebase.

Key issues found:

  • Exponential backoff is non-functional (P1): supervisorAttempts = 0 is reset immediately after the fire-and-forget wsClient.start() returns, which happens before sdkDied can ever be observed. Every supervisor retry therefore calls computeReconnectDelay(1) (~2 s), so the 1.8× growth factor and 60 s cap are never reached. Under sustained failure this causes the supervisor to hammer the Lark endpoint at ~2 s intervals, risking server-side rate limiting.

  • Overly broad "app_id" substring match (P2): The isNonRecoverableFeishuError guard matches any error containing the literal app_id, which could permanently stop the supervisor for a recoverable error that merely references an app_id field in its message.

Confidence Score: 3/5

  • The core health-check supervisor pattern is sound, but the exponential backoff is functionally broken — requires a targeted fix before merge to avoid hammering the Lark endpoint under sustained failures.
  • The staleness detection logic, sleepWithAbort helper, finally-block cleanup, and abort-signal integration are all implemented correctly. The fundamental design goal (detect SDK silent death via getReconnectInfo()) is valid and an improvement over the previous version. However, resetting supervisorAttempts = 0 immediately after a fire-and-forget start() means the exponential backoff degenerates to a constant ~2 s — a concrete production reliability concern. One targeted fix (deferred reset to first confirmed healthy health-check) unblocks merge.
  • extensions/feishu/src/monitor.transport.ts — specifically the supervisorAttempts reset on line 199 and the isNonRecoverableFeishuError pattern on line 89.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 195-199

Comment:
**Exponential backoff is always ~2s — reset fires before any failure is observable**

`supervisorAttempts = 0` is reset on line 199 immediately after `wsClient.start()` returns. Because `start()` is fire-and-forget and returns instantly regardless of connection state, this reset always runs before `sdkDied` can ever be set to `true`.

Tracing the cycle for rapid consecutive failures:

1. Iteration 1: `supervisorAttempts = 0` (initial). `start()` called, **reset to 0**. SDK dies → `sdkDied = true``supervisorAttempts += 1`**1**. Delay = `computeReconnectDelay(1)` ≈ 2 s.
2. Iteration 2: `start()` called, **reset to 0 again**. SDK dies → `supervisorAttempts += 1`**1**. Delay ≈ 2 s.
3. …repeats indefinitely at ~2 s.

The exponential growth (`1.8×`) is never observed because `supervisorAttempts` is reset to 0 before it can accumulate. The intended 60 s cap is unreachable.

The practical risk is that if the Lark server is down or rejecting connections, the supervisor will hammer it every ~2 s continuously, which can trigger server-side rate limiting or IP bans.

The reset should only happen after the health-check confirms a genuinely stable connection — for example, after the first successful `getReconnectInfo()` poll where `lastConnectTime` is recent:

```typescript
wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);

// Do NOT reset here — start() is fire-and-forget and returns immediately.
// Reset is deferred to inside the health-check loop once we observe
// a recent lastConnectTime (connection actually established).
let backoffReset = false;
let lastSeenNextConnectTime = 0;
let staleStartedAt = 0;

while (!abortSignal?.aborted) {
  const result = await sleepWithAbort(HEALTH_CHECK_INTERVAL_MS, abortSignal);
  if (result === "aborted") break;

  const info = wsClient.getReconnectInfo();
  const now = Date.now();

  // First evidence of a live connection → reset backoff.
  if (!backoffReset && info.lastConnectTime > 0 && now - info.lastConnectTime < HEALTH_CHECK_STALE_MS) {
    supervisorAttempts = 0;
    backoffReset = true;
  }
  // … rest of staleness checks unchanged
}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 85-94

Comment:
**`"app_id"` substring match is too broad**

`msg.includes("app_id")` will match any error message that contains the literal string `app_id`, including internal SDK messages like `"missing field: app_id in event payload"` or similar. This could cause a transient or recoverable error to be silently classified as non-recoverable, permanently stopping the supervisor.

The other patterns (`"invalid app"`, `"app not exist"`, `"app has been disabled"`) are specific and safe. Consider removing the naked `"app_id"` term or replacing it with the exact Lark error string you intend to match (e.g., `"invalid app_id"` or `"app_id not found"`):

```suggestion
function isNonRecoverableFeishuError(err: unknown): boolean {
  const msg = String(err);
  return (
    msg.includes("credentials not configured") ||
    msg.includes("invalid app_id") ||
    msg.includes("invalid app") ||
    msg.includes("app has been disabled") ||
    msg.includes("app not exist")
  );
}
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "fix(feishu): supervisor loop with health..." | Re-trigger Greptile

Comment on lines +147 to +164
wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);

// Reset attempt counter on successful start.
reconnectAttempts = 0;

// Wait until abort signal fires. The Lark SDK manages the connection
// internally; when it exhausts its own retries (typically 7 attempts)
// the start() promise settles and we fall through to the reconnect
// loop below. Intentional stops (gateway restart, config reload) fire
// the abort signal and exit the loop cleanly.
await new Promise<void>((resolve) => {
if (abortSignal?.aborted) {
resolve();
return;
}
abortSignal?.addEventListener("abort", () => resolve(), { once: true });
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 wsClient.start() is not awaited — reconnection loop never triggers

wsClient.start() returns a Promise, but it is called without await. This means:

  1. The SDK runs in the background, disconnected from the try/catch and the outer while loop.
  2. The code immediately falls through to await new Promise<void>(...), which resolves only when the abort signal fires.
  3. When the Lark SDK exhausts its 7 internal retries and its Promise settles (rejects), that rejection becomes an unhandled promise rejection — the supervisor loop never sees it.
  4. The function then hangs indefinitely waiting for an abort signal that never comes.

As a result, the core promise of the PR — "wrap monitorWebSocket() in a supervisor while-loop" — is not fulfilled. The reconnect path and the isNonRecoverableFeishuError fast-fail path are both unreachable.

The fix is to race wsClient.start() against the abort signal so that either path can advance the loop:

const startPromise = wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);
reconnectAttempts = 0;

const abortedPromise = new Promise<void>((resolve) => {
  if (abortSignal?.aborted) { resolve(); return; }
  abortSignal?.addEventListener("abort", () => resolve(), { once: true });
});

// Race: SDK gives up naturally, OR caller aborts intentionally.
await Promise.race([startPromise, abortedPromise]);

// If the abort signal won the race, break cleanly.
if (abortSignal?.aborted) break;
// Otherwise the SDK exhausted retries; fall through to reconnect.

Note: if startPromise rejects, the catch block below already handles it correctly. If abortedPromise wins while startPromise is still pending, the finally block tears down the client maps and the next iteration creates a fresh client — which is the intended behavior.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 147-164

Comment:
**`wsClient.start()` is not awaited — reconnection loop never triggers**

`wsClient.start()` returns a Promise, but it is called without `await`. This means:

1. The SDK runs in the background, disconnected from the `try/catch` and the outer `while` loop.
2. The code immediately falls through to `await new Promise<void>(...)`, which resolves **only** when the abort signal fires.
3. When the Lark SDK exhausts its 7 internal retries and its Promise settles (rejects), that rejection becomes an unhandled promise rejection — the supervisor loop never sees it.
4. The function then hangs indefinitely waiting for an abort signal that never comes.

As a result, the core promise of the PR — "wrap `monitorWebSocket()` in a supervisor while-loop" — is not fulfilled. The reconnect path and the `isNonRecoverableFeishuError` fast-fail path are both unreachable.

The fix is to race `wsClient.start()` against the abort signal so that either path can advance the loop:

```typescript
const startPromise = wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);
reconnectAttempts = 0;

const abortedPromise = new Promise<void>((resolve) => {
  if (abortSignal?.aborted) { resolve(); return; }
  abortSignal?.addEventListener("abort", () => resolve(), { once: true });
});

// Race: SDK gives up naturally, OR caller aborts intentionally.
await Promise.race([startPromise, abortedPromise]);

// If the abort signal won the race, break cleanly.
if (abortSignal?.aborted) break;
// Otherwise the SDK exhausted retries; fall through to reconnect.
```

Note: if `startPromise` rejects, the `catch` block below already handles it correctly. If `abortedPromise` wins while `startPromise` is still pending, the `finally` block tears down the client maps and the next iteration creates a fresh client — which is the intended behavior.

How can I resolve this? If you propose a fix, please make it concise.

@@ -28,6 +28,50 @@ export type MonitorTransportParams = {
eventDispatcher: Lark.EventDispatcher;
};

/**
* Reconnection policy for Feishu WebSocket connections.
* Aligned with Slack/Telegram patterns (2s initial, 30s max, 1.8x factor).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stale JSDoc comment — "30s max" doesn't match maxMs: 60_000

The comment says 30s max but the constant directly below sets maxMs: 60_000 (60 seconds).

Suggested change
* Aligned with Slack/Telegram patterns (2s initial, 30s max, 1.8x factor).
* Aligned with Slack/Telegram patterns (2s initial, 60s max, 1.8x factor).
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 33

Comment:
**Stale JSDoc comment — "30s max" doesn't match `maxMs: 60_000`**

The comment says `30s max` but the constant directly below sets `maxMs: 60_000` (60 seconds).

```suggestion
 * Aligned with Slack/Telegram patterns (2s initial, 60s max, 1.8x factor).
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 30cce342b4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +158 to +163
await new Promise<void>((resolve) => {
if (abortSignal?.aborted) {
resolve();
return;
}
abortSignal?.addEventListener("abort", () => resolve(), { once: true });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Drive the supervisor from WS failure, not only abort

I checked the upstream @larksuiteoapi/node-sdk WSClient.start() flow used here: it returns after setting up its own reconnect machinery, not when the socket later exhausts retries or gives up. In this implementation we immediately park on abortSignal, so after a real network drop the outer while never iterates and never recreates the client. That means the channel still stays offline until a manual restart, which is the exact outage this change is trying to fix.

Useful? React with 👍 / 👎.

…tion

## Problem

The Lark SDK's `WSClient.start()` is fire-and-forget — it returns
immediately and manages reconnection internally. When the SDK exhausts
its server-configured `reconnectCount` retries, it stops **silently**:
no error thrown, no event emitted, no promise rejected.

The existing `monitorWebSocket()` implementation parks on `abortSignal`
after calling `start()`, so it can never detect this silent death. This
is the root cause of openclaw#52618.

## Root cause analysis

Reading the upstream Lark SDK source (`larksuite/node-sdk`, `ws-client/index.ts`):

1. `start()` is declared `async` but calls `this.reConnect(true)` without
   `await` — fire-and-forget by design.
2. Internal reconnection: `communicate()` → `ws.on('close')` → `reConnect()`
   loops up to `reconnectCount` times (server-configured, typically 7).
3. After exhausting retries, the SDK simply stops. No callback, no event,
   no rejected promise. The WSClient instance becomes a zombie.
4. The SDK exposes `getReconnectInfo()` → `{ lastConnectTime, nextConnectTime }`
   which is the only observable signal of reconnection state.

## Solution

Replace the park-on-abortSignal pattern with a **supervisor loop +
health-check** that mirrors Slack/Telegram channel patterns:

1. **Outer supervisor loop** owns WSClient lifetime (create → monitor → destroy → retry).
2. **Inner health-check loop** polls `getReconnectInfo()` every 30s.
3. If `nextConnectTime` hasn't advanced for 120s AND `lastConnectTime` is
   also stale → SDK declared dead → `close({ force: true })` → supervisor
   recreates client with exponential backoff.
4. Non-recoverable errors (bad credentials, disabled app) break the loop.
5. `abortSignal` cleanly exits at any point (gateway restart, config reload).

## Why the previous approach didn't work

The v1 of this PR had a supervisor while-loop, but after `start()` it
immediately parked on `abortSignal`. Since `start()` returns instantly
(fire-and-forget), the while-loop could never re-iterate when the SDK
silently gave up. The health-check polling pattern solves this.

Fixes openclaw#52618
@bridzl bridzl force-pushed the fix/feishu-websocket-reconnect branch from 30cce34 to 969f3e3 Compare March 24, 2026 04:55
@bridzl
Copy link
Copy Markdown
Author

bridzl commented Mar 24, 2026

v2: Rewrote based on Lark SDK source analysis + bot review feedback

Thanks to @greptile-apps and @chatgpt-codex-connector for catching the critical bug in v1 — wsClient.start() not being awaited. That review led us to dig into the actual Lark SDK source code (larksuite/node-sdk, ws-client/index.ts), which revealed the full picture:

What we found in the SDK source

// larksuite/node-sdk: ws-client/index.ts
async start(params) {
    this.eventDispatcher = eventDispatcher;
    this.reConnect(true);  // ← no await! fire-and-forget by design
}

The SDK's start() is intentionally fire-and-forget. The internal reconnection loop (reConnectcommunicatews.on('close')reConnect) runs up to reconnectCount times, then stops silently — no error, no event, no rejection.

This means await wsClient.start() wouldn't help either — the Promise resolves before the connection is even established.

Why v1 couldn't work

v1 had the right idea (supervisor while-loop) but parked on abortSignal after start(). Since start() returns instantly, the loop could never detect SDK death.

v2 approach: health-check polling via getReconnectInfo()

The SDK exposes getReconnectInfo(){ lastConnectTime, nextConnectTime }. During active reconnection, nextConnectTime advances with each retry. When the SDK gives up, it freezes.

v2 polls this every 30s. If stale for 120s → declare dead → close({ force: true }) → recreate client with exponential backoff. This is the same supervisor pattern used by the Slack and Telegram channels in OpenClaw.

Rebased onto latest main (e864421). Ready for maintainer review.

Comment on lines 195 to +199
wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);

// SDK started successfully; reset supervisor backoff.
supervisorAttempts = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Exponential backoff is always ~2s — reset fires before any failure is observable

supervisorAttempts = 0 is reset on line 199 immediately after wsClient.start() returns. Because start() is fire-and-forget and returns instantly regardless of connection state, this reset always runs before sdkDied can ever be set to true.

Tracing the cycle for rapid consecutive failures:

  1. Iteration 1: supervisorAttempts = 0 (initial). start() called, reset to 0. SDK dies → sdkDied = truesupervisorAttempts += 11. Delay = computeReconnectDelay(1) ≈ 2 s.
  2. Iteration 2: start() called, reset to 0 again. SDK dies → supervisorAttempts += 11. Delay ≈ 2 s.
  3. …repeats indefinitely at ~2 s.

The exponential growth (1.8×) is never observed because supervisorAttempts is reset to 0 before it can accumulate. The intended 60 s cap is unreachable.

The practical risk is that if the Lark server is down or rejecting connections, the supervisor will hammer it every ~2 s continuously, which can trigger server-side rate limiting or IP bans.

The reset should only happen after the health-check confirms a genuinely stable connection — for example, after the first successful getReconnectInfo() poll where lastConnectTime is recent:

wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);

// Do NOT reset here — start() is fire-and-forget and returns immediately.
// Reset is deferred to inside the health-check loop once we observe
// a recent lastConnectTime (connection actually established).
let backoffReset = false;
let lastSeenNextConnectTime = 0;
let staleStartedAt = 0;

while (!abortSignal?.aborted) {
  const result = await sleepWithAbort(HEALTH_CHECK_INTERVAL_MS, abortSignal);
  if (result === "aborted") break;

  const info = wsClient.getReconnectInfo();
  const now = Date.now();

  // First evidence of a live connection → reset backoff.
  if (!backoffReset && info.lastConnectTime > 0 && now - info.lastConnectTime < HEALTH_CHECK_STALE_MS) {
    supervisorAttempts = 0;
    backoffReset = true;
  }
  // … rest of staleness checks unchanged
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 195-199

Comment:
**Exponential backoff is always ~2s — reset fires before any failure is observable**

`supervisorAttempts = 0` is reset on line 199 immediately after `wsClient.start()` returns. Because `start()` is fire-and-forget and returns instantly regardless of connection state, this reset always runs before `sdkDied` can ever be set to `true`.

Tracing the cycle for rapid consecutive failures:

1. Iteration 1: `supervisorAttempts = 0` (initial). `start()` called, **reset to 0**. SDK dies → `sdkDied = true``supervisorAttempts += 1`**1**. Delay = `computeReconnectDelay(1)` ≈ 2 s.
2. Iteration 2: `start()` called, **reset to 0 again**. SDK dies → `supervisorAttempts += 1`**1**. Delay ≈ 2 s.
3. …repeats indefinitely at ~2 s.

The exponential growth (`1.8×`) is never observed because `supervisorAttempts` is reset to 0 before it can accumulate. The intended 60 s cap is unreachable.

The practical risk is that if the Lark server is down or rejecting connections, the supervisor will hammer it every ~2 s continuously, which can trigger server-side rate limiting or IP bans.

The reset should only happen after the health-check confirms a genuinely stable connection — for example, after the first successful `getReconnectInfo()` poll where `lastConnectTime` is recent:

```typescript
wsClient.start({ eventDispatcher });
log(`feishu[${accountId}]: WebSocket client started`);

// Do NOT reset here — start() is fire-and-forget and returns immediately.
// Reset is deferred to inside the health-check loop once we observe
// a recent lastConnectTime (connection actually established).
let backoffReset = false;
let lastSeenNextConnectTime = 0;
let staleStartedAt = 0;

while (!abortSignal?.aborted) {
  const result = await sleepWithAbort(HEALTH_CHECK_INTERVAL_MS, abortSignal);
  if (result === "aborted") break;

  const info = wsClient.getReconnectInfo();
  const now = Date.now();

  // First evidence of a live connection → reset backoff.
  if (!backoffReset && info.lastConnectTime > 0 && now - info.lastConnectTime < HEALTH_CHECK_STALE_MS) {
    supervisorAttempts = 0;
    backoffReset = true;
  }
  // … rest of staleness checks unchanged
}
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 969f3e3ba9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +259 to +260
botOpenIds.delete(accountId);
botNames.delete(accountId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve bot identity across reconnect cycles

This cleanup runs on every supervisor iteration, not just final shutdown, so any reconnect clears botOpenIds/botNames permanently for the account. After the first WS client recreation, handlers in monitor.account.ts read botOpenIds.get(accountId)/botNames.get(accountId) as empty, which breaks mention detection and bot-self checks for subsequent events until the whole account monitor is restarted.

Useful? React with 👍 / 👎.

Comment on lines +74 to +78
signal?.addEventListener(
"abort",
() => {
clearTimeout(timer);
resolve("aborted");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove abort listeners when sleep completes normally

Each call to sleepWithAbort registers a new abort listener, but when the timeout resolves ("slept") the listener is never removed. In the 30s health-check loop this accumulates listeners for the lifetime of the process, causing avoidable memory growth on long-running gateways (especially with multiple Feishu accounts).

Useful? React with 👍 / 👎.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 26, 2026

Closing this as duplicate or superseded after Codex automated review.

PR #52623 is superseded by the newer Feishu WebSocket supervisor PR #57978. Current main still has the original one-shot Feishu WebSocket monitor, but #57978 tracks the same #52618 recovery bug and explicitly addresses the unresolved review defects found in #52623.

Best possible solution:

Close #52623 as superseded and keep the remaining Feishu WebSocket recovery work focused on #57978, where the same #52618 bug is tracked with fixes for the known review regressions and added regression coverage.

What I checked:

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Codex Review notes: model gpt-5.5, reasoning high; reviewed against 76cf013df507.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: feishu Channel integration: feishu size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(feishu): WebSocket connection not recovered after network disruption

1 participant