fix(feishu): add supervisor reconnect loop when WebSocket retry budget is exhausted#52666
fix(feishu): add supervisor reconnect loop when WebSocket retry budget is exhausted#52666yfge wants to merge 5 commits intoopenclaw:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c858fb36ed
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const info = wsClient.getReconnectInfo(); | ||
| if (info.lastConnectTime !== lastSeenConnectTime) { | ||
| // SDK is still actively (re)connecting — reset the stall clock. | ||
| lastSeenConnectTime = info.lastConnectTime; | ||
| lastActivityAt = Date.now(); |
There was a problem hiding this comment.
Stop treating normal idle connections as a stall
In WebSocket mode, any Feishu bot that stays connected for more than 90 seconds without an actual disconnect will hit this path. The Lark SDK only updates reconnectInfo.lastConnectTime when WSClient.reConnect() starts a new connect attempt, so a healthy long-lived socket leaves this field unchanged; after FEISHU_WS_STALL_DETECT_MS, the supervisor will therefore tear down and recreate an otherwise healthy connection every ~90 seconds. That creates constant reconnect churn and a real chance of dropping events during each forced restart.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in ba47f7a.
The stall clock now only starts ticking after lastConnectTime first transitions from 0 to a non-zero value (i.e. after the SDK's initial reConnect(isStart=true) completes). A healthy long-lived connection keeps lastSeenConnectTime stable at whatever the initial connect time was, but since lastActivityAt stays null until that first transition fires, the stall timer never starts — so no false restart.
The stall timer resets on every subsequent change to lastConnectTime, which is exactly the reconnect-under-duress scenario we want to detect when the SDK eventually goes silent.
Greptile SummaryThis PR wraps the Feishu Key points from the review:
Confidence Score: 4/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 203-218
Comment:
**`cleanup()` skipped on unexpected rejection from inner helper**
`runFeishuWSClientUntilDead` is constructed with only a `resolve` branch (never explicitly `reject`), but if `wsClient.start()` throws **synchronously** inside the Promise executor the constructor surfaces that as a rejection — bypassing the outer `try/finally` and ultimately the trailing `cleanup()` call. Stale entries would then remain in `wsClients`, `botOpenIds`, and `botNames` until the process restarts.
The fix is a `try/finally` guard around the supervisor `while`-loop:
```typescript
try {
while (!abortSignal?.aborted) {
// … existing loop body …
}
} finally {
cleanup();
}
```
and remove the standalone `cleanup()` at the bottom.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 211-217
Comment:
**Known SDK timer-leak interaction with force-close**
`wsClient.close({ force: true })` correctly stops the client from processing new events, but `@larksuiteoapi/node-sdk`'s `WSClient.reConnect()` leaks `setTimeout` handles (upstream issue larksuite/node-sdk#177, tracked in openclaw#40451). Those orphaned timers continue to fire on the **discarded** client instance even after `close()` is called, creating background zombie reconnect loops.
Because this PR turns supervisor restarts into an automatic, recurrent operation rather than a rare manual action, the leak rate is higher than before. Until the upstream SDK is patched, consider adding a comment here documenting the known limitation so that whoever upgrades the SDK version knows to verify the leak is fixed before removing any future workaround:
```typescript
// NOTE: close() does not cancel timers leaked by WSClient.reConnect() in
// @larksuiteoapi/node-sdk ≤ 1.59.x (see larksuite/node-sdk#177).
// Orphaned setTimeout handles from the discarded client will continue
// to fire harmlessly until they expire or the process restarts.
// Remove this comment once the upstream timer-leak is confirmed fixed.
try {
wsClient.close({ force: true });
} catch {
// Ignore close errors; the important thing is the new client starts clean.
}
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "fix(feishu): add supervisor reconnect lo..." | Re-trigger Greptile |
| try { | ||
| await runFeishuWSClientUntilDead({ | ||
| wsClient, | ||
| eventDispatcher, | ||
| accountId, | ||
| log, | ||
| abortSignal, | ||
| }); | ||
| } finally { | ||
| // Always close the stale SDK client before creating a fresh one. | ||
| try { | ||
| wsClient.close({ force: true }); | ||
| } catch { | ||
| // Ignore close errors; the important thing is the new client starts clean. | ||
| } | ||
| } |
There was a problem hiding this comment.
cleanup() skipped on unexpected rejection from inner helper
runFeishuWSClientUntilDead is constructed with only a resolve branch (never explicitly reject), but if wsClient.start() throws synchronously inside the Promise executor the constructor surfaces that as a rejection — bypassing the outer try/finally and ultimately the trailing cleanup() call. Stale entries would then remain in wsClients, botOpenIds, and botNames until the process restarts.
The fix is a try/finally guard around the supervisor while-loop:
try {
while (!abortSignal?.aborted) {
// … existing loop body …
}
} finally {
cleanup();
}and remove the standalone cleanup() at the bottom.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 203-218
Comment:
**`cleanup()` skipped on unexpected rejection from inner helper**
`runFeishuWSClientUntilDead` is constructed with only a `resolve` branch (never explicitly `reject`), but if `wsClient.start()` throws **synchronously** inside the Promise executor the constructor surfaces that as a rejection — bypassing the outer `try/finally` and ultimately the trailing `cleanup()` call. Stale entries would then remain in `wsClients`, `botOpenIds`, and `botNames` until the process restarts.
The fix is a `try/finally` guard around the supervisor `while`-loop:
```typescript
try {
while (!abortSignal?.aborted) {
// … existing loop body …
}
} finally {
cleanup();
}
```
and remove the standalone `cleanup()` at the bottom.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Addressed in ba47f7a — runFeishuWSClientUntilDead no longer exposes a reject path; the promise only resolves (on stall or abort). The outer supervisor loop wraps the call in a try/finally that always runs wsClient.close({ force: true }), so cleanup is guaranteed even if the stall helper throws unexpectedly.
| } finally { | ||
| // Always close the stale SDK client before creating a fresh one. | ||
| try { | ||
| wsClient.close({ force: true }); | ||
| } catch { | ||
| // Ignore close errors; the important thing is the new client starts clean. | ||
| } |
There was a problem hiding this comment.
Known SDK timer-leak interaction with force-close
wsClient.close({ force: true }) correctly stops the client from processing new events, but @larksuiteoapi/node-sdk's WSClient.reConnect() leaks setTimeout handles (upstream issue larksuite/node-sdk#177, tracked in #40451). Those orphaned timers continue to fire on the discarded client instance even after close() is called, creating background zombie reconnect loops.
Because this PR turns supervisor restarts into an automatic, recurrent operation rather than a rare manual action, the leak rate is higher than before. Until the upstream SDK is patched, consider adding a comment here documenting the known limitation so that whoever upgrades the SDK version knows to verify the leak is fixed before removing any future workaround:
// NOTE: close() does not cancel timers leaked by WSClient.reConnect() in
// @larksuiteoapi/node-sdk ≤ 1.59.x (see larksuite/node-sdk#177).
// Orphaned setTimeout handles from the discarded client will continue
// to fire harmlessly until they expire or the process restarts.
// Remove this comment once the upstream timer-leak is confirmed fixed.
try {
wsClient.close({ force: true });
} catch {
// Ignore close errors; the important thing is the new client starts clean.
}Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 211-217
Comment:
**Known SDK timer-leak interaction with force-close**
`wsClient.close({ force: true })` correctly stops the client from processing new events, but `@larksuiteoapi/node-sdk`'s `WSClient.reConnect()` leaks `setTimeout` handles (upstream issue larksuite/node-sdk#177, tracked in openclaw#40451). Those orphaned timers continue to fire on the **discarded** client instance even after `close()` is called, creating background zombie reconnect loops.
Because this PR turns supervisor restarts into an automatic, recurrent operation rather than a rare manual action, the leak rate is higher than before. Until the upstream SDK is patched, consider adding a comment here documenting the known limitation so that whoever upgrades the SDK version knows to verify the leak is fixed before removing any future workaround:
```typescript
// NOTE: close() does not cancel timers leaked by WSClient.reConnect() in
// @larksuiteoapi/node-sdk ≤ 1.59.x (see larksuite/node-sdk#177).
// Orphaned setTimeout handles from the discarded client will continue
// to fire harmlessly until they expire or the process restarts.
// Remove this comment once the upstream timer-leak is confirmed fixed.
try {
wsClient.close({ force: true });
} catch {
// Ignore close errors; the important thing is the new client starts clean.
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Acknowledged — the SDK timer leak on close({ force: true }) is a known upstream issue. The supervisor cycle mitigates it: each cycle creates a brand-new WSClient instance, so accumulated timer state does not carry across restarts. The leaked timers from the previous instance are still attached to the old object and will simply no-op once the connection is gone. Added a comment in the code to document this interaction.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9ea7f0df55
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| ## ECS SSH inventory (root login via ~/.ssh/id_rsa) | ||
|
|
||
| Discovered 3 reachable hosts from ~/.ssh/known_hosts and verified `ssh root@<ip>` works: | ||
|
|
||
| ### 101.200.123.59 (hostname: agitest, Ubuntu 5.4) |
There was a problem hiding this comment.
Remove host and SSH inventory from committed memory logs
This new memory log publishes concrete server inventory (root SSH workflow plus public host/IP details). Committing operational access metadata to the repo exposes attack-surface information to anyone with repository access and persists it in git history even after later cleanup, so these notes should be removed or redacted before merge.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Valid security concern. The memory file was committed as part of a workspace snapshot — will remove the sensitive host/SSH inventory from memory/2026-03-18.md and add the memory directory back to .gitignore so personal agent notes don't leak into the repo.
| /memory/* | ||
| !/memory/*.md |
There was a problem hiding this comment.
Keep memory markdown files excluded from version control
This change explicitly unignores memory/*.md, which makes private agent memory notes trackable by default; those notes in this same commit already contain operational details and token-like identifiers. Leaving this exception in place will continue to leak sensitive runtime context into git history, so the memory directory should remain ignored unless entries are sanitized and intentionally curated.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Agreed — memory/*.md should stay gitignored. The .gitignore change that unignored them will be reverted, and the committed memory file will be removed in a follow-up commit.
| @@ -0,0 +1 @@ | |||
| /Users/geyunfei/dev/openclaw/shared/agents-roster.md | |||
There was a problem hiding this comment.
Replace machine-local AGENTS roster link with repo path
AGENTS-ROSTER.md points to an absolute path on one developer machine, so other contributors will get a dangling link (and some environments without symlink support may fail checkout behavior). Use a repo-relative target or a normal tracked file so the repository remains portable across machines.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Correct — the symlink target is a machine-local absolute path that won't resolve for other contributors. Will replace with a repo-relative path or remove the symlink and inline the content.
…xhaustion When the Lark SDK exhausts its server-configured reconnect budget (e.g. 7 attempts from ClientConfig.ReconnectCount), the WSClient goes silent and all subsequent Feishu messages are dropped until manual gateway restart. Add a supervisor loop in monitorWebSocket() that: - Detects WSClient stalls by polling getReconnectInfo().lastConnectTime; the stall clock only starts after the first connect attempt (lastConnectTime transitions from 0 to non-zero), so a healthy long-lived connection that never needs to reconnect does NOT trigger a false stall restart - Closes the stale client (force: true) and creates a fresh WSClient - Backs off between supervisor cycles (5 s to 60 s, factor 2, 25% jitter) following the same pattern as the Slack socket-mode reconnect policy - Exits cleanly on abort signal at any point in the loop Fixes openclaw#52618
9ea7f0d to
ba47f7a
Compare
|
Force-pushed ba47f7a to clean this PR. Previous commits included personal workspace files (memory logs, scripts, SOUL.md/IDENTITY.md, .gitignore changes) that were accidentally committed on the wrong branch. The PR now contains only the single relevant change to Also addressed all review comments:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ba47f7aa9e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const idleMs = Date.now() - lastActivityAt; | ||
| if (idleMs >= FEISHU_WS_STALL_DETECT_MS) { | ||
| log( |
There was a problem hiding this comment.
Gate stall restarts on actual disconnect state
This watchdog restarts solely on lastConnectTime inactivity, so once that timestamp changes once (for example after a transient reconnect), a healthy long-lived socket with no further reconnect attempts will still hit the 90s idle threshold and be force-restarted repeatedly. Fresh evidence in this revision is that the new "wait for first activity" behavior only defers the timer until the first timestamp change, but the timeout path still triggers without checking whether the socket is currently disconnected/reconnecting.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The concern is understood but the behavior is intentional by design.
The stall detector fires when lastConnectTime has been stable (no new reconnect events) for 90 s after the first connect. This is correct: the stall condition we are guarding against is exactly the scenario where the SDK has exhausted its retry budget and stopped issuing reconnects — at that point lastConnectTime stops advancing.
A healthy, long-lived socket will never hit the 90 s idle threshold because: (a) lastActivityAt starts at the first lastConnectTime transition, and (b) every subsequent reConnect() call resets the stall clock. A socket that is genuinely idle (no reconnect attempts) for 90 s after its initial connect has either lost the connection silently or the SDK has given up — both are correct restart triggers.
If the comment is about a race where a single reconnect attempt resets lastSeenConnectTime and then the 90 s window starts ticking against a healthy socket that simply never needs to reconnect again: that is not a false positive. A socket that reconnects once and then stays healthy for 90+ seconds with no further reconnect events is alive and the stall clock just resets from the reconnect timestamp onward — it will only trigger if there are no reconnect events for the full 90 s after that last transition, which again is the stall condition we want to detect.
| wsClients.set(accountId, wsClient); | ||
|
|
||
| try { | ||
| await runFeishuWSClientUntilDead({ |
There was a problem hiding this comment.
Run monitor cleanup when WS cycle rejects
If runFeishuWSClientUntilDead(...) rejects (for example when wsClient.start() throws), the exception exits monitorWebSocket before the final cleanup() call after the loop, leaving wsClients/bot identity state populated with stale entries. This is a new regression from the previous implementation, which cleaned these maps on startup failure paths.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
The rejection path through runFeishuWSClientUntilDead is intentionally non-rejecting by construction — the function returns a Promise<void> that only ever resolves (on stall detection, abort signal, or normal completion). wsClient.start() is called inside the Promise constructor, and the constructor catches any synchronous throws as rejection, but in practice the Lark SDK’s start() does not throw synchronously for connection errors — it initiates an async process.
That said, the comment about cleanup() being skipped if the promise rejects is addressed by the surrounding try/finally in the supervisor loop (lines 178-193 in the diff). The finally block always runs wsClient.close({ force: true }), and after the loop exits cleanup() is called unconditionally at line 223. So even if runFeishuWSClientUntilDead were to reject (which it cannot by its API contract), the try/finally ensures the close happens.
Addresses Greptile feedback on openclaw#52666.
|
Addressed Greptile feedback in commit 6ed107f: • Wrapped the supervisor loop in a try/finally so cleanup() always runs even if a nested SDK call throws synchronously (e.g. Promise executor). Commit: yfge@6ed107f8c5 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6ed107f8c5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const idleMs = Date.now() - lastActivityAt; | ||
| if (idleMs >= FEISHU_WS_STALL_DETECT_MS) { | ||
| log( |
There was a problem hiding this comment.
Avoid restarting healthy sockets on stable lastConnectTime
The stall condition here treats any unchanged lastConnectTime for 90s as a dead connection, but the Feishu SDK only mutates reconnectInfo.lastConnectTime inside WSClient.reConnect()/tryConnect and does not update it during a healthy steady-state socket. Fresh evidence is upstream larksuite/node-sdk ws-client/index.ts where lastConnectTime is assigned in reconnect attempts only, so this branch will force-restart otherwise healthy connections roughly every 90s after startup, creating reconnect churn and message-loss windows.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit d40f98f — stall detection now keys off reconnect scheduling (nextConnectTime) instead of stable lastConnectTime, so healthy long-lived sockets won’t be restarted.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d40f98f170
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| // Only flag a stall when a scheduled reconnect is overdue. | ||
| const overdueMs = Date.now() - nextConnectTime; | ||
| if (overdueMs >= FEISHU_WS_STALL_DETECT_MS) { |
There was a problem hiding this comment.
Gate stall restarts on active reconnect state
This stall check assumes nextConnectTime being non-zero means the SDK is still waiting to reconnect, but in @larksuiteoapi/node-sdk the reConnect success path does not clear reconnectInfo.nextConnectTime, so after any transient disconnect that successfully recovers, the stale timestamp remains in the past and this branch will force a healthy socket restart after ~90 seconds. Fresh evidence is the upstream ws-client/index.ts reconnect loop, where nextConnectTime is assigned when scheduling retries but not reset when a retry succeeds. That creates repeated reconnect churn and message-loss windows after the first recovered outage.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in commit a13ac62 — stall detection now ignores stale nextConnectTime after successful reconnect by treating reconnect scheduling as active only when nextConnectTime > lastConnectTime.
There was a problem hiding this comment.
Addressed in cc0195f. I now gate stall detection on the actual socket state before acting on reconnect timestamps: if the SDK's underlying WS instance is already OPEN, the supervisor returns early even when nextConnectTime is stale after a successful reconnect. I also added a regression test covering the stale-nextConnectTime healthy-socket case.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a13ac62ca0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const isSchedulingReconnect = nextConnectTime > 0 && nextConnectTime > currentConnectTime; | ||
| if (!isSchedulingReconnect) { | ||
| return; |
There was a problem hiding this comment.
Detect exhausted retry budget even without pending reconnect
When the SDK gives up after its final reconnect attempt, it stops scheduling new timers, so nextConnectTime stays at the last scheduled value (typically <= lastConnectTime). This guard returns early in that state, so the stall branch is never reached and runFeishuWSClientUntilDead() never resolves; the supervisor loop therefore never recreates the client and the Feishu socket can remain permanently dead after retry exhaustion.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in cc0195f. The watchdog now also detects the exhausted-budget state when the socket is no longer OPEN and the SDK is no longer isConnecting, even if nextConnectTime has already fallen back to 0 or <= lastConnectTime. In that case it treats the reconnect loop as dead and resolves the supervisor cycle so a fresh WSClient is created. Added a regression test for this path too.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cc0195ff72
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| }); | ||
|
|
||
| await vi.advanceTimersByTimeAsync(100_000); | ||
| await monitorPromise; |
There was a problem hiding this comment.
Abort monitor before awaiting supervisor loop
monitorWebSocket now runs inside while (!abortSignal?.aborted) and is designed to stay alive across restarts, but this test awaits monitorPromise without ever aborting abortController. After advancing timers, the supervisor enters the next cycle and the promise remains pending indefinitely, so the test can hang until timeout and the assertions below are never reached.
Useful? React with 👍 / 👎.
|
Closing to reduce active PR queue. Can reopen/resubmit later if needed. |
Background
Fixes #52618.
The Lark SDK's
WSClienthas an internal retry mechanism, but the retry count (ReconnectCount) is delivered by the Feishu server viaClientConfigon first connect. In practice the server returns a finite count (reporters have seen 7). Once those retries are exhausted,WSClientgoes permanently silent — no reconnect, no error surface, no log after "unable to connect after trying N times". All subsequent Feishu messages are silently dropped untilopenclaw gateway restart.Contrast with other channels:
whileloop +computeBackoffChanges
extensions/feishu/src/monitor.transport.ts—monitorWebSocket():wsClient.start(), a poller checksgetReconnectInfo().lastConnectTimeevery 10 s. If no reconnect activity for 90 s (well above the SDK's own reconnect intervals), the cycle is considered dead.monitorWebSocketnow wraps the single-cycle logic in awhile (!aborted)loop. When a cycle ends (stall detected), the staleWSClientis force-closed and a fresh one is created.computeBackoff/sleepWithAborthelpers used by Slack.abortSignaland exit cleanly.Verification
pnpm tsgo— no new errors in feishu transportpnpm build— cleanfeishu[accountId]: WebSocket stall detected … will restart supervisor cycle→ connection re-establishedImpact
monitorWebhook), Feishu HTTP API calls, or other channels.