fix(feishu): reconcile WebSocket reconnect backoff by openclaw-clownfish[bot] · Pull Request #73998 · openclaw/openclaw

openclaw-clownfish · 2026-04-29T02:39:08Z

Summary

Repair the remaining Feishu WebSocket reconnect/backoff and PingInterval guard path for #55532 without process-wide Lark.WSClient prototype mutation. Keep the patch narrow to the Feishu client/monitor surfaces, building on current main after #72411.

Credit

Carries forward @sirfengyu's investigation from #55619 and the closed ProjectClownfish replacement attempt in #73945. Preserve @alex-xuweilong's heartbeat attribution from #45674 if that constructor path is touched.

Review blockers to address

Replace global WSClient prototype mutation with explicit Feishu-owned wrapper/composition behavior.
Reset reconnect backoff after successful recovery.
Cancel pending reconnect waits when the monitor/client closes.
Avoid stacking custom backoff on top of the SDK reconnect delay.
Remove fix(feishu): reconcile WebSocket reconnect backoff #73945 unreachable cleanup/dead-code paths if reusing that implementation.
Decide factory/startup failures during reconnect recovery with a narrow test-backed delay policy.

Validation

pnpm test:serial extensions/feishu/src/client.test.ts extensions/feishu/src/monitor.cleanup.test.ts
pnpm check:changed

Fixes #55532. Related to #42354. Security-routed refs #46472, #68865, and #72411 are not modified by this PR.

ProjectClownfish replacement details:

Cluster: ghcrawl-207048-agentic-merge
Source PRs: fix(feishu): exponential backoff + PingInterval guard for WS reconnect #55619, fix(feishu): reconcile WebSocket reconnect backoff #73945, fix(feishu): add wsConfig with PingInterval/PingTimeout to WSClient #45674
Credit: Credit @sirfengyu for the Feishu reconnect/backoff and PingInterval investigation in fix(feishu): exponential backoff + PingInterval guard for WS reconnect #55619.; Mention fix(feishu): reconcile WebSocket reconnect backoff #73945 as the prior ProjectClownfish replacement attempt that should be used only as implementation evidence because it is closed and unmerged.; Preserve existing heartbeat attribution from @alex-xuweilong in fix(feishu): add wsConfig with PingInterval/PingTimeout to WSClient #45674 if the WSClient constructor heartbeat path is touched again.; Keep fix(feishu): add WebSocket heartbeat config to prevent silent disconnection #46472, fix(feishu): add application-level WebSocket reconnection with backoff #68865, and fix(feishu): repair WebSocket reconnect and heartbeat config #72411 security-routed; do not mutate or close those refs from the replacement PR.
Validation: pnpm test:serial extensions/feishu/src/client.test.ts extensions/feishu/src/monitor.cleanup.test.ts; pnpm check:changed

greptile-apps · 2026-04-29T02:42:44Z

Greptile Summary

This PR replaces the Feishu WebSocket client's static wsConfig constructor override and prototype-mutation approach with SDK-native lifecycle hooks (onReady, onError, onReconnecting, onReconnected) and a sanitizing httpInstance that guards invalid server-returned ClientConfig fields (PingInterval, ReconnectCount, ReconnectInterval, ReconnectNonce). The monitor backoff counter is reset on onReady/onReconnected, and onError (SDK reconnect exhaustion) triggers a clean monitor-level retry without stacking backoff on top of the SDK's own reconnect delay.

Confidence Score: 4/5

Safe to merge with one behavior change to verify: PingTimeout: 3 from the old static wsConfig is no longer set anywhere.

All changes are P2. The only noteworthy gap is that the old PingTimeout: 3 constructor-level guard is dropped and not replicated — if the Lark SDK default is much larger, dead-connection detection may be slower. The rest of the implementation (backoff reset, error-triggered retry, ClientConfig sanitization) is logically sound and well-tested.

extensions/feishu/src/client.ts — confirm whether PingTimeout needs to be set explicitly on the WSClient constructor alongside the new httpInstance approach.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: extensions/feishu/src/client.ts
Line: 18-23

Comment:
**`PingTimeout` guard no longer applied**

The old `FEISHU_WS_CONFIG` included `PingTimeout: 3`, passed as a constructor-level `wsConfig` override that limited how long the SDK would wait for a pong before treating the connection as dead. The new `FEISHU_WS_CLIENT_CONFIG_DEFAULTS` and `sanitizeFeishuWsEndpointResponse` only sanitize server-returned fields (`PingInterval`, `ReconnectCount`, `ReconnectInterval`, `ReconnectNonce`). `PingTimeout` is a static SDK constructor option and does not come from the API response, so it is silently dropped. If the SDK's default `PingTimeout` is much larger than 3 s, a dead connection may stall for longer before the heartbeat timeout triggers — which was presumably the original motivation for the 3 s override.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 163-175

Comment:
**`handleAbort` redundantly calls `removeEventListener` when registered with `{ once: true }`**

The listener is added with `{ once: true }` (line 171), which auto-removes it on first fire. `handleAbort` then also calls `removeEventListener` manually (line 165), making that call a no-op. Both branches of cleanup (`handleAbort` itself and `removeAbortListener` in `.finally()`) attempt the same removal. The code is correct but the manual call inside `handleAbort` is dead. Consider removing it to avoid confusion about which mechanism actually deregisters the listener.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(feishu): reconcile WebSocket reconne..." | Re-trigger Greptile}

greptile-apps · 2026-04-29T02:42:47Z

+const FEISHU_WS_CLIENT_CONFIG_DEFAULTS = {
  PingInterval: 30,
-  PingTimeout: 3,
+  ReconnectCount: -1,
+  ReconnectInterval: 120,
+  ReconnectNonce: 30,
 } as const;


PingTimeout guard no longer applied

The old FEISHU_WS_CONFIG included PingTimeout: 3, passed as a constructor-level wsConfig override that limited how long the SDK would wait for a pong before treating the connection as dead. The new FEISHU_WS_CLIENT_CONFIG_DEFAULTS and sanitizeFeishuWsEndpointResponse only sanitize server-returned fields (PingInterval, ReconnectCount, ReconnectInterval, ReconnectNonce). PingTimeout is a static SDK constructor option and does not come from the API response, so it is silently dropped. If the SDK's default PingTimeout is much larger than 3 s, a dead connection may stall for longer before the heartbeat timeout triggers — which was presumably the original motivation for the 3 s override.

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/client.ts Line: 18-23 Comment: **`PingTimeout` guard no longer applied** The old `FEISHU_WS_CONFIG` included `PingTimeout: 3`, passed as a constructor-level `wsConfig` override that limited how long the SDK would wait for a pong before treating the connection as dead. The new `FEISHU_WS_CLIENT_CONFIG_DEFAULTS` and `sanitizeFeishuWsEndpointResponse` only sanitize server-returned fields (`PingInterval`, `ReconnectCount`, `ReconnectInterval`, `ReconnectNonce`). `PingTimeout` is a static SDK constructor option and does not come from the API response, so it is silently dropped. If the SDK's default `PingTimeout` is much larger than 3 s, a dead connection may stall for longer before the heartbeat timeout triggers — which was presumably the original motivation for the 3 s override. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-04-29T02:42:48Z

+  const abortResult = new Promise<FeishuWsStopReason>((resolve) => {
    const handleAbort = () => {
      abortSignal.removeEventListener("abort", handleAbort);
-      resolve();
+      resolve({ status: "abort" });
+    };
+    removeAbortListener = () => {
+      abortSignal.removeEventListener("abort", handleAbort);
    };
    abortSignal.addEventListener("abort", handleAbort, { once: true });
    if (abortSignal.aborted) {
      handleAbort();
    }
  });


handleAbort redundantly calls removeEventListener when registered with { once: true }

The listener is added with { once: true } (line 171), which auto-removes it on first fire. handleAbort then also calls removeEventListener manually (line 165), making that call a no-op. Both branches of cleanup (handleAbort itself and removeAbortListener in .finally()) attempt the same removal. The code is correct but the manual call inside handleAbort is dead. Consider removing it to avoid confusion about which mechanism actually deregisters the listener.

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 163-175 Comment: **`handleAbort` redundantly calls `removeEventListener` when registered with `{ once: true }`** The listener is added with `{ once: true }` (line 171), which auto-removes it on first fire. `handleAbort` then also calls `removeEventListener` manually (line 165), making that call a no-op. Both branches of cleanup (`handleAbort` itself and `removeAbortListener` in `.finally()`) attempt the same removal. The code is correct but the manual call inside `handleAbort` is dead. Consider removing it to avoid confusion about which mechanism actually deregisters the listener. How can I resolve this? If you propose a fix, please make it concise.

clawsweeper · 2026-04-29T02:43:45Z

Codex review: needs changes before merge.

Summary
The PR replaces Feishu WSClient constructor wsConfig handling with a guarded httpInstance and lifecycle-hook retry handling, adds Feishu tests, and adds a changelog entry.

Reproducibility: yes. Mock the WS endpoint response as code: 0 with missing ClientConfig, or as a system-busy non-OK response, and verify the SDK returns false without reaching the PR's monitor onError retry path; separately trigger SDK onError and assert bot identity maps survive the retry cycle.

Next step before merge
This is a narrow, concrete repair on an existing ClawSweeper PR branch.

Security
Cleared: Security review cleared: the diff is limited to Feishu channel source/tests and a changelog entry, with no concrete security or supply-chain regression found.

Review findings

[P2] Handle missing WS ClientConfig — extensions/feishu/src/client.ts:126-127
[P2] Preserve bot identity across retry cycles — extensions/feishu/src/monitor.transport.ts:227

Review details

Best possible solution:

Keep the Feishu-owned SDK lifecycle/httpInstance composition, preserve the current-main supervisor cleanup contract, and make missing or non-OK WS endpoint config deterministically reach a finite retry path with focused tests.

Do we have a high-confidence way to reproduce the issue?

Yes. Mock the WS endpoint response as code: 0 with missing ClientConfig, or as a system-busy non-OK response, and verify the SDK returns false without reaching the PR's monitor onError retry path; separately trigger SDK onError and assert bot identity maps survive the retry cycle.

Is this the best way to solve the issue?

No. The direction is maintainable, but the current branch is incomplete; the safer fix is a rebased Feishu-local recovery policy that handles malformed endpoint responses while carrying forward current-main identity-preserving cleanup.

Full review comments:

[P2] Handle missing WS ClientConfig — extensions/feishu/src/client.ts:126-127
The sanitizer returns the endpoint response unchanged when data.ClientConfig is absent. In the pinned Lark SDK, pullConnectConfig then catches the later ClientConfig.PingInterval failure and returns false; with the default infinite reconnect count, the monitor onError recovery path may never fire. Route missing ClientConfig and non-OK endpoint responses into a finite Feishu recovery path and cover it in tests.
Confidence: 0.88
[P2] Preserve bot identity across retry cycles — extensions/feishu/src/monitor.transport.ts:227
After waitForFeishuWsStop returns an SDK error, this cleanup deletes botOpenIds and botNames before the retry creates a replacement WSClient. Current main deliberately keeps those maps during supervisor recovery and clears them only on true shutdown, so carry that behavior forward and test it.
Confidence: 0.8

Overall correctness: patch is incorrect
Overall confidence: 0.86

Acceptance criteria:

pnpm test:serial extensions/feishu/src/client.test.ts extensions/feishu/src/monitor.cleanup.test.ts
pnpm check:changed

What I checked:

PR sanitizer gap: At PR head, sanitizeFeishuWsEndpointResponse returns responses unchanged when data.ClientConfig is missing, so the branch does not force malformed endpoint responses into a bounded Feishu-owned recovery path. (extensions/feishu/src/client.ts:126, 255e1262454d)
Dependency reconnect contract: @larksuiteoapi/node-sdk@1.62.1 catches pullConnectConfig failures and returns false; onError is only invoked for disabled auto-reconnect or when a non-negative reconnect count is exhausted, while the SDK default reconnect count is -1.
Current-main retry identity behavior: Current main preserves botOpenIds and botNames during supervisor retry cleanup with clearIdentity: false, clearing identity only on true monitor shutdown paths. (extensions/feishu/src/monitor.transport.ts:242, f7549079ceb3)
PR cleanup regression path: At PR head, cleanup always deletes Feishu bot identity maps, and the SDK-error stop path calls that cleanup before retrying the replacement WSClient. (extensions/feishu/src/monitor.transport.ts:227, 255e1262454d)
Security scope: The public PR diff is limited to CHANGELOG.md and Feishu source/tests; it does not touch workflows, dependency manifests, lockfiles, install/build/release scripts, generated/vendor files, or permissions. (255e1262454d)

Likely related people:

vincentkoc: Authored the merged Feishu WebSocket recovery and heartbeat/reconnect repairs that current main builds on, including fix(feishu): repair WebSocket reconnect and heartbeat config #72411 and fix(feishu): recover WebSocket after SDK retry exhaustion #73739. (role: recent maintainer; confidence: high; commits: d93e6f61580a, f9b47ad2a169, 871cd475af8f; files: extensions/feishu/src/client.ts, extensions/feishu/src/monitor.transport.ts, extensions/feishu/src/client.test.ts)
Peter Steinberger: Prior Feishu monitor/client maintenance includes the transport split and adjacent webhook/client hardening, making this a plausible fallback routing path if Vincent is unavailable. (role: adjacent owner; confidence: medium; commits: f46bd2e0cc, 8be3a4466c, 496ca3a637; files: extensions/feishu/src/monitor.transport.ts, extensions/feishu/src/client.ts)

Remaining risk / open question:

The branch is based before the current-main fix(feishu): recover WebSocket after SDK retry exhaustion #73739 recovery cleanup and needs a careful rebase so it does not drop current identity-preservation semantics.
No local tests were run in this read-only review; the findings are from source, PR-diff, dependency-source, and history inspection.

Codex review notes: model gpt-5.5, reasoning high; reviewed against f7549079ceb3.

fix(feishu): reconcile WebSocket reconnect backoff

255e126

openclaw-clownfish Bot added the clawsweeper Tracked by ClawSweeper automation label Apr 29, 2026

openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: M r: too-many-prs Auto-close: author has more than twenty active PRs. and removed r: too-many-prs Auto-close: author has more than twenty active PRs. labels Apr 29, 2026

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

This was referenced Apr 29, 2026

fix(feishu): reconcile WebSocket reconnect backoff #73945

Closed

fix(feishu): keep quick actions and status responsive #74187

Closed

Feishu WebSocket: No exponential backoff on reconnect #55532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(feishu): reconcile WebSocket reconnect backoff#73998

fix(feishu): reconcile WebSocket reconnect backoff#73998
openclaw-clownfish[bot] wants to merge 1 commit intomainfrom
clownfish/ghcrawl-207048-agentic-merge

openclaw-clownfish Bot commented Apr 29, 2026

Uh oh!

greptile-apps Bot commented Apr 29, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

clawsweeper Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Uh oh!

Conversation

openclaw-clownfish Bot commented Apr 29, 2026

Summary

Credit

Review blockers to address

Validation

Uh oh!

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

clawsweeper Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

clawsweeper Bot commented Apr 29, 2026 •

edited

Loading