fix(feishu): reconcile WebSocket reconnect backoff by openclaw-clownfish[bot] · Pull Request #73945 · openclaw/openclaw

openclaw-clownfish · 2026-04-29T01:57:24Z

Summary

Repair the Feishu WebSocket reconnect path from #55619 without process-wide Lark.WSClient prototype mutation. Rebase the contributor branch onto current main, verify the pinned Lark SDK start/onReady/onError/reConnect behavior, and keep the fix inside the Feishu app-layer monitor/client code.

Credit

Carries forward @sirfengyu's reconnect/backoff investigation from #55619. Preserve existing heartbeat attribution from #45674 if that path is touched.

Review blockers to address

Replace global WSClient prototype mutation with explicit app-layer behavior.
Reset backoff after any successful reconnect.
Ensure pending reconnect waits are canceled when the monitor/client closes.
Avoid stacking custom backoff on top of the SDK reconnect delay.
Keep fix(feishu): add WebSocket heartbeat config to prevent silent disconnection #46472 and fix(feishu): repair WebSocket reconnect and heartbeat config #72411 security-routed; do not mutate those refs from this repair.

Validation

pnpm test:serial extensions/feishu/src/async.test.ts extensions/feishu/src/monitor.cleanup.test.ts extensions/feishu/src/client.test.ts
pnpm check:changed

Fixes #55532. Related to #42354.

ProjectClownfish replacement details:

Cluster: ghcrawl-156816-autonomous-smoke
Source PRs: fix(feishu): exponential backoff + PingInterval guard for WS reconnect #55619
Credit: Credit @sirfengyu and source PR fix(feishu): exponential backoff + PingInterval guard for WS reconnect #55619 for the Feishu reconnect/backoff investigation and prototype proof of concept.; Preserve existing attribution to @alex-xuweilong in fix(feishu): add wsConfig with PingInterval/PingTimeout to WSClient #45674 if the heartbeat constructor path is touched.; Do not mutate or close security-routed refs fix(feishu): add WebSocket heartbeat config to prevent silent disconnection #46472 or fix(feishu): repair WebSocket reconnect and heartbeat config #72411 from this worker result.
Validation: pnpm test:serial extensions/feishu/src/async.test.ts extensions/feishu/src/monitor.cleanup.test.ts extensions/feishu/src/client.test.ts; pnpm check:changed
Repair fallback: To https://github.com/sirfengyu/openclaw.git
! [remote rejected] HEAD -> pr-55546 (refusing to allow a GitHub App to create or update workflow .github/workflows/auto-response.yml without workflows permission)
error: failed to push some refs to 'https://github.com/sirfengyu/openclaw.git'

openclaw-barnacle · 2026-04-29T01:57:59Z

Closing this PR because the author has more than 10 active PRs in this repo. Please reduce the active PR queue and reopen or resubmit once it is back under the limit. You can close your own PRs to get back under the limit.

greptile-apps · 2026-04-29T02:00:58Z

Greptile Summary

This PR replaces the previous global Lark.WSClient prototype mutation with a per-instance ManagedFeishuWebSocketClient wrapper, separates startup-retry and reconnect-backoff counters (1 s→30 s vs 120 s→15 min), resets both counters after a successful start(), and guards PingInterval-missing pong frames at the instance level. The previously flagged concerns (double-close on reconnect handoff path, unreachable cleanup after the terminal-error throw, and recoveringFromDisconnect forcing reconnect backoff on factory-level start failures) remain the residual risks.

Confidence Score: 5/5

Safe to merge; no new P0/P1 findings beyond those already in open review threads.

All identified concerns are P2 or were already raised in prior review threads. The core logic — managed client wrapper, separated backoff counters, reset after successful start, per-instance PingInterval guard — is sound. raceWithTimeoutAndAbort propagates start() rejections correctly via Promise.race semantics, and waitForTerminalError only ever resolves, so status-handling paths are correct.

extensions/feishu/src/client.ts (scheduleReconnectClose / double-close interaction) and extensions/feishu/src/monitor.transport.ts (recoveringFromDisconnect classification for factory-level failures) — both already tracked in open threads.

_{Reviews (3): Last reviewed commit: "fix(feishu): reconcile WebSocket reconne..." | Re-trigger Greptile}

greptile-apps · 2026-04-29T02:01:05Z

+      const isReconnectFailure =
+        recoveringFromDisconnect || isFeishuWebSocketReconnectRequiredError(err);
+      if (isReconnectFailure) {
+        reconnectAttempt += 1;
+        recoveringFromDisconnect = true;


Start-failure during reconnect recovery uses full reconnect backoff

When recoveringFromDisconnect is true and createFeishuWSClient(account) itself throws (e.g., transient DNS failure, auth rejection), the failure is classified as a reconnect failure (isReconnectFailure = true) and uses getFeishuWsReconnectDelayMs (starting at 120 s) instead of the startup retry delay (starting at 1 s). If the intent is to stay aggressive only for SDK-initiated reconnects and apply normal start-retry delays for factory-level failures, consider checking isFeishuWebSocketReconnectRequiredError(err) solely rather than short-circuiting on recoveringFromDisconnect.

Prompt To Fix With AI

This is a comment left during a code review. Path: extensions/feishu/src/monitor.transport.ts Line: 220-224 Comment: **Start-failure during reconnect recovery uses full reconnect backoff** When `recoveringFromDisconnect` is `true` and `createFeishuWSClient(account)` itself throws (e.g., transient DNS failure, auth rejection), the failure is classified as a reconnect failure (`isReconnectFailure = true`) and uses `getFeishuWsReconnectDelayMs` (starting at 120 s) instead of the startup retry delay (starting at 1 s). If the intent is to stay aggressive only for SDK-initiated reconnects and apply normal start-retry delays for factory-level failures, consider checking `isFeishuWebSocketReconnectRequiredError(err)` solely rather than short-circuiting on `recoveringFromDisconnect`. How can I resolve this? If you propose a fix, please make it concise.

clawsweeper · 2026-04-29T02:44:49Z

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Close as superseded. The same Feishu reconnect/backoff repair is now tracked by #73998, which explicitly carries forward this branch and its source work while addressing the review blockers; current main still lacks the complete repair, so the remaining work should continue on the newer PR rather than this older branch.

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Review details

Best possible solution:

Close this older PR and review the newer #73998 branch as the canonical Feishu reconnect/backoff repair, preserving the contributor credit from #55619 and the heartbeat attribution while keeping #55532 open until a replacement lands.

Do we have a high-confidence way to reproduce the issue?

Yes. Source-level reproduction is high-confidence: current main waits only for abort after wsClient.start() resolves, while the pinned SDK reports reconnect/error lifecycle through callbacks and Pong control handling, so the reported reconnect/backoff gap can be verified from code without live Feishu credentials.

Is this the best way to solve the issue?

No. This branch is no longer the best fix path because #73998 supersedes it with a narrower replacement plan that explicitly carries forward this work and addresses this branch's residual review blockers.

Security review:

Security review cleared: The diff is limited to Feishu plugin runtime, tests, and changelog; it does not change dependencies, workflows, package resolution, permissions, or secret handling surfaces.

What I checked:

current-main Feishu WS client: Current main still constructs a raw Lark.WSClient with wsConfig in createFeishuWSClient; it does not contain this PR's managed wrapper or terminal-error API. (extensions/feishu/src/client.ts:227, 7969f1f07ccc)
current-main Feishu monitor: Current main retries only wsClient.start() failures with a 1s-to-30s backoff, then waits only for abort after a successful start; it does not supervise SDK reconnect callbacks. (extensions/feishu/src/monitor.transport.ts:173, 7969f1f07ccc)
Lark SDK contract: The pinned @larksuiteoapi/node-sdk@1.62.0 types expose autoReconnect, onReady, onError, onReconnecting, onReconnected, getReconnectInfo, and private reconnect/control handling, matching the dependency surface being repaired.
Lark SDK reconnect source: The SDK source invokes onReconnecting after an established connection drops, invokes onError when reconnect is exhausted, and reads PingInterval from Pong payloads while updating reconnect timing.
superseding PR: Open PR fix(feishu): reconcile WebSocket reconnect backoff #73998 targets the same Feishu reconnect/backoff and PingInterval repair, explicitly carries forward this PR and fix(feishu): exponential backoff + PingInterval guard for WS reconnect #55619, and includes the residual cleanup topics from this branch as review blockers. (255e1262454d)
PR residual defect: This PR classifies any later startup/factory failure as reconnect recovery once recoveringFromDisconnect is true, so ordinary start failures after a disconnect use the long reconnect delay instead of the startup retry policy. (extensions/feishu/src/monitor.transport.ts:220, 5a4282896384)

Likely related people:

vincentkoc: Current shallow blame for the Feishu WebSocket files points to Vincent Koc, and the related merged maintainer PR fix(feishu): repair WebSocket reconnect and heartbeat config #72411 carried the current Feishu WebSocket repair baseline forward. (role: recent maintainer and merger; confidence: high; commits: 9d68c676, 217a8fbd4dd6; files: extensions/feishu/src/client.ts, extensions/feishu/src/monitor.transport.ts, extensions/feishu/src/monitor.state.ts)
sirfengyu: Reported the canonical reconnect/backoff issue and authored the earlier fix(feishu): exponential backoff + PingInterval guard for WS reconnect #55619 prototype that this PR and fix(feishu): reconcile WebSocket reconnect backoff #73998 explicitly carry forward. (role: source investigator; confidence: medium; commits: 6edc66571374; files: extensions/feishu/src/client.ts, extensions/feishu/src/monitor.transport.ts)
alex-xuweilong: Authored the earlier Feishu wsConfig heartbeat PR fix(feishu): add wsConfig with PingInterval/PingTimeout to WSClient #45674 that is credited in this repair cluster and relates to the PingInterval failure path. (role: adjacent heartbeat contributor; confidence: medium; commits: d3c1e3f20f05; files: extensions/feishu/src/client.ts)
tianhaocui: The merged fix(feishu): repair WebSocket reconnect and heartbeat config #72411 body credits fix(feishu): add application-level WebSocket reconnection with backoff #68865 for the app-layer reconnect loop approach that shaped this repair cluster. (role: adjacent reconnect implementation author; confidence: low; files: extensions/feishu/src/monitor.transport.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 7969f1f07ccc.

openclaw-clownfish Bot mentioned this pull request Apr 29, 2026

fix(feishu): exponential backoff + PingInterval guard for WS reconnect #55619

Closed

3 tasks

openclaw-clownfish Bot added the clawsweeper Tracked by ClawSweeper automation label Apr 29, 2026

openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: L r: too-many-prs Auto-close: author has more than twenty active PRs. labels Apr 29, 2026

openclaw-barnacle Bot closed this Apr 29, 2026

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

openclaw-clownfish Bot mentioned this pull request Apr 29, 2026

fix(feishu): reconcile WebSocket reconnect backoff #73998

Open

vincentkoc reopened this Apr 29, 2026

openclaw-barnacle Bot added r: too-many-prs Auto-close: author has more than twenty active PRs. and removed r: too-many-prs Auto-close: author has more than twenty active PRs. labels Apr 29, 2026

clawsweeper Bot closed this Apr 29, 2026

vincentkoc reopened this Apr 29, 2026

vincentkoc force-pushed the clownfish/ghcrawl-156816-autonomous-smoke branch 3 times, most recently from 871970c to bc33d7f Compare April 29, 2026 05:53

openclaw-clownfish Bot and others added 2 commits April 28, 2026 23:06

fix(feishu): reconcile WebSocket reconnect backoff

e306941

test(feishu): mock websocket reconnect classifiers

5a42828

vincentkoc force-pushed the clownfish/ghcrawl-156816-autonomous-smoke branch from bc33d7f to 5a42828 Compare April 29, 2026 06:08

clawsweeper Bot mentioned this pull request Apr 30, 2026

Feishu WebSocket: No exponential backoff on reconnect #55532

Closed

clawsweeper Bot closed this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(feishu): reconcile WebSocket reconnect backoff#73945

fix(feishu): reconcile WebSocket reconnect backoff#73945
openclaw-clownfish[bot] wants to merge 2 commits intomainfrom
clownfish/ghcrawl-156816-autonomous-smoke

openclaw-clownfish Bot commented Apr 29, 2026

Uh oh!

openclaw-barnacle Bot commented Apr 29, 2026

Uh oh!

greptile-apps Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

clawsweeper Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

openclaw-clownfish Bot commented Apr 29, 2026

Summary

Credit

Review blockers to address

Validation

Uh oh!

openclaw-barnacle Bot commented Apr 29, 2026

Uh oh!

greptile-apps Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Uh oh!

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

clawsweeper Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 29, 2026 •

edited

Loading

clawsweeper Bot commented Apr 29, 2026 •

edited

Loading