Skip to content

fix(feishu): reconcile WebSocket reconnect backoff#73945

Closed
openclaw-clownfish[bot] wants to merge 2 commits intomainfrom
clownfish/ghcrawl-156816-autonomous-smoke
Closed

fix(feishu): reconcile WebSocket reconnect backoff#73945
openclaw-clownfish[bot] wants to merge 2 commits intomainfrom
clownfish/ghcrawl-156816-autonomous-smoke

Conversation

@openclaw-clownfish
Copy link
Copy Markdown
Contributor

Summary

Repair the Feishu WebSocket reconnect path from #55619 without process-wide Lark.WSClient prototype mutation. Rebase the contributor branch onto current main, verify the pinned Lark SDK start/onReady/onError/reConnect behavior, and keep the fix inside the Feishu app-layer monitor/client code.

Credit

Carries forward @sirfengyu's reconnect/backoff investigation from #55619. Preserve existing heartbeat attribution from #45674 if that path is touched.

Review blockers to address

Validation

  • pnpm test:serial extensions/feishu/src/async.test.ts extensions/feishu/src/monitor.cleanup.test.ts extensions/feishu/src/client.test.ts
  • pnpm check:changed

Fixes #55532. Related to #42354.

ProjectClownfish replacement details:

@openclaw-clownfish openclaw-clownfish Bot added the clawsweeper Tracked by ClawSweeper automation label Apr 29, 2026
@openclaw-barnacle openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: L r: too-many-prs Auto-close: author has more than twenty active PRs. labels Apr 29, 2026
@openclaw-barnacle
Copy link
Copy Markdown

Closing this PR because the author has more than 10 active PRs in this repo. Please reduce the active PR queue and reopen or resubmit once it is back under the limit. You can close your own PRs to get back under the limit.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR replaces the previous global Lark.WSClient prototype mutation with a per-instance ManagedFeishuWebSocketClient wrapper, separates startup-retry and reconnect-backoff counters (1 s→30 s vs 120 s→15 min), resets both counters after a successful start(), and guards PingInterval-missing pong frames at the instance level. The previously flagged concerns (double-close on reconnect handoff path, unreachable cleanup after the terminal-error throw, and recoveringFromDisconnect forcing reconnect backoff on factory-level start failures) remain the residual risks.

Confidence Score: 5/5

Safe to merge; no new P0/P1 findings beyond those already in open review threads.

All identified concerns are P2 or were already raised in prior review threads. The core logic — managed client wrapper, separated backoff counters, reset after successful start, per-instance PingInterval guard — is sound. raceWithTimeoutAndAbort propagates start() rejections correctly via Promise.race semantics, and waitForTerminalError only ever resolves, so status-handling paths are correct.

extensions/feishu/src/client.ts (scheduleReconnectClose / double-close interaction) and extensions/feishu/src/monitor.transport.ts (recoveringFromDisconnect classification for factory-level failures) — both already tracked in open threads.

Reviews (3): Last reviewed commit: "fix(feishu): reconcile WebSocket reconne..." | Re-trigger Greptile

Comment thread extensions/feishu/src/monitor.transport.ts
Comment on lines +220 to +224
const isReconnectFailure =
recoveringFromDisconnect || isFeishuWebSocketReconnectRequiredError(err);
if (isReconnectFailure) {
reconnectAttempt += 1;
recoveringFromDisconnect = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Start-failure during reconnect recovery uses full reconnect backoff

When recoveringFromDisconnect is true and createFeishuWSClient(account) itself throws (e.g., transient DNS failure, auth rejection), the failure is classified as a reconnect failure (isReconnectFailure = true) and uses getFeishuWsReconnectDelayMs (starting at 120 s) instead of the startup retry delay (starting at 1 s). If the intent is to stay aggressive only for SDK-initiated reconnects and apply normal start-retry delays for factory-level failures, consider checking isFeishuWebSocketReconnectRequiredError(err) solely rather than short-circuiting on recoveringFromDisconnect.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 220-224

Comment:
**Start-failure during reconnect recovery uses full reconnect backoff**

When `recoveringFromDisconnect` is `true` and `createFeishuWSClient(account)` itself throws (e.g., transient DNS failure, auth rejection), the failure is classified as a reconnect failure (`isReconnectFailure = true`) and uses `getFeishuWsReconnectDelayMs` (starting at 120 s) instead of the startup retry delay (starting at 1 s). If the intent is to stay aggressive only for SDK-initiated reconnects and apply normal start-retry delays for factory-level failures, consider checking `isFeishuWebSocketReconnectRequiredError(err)` solely rather than short-circuiting on `recoveringFromDisconnect`.

How can I resolve this? If you propose a fix, please make it concise.

@vincentkoc vincentkoc reopened this Apr 29, 2026
@openclaw-barnacle openclaw-barnacle Bot added r: too-many-prs Auto-close: author has more than twenty active PRs. and removed r: too-many-prs Auto-close: author has more than twenty active PRs. labels Apr 29, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Thanks for the context here. I swept through the related work, and this is now duplicate or superseded.

Close as superseded. The same Feishu reconnect/backoff repair is now tracked by #73998, which explicitly carries forward this branch and its source work while addressing the review blockers; current main still lacks the complete repair, so the remaining work should continue on the newer PR rather than this older branch.

So I’m closing this here and keeping the remaining discussion on the canonical linked item.

Review details

Best possible solution:

Close this older PR and review the newer #73998 branch as the canonical Feishu reconnect/backoff repair, preserving the contributor credit from #55619 and the heartbeat attribution while keeping #55532 open until a replacement lands.

Do we have a high-confidence way to reproduce the issue?

Yes. Source-level reproduction is high-confidence: current main waits only for abort after wsClient.start() resolves, while the pinned SDK reports reconnect/error lifecycle through callbacks and Pong control handling, so the reported reconnect/backoff gap can be verified from code without live Feishu credentials.

Is this the best way to solve the issue?

No. This branch is no longer the best fix path because #73998 supersedes it with a narrower replacement plan that explicitly carries forward this work and addresses this branch's residual review blockers.

Security review:

Security review cleared: The diff is limited to Feishu plugin runtime, tests, and changelog; it does not change dependencies, workflows, package resolution, permissions, or secret handling surfaces.

What I checked:

  • current-main Feishu WS client: Current main still constructs a raw Lark.WSClient with wsConfig in createFeishuWSClient; it does not contain this PR's managed wrapper or terminal-error API. (extensions/feishu/src/client.ts:227, 7969f1f07ccc)
  • current-main Feishu monitor: Current main retries only wsClient.start() failures with a 1s-to-30s backoff, then waits only for abort after a successful start; it does not supervise SDK reconnect callbacks. (extensions/feishu/src/monitor.transport.ts:173, 7969f1f07ccc)
  • Lark SDK contract: The pinned @larksuiteoapi/node-sdk@1.62.0 types expose autoReconnect, onReady, onError, onReconnecting, onReconnected, getReconnectInfo, and private reconnect/control handling, matching the dependency surface being repaired.
  • Lark SDK reconnect source: The SDK source invokes onReconnecting after an established connection drops, invokes onError when reconnect is exhausted, and reads PingInterval from Pong payloads while updating reconnect timing.
  • superseding PR: Open PR fix(feishu): reconcile WebSocket reconnect backoff #73998 targets the same Feishu reconnect/backoff and PingInterval repair, explicitly carries forward this PR and fix(feishu): exponential backoff + PingInterval guard for WS reconnect #55619, and includes the residual cleanup topics from this branch as review blockers. (255e1262454d)
  • PR residual defect: This PR classifies any later startup/factory failure as reconnect recovery once recoveringFromDisconnect is true, so ordinary start failures after a disconnect use the long reconnect delay instead of the startup retry policy. (extensions/feishu/src/monitor.transport.ts:220, 5a4282896384)

Likely related people:

Codex review notes: model gpt-5.5, reasoning high; reviewed against 7969f1f07ccc.

@clawsweeper clawsweeper Bot closed this Apr 29, 2026
@vincentkoc vincentkoc reopened this Apr 29, 2026
@vincentkoc vincentkoc force-pushed the clownfish/ghcrawl-156816-autonomous-smoke branch 3 times, most recently from 871970c to bc33d7f Compare April 29, 2026 05:53
@vincentkoc vincentkoc force-pushed the clownfish/ghcrawl-156816-autonomous-smoke branch from bc33d7f to 5a42828 Compare April 29, 2026 06:08
@clawsweeper clawsweeper Bot closed this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: feishu Channel integration: feishu clawsweeper Tracked by ClawSweeper automation size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feishu WebSocket: No exponential backoff on reconnect

1 participant