Skip to content

fix(feishu): reconcile WebSocket reconnect backoff#73998

Open
openclaw-clownfish[bot] wants to merge 1 commit intomainfrom
clownfish/ghcrawl-207048-agentic-merge
Open

fix(feishu): reconcile WebSocket reconnect backoff#73998
openclaw-clownfish[bot] wants to merge 1 commit intomainfrom
clownfish/ghcrawl-207048-agentic-merge

Conversation

@openclaw-clownfish
Copy link
Copy Markdown
Contributor

Summary

Repair the remaining Feishu WebSocket reconnect/backoff and PingInterval guard path for #55532 without process-wide Lark.WSClient prototype mutation. Keep the patch narrow to the Feishu client/monitor surfaces, building on current main after #72411.

Credit

Carries forward @sirfengyu's investigation from #55619 and the closed ProjectClownfish replacement attempt in #73945. Preserve @alex-xuweilong's heartbeat attribution from #45674 if that constructor path is touched.

Review blockers to address

  • Replace global WSClient prototype mutation with explicit Feishu-owned wrapper/composition behavior.
  • Reset reconnect backoff after successful recovery.
  • Cancel pending reconnect waits when the monitor/client closes.
  • Avoid stacking custom backoff on top of the SDK reconnect delay.
  • Remove fix(feishu): reconcile WebSocket reconnect backoff #73945 unreachable cleanup/dead-code paths if reusing that implementation.
  • Decide factory/startup failures during reconnect recovery with a narrow test-backed delay policy.

Validation

  • pnpm test:serial extensions/feishu/src/client.test.ts extensions/feishu/src/monitor.cleanup.test.ts
  • pnpm check:changed

Fixes #55532. Related to #42354. Security-routed refs #46472, #68865, and #72411 are not modified by this PR.

ProjectClownfish replacement details:

@openclaw-clownfish openclaw-clownfish Bot added the clawsweeper Tracked by ClawSweeper automation label Apr 29, 2026
@openclaw-barnacle openclaw-barnacle Bot added channel: feishu Channel integration: feishu size: M r: too-many-prs Auto-close: author has more than twenty active PRs. and removed r: too-many-prs Auto-close: author has more than twenty active PRs. labels Apr 29, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR replaces the Feishu WebSocket client's static wsConfig constructor override and prototype-mutation approach with SDK-native lifecycle hooks (onReady, onError, onReconnecting, onReconnected) and a sanitizing httpInstance that guards invalid server-returned ClientConfig fields (PingInterval, ReconnectCount, ReconnectInterval, ReconnectNonce). The monitor backoff counter is reset on onReady/onReconnected, and onError (SDK reconnect exhaustion) triggers a clean monitor-level retry without stacking backoff on top of the SDK's own reconnect delay.

Confidence Score: 4/5

Safe to merge with one behavior change to verify: PingTimeout: 3 from the old static wsConfig is no longer set anywhere.

All changes are P2. The only noteworthy gap is that the old PingTimeout: 3 constructor-level guard is dropped and not replicated — if the Lark SDK default is much larger, dead-connection detection may be slower. The rest of the implementation (backoff reset, error-triggered retry, ClientConfig sanitization) is logically sound and well-tested.

extensions/feishu/src/client.ts — confirm whether PingTimeout needs to be set explicitly on the WSClient constructor alongside the new httpInstance approach.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/feishu/src/client.ts
Line: 18-23

Comment:
**`PingTimeout` guard no longer applied**

The old `FEISHU_WS_CONFIG` included `PingTimeout: 3`, passed as a constructor-level `wsConfig` override that limited how long the SDK would wait for a pong before treating the connection as dead. The new `FEISHU_WS_CLIENT_CONFIG_DEFAULTS` and `sanitizeFeishuWsEndpointResponse` only sanitize server-returned fields (`PingInterval`, `ReconnectCount`, `ReconnectInterval`, `ReconnectNonce`). `PingTimeout` is a static SDK constructor option and does not come from the API response, so it is silently dropped. If the SDK's default `PingTimeout` is much larger than 3 s, a dead connection may stall for longer before the heartbeat timeout triggers — which was presumably the original motivation for the 3 s override.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 163-175

Comment:
**`handleAbort` redundantly calls `removeEventListener` when registered with `{ once: true }`**

The listener is added with `{ once: true }` (line 171), which auto-removes it on first fire. `handleAbort` then also calls `removeEventListener` manually (line 165), making that call a no-op. Both branches of cleanup (`handleAbort` itself and `removeAbortListener` in `.finally()`) attempt the same removal. The code is correct but the manual call inside `handleAbort` is dead. Consider removing it to avoid confusion about which mechanism actually deregisters the listener.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(feishu): reconcile WebSocket reconne..." | Re-trigger Greptile

Comment on lines +18 to 23
const FEISHU_WS_CLIENT_CONFIG_DEFAULTS = {
PingInterval: 30,
PingTimeout: 3,
ReconnectCount: -1,
ReconnectInterval: 120,
ReconnectNonce: 30,
} as const;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 PingTimeout guard no longer applied

The old FEISHU_WS_CONFIG included PingTimeout: 3, passed as a constructor-level wsConfig override that limited how long the SDK would wait for a pong before treating the connection as dead. The new FEISHU_WS_CLIENT_CONFIG_DEFAULTS and sanitizeFeishuWsEndpointResponse only sanitize server-returned fields (PingInterval, ReconnectCount, ReconnectInterval, ReconnectNonce). PingTimeout is a static SDK constructor option and does not come from the API response, so it is silently dropped. If the SDK's default PingTimeout is much larger than 3 s, a dead connection may stall for longer before the heartbeat timeout triggers — which was presumably the original motivation for the 3 s override.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/client.ts
Line: 18-23

Comment:
**`PingTimeout` guard no longer applied**

The old `FEISHU_WS_CONFIG` included `PingTimeout: 3`, passed as a constructor-level `wsConfig` override that limited how long the SDK would wait for a pong before treating the connection as dead. The new `FEISHU_WS_CLIENT_CONFIG_DEFAULTS` and `sanitizeFeishuWsEndpointResponse` only sanitize server-returned fields (`PingInterval`, `ReconnectCount`, `ReconnectInterval`, `ReconnectNonce`). `PingTimeout` is a static SDK constructor option and does not come from the API response, so it is silently dropped. If the SDK's default `PingTimeout` is much larger than 3 s, a dead connection may stall for longer before the heartbeat timeout triggers — which was presumably the original motivation for the 3 s override.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +163 to 175
const abortResult = new Promise<FeishuWsStopReason>((resolve) => {
const handleAbort = () => {
abortSignal.removeEventListener("abort", handleAbort);
resolve();
resolve({ status: "abort" });
};
removeAbortListener = () => {
abortSignal.removeEventListener("abort", handleAbort);
};
abortSignal.addEventListener("abort", handleAbort, { once: true });
if (abortSignal.aborted) {
handleAbort();
}
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 handleAbort redundantly calls removeEventListener when registered with { once: true }

The listener is added with { once: true } (line 171), which auto-removes it on first fire. handleAbort then also calls removeEventListener manually (line 165), making that call a no-op. Both branches of cleanup (handleAbort itself and removeAbortListener in .finally()) attempt the same removal. The code is correct but the manual call inside handleAbort is dead. Consider removing it to avoid confusion about which mechanism actually deregisters the listener.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/monitor.transport.ts
Line: 163-175

Comment:
**`handleAbort` redundantly calls `removeEventListener` when registered with `{ once: true }`**

The listener is added with `{ once: true }` (line 171), which auto-removes it on first fire. `handleAbort` then also calls `removeEventListener` manually (line 165), making that call a no-op. Both branches of cleanup (`handleAbort` itself and `removeAbortListener` in `.finally()`) attempt the same removal. The code is correct but the manual call inside `handleAbort` is dead. Consider removing it to avoid confusion about which mechanism actually deregisters the listener.

How can I resolve this? If you propose a fix, please make it concise.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: needs changes before merge.

Summary
The PR replaces Feishu WSClient constructor wsConfig handling with a guarded httpInstance and lifecycle-hook retry handling, adds Feishu tests, and adds a changelog entry.

Reproducibility: yes. Mock the WS endpoint response as code: 0 with missing ClientConfig, or as a system-busy non-OK response, and verify the SDK returns false without reaching the PR's monitor onError retry path; separately trigger SDK onError and assert bot identity maps survive the retry cycle.

Next step before merge
This is a narrow, concrete repair on an existing ClawSweeper PR branch.

Security
Cleared: Security review cleared: the diff is limited to Feishu channel source/tests and a changelog entry, with no concrete security or supply-chain regression found.

Review findings

  • [P2] Handle missing WS ClientConfig — extensions/feishu/src/client.ts:126-127
  • [P2] Preserve bot identity across retry cycles — extensions/feishu/src/monitor.transport.ts:227
Review details

Best possible solution:

Keep the Feishu-owned SDK lifecycle/httpInstance composition, preserve the current-main supervisor cleanup contract, and make missing or non-OK WS endpoint config deterministically reach a finite retry path with focused tests.

Do we have a high-confidence way to reproduce the issue?

Yes. Mock the WS endpoint response as code: 0 with missing ClientConfig, or as a system-busy non-OK response, and verify the SDK returns false without reaching the PR's monitor onError retry path; separately trigger SDK onError and assert bot identity maps survive the retry cycle.

Is this the best way to solve the issue?

No. The direction is maintainable, but the current branch is incomplete; the safer fix is a rebased Feishu-local recovery policy that handles malformed endpoint responses while carrying forward current-main identity-preserving cleanup.

Full review comments:

  • [P2] Handle missing WS ClientConfig — extensions/feishu/src/client.ts:126-127
    The sanitizer returns the endpoint response unchanged when data.ClientConfig is absent. In the pinned Lark SDK, pullConnectConfig then catches the later ClientConfig.PingInterval failure and returns false; with the default infinite reconnect count, the monitor onError recovery path may never fire. Route missing ClientConfig and non-OK endpoint responses into a finite Feishu recovery path and cover it in tests.
    Confidence: 0.88
  • [P2] Preserve bot identity across retry cycles — extensions/feishu/src/monitor.transport.ts:227
    After waitForFeishuWsStop returns an SDK error, this cleanup deletes botOpenIds and botNames before the retry creates a replacement WSClient. Current main deliberately keeps those maps during supervisor recovery and clears them only on true shutdown, so carry that behavior forward and test it.
    Confidence: 0.8

Overall correctness: patch is incorrect
Overall confidence: 0.86

Acceptance criteria:

  • pnpm test:serial extensions/feishu/src/client.test.ts extensions/feishu/src/monitor.cleanup.test.ts
  • pnpm check:changed

What I checked:

  • PR sanitizer gap: At PR head, sanitizeFeishuWsEndpointResponse returns responses unchanged when data.ClientConfig is missing, so the branch does not force malformed endpoint responses into a bounded Feishu-owned recovery path. (extensions/feishu/src/client.ts:126, 255e1262454d)
  • Dependency reconnect contract: @larksuiteoapi/node-sdk@1.62.1 catches pullConnectConfig failures and returns false; onError is only invoked for disabled auto-reconnect or when a non-negative reconnect count is exhausted, while the SDK default reconnect count is -1.
  • Current-main retry identity behavior: Current main preserves botOpenIds and botNames during supervisor retry cleanup with clearIdentity: false, clearing identity only on true monitor shutdown paths. (extensions/feishu/src/monitor.transport.ts:242, f7549079ceb3)
  • PR cleanup regression path: At PR head, cleanup always deletes Feishu bot identity maps, and the SDK-error stop path calls that cleanup before retrying the replacement WSClient. (extensions/feishu/src/monitor.transport.ts:227, 255e1262454d)
  • Security scope: The public PR diff is limited to CHANGELOG.md and Feishu source/tests; it does not touch workflows, dependency manifests, lockfiles, install/build/release scripts, generated/vendor files, or permissions. (255e1262454d)

Likely related people:

Remaining risk / open question:

  • The branch is based before the current-main fix(feishu): recover WebSocket after SDK retry exhaustion #73739 recovery cleanup and needs a careful rebase so it does not drop current identity-preservation semantics.
  • No local tests were run in this read-only review; the findings are from source, PR-diff, dependency-source, and history inspection.

Codex review notes: model gpt-5.5, reasoning high; reviewed against f7549079ceb3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: feishu Channel integration: feishu clawsweeper Tracked by ClawSweeper automation size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feishu WebSocket: No exponential backoff on reconnect

0 participants