Skip to content

fix(feishu): WebSocket connection not recovered after network disruption #52618

@bridzl

Description

@bridzl

Bug Description

When the Feishu WebSocket connection drops due to a network disruption, the connection is never re-established. The gateway silently loses the ability to receive Feishu messages until a manual openclaw gateway restart.

Steps to Reproduce

  1. Start OpenClaw with Feishu channel enabled (connectionMode: "websocket")
  2. Wait for WebSocket to connect ([ws] ws client ready)
  3. Cause a network interruption (e.g., router restart, ISP hiccup) lasting ~10 minutes
  4. Observe gateway logs:
[ws] unable to connect to the server after trying 1 times
[ws] unable to connect to the server after trying 2 times
...
[ws] unable to connect to the server after trying 7 times  ← ECONNRESET
  1. After network recovers, Feishu WebSocket does not reconnect
  2. Messages sent to the bot are silently dropped

Expected Behavior

After the Lark SDK exhausts its internal retries, OpenClaw should implement a supervisor loop that periodically attempts to re-establish the WebSocket connection with exponential backoff, similar to how Slack and Telegram channels handle reconnection.

Root Cause Analysis

In extensions/feishu/src/monitor.transport.ts (line 84-127), monitorWebSocket() calls wsClient.start() once inside a Promise that only resolves on abort signal. There is:

  • No reconnection loop after SDK retry exhaustion
  • No stall/disconnect detection
  • No backoff or supervisor logic

Comparison with other channels:

Channel Reconnection Backoff Stall Detection
Slack Explicit while loop with SLACK_SOCKET_RECONNECT_POLICY Exponential (2s→30s, 1.8x, 25% jitter) No
Telegram grammY runner (maxRetryTime: 60min) + explicit loop Exponential (2s→30s, 1.8x, 25% jitter) Yes (90s watchdog)
Discord Full lifecycle controller with createArmableStallWatchdog Exponential via computeBackoff Yes (5min reconnect stall)
Feishu None — delegates entirely to Lark SDK (7 retries, then dead) None None

Proposed Fix

Wrap the existing monitorWebSocket() in a supervisor loop that:

  1. Catches connection failures after Lark SDK retry exhaustion
  2. Recreates the WSClient and retries with exponential backoff
  3. Reuses OpenClaw's existing createArmableStallWatchdog for disconnect detection
  4. Follows the Slack reconnection pattern (while (!aborted) { try/catch + backoff })
  5. Fails fast on non-recoverable errors (invalid credentials, app disabled)

Environment

  • OpenClaw: 2026.3.13
  • macOS (Mac Mini M4), Feishu channel via WebSocket
  • @larksuiteoapi/node-sdk: ^1.59.0
  • Home network (occasional ISP disruptions)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions