Skip to content

Feishu health-monitor restarts cause leaked reconnect loops in @larksuiteoapi/node-sdk WSClient #40451

@noodle-bag

Description

@noodle-bag

Description

The Feishu channel health-monitor periodically detects the connection as stuck and triggers a restart (~every 35 minutes). Due to a bug in @larksuiteoapi/node-sdk's WSClient.reConnect() (reported upstream: larksuite/node-sdk#177), each restart leaks the previous reconnect loop's setTimeout handles, causing unbounded parallel reconnect attempts.

Observed Behavior

After running openclaw-gateway v2026.3.1 for ~3 days:

  • 62,000+ [ws] ws connect failed log entries
  • Dozens of concurrent leaked loopReConnect loops (visible from simultaneous retry counters like 66, 364, 1446, 1241, etc.)
  • Memory growth from ~750MB RSS to 1.9GB peak
  • DingTalk health-monitor exhibits the same stuck → restart pattern every ~35 minutes

Meanwhile, actual message delivery via Feishu and DingTalk works fine — the latest connection established by each restart is successful. The leaked loops are all from orphaned reconnect chains.

Logs

[health-monitor] [feishu:default] health-monitor: restarting (reason: stuck)
[feishu] feishu[default]: abort signal received, stopping
[feishu] starting feishu[default] (mode: websocket)
[feishu] feishu[default]: WebSocket client started
[error]: [ '[ws]', 'ws connect failed' ]
[error]: [ '[ws]', 'connect failed' ]
[info]: [ '[ws]', 'reconnect' ]
[info]: [ '[ws]', 'ws client ready' ]
# ...then old loops continue:
[info]: [ 'ws', 'unable to connect to the server after trying 828 times")' ]
[info]: [ 'ws', 'unable to connect to the server after trying 1424 times")' ]

Root Cause

The upstream SDK bug (larksuite/node-sdk#177): WSClient.reConnect() stores only the latest setTimeout ID in this.reconnectInterval, so clearTimeout in subsequent restarts can only cancel the most recent timer, not older ones.

Suggested Mitigations (openclaw side)

  1. Before restarting the Feishu provider, ensure the SDK's WSClient is fully destroyed (not just abort signaled). Consider creating a fresh WSClient instance on each restart rather than reusing the same one.

  2. Tune health-monitor sensitivity: If the connection is actually working (messages are being received and dispatched), consider adjusting the stuck detection logic to avoid unnecessary restarts. The current 35-minute cycle creates a new leaked loop each time.

  3. Apply a workaround until the SDK is fixed: wrap the WSClient with a generation counter or AbortController that forcefully exits orphaned loopReConnect callbacks.

Environment

  • openclaw-gateway: v2026.3.1
  • @larksuiteoapi/node-sdk: 1.59.0
  • Node.js: v22 (Linux x86_64, Ubuntu 24.04)
  • Uptime at observation: ~3 days

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions