Skip to content

fix(feishu): WebSocket ping timeout causes zombie gateway process without auto-recovery #65

@bugmaker2

Description

@bugmaker2

Problem

When the Feishu WebSocket connection suffers a keepalive ping timeout, the SDK's message loop exits, but the main Hermes agent process does not terminate or reconnect. This leaves the Gateway in a zombie state where it appears "running" to systemd but accepts no messages.

Log output:

[Lark] [ERROR] receive message loop exit, err: sent 1011 (internal error) keepalive ping timeout; no close frame received
[Lark] [WARNING] ping failed, err: sent 1011 (internal error) keepalive ping timeout

Expected behavior: Under a crash-only architecture, the feishu.py integration thread should raise a SystemExit(1) so systemd-level Restart=always can respawn a healthy stack.

Reference

Upstream: NousResearch#10616

Design (from OpenClaw)

OpenClaw's monitor.ts has a complete lifecycle management system:

  • monitorSingleAccount() with abort signal support
  • Health check probe via fetchBotIdentityForMonitor()
  • State management in monitor.state.ts
  • Webhook anomaly tracking

Implementation Plan

  1. Add a watchdog thread/task in FeishuAdapter.connect() that monitors the SDK's message loop
  2. Detect ping timeout conditions and raise SystemExit(1) to trigger systemd restart
  3. Add botIdentity pre-fetch to validate connection health on startup
  4. Reference: gateway/platforms/feishu.py line ~1042 FeishuAdapter class

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions