Skip to content

Feishu/Lark WebSocket drops lead to Zombie Gateway Process without auto-recovery #10616

@watsonctl

Description

@watsonctl

Description

When using the Feishu integration, if the underlying connection suffers a keepalive ping timeout, the SDK's message loop exits, but the main Hermes agent process doesn't terminate or successfully reconnect. This leaves the Gateway in a zombie state where it appears "running" to system daemon managers (like systemd) but accepts no messages.

Logs

[Lark] [ERROR] receive message loop exit, err: sent 1011 (internal error) keepalive ping timeout; no close frame received
[Lark] [WARNING] ping failed, err: sent 1011 (internal error) keepalive ping timeout

Expected Behavior (Crash-Only Architecture)

If the Feishu websocket loop permanently drops and cannot intrinsically reconnect, the feishu.py integration thread should raise a SystemExit(1) or bubble the exception to the parent thread. System level managers (Restart=always) can then forcefully respawn a healthy agent stack.

Environment

  • OS: Ubuntu 24.04 via WSL2
  • Deploy type: systemd service
  • Provider: feishu

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliveryplatform/feishuFeishu / Lark adaptertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions