Skip to content

fix(feishu): terminate process on unrecoverable websocket drops#10620

Open
watsonctl wants to merge 1 commit into
NousResearch:mainfrom
watsonctl:fix-feishu-zombie
Open

fix(feishu): terminate process on unrecoverable websocket drops#10620
watsonctl wants to merge 1 commit into
NousResearch:mainfrom
watsonctl:fix-feishu-zombie

Conversation

@watsonctl

Copy link
Copy Markdown

Fixes #10616

Description

When using the Feishu integration, if the underlying connection suffers a keepalive ping timeout, the SDK's message loop exits, but the main Hermes agent process doesn't terminate or successfully reconnect. This leaves the Gateway in a zombie state where it appears "running" to system daemon managers (like systemd) but accepts no messages.

This PR adds a Crash-Only architecture mechanism: if the Feishu websocket loop terminates unexpectedly while the gateway adapter is still running, it forces the process to crash (os._exit(1)). This allows system level managers (Restart=always) to successfully identify the failure and restart the agent.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists platform/feishu Feishu / Lark adapter comp/gateway Gateway runner, session dispatch, delivery labels Apr 25, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related: #10801 bundles a WS watchdog fix for the same zombie gateway issue (#10616).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/feishu Feishu / Lark adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feishu/Lark WebSocket drops lead to Zombie Gateway Process without auto-recovery

2 participants