fix(gateway): add timeout to adapter.disconnect() during shutdown#14130
Open
happy5318 wants to merge 1 commit into
Open
fix(gateway): add timeout to adapter.disconnect() during shutdown#14130happy5318 wants to merge 1 commit into
happy5318 wants to merge 1 commit into
Conversation
Prevent gateway shutdown hangs when a platform adapter's disconnect() method blocks indefinitely (e.g., Feishu WebSocket thread waiting for network I/O). Without this timeout, systemd sends SIGKILL after TimeoutStopSec, but SIGKILL doesn't trigger Python's atexit handlers, leaving a stale PID file that causes 'PID file race lost' errors on restart. Changes: - Wrap adapter.disconnect() in asyncio.wait_for() with 15s timeout - Log warning on timeout and continue with shutdown instead of hanging - Ensures PID file cleanup always runs even if adapter cleanup fails Fixes NousResearch#14128
This was referenced Apr 22, 2026
19 tasks
12 tasks
austinpickett
requested changes
May 18, 2026
austinpickett
left a comment
Collaborator
There was a problem hiding this comment.
LGTM - Use PULL_REQUEST_TEMPLATE.md, fix merge conflicts
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a 15-second timeout around each adapter's disconnect() call during gateway shutdown so a stuck adapter (e.g., Feishu WebSocket) cannot block the entire shutdown sequence and prevent PID file cleanup.
Changes:
- Wraps
adapter.disconnect()inasyncio.wait_for()with a 15s per-adapter timeout. - Adds an
asyncio.TimeoutErrorhandler that logs a warning and continues with the next adapter.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When restarting the Hermes gateway via , the gateway process sometimes hangs during shutdown and gets SIGKILL'd by systemd after
TimeoutStopSec(default 60s). This leaves a stale PID file, causing the new gateway instance to fail with "PID file race lost to another gateway instance. Exiting."Root Cause
The shutdown sequence in
gateway/run.pycallsawait adapter.disconnect()for each platform adapter without a timeout. If any adapter'sdisconnect()method blocks (e.g., Feishu adapter's WebSocket thread waiting for network response), the entire shutdown process hangs.When systemd sends SIGKILL after timeout, Python's
atexithandlers don't run, so the PID file is never cleaned up. The new instance sees the stale PID file and exits.Solution
Add a timeout wrapper (
asyncio.wait_for) aroundadapter.disconnect()with a 15-second timeout per adapter. On timeout, log a warning and continue with the shutdown sequence instead of hanging indefinitely.Changes
adapter.disconnect()inasyncio.wait_for()with 15s timeoutasyncio.TimeoutErrorhandler to log warning and continueTesting
Manually tested by triggering gateway restart while Feishu WebSocket was active. The gateway now shuts down cleanly within the timeout and restarts successfully without "PID file race lost" errors.
Fixes #14128