Skip to content

fix(gateway): add timeout to adapter.disconnect() during shutdown#14130

Open
happy5318 wants to merge 1 commit into
NousResearch:mainfrom
happy5318:fix/gateway-shutdown-adapter-timeout
Open

fix(gateway): add timeout to adapter.disconnect() during shutdown#14130
happy5318 wants to merge 1 commit into
NousResearch:mainfrom
happy5318:fix/gateway-shutdown-adapter-timeout

Conversation

@happy5318

Copy link
Copy Markdown
Contributor

Problem

When restarting the Hermes gateway via , the gateway process sometimes hangs during shutdown and gets SIGKILL'd by systemd after TimeoutStopSec (default 60s). This leaves a stale PID file, causing the new gateway instance to fail with "PID file race lost to another gateway instance. Exiting."

Root Cause

The shutdown sequence in gateway/run.py calls await adapter.disconnect() for each platform adapter without a timeout. If any adapter's disconnect() method blocks (e.g., Feishu adapter's WebSocket thread waiting for network response), the entire shutdown process hangs.

When systemd sends SIGKILL after timeout, Python's atexit handlers don't run, so the PID file is never cleaned up. The new instance sees the stale PID file and exits.

Solution

Add a timeout wrapper (asyncio.wait_for) around adapter.disconnect() with a 15-second timeout per adapter. On timeout, log a warning and continue with the shutdown sequence instead of hanging indefinitely.

Changes

  • Wrap adapter.disconnect() in asyncio.wait_for() with 15s timeout
  • Add asyncio.TimeoutError handler to log warning and continue
  • Ensures PID file cleanup always runs even if adapter cleanup fails

Testing

Manually tested by triggering gateway restart while Feishu WebSocket was active. The gateway now shuts down cleanly within the timeout and restarts successfully without "PID file race lost" errors.

Fixes #14128

Prevent gateway shutdown hangs when a platform adapter's disconnect()
method blocks indefinitely (e.g., Feishu WebSocket thread waiting for
network I/O). Without this timeout, systemd sends SIGKILL after
TimeoutStopSec, but SIGKILL doesn't trigger Python's atexit handlers,
leaving a stale PID file that causes 'PID file race lost' errors on
restart.

Changes:
- Wrap adapter.disconnect() in asyncio.wait_for() with 15s timeout
- Log warning on timeout and continue with shutdown instead of hanging
- Ensures PID file cleanup always runs even if adapter cleanup fails

Fixes NousResearch#14128

@austinpickett austinpickett left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Use PULL_REQUEST_TEMPLATE.md, fix merge conflicts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a 15-second timeout around each adapter's disconnect() call during gateway shutdown so a stuck adapter (e.g., Feishu WebSocket) cannot block the entire shutdown sequence and prevent PID file cleanup.

Changes:

  • Wraps adapter.disconnect() in asyncio.wait_for() with a 15s per-adapter timeout.
  • Adds an asyncio.TimeoutError handler that logs a warning and continues with the next adapter.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround platform/feishu Feishu / Lark adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(gateway): Gateway shutdown hangs causing 'PID file race lost' on restart

4 participants