Bug Description
When restarting the Hermes gateway via systemctl restart hermes-gateway, the gateway process sometimes hangs during shutdown and gets SIGKILL'd by systemd after TimeoutStopSec (default 60s). This leaves a stale PID file, causing the new gateway instance to fail with "PID file race lost to another gateway instance. Exiting."
Steps to Reproduce
- Configure Hermes gateway with Feishu/Lark platform adapter
- Run
systemctl --user restart hermes-gateway
- If the Feishu WebSocket thread happens to be blocked (e.g., waiting for network I/O), the gateway hangs during shutdown
- After 60 seconds, systemd sends SIGKILL
- New instance starts but fails with "PID file race lost" error
- Gateway enters restart loop until manually fixed
Root Cause
The shutdown sequence in gateway/run.py calls await adapter.disconnect() for each platform adapter without a timeout. If any adapter's disconnect() method blocks (e.g., Feishu adapter's WebSocket thread waiting for network response), the entire shutdown process hangs.
When systemd sends SIGKILL after timeout, Python's atexit handlers don't run, so the PID file (~/.hermes/gateway.pid) is never cleaned up. The new instance sees the stale PID file and exits with "PID file race lost".
Relevant Logs
Apr 23 03:21:57 python[1782979]: WARNING gateway.run: Shutdown diagnostic — other hermes processes running:
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: State 'stop-sigterm' timed out. Killing.
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: Killing process 1782979 (python) with signal SIGKILL.
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: Failed with result 'timeout'.
Apr 23 03:22:58 python[1783144]: ERROR gateway.run: PID file race lost to another gateway instance. Exiting.
Proposed Fix
Add a timeout wrapper around adapter.disconnect() in the shutdown sequence:
_adapter_disconnect_timeout = 15.0 # seconds per adapter
for platform, adapter in list(self.adapters.items()):
try:
await asyncio.wait_for(adapter.disconnect(), timeout=_adapter_disconnect_timeout)
logger.info("✓ %s disconnected", platform.value)
except asyncio.TimeoutError:
logger.warning(
"✗ %s disconnect timed out after %.1fs - forcing continue",
platform.value, _adapter_disconnect_timeout
)
This ensures the shutdown sequence always completes within a reasonable time, allowing PID file cleanup to run properly.
Environment
- Hermes Agent version: latest main branch
- Platform: Feishu/Lark
- OS: Linux (systemd user service)
Bug Description
When restarting the Hermes gateway via
systemctl restart hermes-gateway, the gateway process sometimes hangs during shutdown and gets SIGKILL'd by systemd afterTimeoutStopSec(default 60s). This leaves a stale PID file, causing the new gateway instance to fail with "PID file race lost to another gateway instance. Exiting."Steps to Reproduce
systemctl --user restart hermes-gatewayRoot Cause
The shutdown sequence in
gateway/run.pycallsawait adapter.disconnect()for each platform adapter without a timeout. If any adapter'sdisconnect()method blocks (e.g., Feishu adapter's WebSocket thread waiting for network response), the entire shutdown process hangs.When systemd sends SIGKILL after timeout, Python's
atexithandlers don't run, so the PID file (~/.hermes/gateway.pid) is never cleaned up. The new instance sees the stale PID file and exits with "PID file race lost".Relevant Logs
Proposed Fix
Add a timeout wrapper around
adapter.disconnect()in the shutdown sequence:This ensures the shutdown sequence always completes within a reasonable time, allowing PID file cleanup to run properly.
Environment