Skip to content

fix(gateway): Gateway shutdown hangs causing 'PID file race lost' on restart #14128

@happy5318

Description

@happy5318

Bug Description

When restarting the Hermes gateway via systemctl restart hermes-gateway, the gateway process sometimes hangs during shutdown and gets SIGKILL'd by systemd after TimeoutStopSec (default 60s). This leaves a stale PID file, causing the new gateway instance to fail with "PID file race lost to another gateway instance. Exiting."

Steps to Reproduce

  1. Configure Hermes gateway with Feishu/Lark platform adapter
  2. Run systemctl --user restart hermes-gateway
  3. If the Feishu WebSocket thread happens to be blocked (e.g., waiting for network I/O), the gateway hangs during shutdown
  4. After 60 seconds, systemd sends SIGKILL
  5. New instance starts but fails with "PID file race lost" error
  6. Gateway enters restart loop until manually fixed

Root Cause

The shutdown sequence in gateway/run.py calls await adapter.disconnect() for each platform adapter without a timeout. If any adapter's disconnect() method blocks (e.g., Feishu adapter's WebSocket thread waiting for network response), the entire shutdown process hangs.

When systemd sends SIGKILL after timeout, Python's atexit handlers don't run, so the PID file (~/.hermes/gateway.pid) is never cleaned up. The new instance sees the stale PID file and exits with "PID file race lost".

Relevant Logs

Apr 23 03:21:57 python[1782979]: WARNING gateway.run: Shutdown diagnostic — other hermes processes running:
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: State 'stop-sigterm' timed out. Killing.
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: Killing process 1782979 (python) with signal SIGKILL.
Apr 23 03:22:57 systemd[965]: hermes-gateway.service: Failed with result 'timeout'.
Apr 23 03:22:58 python[1783144]: ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

Proposed Fix

Add a timeout wrapper around adapter.disconnect() in the shutdown sequence:

_adapter_disconnect_timeout = 15.0  # seconds per adapter
for platform, adapter in list(self.adapters.items()):
    try:
        await asyncio.wait_for(adapter.disconnect(), timeout=_adapter_disconnect_timeout)
        logger.info("✓ %s disconnected", platform.value)
    except asyncio.TimeoutError:
        logger.warning(
            "✗ %s disconnect timed out after %.1fs - forcing continue",
            platform.value, _adapter_disconnect_timeout
        )

This ensures the shutdown sequence always completes within a reasonable time, allowing PID file cleanup to run properly.

Environment

  • Hermes Agent version: latest main branch
  • Platform: Feishu/Lark
  • OS: Linux (systemd user service)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliveryplatform/feishuFeishu / Lark adaptertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions