Skip to content

Bug: Stale gateway.pid causes gateway restart loop after crash/SIGKILL #13655

@ObiJuanDeanobi

Description

@ObiJuanDeanobi

Severity

Medium — causes complete gateway service outage requiring manual intervention

Affected Versions

Current stable as of 2026-04-21

Problem Description

The Hermes Gateway service enters a restart loop when the Python gateway process is killed unexpectedly (SIGKILL, OOM, crash). On restart, the gateway.pid file still exists with the dead process's PID. The gateway startup logic treats this as a live process conflict and exits immediately with:

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

Because the service is configured with Restart=on-failure and RestartSec=30, systemd re-attempts every 30 seconds and fails repeatedly, eventually hitting StartLimitBurst=5 and rate-limiting itself. The service becomes unreachable until an operator manually deletes ~/.hermes/gateway.pid.

Root Cause

In gateway/run.py, the startup sequence:

  1. Reads gateway.pid to check for an existing gateway
  2. Writes gateway.pid (which fails with FileExistsError if the file already exists and the old PID is gone but the file wasn't cleaned up)
  3. Registers atexit.register(remove_pid_file) to delete the PID file on clean exit

The problem: if the process is killed with SIGKILL or crashes, the atexit handler never fires and gateway.pid is left behind. The next startup sees the stale file, treats it as a conflict, and exits without attempting to validate whether the PID is actually alive.

Recommended Fix (In Gateway Code)

The gateway startup should validate whether a PID in gateway.pid is actually alive before treating it as a conflict:

# In gateway/run.py — before writing PID file:
stale_pid = get_running_pid()
if stale_pid is not None:
    try:
        os.kill(stale_pid, 0)  # PID exists and we can signal it
        # Real gateway running — exit
        logger.error("Another gateway instance (PID %d) is running. Exiting.", stale_pid)
        return False
    except (ProcessLookupError, PermissionError):
        # Stale PID — file exists but process is dead, safe to overwrite
        remove_pid_file()

This makes the gateway resilient to crashes and eliminates the need for the workaround below.

Workaround (Service Definition)

Added ExecStartPre to the systemd service template in hermes_cli/gateway.py:

[Service]
Type=simple
ExecStartPre=/bin/rm -f {hermes_home}/gateway.pid
ExecStart=...gateway run --replace

This clears any stale PID file before every service start, breaking the restart loop.

Quick Fix (One-Liner)

When this happens, clear the lock and restart:

rm -f ~/.hermes/gateway.pid && hermes gateway start

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliverysweeper:implemented-on-mainSweeper: behavior already present on current maintype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions