Severity
Medium — causes complete gateway service outage requiring manual intervention
Affected Versions
Current stable as of 2026-04-21
Problem Description
The Hermes Gateway service enters a restart loop when the Python gateway process is killed unexpectedly (SIGKILL, OOM, crash). On restart, the gateway.pid file still exists with the dead process's PID. The gateway startup logic treats this as a live process conflict and exits immediately with:
ERROR gateway.run: PID file race lost to another gateway instance. Exiting.
Because the service is configured with Restart=on-failure and RestartSec=30, systemd re-attempts every 30 seconds and fails repeatedly, eventually hitting StartLimitBurst=5 and rate-limiting itself. The service becomes unreachable until an operator manually deletes ~/.hermes/gateway.pid.
Root Cause
In gateway/run.py, the startup sequence:
- Reads
gateway.pid to check for an existing gateway
- Writes
gateway.pid (which fails with FileExistsError if the file already exists and the old PID is gone but the file wasn't cleaned up)
- Registers
atexit.register(remove_pid_file) to delete the PID file on clean exit
The problem: if the process is killed with SIGKILL or crashes, the atexit handler never fires and gateway.pid is left behind. The next startup sees the stale file, treats it as a conflict, and exits without attempting to validate whether the PID is actually alive.
Recommended Fix (In Gateway Code)
The gateway startup should validate whether a PID in gateway.pid is actually alive before treating it as a conflict:
# In gateway/run.py — before writing PID file:
stale_pid = get_running_pid()
if stale_pid is not None:
try:
os.kill(stale_pid, 0) # PID exists and we can signal it
# Real gateway running — exit
logger.error("Another gateway instance (PID %d) is running. Exiting.", stale_pid)
return False
except (ProcessLookupError, PermissionError):
# Stale PID — file exists but process is dead, safe to overwrite
remove_pid_file()
This makes the gateway resilient to crashes and eliminates the need for the workaround below.
Workaround (Service Definition)
Added ExecStartPre to the systemd service template in hermes_cli/gateway.py:
[Service]
Type=simple
ExecStartPre=/bin/rm -f {hermes_home}/gateway.pid
ExecStart=...gateway run --replace
This clears any stale PID file before every service start, breaking the restart loop.
Quick Fix (One-Liner)
When this happens, clear the lock and restart:
rm -f ~/.hermes/gateway.pid && hermes gateway start
Severity
Medium — causes complete gateway service outage requiring manual intervention
Affected Versions
Current stable as of 2026-04-21
Problem Description
The Hermes Gateway service enters a restart loop when the Python gateway process is killed unexpectedly (SIGKILL, OOM, crash). On restart, the
gateway.pidfile still exists with the dead process's PID. The gateway startup logic treats this as a live process conflict and exits immediately with:Because the service is configured with
Restart=on-failureandRestartSec=30, systemd re-attempts every 30 seconds and fails repeatedly, eventually hittingStartLimitBurst=5and rate-limiting itself. The service becomes unreachable until an operator manually deletes~/.hermes/gateway.pid.Root Cause
In
gateway/run.py, the startup sequence:gateway.pidto check for an existing gatewaygateway.pid(which fails withFileExistsErrorif the file already exists and the old PID is gone but the file wasn't cleaned up)atexit.register(remove_pid_file)to delete the PID file on clean exitThe problem: if the process is killed with SIGKILL or crashes, the atexit handler never fires and
gateway.pidis left behind. The next startup sees the stale file, treats it as a conflict, and exits without attempting to validate whether the PID is actually alive.Recommended Fix (In Gateway Code)
The gateway startup should validate whether a PID in
gateway.pidis actually alive before treating it as a conflict:This makes the gateway resilient to crashes and eliminates the need for the workaround below.
Workaround (Service Definition)
Added
ExecStartPreto the systemd service template inhermes_cli/gateway.py:This clears any stale PID file before every service start, breaking the restart loop.
Quick Fix (One-Liner)
When this happens, clear the lock and restart: