When the gateway crashes or is killed and its PID is no longer alive, a subsequent gateway run --replace fails immediately with:
ERROR gateway.run: PID file race lost to another gateway instance. Exiting.
The gateway then exits with code 1 and systemd's Restart=on-failure creates a flap loop that cannot self-recover without manually deleting gateway.pid.
Root Cause
In gateway/run.py, the --replace startup flow is:
existing_pid = get_running_pid() (line ~10799)
- If
existing_pid is not None: terminate old process, then unlink gateway.pid (lines ~10801-10857)
write_pid_file() with O_CREAT|O_EXCL (line ~11012)
When the old gateway is already dead, get_running_pid() correctly returns None (PID not found → file cleaned up by get_running_pid). However, step 2 is skipped entirely because the condition is existing_pid is not None. The stale gateway.pid file is unlinked by get_running_pid's internal _cleanup_invalid_pid_path call — but there is a race: systemd may restart the gateway process before the file is deleted, or the file may persist if cleanup_stale doesn't run in time.
More critically: if --replace is specified, the startup should always ensure the PID file is gone before attempting write_pid_file(), regardless of whether the old process is alive or dead. The current code only cleans up inside the if existing_pid is not None block.
Reproduction
- Start the gateway via systemd (
hermes-gateway.service with Restart=on-failure)
- Kill the gateway process (SIGKILL, OOM, etc.)
- Note that
~/.hermes/gateway.pid still exists with the dead PID
- Systemd restarts the gateway →
get_running_pid() returns None → stale PID file not cleaned by the replace block → write_pid_file() raises FileExistsError → gateway exits 1 → systemd restarts again → flap loop
Fix
After the if existing_pid is not None block and before write_pid_file(), add cleanup when --replace is set:
# (after the existing_pid block, around line 10870)
if replace:
# Ensure stale PID file is removed even when the old process
# is already dead (get_running_pid returned None). Without this,
# write_pid_file()'s O_CREAT|O_EXCL races with a leftover file.
try:
(get_hermes_home() / "gateway.pid").unlink(missing_ok=True)
except Exception:
pass
This matches the existing force-unlink on line ~10855 but covers the case where the replace logic was skipped because the old process was already gone.
Workaround
sudo systemctl stop hermes-gateway
rm -f ~/.hermes/gateway.pid
sudo systemctl reset-failed hermes-gateway
sudo systemctl start hermes-gateway
Environment
- Hermes Agent: latest (post-v2.47)
- OS: Ubuntu inside Proxmox LXC
- Systemd: system-level service (not user systemd)
When the gateway crashes or is killed and its PID is no longer alive, a subsequent
gateway run --replacefails immediately with:The gateway then exits with code 1 and systemd's
Restart=on-failurecreates a flap loop that cannot self-recover without manually deletinggateway.pid.Root Cause
In
gateway/run.py, the--replacestartup flow is:existing_pid = get_running_pid()(line ~10799)existing_pid is not None: terminate old process, then unlinkgateway.pid(lines ~10801-10857)write_pid_file()withO_CREAT|O_EXCL(line ~11012)When the old gateway is already dead,
get_running_pid()correctly returnsNone(PID not found → file cleaned up byget_running_pid). However, step 2 is skipped entirely because the condition isexisting_pid is not None. The stalegateway.pidfile is unlinked byget_running_pid's internal_cleanup_invalid_pid_pathcall — but there is a race: systemd may restart the gateway process before the file is deleted, or the file may persist ifcleanup_staledoesn't run in time.More critically: if
--replaceis specified, the startup should always ensure the PID file is gone before attemptingwrite_pid_file(), regardless of whether the old process is alive or dead. The current code only cleans up inside theif existing_pid is not Noneblock.Reproduction
hermes-gateway.servicewithRestart=on-failure)~/.hermes/gateway.pidstill exists with the dead PIDget_running_pid()returnsNone→ stale PID file not cleaned by the replace block →write_pid_file()raisesFileExistsError→ gateway exits 1 → systemd restarts again → flap loopFix
After the
if existing_pid is not Noneblock and beforewrite_pid_file(), add cleanup when--replaceis set:This matches the existing force-unlink on line ~10855 but covers the case where the replace logic was skipped because the old process was already gone.
Workaround
sudo systemctl stop hermes-gateway rm -f ~/.hermes/gateway.pid sudo systemctl reset-failed hermes-gateway sudo systemctl start hermes-gatewayEnvironment