Skip to content

Gateway fails to start with --replace when previous instance PID is already dead (stale gateway.pid) #14203

@SDMD92

Description

@SDMD92

When the gateway crashes or is killed and its PID is no longer alive, a subsequent gateway run --replace fails immediately with:

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

The gateway then exits with code 1 and systemd's Restart=on-failure creates a flap loop that cannot self-recover without manually deleting gateway.pid.

Root Cause

In gateway/run.py, the --replace startup flow is:

  1. existing_pid = get_running_pid() (line ~10799)
  2. If existing_pid is not None: terminate old process, then unlink gateway.pid (lines ~10801-10857)
  3. write_pid_file() with O_CREAT|O_EXCL (line ~11012)

When the old gateway is already dead, get_running_pid() correctly returns None (PID not found → file cleaned up by get_running_pid). However, step 2 is skipped entirely because the condition is existing_pid is not None. The stale gateway.pid file is unlinked by get_running_pid's internal _cleanup_invalid_pid_path call — but there is a race: systemd may restart the gateway process before the file is deleted, or the file may persist if cleanup_stale doesn't run in time.

More critically: if --replace is specified, the startup should always ensure the PID file is gone before attempting write_pid_file(), regardless of whether the old process is alive or dead. The current code only cleans up inside the if existing_pid is not None block.

Reproduction

  1. Start the gateway via systemd (hermes-gateway.service with Restart=on-failure)
  2. Kill the gateway process (SIGKILL, OOM, etc.)
  3. Note that ~/.hermes/gateway.pid still exists with the dead PID
  4. Systemd restarts the gateway → get_running_pid() returns None → stale PID file not cleaned by the replace block → write_pid_file() raises FileExistsError → gateway exits 1 → systemd restarts again → flap loop

Fix

After the if existing_pid is not None block and before write_pid_file(), add cleanup when --replace is set:

        # (after the existing_pid block, around line 10870)
        if replace:
            # Ensure stale PID file is removed even when the old process
            # is already dead (get_running_pid returned None).  Without this,
            # write_pid_file()'s O_CREAT|O_EXCL races with a leftover file.
            try:
                (get_hermes_home() / "gateway.pid").unlink(missing_ok=True)
            except Exception:
                pass

This matches the existing force-unlink on line ~10855 but covers the case where the replace logic was skipped because the old process was already gone.

Workaround

sudo systemctl stop hermes-gateway
rm -f ~/.hermes/gateway.pid
sudo systemctl reset-failed hermes-gateway
sudo systemctl start hermes-gateway

Environment

  • Hermes Agent: latest (post-v2.47)
  • OS: Ubuntu inside Proxmox LXC
  • Systemd: system-level service (not user systemd)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions