Skip to content

fix(gateway): clear stale PID file from crashed gateway on startup (closes #13655)#13709

Closed
RhythrosaLabs wants to merge 1 commit into
NousResearch:mainfrom
RhythrosaLabs:fix/stale-pid-restart-loop
Closed

fix(gateway): clear stale PID file from crashed gateway on startup (closes #13655)#13709
RhythrosaLabs wants to merge 1 commit into
NousResearch:mainfrom
RhythrosaLabs:fix/stale-pid-restart-loop

Conversation

@RhythrosaLabs

Copy link
Copy Markdown

Summary

Closes #13655

After a gateway process is killed with SIGKILL (or dies from OOM, systemd stop-timeout, etc.) the handler never fires and is left on disk containing the dead process's PID. On the next startup this stale file causes an unrecoverable restart loop.

Root Cause

The call chain on startup was:

get_running_pid()
  → os.kill(dead_pid, 0)          # ProcessLookupError — process is gone
  → _cleanup_invalid_pid_path()
      → remove_pid_file()          # BUG: checks file_pid != os.getpid()
                                   #      dead_pid ≠ new_pid → does nothing
  → returns None

write_pid_file()                   # O_CREAT | O_EXCL
  → FileExistsError                # stale file still present
  → 'PID file race lost. Exiting.'

remove_pid_file() intentionally guards against removing a file belonging to another live process (the --replace handoff race), but after a crash the guarded process is dead — the guard is wrong here.

Fix

In _cleanup_invalid_pid_path(), replace the call to remove_pid_file() with a direct force-unlink (pid_path.unlink(missing_ok=True)). By the time this helper is called we have already confirmed the recorded PID is dead or invalid, so skipping the ownership guard is correct.

The concurrent-write race (--replace with two competing starters) is unaffected: both processes see the stale file, both unlink it (idempotent), then their write_pid_file() O_CREAT|O_EXCL calls race as intended — exactly one wins.

Changes

File Change
gateway/status.py _cleanup_invalid_pid_path(): direct unlink() instead of remove_pid_file()
tests/gateway/test_status.py Two new regression tests covering the crash→restart scenario

Regression Tests Added

All 29 existing tests in tests/gateway/test_status.py continue to pass.

Before / After

Checklist

_cleanup_invalid_pid_path() delegated to remove_pid_file(), which
refuses to delete files whose recorded PID != os.getpid().  After a
gateway crash (SIGKILL, OOM, systemd stop timeout) the atexit handler
never fires, leaving gateway.pid on disk with the dead process's PID.

On the next startup:
  1. get_running_pid() reads the dead PID, raises ProcessLookupError
     from os.kill(pid, 0), calls _cleanup_invalid_pid_path() -- which
     calls remove_pid_file() -- which sees file_pid != os.getpid() and
     leaves the file intact -- and returns None.
  2. The 'already running' guard is skipped (correctly, since None).
  3. write_pid_file() tries O_CREAT|O_EXCL -- FileExistsError because
     the stale file was never removed.
  4. 'PID file race lost to another gateway instance. Exiting.'
  5. systemd Restart=on-failure retries indefinitely until an operator
     manually deletes ~/.hermes/gateway.pid.

Fix: in _cleanup_invalid_pid_path(), force-unlink the path directly
(missing_ok=True) instead of going through remove_pid_file().  At the
point this function is called we have already confirmed the recorded
PID is dead or invalid, so the ownership guard in remove_pid_file() is
wrong.  The concurrent-write race (two --replace instances) is still
handled correctly: both see the dead file, both unlink (idempotent),
then their respective write_pid_file() O_CREAT|O_EXCL calls race as
intended, with exactly one winner.

Adds two regression tests:
  - test_get_running_pid_removes_stale_pid_file_after_crash
  - test_write_pid_file_succeeds_after_stale_pid_cleared

Fixes NousResearch#13655
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery labels Apr 21, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #13658 — both fix stale PID file cleanup on gateway startup after hard crash. Also overlaps with #9703. Recommend consolidating.

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #13658 — both fix stale PID file cleanup on gateway startup after hard crash. Also overlaps with #9703. Recommend consolidating.

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the well-researched PR and clear root cause analysis — the call chain diagram and the explanation of why remove_pid_file()'s ownership guard is wrong in the crash scenario are exactly right.

This is an automated hermes-sweeper review. The fix has already landed on main via a consolidated salvage:

Closing as implemented. Your analysis directly informed the fix — appreciate the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Stale gateway.pid causes gateway restart loop after crash/SIGKILL

3 participants