fix(gateway): clear stale PID file from crashed gateway on startup (closes #13655)#13709
Closed
RhythrosaLabs wants to merge 1 commit into
Closed
fix(gateway): clear stale PID file from crashed gateway on startup (closes #13655)#13709RhythrosaLabs wants to merge 1 commit into
RhythrosaLabs wants to merge 1 commit into
Conversation
_cleanup_invalid_pid_path() delegated to remove_pid_file(), which
refuses to delete files whose recorded PID != os.getpid(). After a
gateway crash (SIGKILL, OOM, systemd stop timeout) the atexit handler
never fires, leaving gateway.pid on disk with the dead process's PID.
On the next startup:
1. get_running_pid() reads the dead PID, raises ProcessLookupError
from os.kill(pid, 0), calls _cleanup_invalid_pid_path() -- which
calls remove_pid_file() -- which sees file_pid != os.getpid() and
leaves the file intact -- and returns None.
2. The 'already running' guard is skipped (correctly, since None).
3. write_pid_file() tries O_CREAT|O_EXCL -- FileExistsError because
the stale file was never removed.
4. 'PID file race lost to another gateway instance. Exiting.'
5. systemd Restart=on-failure retries indefinitely until an operator
manually deletes ~/.hermes/gateway.pid.
Fix: in _cleanup_invalid_pid_path(), force-unlink the path directly
(missing_ok=True) instead of going through remove_pid_file(). At the
point this function is called we have already confirmed the recorded
PID is dead or invalid, so the ownership guard in remove_pid_file() is
wrong. The concurrent-write race (two --replace instances) is still
handled correctly: both see the dead file, both unlink (idempotent),
then their respective write_pid_file() O_CREAT|O_EXCL calls race as
intended, with exactly one winner.
Adds two regression tests:
- test_get_running_pid_removes_stale_pid_file_after_crash
- test_write_pid_file_succeeds_after_stale_pid_cleared
Fixes NousResearch#13655
Collaborator
1 similar comment
Collaborator
This was referenced Apr 21, 2026
This was referenced Apr 23, 2026
Contributor
|
Thanks for the well-researched PR and clear root cause analysis — the call chain diagram and the explanation of why This is an automated hermes-sweeper review. The fix has already landed on
Closing as implemented. Your analysis directly informed the fix — appreciate the contribution! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #13655
After a gateway process is killed with SIGKILL (or dies from OOM, systemd stop-timeout, etc.) the handler never fires and is left on disk containing the dead process's PID. On the next startup this stale file causes an unrecoverable restart loop.
Root Cause
The call chain on startup was:
remove_pid_file()intentionally guards against removing a file belonging to another live process (the --replace handoff race), but after a crash the guarded process is dead — the guard is wrong here.Fix
In
_cleanup_invalid_pid_path(), replace the call toremove_pid_file()with a direct force-unlink (pid_path.unlink(missing_ok=True)). By the time this helper is called we have already confirmed the recorded PID is dead or invalid, so skipping the ownership guard is correct.The concurrent-write race (
--replacewith two competing starters) is unaffected: both processes see the stale file, both unlink it (idempotent), then theirwrite_pid_file()O_CREAT|O_EXCLcalls race as intended — exactly one wins.Changes
gateway/status.py_cleanup_invalid_pid_path(): directunlink()instead ofremove_pid_file()tests/gateway/test_status.pyRegression Tests Added
All 29 existing tests in
tests/gateway/test_status.pycontinue to pass.Before / After
Checklist
--replacerace path