Skip to content

fix: recover from stale PID file on gateway startup#14388

Closed
eLeanwang wants to merge 1 commit into
NousResearch:mainfrom
eLeanwang:fix/stale-pid-file-recovery
Closed

fix: recover from stale PID file on gateway startup#14388
eLeanwang wants to merge 1 commit into
NousResearch:mainfrom
eLeanwang:fix/stale-pid-file-recovery

Conversation

@eLeanwang

Copy link
Copy Markdown

When killed by SIGKILL (systemd timeout, OOM), atexit handlers don't run and gateway.pid persists. On next startup, write_pid_file() hits FileExistsError and logs 'PID file race lost', causing systemd to exhaust restart attempts.

Fix: on FileExistsError, check if the recorded PID is alive via get_running_pid(cleanup_stale=True). If dead, clean up the stale file and retry write_pid_file() once. Genuine races (live competing instance) still fail immediately.

When a gateway process is killed by SIGKILL (e.g. systemd
TimeoutStopSec exceeded, OOM killer), atexit handlers never run and
the PID file is left behind. On the next startup, write_pid_file()
hits FileExistsError and logs "PID file race lost", causing systemd
to exhaust its restart burst (StartLimitBurst=5) and leave the
gateway in a failed state requiring manual intervention.

Fix: when write_pid_file() raises FileExistsError, call
get_running_pid(cleanup_stale=True) to check whether the recorded
PID is still alive. If the process is gone, the stale file is
cleaned up automatically and write_pid_file() is retried once. A
genuine race (live competing instance) still fails immediately.
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery labels Apr 23, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing fix for #13655 (stale PID restart loop). Multiple open PRs target same issue: #13709, #9703, #14203 (closed as dup). Maintainers should pick one approach.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing fix for #13655.

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the contribution, @eLeanwang! This stale-PID restart loop was fixed on main just before this PR was opened.

This is an automated hermes-sweeper review.

Evidence:

  • Commit 402d048eb (2026-04-22) rewrote _cleanup_invalid_pid_path() in gateway/status.py to directly unlink() both the PID file and its sibling lock file, bypassing the remove_pid_file() ownership guard that was the root cause of the stale-file failure.
  • The commit message explicitly describes the same bug scenario: crashed/SIGKILL'd gateway leaves a PID file owned by a dead foreign PID; remove_pid_file()'s guard then refuses to clean it up.
  • This fix shipped in v2026.4.23 (released 2026-04-23).
  • Your PR's caller-side retry in gateway/run.py is no longer needed because get_running_pid() now cleans up stale files before write_pid_file() is ever called, so the FileExistsError path is only reachable in a genuine live-process race.

The original bug report is tracked in #13655. Related open PRs #13709 and #9703 address the same issue and may also be closeable now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants