Skip to content

fix(gateway): clean stale PID file before O_EXCL write#25569

Closed
zccyman wants to merge 1 commit into
NousResearch:mainfrom
atyou2happy:fix/zccyman/stale-pid-file-cleanup
Closed

fix(gateway): clean stale PID file before O_EXCL write#25569
zccyman wants to merge 1 commit into
NousResearch:mainfrom
atyou2happy:fix/zccyman/stale-pid-file-cleanup

Conversation

@zccyman

@zccyman zccyman commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes #13511

After an abnormal gateway exit (SIGKILL, crash, OOM, terminal disconnect), the gateway.pid file remains on disk. On next gateway start, write_pid_file() uses O_CREAT | O_EXCL which raises FileExistsError — causing the gateway to exit with "PID file race lost to another gateway instance" even though the old process is dead.

Root Cause

write_pid_file() in gateway/status.py does not check whether the process recorded in a pre-existing PID file is still alive before attempting the atomic O_EXCL create. The FileExistsError is unconditionally re-raised, blocking restart.

Fix

Before the os.open(path, O_CREAT | O_EXCL | ...) call, add a stale PID cleanup step:

  1. If the PID file exists, read the recorded PID
  2. Probe the process with os.kill(pid, 0)
  3. If ProcessLookupError (process dead), remove the stale file
  4. Then proceed with the normal O_EXCL creation

This preserves the atomic race behavior for truly concurrent starts while fixing the stale-file blocking case.

Testing

Added 2 new tests in tests/gateway/test_status.py:

  • test_write_pid_file_cleans_stale_pid_from_dead_process — stale PID file is cleaned, write succeeds
  • test_write_pid_file_preserves_pid_file_for_live_process — live process PID file preserved, FileExistsError raised

All 13 TestGatewayPidState tests pass.

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 14, 2026
@teknium1

Copy link
Copy Markdown
Contributor

This looks implemented on current main now. Automated hermes-sweeper review found the stale gateway PID restart case already covered by the gateway startup/status path.

Evidence:

  • gateway/run.py:16103 calls get_running_pid() before write_pid_file(), so stale cleanup runs before the O_CREAT | O_EXCL PID-file write.
  • gateway/status.py:290 defines _cleanup_invalid_pid_path(), which force-unlinks stale gateway.pid and sibling lock metadata once the runtime lock is inactive.
  • tests/gateway/test_status.py:55 covers the crash/stale-PID scenario and explicitly guards against the PID file race lost restart loop.
  • The relevant mainline commit is 413990c94537e9c9da973bb21a6afcd332400b91, contained in v2026.5.16.

@teknium1 teknium1 closed this Jun 12, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Hermes Gateway PID 文件竞态问题及修复方案

3 participants