Skip to content

fix(gateway): always unlink stale PID files, even with foreign PID#13947

Closed
stevenoliver7 wants to merge 1 commit into
NousResearch:mainfrom
stevenoliver7:fix/gateway-pid-cleanup-stale-foreign-records
Closed

fix(gateway): always unlink stale PID files, even with foreign PID#13947
stevenoliver7 wants to merge 1 commit into
NousResearch:mainfrom
stevenoliver7:fix/gateway-pid-cleanup-stale-foreign-records

Conversation

@stevenoliver7

Copy link
Copy Markdown

Summary

gateway/status.py::_cleanup_invalid_pid_path was delegating to remove_pid_file(), whose "only unlink if PID matches os.getpid()" guard is correct for the live --replace atexit-handoff case it was written for, but wrongly preserves stale foreign records during startup cleanup. The result: once a gateway exits uncleanly (SIGKILL, power loss, launchd kill -9), the next startup's liveness check sees a stale PID file belonging to a dead foreign PID, refuses to delete it, and then write_pid_file()'s O_CREAT|O_EXCL trips forever. Under launchd's KeepAlive, this becomes an infinite respawn crash-loop.

Observed failure (macOS + launchd, 2026-04-22)

com.hermes.cutieclawie respawned ~every 15s for ~12 hours (throttled by ThrottleInterval). Each child logged:

ERROR gateway.run: PID file race lost to another gateway instance. Exiting.

The "other instance" was a long-dead PID whose record was left behind by an earlier SIGKILL, because _cleanup_invalid_pid_pathremove_pid_file() → identity guard → no-op → file survives → next O_CREAT|O_EXCL trips again. Forever.

Fix

When we reach _cleanup_invalid_pid_path, get_running_pid has already established the file is unreadable, names a dead process, or references a PID whose identity no longer looks like a gateway — so the identity guard that protected live --replace handoffs is actively harmful here. Replaced the delegation with a direct pid_path.unlink(missing_ok=True) and let the next O_CREAT|O_EXCL serialize any real concurrent-startup race.

remove_pid_file() itself is left untouched — its guard is still correct for the atexit path it was written for.

Repro / regression test

tests/gateway/test_status.py::test_get_running_pid_cleans_up_stale_foreign_gateway_pid_file writes a gateway-shaped record with a foreign, provably-dead PID, stubs os.kill to raise ProcessLookupError, and asserts that get_running_pid() (a) returns None, and (b) removes the stale file. Before the fix: the assertion that the file is gone fails. After the fix: passes.

Test plan

  • pytest tests/gateway/test_status.py -v → 28 passed
  • Broader sweep pytest tests/ → 3619 passed / 3 pre-existing unrelated failures (Matrix upload, WhatsApp bridge phase2, AgentCache idle resume — confirmed by reverting fix and re-running: same 3 fail, unrelated to this change)
  • Real-agent verification on live Hermes profile (hermes -p hub gateway run --foreground with a stale foreign PID injected): gateway started cleanly, wrote its own pid, hermes -p hub chat -q "ping" returned the expected response, atexit removed the pid on shutdown.

Why this matters

This is the difference between a gateway that recovers from a kill -9 and one that requires manual rm ~/.hermes/profiles/<name>/gateway.pid intervention after every unclean exit. Under a supervisor like launchd / systemd, the latter means a service outage that grows the longer nobody notices.

`_cleanup_invalid_pid_path` was delegating to `remove_pid_file()`, whose
"only unlink if PID matches os.getpid()" guard is correct for the live
`--replace` atexit-handoff case it was written for, but wrongly preserves
stale foreign records during startup cleanup.

Observed failure (macOS + launchd, 2026-04-22): a gateway SIGKILL'd (or
otherwise exited without atexit firing) left `gateway.pid` pointing at a
dead PID. On every respawn, the new process:
  1. Read the stale record;
  2. os.kill(pid, 0) → ProcessLookupError → stale-cleanup path;
  3. `_cleanup_invalid_pid_path` called `remove_pid_file()` which
     bailed out because record.pid != os.getpid();
  4. `write_pid_file()` O_CREAT|O_EXCL tripped on the surviving file
     and logged "PID file race lost to another gateway instance.
     Exiting.";
  5. launchd respawned — goto 1.

Fix: when we reach `_cleanup_invalid_pid_path`, the caller
(`get_running_pid`) has already established the file is stale or
invalid, so the identity guard is actively harmful. Use a direct
`pid_path.unlink(missing_ok=True)` and let the next O_CREAT|O_EXCL
serialize any real concurrent-startup race.

The `remove_pid_file()` atexit protection is still correct for its
real use case (--replace handoffs) and is unchanged.

Regression test: `test_get_running_pid_cleans_up_stale_foreign_gateway_pid_file`
reproduces the macOS on-disk shape (`start_time: null`) with a
monkeypatched-dead PID and asserts the file is removed.
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #13709 — same root cause: _cleanup_invalid_pid_path delegates to remove_pid_file() whose identity guard preserves stale foreign PID files. Both fix by force-unlinking. Also overlaps with #13934 and #13872.

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery duplicate This issue or pull request already exists labels Apr 22, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #13709.

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the detailed write-up and regression test, @stevenoliver7 — this is a well-diagnosed bug with a correct fix.

This is an automated hermes-sweeper review.

The change you propose is already on main:

Closing as implemented on main. The regression test you contributed is a great addition — if any part of it wasn't captured by the merged commit's tests, feel free to open a focused test-only PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery duplicate This issue or pull request already exists P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants