fix(gateway): always unlink stale PID files, even with foreign PID#13947
Closed
stevenoliver7 wants to merge 1 commit into
Closed
fix(gateway): always unlink stale PID files, even with foreign PID#13947stevenoliver7 wants to merge 1 commit into
stevenoliver7 wants to merge 1 commit into
Conversation
`_cleanup_invalid_pid_path` was delegating to `remove_pid_file()`, whose
"only unlink if PID matches os.getpid()" guard is correct for the live
`--replace` atexit-handoff case it was written for, but wrongly preserves
stale foreign records during startup cleanup.
Observed failure (macOS + launchd, 2026-04-22): a gateway SIGKILL'd (or
otherwise exited without atexit firing) left `gateway.pid` pointing at a
dead PID. On every respawn, the new process:
1. Read the stale record;
2. os.kill(pid, 0) → ProcessLookupError → stale-cleanup path;
3. `_cleanup_invalid_pid_path` called `remove_pid_file()` which
bailed out because record.pid != os.getpid();
4. `write_pid_file()` O_CREAT|O_EXCL tripped on the surviving file
and logged "PID file race lost to another gateway instance.
Exiting.";
5. launchd respawned — goto 1.
Fix: when we reach `_cleanup_invalid_pid_path`, the caller
(`get_running_pid`) has already established the file is stale or
invalid, so the identity guard is actively harmful. Use a direct
`pid_path.unlink(missing_ok=True)` and let the next O_CREAT|O_EXCL
serialize any real concurrent-startup race.
The `remove_pid_file()` atexit protection is still correct for its
real use case (--replace handoffs) and is unchanged.
Regression test: `test_get_running_pid_cleans_up_stale_foreign_gateway_pid_file`
reproduces the macOS on-disk shape (`start_time: null`) with a
monkeypatched-dead PID and asserts the file is removed.
Collaborator
Collaborator
|
Likely duplicate of #13709. |
Contributor
|
Thanks for the detailed write-up and regression test, @stevenoliver7 — this is a well-diagnosed bug with a correct fix. This is an automated hermes-sweeper review. The change you propose is already on
Closing as implemented on main. The regression test you contributed is a great addition — if any part of it wasn't captured by the merged commit's tests, feel free to open a focused test-only PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gateway/status.py::_cleanup_invalid_pid_pathwas delegating toremove_pid_file(), whose "only unlink if PID matchesos.getpid()" guard is correct for the live--replaceatexit-handoff case it was written for, but wrongly preserves stale foreign records during startup cleanup. The result: once a gateway exits uncleanly (SIGKILL, power loss, launchdkill -9), the next startup's liveness check sees a stale PID file belonging to a dead foreign PID, refuses to delete it, and thenwrite_pid_file()'sO_CREAT|O_EXCLtrips forever. Under launchd'sKeepAlive, this becomes an infinite respawn crash-loop.Observed failure (macOS + launchd, 2026-04-22)
com.hermes.cutieclawierespawned ~every 15s for ~12 hours (throttled byThrottleInterval). Each child logged:The "other instance" was a long-dead PID whose record was left behind by an earlier SIGKILL, because
_cleanup_invalid_pid_path→remove_pid_file()→ identity guard → no-op → file survives → nextO_CREAT|O_EXCLtrips again. Forever.Fix
When we reach
_cleanup_invalid_pid_path,get_running_pidhas already established the file is unreadable, names a dead process, or references a PID whose identity no longer looks like a gateway — so the identity guard that protected live--replacehandoffs is actively harmful here. Replaced the delegation with a directpid_path.unlink(missing_ok=True)and let the nextO_CREAT|O_EXCLserialize any real concurrent-startup race.remove_pid_file()itself is left untouched — its guard is still correct for the atexit path it was written for.Repro / regression test
tests/gateway/test_status.py::test_get_running_pid_cleans_up_stale_foreign_gateway_pid_filewrites a gateway-shaped record with a foreign, provably-dead PID, stubsos.killto raiseProcessLookupError, and asserts thatget_running_pid()(a) returnsNone, and (b) removes the stale file. Before the fix: the assertion that the file is gone fails. After the fix: passes.Test plan
pytest tests/gateway/test_status.py -v→ 28 passedpytest tests/→ 3619 passed / 3 pre-existing unrelated failures (Matrix upload, WhatsApp bridge phase2, AgentCache idle resume — confirmed by reverting fix and re-running: same 3 fail, unrelated to this change)hermes -p hub gateway run --foregroundwith a stale foreign PID injected): gateway started cleanly, wrote its own pid,hermes -p hub chat -q "ping"returned the expected response, atexit removed the pid on shutdown.Why this matters
This is the difference between a gateway that recovers from a kill -9 and one that requires manual
rm ~/.hermes/profiles/<name>/gateway.pidintervention after every unclean exit. Under a supervisor like launchd / systemd, the latter means a service outage that grows the longer nobody notices.