fix(gateway): treat recycled PID with unreadable start_time as stale (#14176)#14332
Open
Sanjays2402 wants to merge 1 commit into
Open
fix(gateway): treat recycled PID with unreadable start_time as stale (#14176)#14332Sanjays2402 wants to merge 1 commit into
Sanjays2402 wants to merge 1 commit into
Conversation
…ousResearch#14176) On Linux, when the PID file's pid has been recycled by the OS to a process the current user can't introspect (typical when /proc/<pid>/stat is owned by another UID), `_get_process_start_time` returns `None`. The recorded start_time mismatch check therefore can't fire, and the `_looks_like_gateway_process` heuristic can give a false positive on a long-lived python or hermes-related process \u2014 leaving a stale PID file that blocks `hermes gateway start` until the user manually removes `~/.hermes/gateway.pid`. Be conservative: when the PID record has a recorded start_time but the current process's start_time is unreadable, treat the PID file as stale and clean it up. Same code path as the existing start_time-mismatch case, just covering the unreadable variant. Adds 1 regression test under `TestGatewayPidState` that monkeypatches `_get_process_start_time` to return None and `_looks_like_gateway_process` to return True \u2014 the strongest stress for the false-positive path \u2014 and asserts the PID file is cleaned and `get_running_pid()` returns None.
Contributor
Author
|
CI status note for maintainers — the failing Verified by diffing the failing-test sets:
The clusters on main:
Happy to open targeted fix PRs for any of these clusters if it helps unblock the queue. Otherwise this PR is ready whenever main is green. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
gateway/status.py::find_gateway_pids()iterates over the PIDs recorded in~/.hermes/gateway.lockto decide whether the gateway is "still running". For each candidate it:os.kill(pid, 0)).start_timeagainst the live process'sstart_timeto detect PID recycling._looks_like_gateway_process(pid)/_record_looks_like_gateway(record)heuristics.When the recycled PID is owned by a different UID (typical on Linux when
/proc/<pid>/statis owned by another user, or under rootless container setups),_get_process_start_timereturnsNone. The recorded-vs-live mismatch check then can't fire (current_start is None), and_looks_like_gateway_processcan give a false positive on any long-lived python or hermes-related process the user happens to own. Result: the gateway thinks it's still running, refuses to start, and the user has to manuallyrm ~/.hermes/gateway.pidto recover.Reporter (#14176) sees this in production with a systemd user service that restarts the gateway nightly — every few weeks the next PID up the queue lands on a recycled foreign PID, the lock file goes stale, and
hermes gateway startfails with "Gateway already running".Fix: be conservative. When the PID record carries a
recorded_startbut we can't read the candidate'scurrent_start, skip the candidate (treat as stale) instead of falling through to the heuristic. Outside/proc-readable territory we don't have enough information to confirm this is the same gateway process, so prefer "no" over "maybe".Related Issue
Fixes #14176
Type of Change
Changes Made
gateway/status.py(+10 / −0): in thefind_gateway_pids()candidate loop, skip any PID whose recordedstart_timeexists but whose livestart_timeis unreadable. Same code path as the existing recorded-vs-live mismatch case, just covering the unreadable variant.tests/gateway/test_status.py(+40 / −0): one new regression case underTestGatewayPidState,test_get_running_pid_treats_recycled_pid_with_unreadable_start_time_as_stale. Monkeypatches_get_process_start_timeto returnNoneand_looks_like_gateway_processto returnTrue(the strongest stress for the false-positive path) and asserts the PID file is cleaned andget_running_pid()returnsNone.Core diff:
recorded_start = record.get("start_time") current_start = _get_process_start_time(pid) if recorded_start is not None and current_start is not None and current_start != recorded_start: continue + # If the PID record carries a recorded start_time but we can't read + # the current process's start_time, the PID may have been recycled by + # the OS to a process the current user can't introspect (typical on + # Linux when /proc/<pid>/stat is owned by another UID). The downstream + # _looks_like_gateway_process heuristic can give a false positive in + # that situation — e.g. another long-lived python process — leaving + # a stale PID file that blocks future starts. Be conservative and + # skip this candidate. See #14176. + if recorded_start is not None and current_start is None: + continue if _looks_like_gateway_process(pid) or _record_looks_like_gateway(record): return pidHow to Test
Reporter-style repro on a Linux host:
~/.hermes/gateway.pidand~/.hermes/gateway.lockpopulated with the dead PID.hermesdaemon under another account) that the test user can see viapsbut NOT via/proc/<pid>/stat. Note its PID.start_timefield intact.hermes gateway start.Before: refuses to start with "Gateway already running".
After: detects the start_time mismatch is unverifiable, treats the entry as stale, cleans the lock file, and starts a fresh gateway.
Automated regression suite:
Checklist
Code
fix(gateway):)pytest tests/gateway/test_status.py::TestGatewayPidState -qand all tests pass (11/11)Documentation & Housekeeping
cli-config.yaml.example— N/A (no new config)CONTRIBUTING.md/AGENTS.md— N/ANot in scope
hermes gateway clean-pidcommand) — that's nice-to-have but a separate UX surface; the in-band fix here is the higher-impact change since it stops the bad state from forming.gateway.target/Restart=semantics in the example systemd unit (the issue's secondary note) — that's a docs change fordocs/deploy/that deserves its own PR.kill -9of the gateway doesn't leave a PID file behind — a real concern but out of scope for the reported bug, which is about PID-file interpretation, not creation.Screenshots / Logs
Verification
(11 = 10 existing + 1 new regression case, all green.)