Skip to content

fix(gateway): treat recycled PID with unreadable start_time as stale (#14176)#14332

Open
Sanjays2402 wants to merge 1 commit into
NousResearch:mainfrom
Sanjays2402:fix/14176-stale-pid-file-cleanup
Open

fix(gateway): treat recycled PID with unreadable start_time as stale (#14176)#14332
Sanjays2402 wants to merge 1 commit into
NousResearch:mainfrom
Sanjays2402:fix/14176-stale-pid-file-cleanup

Conversation

@Sanjays2402

Copy link
Copy Markdown
Contributor

What does this PR do?

gateway/status.py::find_gateway_pids() iterates over the PIDs recorded in ~/.hermes/gateway.lock to decide whether the gateway is "still running". For each candidate it:

  1. Checks the PID is alive (os.kill(pid, 0)).
  2. Compares the recorded start_time against the live process's start_time to detect PID recycling.
  3. Falls back to _looks_like_gateway_process(pid) / _record_looks_like_gateway(record) heuristics.

When the recycled PID is owned by a different UID (typical on Linux when /proc/<pid>/stat is owned by another user, or under rootless container setups), _get_process_start_time returns None. The recorded-vs-live mismatch check then can't fire (current_start is None), and _looks_like_gateway_process can give a false positive on any long-lived python or hermes-related process the user happens to own. Result: the gateway thinks it's still running, refuses to start, and the user has to manually rm ~/.hermes/gateway.pid to recover.

Reporter (#14176) sees this in production with a systemd user service that restarts the gateway nightly — every few weeks the next PID up the queue lands on a recycled foreign PID, the lock file goes stale, and hermes gateway start fails with "Gateway already running".

Fix: be conservative. When the PID record carries a recorded_start but we can't read the candidate's current_start, skip the candidate (treat as stale) instead of falling through to the heuristic. Outside /proc-readable territory we don't have enough information to confirm this is the same gateway process, so prefer "no" over "maybe".

Related Issue

Fixes #14176

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/status.py (+10 / −0): in the find_gateway_pids() candidate loop, skip any PID whose recorded start_time exists but whose live start_time is unreadable. Same code path as the existing recorded-vs-live mismatch case, just covering the unreadable variant.
  • tests/gateway/test_status.py (+40 / −0): one new regression case under TestGatewayPidState, test_get_running_pid_treats_recycled_pid_with_unreadable_start_time_as_stale. Monkeypatches _get_process_start_time to return None and _looks_like_gateway_process to return True (the strongest stress for the false-positive path) and asserts the PID file is cleaned and get_running_pid() returns None.

Core diff:

         recorded_start = record.get("start_time")
         current_start = _get_process_start_time(pid)
         if recorded_start is not None and current_start is not None and current_start != recorded_start:
             continue
+        # If the PID record carries a recorded start_time but we can't read
+        # the current process's start_time, the PID may have been recycled by
+        # the OS to a process the current user can't introspect (typical on
+        # Linux when /proc/<pid>/stat is owned by another UID). The downstream
+        # _looks_like_gateway_process heuristic can give a false positive in
+        # that situation — e.g. another long-lived python process — leaving
+        # a stale PID file that blocks future starts. Be conservative and
+        # skip this candidate. See #14176.
+        if recorded_start is not None and current_start is None:
+            continue

         if _looks_like_gateway_process(pid) or _record_looks_like_gateway(record):
             return pid

How to Test

Reporter-style repro on a Linux host:

  1. Run the gateway, kill -9 the parent process to leave ~/.hermes/gateway.pid and ~/.hermes/gateway.lock populated with the dead PID.
  2. Start a long-lived python process under a different UID (e.g. another hermes daemon under another account) that the test user can see via ps but NOT via /proc/<pid>/stat. Note its PID.
  3. Edit the lock file to point at that recycled PID, keeping the original start_time field intact.
  4. Run hermes gateway start.

Before: refuses to start with "Gateway already running".
After: detects the start_time mismatch is unverifiable, treats the entry as stale, cleans the lock file, and starts a fresh gateway.

Automated regression suite:

pytest tests/gateway/test_status.py::TestGatewayPidState -q

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(gateway):)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix
  • I've run pytest tests/gateway/test_status.py::TestGatewayPidState -q and all tests pass (11/11)
  • I've added tests for my changes
  • I've tested on my platform: macOS 26.5 (arm64), Python 3.11.14 via uv

Documentation & Housekeeping

  • Documentation updates — N/A (internal helper, no user-visible API change beyond bug fix)
  • cli-config.yaml.example — N/A (no new config)
  • CONTRIBUTING.md / AGENTS.md — N/A
  • Cross-platform impact considered — change is conservative on every platform; the false positive fix matters most on Linux but doesn't regress macOS/Windows behavior
  • Tool descriptions/schemas — N/A

Not in scope

  • The reporter's bash script idea (a dedicated hermes gateway clean-pid command) — that's nice-to-have but a separate UX surface; the in-band fix here is the higher-impact change since it stops the bad state from forming.
  • Auditing gateway.target / Restart= semantics in the example systemd unit (the issue's secondary note) — that's a docs change for docs/deploy/ that deserves its own PR.
  • Hardening atexit-vs-SIGKILL paths so a kill -9 of the gateway doesn't leave a PID file behind — a real concern but out of scope for the reported bug, which is about PID-file interpretation, not creation.

Screenshots / Logs

Verification

$ python3 -m py_compile gateway/status.py tests/gateway/test_status.py
OK

$ uv run --no-project --with pytest --with pytest-xdist --with pyyaml \
       --with python-dotenv --with prompt_toolkit --with rich --with httpx \
       --with fastapi --with pydantic python -m pytest \
       tests/gateway/test_status.py::TestGatewayPidState -q
...........                                                              [100%]
11 passed in 0.51s

(11 = 10 existing + 1 new regression case, all green.)

…ousResearch#14176)

On Linux, when the PID file's pid has been recycled by the OS to a
process the current user can't introspect (typical when /proc/<pid>/stat
is owned by another UID), `_get_process_start_time` returns `None`.
The recorded start_time mismatch check therefore can't fire, and the
`_looks_like_gateway_process` heuristic can give a false positive on a
long-lived python or hermes-related process \u2014 leaving a stale PID file
that blocks `hermes gateway start` until the user manually removes
`~/.hermes/gateway.pid`.

Be conservative: when the PID record has a recorded start_time but the
current process's start_time is unreadable, treat the PID file as stale
and clean it up. Same code path as the existing start_time-mismatch
case, just covering the unreadable variant.

Adds 1 regression test under `TestGatewayPidState` that monkeypatches
`_get_process_start_time` to return None and `_looks_like_gateway_process`
to return True \u2014 the strongest stress for the false-positive path \u2014
and asserts the PID file is cleaned and `get_running_pid()` returns
None.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Apr 23, 2026
@Sanjays2402

Copy link
Copy Markdown
Contributor Author

CI status note for maintainers — the failing test check on this PR is from a set of 15 pre-existing test failures on main, not regressions introduced here.

Verified by diffing the failing-test sets:

  • Latest main push run (25250051126): 16 failed
  • This PR's run: same 15 (subset)
  • Net new failures introduced by this PR: 0

The clusters on main:

Cluster Tests Likely cause
Systemd TimeoutStopSec test_*_unit_avoids_recursive_execstop_and_uses_extended_stop_timeout Code emits 210, test asserts 90 — drift
Gateway restart kill semantics test_update_* Recent change in expected .kill() call counts
update --yes flag test_update_yes_flag TTY prompt / stash restore behavior changed
Dockerfile pid1 test_dockerfile_* Dockerfile regenerated, dropped TUI ink references
Concurrent interrupt _Stub test fixture missing _tool_guardrails attr
dotenv vs os.environ test_os_environ_still_wins_over_dotenv Same class as #18757; happy to add to my fix PR if useful
ACP commands test_send_available_commands_update Command list ordering
Teams typing test_send_typing Mock not awaited
TUI pending_title test_session_create_drops_pending_title_on_valueerror ValueError no longer drops title

Happy to open targeted fix PRs for any of these clusters if it helps unblock the queue. Otherwise this PR is ready whenever main is green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway hang on clean exit / restart race with stale PID

2 participants