Skip to content

fix(gateway): resolve stale PID file blocking startup after forced kill#13559

Open
batumilove wants to merge 2 commits into
NousResearch:mainfrom
batumilove:fix/stale-pid-after-forced-kill
Open

fix(gateway): resolve stale PID file blocking startup after forced kill#13559
batumilove wants to merge 2 commits into
NousResearch:mainfrom
batumilove:fix/stale-pid-after-forced-kill

Conversation

@batumilove

Copy link
Copy Markdown

Problem

When systemd kills the gateway process (e.g. TimeoutStopSec exceeded during drain of in-flight Telegram sessions), the PID file (gateway.pid) remains on disk. If the OS later reuses that PID for an unrelated process, get_running_pid() incorrectly reports it as "still running", causing a restart loop:

  1. systemd restarts the gateway
  2. Gateway sees the stale PID as "another instance running"
  3. Gateway exits with error
  4. systemd restarts → repeat

The root cause: _looks_like_gateway_process() returns False (the reused PID is not the gateway), but _record_looks_like_gateway() returns True (the stale record still contains gateway metadata). The old code only cleaned up when both checks agreed the PID was stale.

Fix

In get_running_pid(), when _looks_like_gateway_process() returns False, read the actual /proc/<pid>/cmdline before falling back to the stored record:

  • cmdline is readable → trust it over the stale record. A non-gateway cmdline means the PID file is stale → clean up.
  • cmdline is unreadable (container/capability edge case) → fall back to _record_looks_like_gateway() (preserves old behavior).

Testing

  • Reproduced the issue: killed gateway via systemctl --user kill hermes-gateway, confirmed restart loop with "PID file race lost" errors
  • Applied patch, repeated forced kill → gateway starts cleanly, stale PID cleaned up automatically
  • Verified --replace flow still works correctly

Changes

  • gateway/status.py: 9 lines added, 1 removed in get_running_pid() around line 610

Hermes Agent added 2 commits April 21, 2026 10:57
When systemd kills the gateway process (e.g. TimeoutStopSec exceeded
during drain), the PID file remains on disk. If the OS later reuses
that PID for an unrelated process, get_running_pid() would see a live
PID and check _record_looks_like_gateway() — but the stale record
still contains gateway metadata, so it matched and reported the PID
as 'running'. This caused a restart loop: systemd restarts the
gateway, it sees its own stale PID as 'another instance running',
exits, systemd restarts again.

The fix: when _looks_like_gateway_process() returns False (the live
PID's /proc entry doesn't look like the gateway), read the actual
cmdline from /proc before falling back to the stored record. If the
cmdline is readable, trust it over the stale record — a non-gateway
cmdline means the PID file is stale and should be cleaned up. Only
fall back to _record_looks_like_gateway() when /proc is unreadable
(container/capability edge case).
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery labels Apr 21, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #13658, #13709, and #9703 — all address stale PID file cleanup but via different code paths (pre-startup cleanup vs. get_running_pid() logic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants