fix(gateway): resolve stale PID file blocking startup after forced kill by batumilove · Pull Request #13559 · NousResearch/hermes-agent

batumilove · 2026-04-21T14:21:41Z

Problem

When systemd kills the gateway process (e.g. TimeoutStopSec exceeded during drain of in-flight Telegram sessions), the PID file (gateway.pid) remains on disk. If the OS later reuses that PID for an unrelated process, get_running_pid() incorrectly reports it as "still running", causing a restart loop:

systemd restarts the gateway
Gateway sees the stale PID as "another instance running"
Gateway exits with error
systemd restarts → repeat

The root cause: _looks_like_gateway_process() returns False (the reused PID is not the gateway), but _record_looks_like_gateway() returns True (the stale record still contains gateway metadata). The old code only cleaned up when both checks agreed the PID was stale.

Fix

In get_running_pid(), when _looks_like_gateway_process() returns False, read the actual /proc/<pid>/cmdline before falling back to the stored record:

cmdline is readable → trust it over the stale record. A non-gateway cmdline means the PID file is stale → clean up.
cmdline is unreadable (container/capability edge case) → fall back to _record_looks_like_gateway() (preserves old behavior).

Testing

Reproduced the issue: killed gateway via systemctl --user kill hermes-gateway, confirmed restart loop with "PID file race lost" errors
Applied patch, repeated forced kill → gateway starts cleanly, stale PID cleaned up automatically
Verified --replace flow still works correctly

Changes

gateway/status.py: 9 lines added, 1 removed in get_running_pid() around line 610

When systemd kills the gateway process (e.g. TimeoutStopSec exceeded during drain), the PID file remains on disk. If the OS later reuses that PID for an unrelated process, get_running_pid() would see a live PID and check _record_looks_like_gateway() — but the stale record still contains gateway metadata, so it matched and reported the PID as 'running'. This caused a restart loop: systemd restarts the gateway, it sees its own stale PID as 'another instance running', exits, systemd restarts again. The fix: when _looks_like_gateway_process() returns False (the live PID's /proc entry doesn't look like the gateway), read the actual cmdline from /proc before falling back to the stored record. If the cmdline is readable, trust it over the stale record — a non-gateway cmdline means the PID file is stale and should be cleaned up. Only fall back to _record_looks_like_gateway() when /proc is unreadable (container/capability edge case).

alt-glitch · 2026-04-21T22:36:57Z

Related to #13658, #13709, and #9703 — all address stale PID file cleanup but via different code paths (pre-startup cleanup vs. get_running_pid() logic).

Hermes Agent added 2 commits April 21, 2026 10:57

Fix GLM error 1213: handle empty prompts gracefully

62b6a26

This was referenced Apr 21, 2026

fix: force-clean stale PID file on gateway startup #13658

Closed

[codex] Fix gateway update restart race #13713

Closed

alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery labels Apr 21, 2026

This was referenced Apr 22, 2026

fix: clean up stale gateway pid files and skip wrapper memory plugins #13872

Closed

[Bug]: Gateway hang on clean exit / restart race with stale PID #14176

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): resolve stale PID file blocking startup after forced kill#13559

fix(gateway): resolve stale PID file blocking startup after forced kill#13559
batumilove wants to merge 2 commits into
NousResearch:mainfrom
batumilove:fix/stale-pid-after-forced-kill

batumilove commented Apr 21, 2026

Uh oh!

alt-glitch commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

batumilove commented Apr 21, 2026

Problem

Fix

Testing

Changes

Uh oh!

alt-glitch commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants