Skip to content

fix(gateway): clear stale pid file before service start#9703

Closed
john-livingston wants to merge 1 commit into
NousResearch:mainfrom
john-livingston:fix/gateway-service-stale-pid
Closed

fix(gateway): clear stale pid file before service start#9703
john-livingston wants to merge 1 commit into
NousResearch:mainfrom
john-livingston:fix/gateway-service-stale-pid

Conversation

@john-livingston

Copy link
Copy Markdown

Summary

When the gateway process is killed abruptly (SIGKILL, power loss, OOM)
it cannot clean up gateway.pid. On the next systemctl start, the
PID file is still present. If the old PID has been recycled by the OS,
get_running_pid() may incorrectly treat the new process as a running
gateway, blocking startup. If it correctly identifies the PID as stale,
startup proceeds but the file is left as noise until Python cleans it up.

Fix: Add ExecStartPre=-/bin/rm -f {hermes_home}/gateway.pid to
both service templates (user-level and system-level) in
generate_systemd_unit(). The - prefix makes the step non-fatal —
if the file doesn't exist or the command fails for any reason, systemd
continues. The path is the Python f-string value of hermes_home, baked
in at install time, so it targets the correct profile directory.

This is a belt-and-suspenders complement to --replace: --replace
handles live processes via Python; ExecStartPre handles the stale-file
case at the OS level before Python starts.

Changes

  • hermes_cli/gateway.py — both generate_systemd_unit() templates:
    insert ExecStartPre=-/bin/rm -f {hermes_home}/gateway.pid before
    ExecStart

Related

Test plan

  • Manually verified: service starts cleanly after simulating a stale
    gateway.pid left by a killed process
  • No automated tests added (existing suite requires dependencies not
    available in this environment)

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for this careful fix, @john-livingston! The underlying stale-PID-on-abrupt-kill problem has since been addressed at a deeper layer on main.

Automated hermes-sweeper review found this superseded by commit b52123e (fix(gateway): recover stale pid and planned restart state, shipped in v2026.4.23):

  • gateway/status.py now uses an fcntl-backed runtime lock file (gateway.lock) alongside gateway.pid. When the gateway is killed by SIGKILL, the OS releases the lock automatically. get_running_pid() calls is_gateway_runtime_lock_active() first (line 757); if the lock is not held, it immediately calls _cleanup_invalid_pid_path() and returns None, unblocking startup without any systemd pre-start hook. (gateway/status.py:757)
  • The ExecStartPre=-/bin/rm -f this PR adds is now redundant — the Python startup path handles the stale-file case before any platform adapters are opened.

Closing as implemented on main. The cross-references to #7227 / #9574 (Windows stale PID) remain open for the Windows path.

@teknium1 teknium1 closed this Jun 10, 2026
@teknium1 teknium1 added the sweeper:implemented-on-main Sweeper: behavior already present on current main label Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists sweeper:implemented-on-main Sweeper: behavior already present on current main type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants