Skip to content

Fix stale gateway PID recovery and add systemd startup cleanup#14002

Closed
qwertysc wants to merge 1 commit into
NousResearch:mainfrom
qwertysc:fix/gateway-stale-pid-and-systemd-hardening
Closed

Fix stale gateway PID recovery and add systemd startup cleanup#14002
qwertysc wants to merge 1 commit into
NousResearch:mainfrom
qwertysc:fix/gateway-stale-pid-and-systemd-hardening

Conversation

@qwertysc

Copy link
Copy Markdown

Summary

  • force-remove invalid or stale gateway.pid records during runtime PID checks
  • add ExecStartPre startup cleanup to generated systemd units so stale PID files self-heal before launch
  • add regression coverage for stale foreign PID cleanup and generated unit startup cleanup

Testing

  • venv/bin/python -m pytest -q tests/gateway/test_status.py tests/hermes_cli/test_gateway_service.py

Background

This fixes the restart loop where an unclean shutdown leaves gateway.pid behind, systemd kills the old process after timeout, and the next startup fails with PID file race lost to another gateway instance. The fix is split across core correctness (PID cleanup) and service-level self-healing (systemd ExecStartPre).

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard duplicate This issue or pull request already exists labels Apr 22, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #13709 — same root cause: stale gateway.pid after unclean shutdown. Also overlaps with #13947 and #13934.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #13709

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the detailed write-up and the regression coverage! The core fix here — replacing the defensive remove_pid_file() call in _cleanup_invalid_pid_path with a direct force-unlink — was already merged to main as part of a salvage of the overlapping PRs (#13709, #13947, #13934) flagged by @alt-glitch.

Evidence:

  • gateway/status.py_cleanup_invalid_pid_path now uses pid_path.unlink(missing_ok=True) + sibling lock unlink directly (commit 402d048eb)
  • tests/gateway/test_status.py:153test_get_running_pid_cleans_stale_metadata_from_dead_foreign_pid is already present

The ExecStartPre systemd startup-cleanup clause you added to generate_systemd_unit() is not yet on main — if you'd like to follow up with just that hardening layer as a fresh PR, it would be a clean, separable addition.

This is an automated hermes-sweeper review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery duplicate This issue or pull request already exists P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants