Skip to content

fix: force-clean stale PID file on gateway startup#13658

Closed
natehale wants to merge 1 commit into
NousResearch:mainfrom
natehale:fix/stale-pid-cleanup
Closed

fix: force-clean stale PID file on gateway startup#13658
natehale wants to merge 1 commit into
NousResearch:mainfrom
natehale:fix/stale-pid-cleanup

Conversation

@natehale

Copy link
Copy Markdown

Problem

When the gateway crashes hard (SIGKILL, OOM-kill, segfault), the atexit handler never fires, leaving a stale gateway.pid file. On restart, the duplicate-instance guard can fail to detect the dead process under certain race conditions, preventing the gateway from starting and requiring manual intervention to delete the PID file.

Solution

Add a pre-startup force-cleanup block in gateway/run.py that runs BEFORE the existing get_running_pid() duplicate-instance guard:

  1. Reads the PID file
  2. Checks if the recorded PID is actually alive via os.kill(pid, 0)
  3. If dead → force-deletes the stale PID file with a log message
  4. Then the normal duplicate-instance guard continues

This catches the most common crash scenario: previous gateway dies hard, PID file stays, next startup cleans it up automatically.

Testing

  • Verified that the existing get_running_pid() handles PID reuse (start_time mismatch) and cmdline validation
  • The new block covers the simpler "process is dead but PID file remains" case that users hit in practice
  • json, os, logger, and get_hermes_home are all already in scope

When the gateway crashes hard (SIGKILL, OOM-kill, segfault), the
atexit handler never fires, leaving a stale PID file. On restart,
the duplicate-instance guard may fail to clean it up under certain
race conditions, preventing the gateway from starting.

Add a pre-startup defense that checks if the PID file exists but
the recorded PID is no longer running, and force-deletes the stale
file before the existing guards execute.
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery labels Apr 21, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #13559 — both add pre-startup stale PID file cleanup in gateway/run.py after hard crashes. Also overlaps with #9703.

@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #13559

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #13559

@natehale

Copy link
Copy Markdown
Author

Comparison with Related PRs (#9703 and #13559)

Hi all — I wanted to cross-reference this PR with two other open efforts that address the same stale gateway.pid issue, in the interest of converging on the best combined approach.

cc @batumilove @john-livingston — would love your input!

PR #9703 (fix(gateway): clear stale pid file before service start) by @john-livingston takes a clean, minimal approach by adding ExecStartPre=-/bin/rm -f to the systemd service templates. This is elegant in its simplicity — it sidesteps the race condition entirely by removing the file before Python even starts. The limitation is that it only covers systemd-managed deployments, leaving CLI and container starts without the same protection.

PR #13559 (fix(gateway): resolve stale PID file blocking startup after forced kill) by @batumilove does excellent work identifying a logic bug in get_running_pid() where _looks_like_gateway_process() and _record_looks_like_gateway() can return contradictory results, causing the stale-file case to go unhandled. Their fix — reading /proc/<pid>/cmdline to verify the live process — is thorough and works across all deployment types. Really nice root-cause analysis.

This PR (#13658) adds a pre-startup os.kill(pid, 0) cleanup block as a lightweight fallback before the existing guard runs. It's designed to be purely additive and deployment-agnostic.

How they relate:

These approaches are complementary rather than competing:

I think #13559 addresses the most important issue — the logic contradiction in get_running_pid() — and would be great to get merged. If there's interest, I'd be happy to incorporate that fix here, or rebase this PR on top of #13559 to provide the defense-in-depth layer.

One small note: #13559 also includes a GLM/Zhipu error 1213 fix, which might be cleaner as a separate PR to keep review scope focused.

Would love to hear thoughts from both of you on whether a combined approach makes sense. Great work from everyone on tackling this issue from different angles.

@john-livingston

Copy link
Copy Markdown

Thanks for the thorough cross-reference @natehale — really helpful to have all three approaches laid out together.

Having now read the actual diffs for #13559 and #13709, I think the picture is clearer than it first appeared. They address genuinely different failure modes:

Both should merge; neither makes the other redundant.

As for #9703: it operates at a completely different layer (hermes_cli/gateway.py, not gateway/status.py) and covers a scenario none of the Python-level fixes address — when Python can't start cleanly at all. The ExecStartPre=-/bin/rm -f line runs before Python is invoked, so even if get_running_pid() logic is perfect, a stale file left after a hard kill between systemd restarts is cleared unconditionally. It's the OS-level belt to the Python-level suspenders. No conflict with any of the above.

Happy to rebase #9703 on top of whatever merges first.

@natehale natehale closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants