fix(gateway): recover stale pid and planned restart state (salvage #14179)#14200
Merged
Conversation
Follow-up for salvaged PR #14179. `_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the default PID path, but that helper defensively refuses to delete a PID file whose pid field differs from `os.getpid()` (to protect --replace handoffs). Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd gateway left behind a PID file owned by a now-dead foreign PID. Once `get_running_pid()` has confirmed the runtime lock is inactive, the on-disk metadata is known to belong to a dead process, so we can force-unlink both the PID file and the sibling `gateway.lock` directly instead of going through the defensive helper. Also adds a regression test with a dead foreign PID that would have failed against the previous cleanup logic.
19 tasks
This was referenced Apr 23, 2026
This was referenced May 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Salvages #14179 by @helix4u onto current main plus a follow-up cleanup fix.
What this PR does
Adds a real OS-held
gateway.lock(fcntl on POSIX, msvcrt on Windows) as the source of truth for "is a gateway alive?" — so a stalegateway.pidfrom a Ctrl+C'd or crashed gateway no longer blocks the nexthermes gateway start. The OS releases the lock automatically when the process dies, so the next startup cleanly detects the old instance is gone.Also hardens
hermes gateway restartfor systemd-managed gateways: clears failed state, issuesreset-failed+start, and waits for the replacement process instead of leaving operators in a silent dead window after a plannedexit 75restart.Closes the Discord thread where a user hit Ctrl+C, closed the terminal before the gateway finished shutting down, and couldn't start the gateway again — matching helix4u's original report.
Follow-up commit (ours, on top of the salvage)
_cleanup_invalid_pid_pathoriginally calledremove_pid_file()for the default PID path, but that helper defensively refuses to delete a PID file whose pid differs fromos.getpid()(to protect--replacehandoffs). Every realistic stale-PID scenario is exactly that case — a crashed gateway's PID file belongs to a now-dead foreign PID.Fix: once
get_running_pid()has confirmed the runtime lock is inactive, the on-disk metadata is known stale, so force-unlink bothgateway.pidandgateway.lockdirectly. Added a regression test with a dead foreign PID.Validation
os._exit(abnormal) → parent sees lock released AND both files unlinked on nextget_running_pid()callCredit
Authored by @helix4u (commit 1c5d088 preserved via rebase-merge). Follow-up cleanup commit by us.