fix(gateway): resolve stale PID file blocking startup after forced kill#13559
Open
batumilove wants to merge 2 commits into
Open
fix(gateway): resolve stale PID file blocking startup after forced kill#13559batumilove wants to merge 2 commits into
batumilove wants to merge 2 commits into
Conversation
added 2 commits
April 21, 2026 10:57
When systemd kills the gateway process (e.g. TimeoutStopSec exceeded during drain), the PID file remains on disk. If the OS later reuses that PID for an unrelated process, get_running_pid() would see a live PID and check _record_looks_like_gateway() — but the stale record still contains gateway metadata, so it matched and reported the PID as 'running'. This caused a restart loop: systemd restarts the gateway, it sees its own stale PID as 'another instance running', exits, systemd restarts again. The fix: when _looks_like_gateway_process() returns False (the live PID's /proc entry doesn't look like the gateway), read the actual cmdline from /proc before falling back to the stored record. If the cmdline is readable, trust it over the stale record — a non-gateway cmdline means the PID file is stale and should be cleaned up. Only fall back to _record_looks_like_gateway() when /proc is unreadable (container/capability edge case).
This was referenced Apr 21, 2026
Collaborator
This was referenced Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When systemd kills the gateway process (e.g.
TimeoutStopSecexceeded during drain of in-flight Telegram sessions), the PID file (gateway.pid) remains on disk. If the OS later reuses that PID for an unrelated process,get_running_pid()incorrectly reports it as "still running", causing a restart loop:The root cause:
_looks_like_gateway_process()returnsFalse(the reused PID is not the gateway), but_record_looks_like_gateway()returnsTrue(the stale record still contains gateway metadata). The old code only cleaned up when both checks agreed the PID was stale.Fix
In
get_running_pid(), when_looks_like_gateway_process()returnsFalse, read the actual/proc/<pid>/cmdlinebefore falling back to the stored record:_record_looks_like_gateway()(preserves old behavior).Testing
systemctl --user kill hermes-gateway, confirmed restart loop with "PID file race lost" errors--replaceflow still works correctlyChanges
gateway/status.py: 9 lines added, 1 removed inget_running_pid()around line 610