fix: force-clean stale PID file on gateway startup#13658
Conversation
When the gateway crashes hard (SIGKILL, OOM-kill, segfault), the atexit handler never fires, leaving a stale PID file. On restart, the duplicate-instance guard may fail to clean it up under certain race conditions, preventing the gateway from starting. Add a pre-startup defense that checks if the PID file exists but the recorded PID is no longer running, and force-deletes the stale file before the existing guards execute.
|
Likely duplicate of #13559 |
1 similar comment
|
Likely duplicate of #13559 |
|
Comparison with Related PRs (#9703 and #13559) Hi all — I wanted to cross-reference this PR with two other open efforts that address the same stale cc @batumilove @john-livingston — would love your input! PR #9703 ( PR #13559 ( This PR (#13658) adds a pre-startup How they relate: These approaches are complementary rather than competing:
I think #13559 addresses the most important issue — the logic contradiction in One small note: #13559 also includes a GLM/Zhipu error 1213 fix, which might be cleaner as a separate PR to keep review scope focused. Would love to hear thoughts from both of you on whether a combined approach makes sense. Great work from everyone on tackling this issue from different angles. |
|
Thanks for the thorough cross-reference @natehale — really helpful to have all three approaches laid out together. Having now read the actual diffs for #13559 and #13709, I think the picture is clearer than it first appeared. They address genuinely different failure modes:
Both should merge; neither makes the other redundant. As for #9703: it operates at a completely different layer ( Happy to rebase #9703 on top of whatever merges first. |
Problem
When the gateway crashes hard (SIGKILL, OOM-kill, segfault), the atexit handler never fires, leaving a stale
gateway.pidfile. On restart, the duplicate-instance guard can fail to detect the dead process under certain race conditions, preventing the gateway from starting and requiring manual intervention to delete the PID file.Solution
Add a pre-startup force-cleanup block in
gateway/run.pythat runs BEFORE the existingget_running_pid()duplicate-instance guard:os.kill(pid, 0)This catches the most common crash scenario: previous gateway dies hard, PID file stays, next startup cleans it up automatically.
Testing
get_running_pid()handles PID reuse (start_time mismatch) and cmdline validationjson,os,logger, andget_hermes_homeare all already in scope