fix(gateway): catch OSError in get_running_pid() PID existence check (Windows crash loop)#11140
fix(gateway): catch OSError in get_running_pid() PID existence check (Windows crash loop)#11140ArecaNon wants to merge 1 commit into
Conversation
On Windows, os.kill(pid, 0) can raise a bare OSError (notably WinError 11 "An attempt was made to load a program with an incorrect format") when the recorded PID belongs to a process of a different architecture or in an inspection-blocked state. The previous exception handler only caught ProcessLookupError and PermissionError, so the bare OSError propagated out of get_running_pid() and crashed start_gateway(). Under a supervisor that auto-restarts on crash (PM2, NSSM, Windows Service wrapper), this produces a tight restart loop because every restart reads the same stale PID file and re-raises immediately (observed: 2000+ restarts in ~3h on PM2). ProcessLookupError and PermissionError are both OSError subclasses, so widening the catch to OSError is a strict superset and preserves POSIX behavior. Semantically every os.kill(pid, 0) failure at this point means "cannot confirm process is alive and ours", which is exactly when we should treat the PID file as stale and return None. Adds a regression test simulating WinError 11 via monkeypatch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Duplicate of #14504 (merged) — the same OSError catch for os.kill(pid, 0) on Windows in gateway/status.py was already applied. |
|
Thanks for this — appreciate the work. We're closing the entire cluster of open native-Windows PRs (44 of them spanning installer, terminal routing, file ops, gateway PID handling, encoding, docs, and more) because the surface area needs a designed, consolidated approach rather than piecemeal merges. Cherry-picking individual fixes keeps leaving inconsistencies and we'd rather land Windows support properly, in one coherent pass.\n\nYour PR is catalogued in our internal Windows support plan. When we pick this back up (soon), we'll mine every PR in the cluster for its fix shape and credit all contributors whose work informs the final patch via lines. Watch for the consolidating PR and feel free to chime in with context on the specific failure mode you were hitting.\n\nClosing for now, not as a rejection of the fix — just queueing it for the designed rollout. Thanks again. |
Problem
On Windows,
os.kill(pid, 0)can raise a bareOSError(specificallyOSError: [WinError 11] An attempt was made to load a program with an incorrect format.) when the target PID belongs to a process of a different architecture or is in an inspection-blocked state. The current exception handler atgateway/status.py:419only catchesProcessLookupErrorandPermissionError, so the bareOSErrorpropagates up and crashesstart_gateway().When the gateway is managed by a supervisor that auto-restarts on crash (PM2, NSSM, Windows Service wrapper, etc.), this produces a tight crash loop because every restart reads the same stale PID file and re-raises immediately.
Observed impact
Local reproduction on Windows 11 + PM2:
OSError: [WinError 11]atgateway/status.py:418remove_pid_file()could runRoot cause
The Python docs (https://docs.python.org/3/library/os.html#os.kill) note that on Windows,
os.killmay raise variousOSErrorsubclasses depending on the error mode, andProcessLookupErroris not guaranteed even when the target is genuinely absent or inspection-denied. POSIXos.killis more predictable in raisingProcessLookupErrorspecifically.Fix
Add
OSErrorto the except tuple. This is a safe superset:ProcessLookupErrorandPermissionErrorare both subclasses ofOSError, so the new handler catches everything the old one caught, plus the Windows edge cases. Semantically, anyos.kill(pid, 0)failure here means "I cannot confirm the process is alive and mine", which is exactly when we should remove the stale PID file and returnNone.Testing
test_get_running_pid_handles_bare_oserror_from_os_killintests/gateway/test_status.pytests/gateway/test_status.pycontinue to passAlternative considered
Narrower catch: add only the specific Windows error codes. Rejected because:
Notes
Hermes docs state native Windows is not supported and recommend WSL2. This fix does not change that recommendation -- it simply prevents a crash loop for users who run Hermes on Windows anyway (e.g., via PM2 fork mode before migrating to WSL2), and makes the degradation path graceful.