Skip to content

fix(gateway): catch OSError in get_running_pid() PID existence check (Windows crash loop)#11140

Closed
ArecaNon wants to merge 1 commit into
NousResearch:mainfrom
ArecaNon:fix/windows-oserror-pid-check
Closed

fix(gateway): catch OSError in get_running_pid() PID existence check (Windows crash loop)#11140
ArecaNon wants to merge 1 commit into
NousResearch:mainfrom
ArecaNon:fix/windows-oserror-pid-check

Conversation

@ArecaNon

Copy link
Copy Markdown
Contributor

Problem

On Windows, os.kill(pid, 0) can raise a bare OSError (specifically OSError: [WinError 11] An attempt was made to load a program with an incorrect format.) when the target PID belongs to a process of a different architecture or is in an inspection-blocked state. The current exception handler at gateway/status.py:419 only catches ProcessLookupError and PermissionError, so the bare OSError propagates up and crashes start_gateway().

When the gateway is managed by a supervisor that auto-restarts on crash (PM2, NSSM, Windows Service wrapper, etc.), this produces a tight crash loop because every restart reads the same stale PID file and re-raises immediately.

Observed impact

Local reproduction on Windows 11 + PM2:

  • 2123 restarts in ~3 hours before detection
  • Error log line: OSError: [WinError 11] at gateway/status.py:418
  • Stale PID file was never removed because the handler bailed before remove_pid_file() could run

Root cause

The Python docs (https://docs.python.org/3/library/os.html#os.kill) note that on Windows, os.kill may raise various OSError subclasses depending on the error mode, and ProcessLookupError is not guaranteed even when the target is genuinely absent or inspection-denied. POSIX os.kill is more predictable in raising ProcessLookupError specifically.

Fix

Add OSError to the except tuple. This is a safe superset: ProcessLookupError and PermissionError are both subclasses of OSError, so the new handler catches everything the old one caught, plus the Windows edge cases. Semantically, any os.kill(pid, 0) failure here means "I cannot confirm the process is alive and mine", which is exactly when we should remove the stale PID file and return None.

Testing

  • Added regression test test_get_running_pid_handles_bare_oserror_from_os_kill in tests/gateway/test_status.py
  • Verified on Windows 11 Pro 10.0.22621 + Python 3.12 + PM2
  • Before patch: 2123 restarts in 3h, all failing at the same line
  • After patch: clean uptime with zero new crashes, PID file correctly cleaned up on stale detection
  • POSIX behavior unchanged (ProcessLookupError + PermissionError still handled identically)
  • All 15 existing tests in tests/gateway/test_status.py continue to pass

Alternative considered

Narrower catch: add only the specific Windows error codes. Rejected because:

  1. Windows error code mapping is version-dependent
  2. The semantics of "any os.kill(pid, 0) failure here = treat as not running" is correct for ALL failure modes, including future Windows error codes
  3. The broader catch does not mask any semantically meaningful exception

Notes

Hermes docs state native Windows is not supported and recommend WSL2. This fix does not change that recommendation -- it simply prevents a crash loop for users who run Hermes on Windows anyway (e.g., via PM2 fork mode before migrating to WSL2), and makes the degradation path graceful.

On Windows, os.kill(pid, 0) can raise a bare OSError (notably
WinError 11 "An attempt was made to load a program with an incorrect
format") when the recorded PID belongs to a process of a different
architecture or in an inspection-blocked state. The previous
exception handler only caught ProcessLookupError and PermissionError,
so the bare OSError propagated out of get_running_pid() and crashed
start_gateway(). Under a supervisor that auto-restarts on crash
(PM2, NSSM, Windows Service wrapper), this produces a tight restart
loop because every restart reads the same stale PID file and
re-raises immediately (observed: 2000+ restarts in ~3h on PM2).

ProcessLookupError and PermissionError are both OSError subclasses,
so widening the catch to OSError is a strict superset and preserves
POSIX behavior. Semantically every os.kill(pid, 0) failure at this
point means "cannot confirm process is alive and ours", which is
exactly when we should treat the PID file as stale and return None.

Adds a regression test simulating WinError 11 via monkeypatch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery duplicate This issue or pull request already exists labels Apr 25, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #14504 (merged) — the same OSError catch for os.kill(pid, 0) on Windows in gateway/status.py was already applied.

@teknium1

teknium1 commented May 7, 2026

Copy link
Copy Markdown
Contributor

Thanks for this — appreciate the work. We're closing the entire cluster of open native-Windows PRs (44 of them spanning installer, terminal routing, file ops, gateway PID handling, encoding, docs, and more) because the surface area needs a designed, consolidated approach rather than piecemeal merges. Cherry-picking individual fixes keeps leaving inconsistencies and we'd rather land Windows support properly, in one coherent pass.\n\nYour PR is catalogued in our internal Windows support plan. When we pick this back up (soon), we'll mine every PR in the cluster for its fix shape and credit all contributors whose work informs the final patch via lines. Watch for the consolidating PR and feel free to chime in with context on the specific failure mode you were hitting.\n\nClosing for now, not as a rejection of the fix — just queueing it for the designed rollout. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery duplicate This issue or pull request already exists P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants