Skip to content

fix(gateway): catch OSError from os.kill on Windows for stale PID detection#14364

Closed
Bartok9 wants to merge 1 commit into
NousResearch:mainfrom
Bartok9:fix/gateway-windows-oserror-pid-check
Closed

fix(gateway): catch OSError from os.kill on Windows for stale PID detection#14364
Bartok9 wants to merge 1 commit into
NousResearch:mainfrom
Bartok9:fix/gateway-windows-oserror-pid-check

Conversation

@Bartok9

@Bartok9 Bartok9 commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Problem

Fixes #14359.

On Windows, os.kill(existing_pid, 0) raises OSError([WinError 11]) (and other OSError variants) instead of ProcessLookupError when the process no longer exists. The current handler only catches ProcessLookupError and PermissionError:

# Before
except (ProcessLookupError, PermissionError):
    stale = True

This causes every gateway restart on Windows after a non-graceful shutdown (terminal close, crash, Ctrl+C) to fail with an unhandled OSError — requiring manual deletion of the PID file each time.

Root Cause

Windows does not raise ProcessLookupError (ESRCH) for dead PIDs in os.kill. It raises a platform-specific OSError. This is a known Python/Windows platform difference.

Fix

Add OSError to the exception tuple at the one affected call site:

# After
except (ProcessLookupError, PermissionError, OSError):
    stale = True

This matches the pattern already used at two other os.kill call sites in the same file (lines 522 and 633 in the current main), which already include OSError.

Changes

File Change
gateway/status.py Add OSError to exception tuple at line 499
tests/test_gateway_status_pid_check.py New test file — 4 regression tests

Tests

New test file tests/test_gateway_status_pid_check.py covers:

  • ✅ Windows path: os.kill raises OSError(11, ...) → treated as stale (the regression)
  • ✅ Unix path: os.kill raises ProcessLookupError → treated as stale (non-regression)
  • PermissionError path — no unhandled exception (non-regression)
  • ✅ Missing PID file → None returned (baseline)
python -m pytest tests/test_gateway_status_pid_check.py -v

…ection

On Windows, os.kill(pid, 0) raises OSError([WinError 11]) instead of
ProcessLookupError when the target process no longer exists. The current
handler only catches ProcessLookupError and PermissionError, so the OSError
propagates unhandled and the gateway refuses to start until the stale PID
file is manually deleted.

Fix: add OSError to the exception tuple. This matches the pattern already
used at other os.kill call sites in the same file (lines 522, 633).

Adds regression tests covering the Windows OSError path, the existing
ProcessLookupError path, and a missing-PID-file baseline.

Fixes NousResearch#14359
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Apr 23, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing with open PR #11140 for the same root cause. Also related to issues #5760, #12359, #9574 (all os.kill OSError on Windows).

@Bartok9

Bartok9 commented Apr 23, 2026

Copy link
Copy Markdown
Contributor Author

Related: #14388 by @eLeanwang addresses a complementary stale-PID scenario — FileExistsError in write_pid_file() after SIGKILL (atexit doesn't run). This PR covers the os.kill Windows OSError path. Together they close two separate failure modes in gateway startup recovery.

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the fix, @Bartok9! This is an automated hermes-sweeper review.

The functional change in this PR is already on main. Commit 4c02e459 (fix(status): catch OSError in os.kill(pid, 0) for Windows compatibility) landed about 90 minutes before this PR was opened and made the exact same one-line change to acquire_scoped_lock() plus an additional except OSError in get_running_pid():

  • gateway/status.py line 503: except (ProcessLookupError, PermissionError, OSError):
  • gateway/status.py ~line 780: separate except OSError: continue in get_running_pid()

The one thing not yet on main is your new test file (tests/test_gateway_status_pid_check.py). If you'd like, feel free to open a follow-up PR with just those four regression tests against the existing implementation — that would be a welcome addition.

@teknium1 teknium1 closed this Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

os.kill(pid, 0) in gateway/status.py raises unhandled OSError on Windows

3 participants