Skip to content

fix(windows): handle OSError from os.kill() for non-existent PIDs#12363

Open
octo-patch wants to merge 1 commit into
NousResearch:mainfrom
octo-patch:fix/issue-12359-windows-oskill-oserror
Open

fix(windows): handle OSError from os.kill() for non-existent PIDs#12363
octo-patch wants to merge 1 commit into
NousResearch:mainfrom
octo-patch:fix/issue-12359-windows-oskill-oserror

Conversation

@octo-patch

Copy link
Copy Markdown
Contributor

Fixes #12359

Problem

On Windows, os.kill(pid, 0) raises OSError: [WinError 87] ERROR_INVALID_PARAMETER when the target PID does not exist — unlike POSIX where it raises ProcessLookupError. The existing liveness checks only caught (ProcessLookupError, PermissionError), so a stale gateway.pid file left behind after a crash/reboot would cause hermes gateway run to crash immediately instead of cleaning up and starting fresh.

The same pattern surfaced in three additional call sites that share the same POSIX-only assumption.

Solution

Add OSError to the exception tuple at all four affected os.kill(pid, 0) liveness checks, treating it identically to ProcessLookupError (i.e. "process is gone, clean up"):

  • gateway/status.pyget_running_pid() (line 578) and acquire_scoped_lock() (line 343)
  • tools/process_registry.py_is_host_pid_alive() (line 258)
  • gateway/run.py--replace wait loop (line 10364)

Each change is a one-line addition. Since ProcessLookupError is already a subclass of OSError, existing POSIX behavior is unchanged; only the Windows code path is affected.

Testing

Verified locally on Windows 11 (Python 3.11.15, Hermes v0.10.0) with a stale gateway.pid — gateway now starts cleanly after applying these changes.

 NousResearch#12359)

On Windows, os.kill(pid, 0) raises OSError (WinError 87 / errno 22)
for non-existent PIDs rather than ProcessLookupError as on POSIX.
The existing liveness checks only caught (ProcessLookupError, PermissionError),
causing a crash when a stale gateway.pid references a dead PID.

Add OSError to all four affected exception handlers:
- gateway/status.py: get_running_pid() and acquire_scoped_lock()
- tools/process_registry.py: _is_host_pid_alive()
- gateway/run.py: --replace wait loop
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Apr 23, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing fix with #14364 and #11140 for the same Windows OSError. This PR covers 4 call sites; others may be more narrowly scoped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]:Windows: os.kill(pid, 0) raises OSError [WinError 87] and crashes gateway startup

2 participants