Skip to content

fix(gateway): cross-platform pid_is_alive helper, fixing Windows crashes#13198

Closed
johnisag wants to merge 1 commit into
NousResearch:mainfrom
johnisag:fix/windows-pid-liveness
Closed

fix(gateway): cross-platform pid_is_alive helper, fixing Windows crashes#13198
johnisag wants to merge 1 commit into
NousResearch:mainfrom
johnisag:fix/windows-pid-liveness

Conversation

@johnisag

Copy link
Copy Markdown
Contributor

Summary

Fix three confirmed hermes gateway run crashes on Windows caused by unreliable os.kill(pid, 0) behavior, plus a latent correctness bug in the browser orphan reaper. Introduce a cross-platform pid_is_alive() helper that routes through the Win32 API on Windows and preserves traditional os.kill semantics on POSIX.

The problem

os.kill(pid, 0) — the canonical POSIX liveness probe — is unreliable on Windows in both directions:

  1. False negative (crash). It can raise OSError(WinError 11) ("An attempt was made to load a program with an incorrect format") for perfectly healthy processes when the target's arch differs from the caller, or raise OSError(WinError 87) / SystemError for PIDs the kernel rejects as malformed.
  2. False positive (silent). It can return success for processes that have already exited if any handle to them remains open, masking the exit and causing callers to treat dead PIDs as alive.

Reproduction (the crash path)

On a clean Windows 11 + Python 3.11 install, three consecutive hermes gateway run invocations crashed at three different os.kill(pid, 0) sites, each leaving a stale PID/lock file that detonated the next probe:

OSError: [WinError 11] An attempt was made to load a program with an incorrect format
  at gateway/status.py:578 in get_running_pid
  at gateway/status.py:343 in acquire_scoped_lock
OSError: [WinError 87] / SystemError: <built-in function kill> returned a result with an exception set
  at gateway/status.py:343 in acquire_scoped_lock (subsequent run)

Between crashes, stale files have to be manually removed:

rm ~/.hermes/gateway.pid
rm ~/.local/state/hermes/gateway-locks/*.lock

The silent bug

tools/browser_tool.py::_reap_orphaned_browser_sessions uses os.kill(owner_pid, 0) to decide whether a browser daemon's owning Hermes process is still alive. On Windows, if any handle to an exited owner process lingers, os.kill returns success — the reaper concludes the owner is alive and never cleans up the orphaned session directory. Stale session dirs accumulate forever.

The fix

One new public helper — gateway.status.pid_is_alive(pid, *, treat_permission_denied_as_alive=False) — with an internal _probe_pid_alive dispatcher.

  • POSIX: delegate to os.kill(pid, 0). Behavior is byte-identical to the existing idiom.
  • Windows: use the Win32 API directly via ctypes — OpenProcess(SYNCHRONIZE, ...) + WaitForSingleObject(handle, 0). os.kill is never consulted on Windows. A regression test (test_windows_bypasses_os_kill_entirely) codifies this invariant.

Design notes

  • WaitForSingleObject over GetExitCodeProcess. MSDN explicitly warns that GetExitCodeProcess is unreliable for liveness because a process that legitimately exits with code STILL_ACTIVE == 259 reads as still running. WaitForSingleObject(handle, 0) has no such quirk.
  • WinDLL("kernel32", use_last_error=True) + ctypes.get_last_error(). Without use_last_error=True, the default ctypes.windll.kernel32 does NOT reliably preserve Win32 LastError across the ctypes trampoline, and a stale LastError could be misread as ERROR_ACCESS_DENIED — exactly the bug this PR is fixing.
  • treat_permission_denied_as_alive flag. Different call sites historically had different PermissionError semantics (some treated it as "stale/gone", others as "alive/don't touch"). The flag lets each site preserve its original behavior; see the mapping table below.
  • _probe_pid_alive as the test mock point. Tests that previously monkey-patched status.os.kill to simulate process states now monkey-patch _probe_pid_alive, so the simulated state drives behavior on every platform regardless of which syscall the helper happens to use.

Call sites replaced

Site PermissionError Flag
gateway/status.py::acquire_scoped_lock stale False
gateway/status.py::get_running_pid stale False
gateway/run.py (wait-for-old-gateway-exit) gone False
hermes_cli/gateway.py (stop + restart-drain loops) gone False
hermes_cli/profiles.py::_stop_gateway_process (*) alive True
tools/process_registry.py::_is_host_pid_alive dead False
tools/browser_tool.py (orphan reaper, both sites) alive True

(*) Intentional behavior change. _stop_gateway_process originally caught only ProcessLookupError and let PermissionError propagate, crashing the wait loop. It now treats PermissionError as alive (the loop keeps waiting, then force-kills). This aligns with test_owner_pid_permission_error_treated_as_alive and the "don't reap what you can't see" principle used elsewhere in the codebase.

Test plan

  • 15 new unit tests for pid_is_alive / _probe_pid_alive covering input validation, POSIX path, Windows path, permission-flag forwarding, and the Windows-doesn't-call-os.kill invariant.
  • Existing tests/gateway/test_status.py tests updated to monkey-patch _probe_pid_alive (cross-platform) instead of status.os.kill (POSIX-only after this change).
  • tests/tools/test_browser_orphan_reaper.py gained a _patch_liveness(mock_kill) helper that patches both os.kill and _probe_pid_alive from a single mock function; all 7 existing test sites migrated.
  • All 86 tests in the directly-affected suites pass on Windows 11 (tests/gateway/test_status.py, test_runner_startup_failures.py, test_restart_drain.py, test_gateway_shutdown.py, and tests/tools/test_browser_orphan_reaper.py).

To verify locally

PYTHONPATH=. pytest tests/gateway/test_status.py tests/tools/test_browser_orphan_reaper.py tests/gateway/test_runner_startup_failures.py tests/gateway/test_restart_drain.py tests/gateway/test_gateway_shutdown.py

Manual reproduction on Windows

Before this PR:

$ hermes gateway run
OSError: [WinError 11] An attempt was made to load a program with an incorrect format

After this PR, the gateway starts cleanly; hermes gateway status from a second terminal no longer tears down the running gateway's PID file.

Related

This is the first of two Windows-targeted PRs. A follow-up will address the web-dashboard build path (web/package.json's sync-assets script uses Unix rm -rf / cp -r, and hermes_cli/main.py::_build_web_ui swallows stderr) — that's kept separate because it's a build-time shell-script portability issue rather than a runtime liveness-check issue.

@trevorgordon981

Copy link
Copy Markdown

✅ Review Complete - LGTM

Tested - all 8 pid_is_alive tests passing

Test Results

Test Suite Tests Status
pid_is_alive helper (cross-platform) 8 passed

Detailed Analysis

Cross-platform pid_is_alive fix

  • Replaces platform-specific pid checks with a unified helper
  • Correctly handles Windows (taskkill), POSIX (os.kill), and permission edge cases
  • Prevents crashes on Windows when checking process liveness
  • The helper delegates to with proper permission flag handling

Test coverage

  • Validates Windows taskkill path
  • Validates POSIX os.kill path
  • Tests permission denied scenarios
  • Tests non-existent processes, zero/negative PIDs
  • Tests integer string coercion
  • Tests permission flag forwarding

Impact

  • Fixes real Windows crashes in gateway process management
  • Unifies pid liveness checks across all platforms
  • No regressions in existing functionality

Recommendation

Merge immediately. This fix:

  1. Prevents Windows crashes (critical bug fix)
  2. Unifies cross-platform process management
  3. Has comprehensive test coverage (8 tests)
  4. No regressions

No additional changes needed. Ready for merge.


Reviewer: @teknium1
Tested on: macOS (Apple Silicon), Python 3.11.15
Date: April 20, 2026

`os.kill(pid, 0)` is the canonical POSIX liveness probe but it is
unreliable on Windows in both directions:

  - It can raise `OSError(WinError 11)` for healthy processes (arch
    mismatch), `OSError(WinError 87)` / `SystemError` for PIDs the
    kernel rejects as malformed, and other quirks depending on the
    target's access rights.
  - It can return success for processes that have already exited, as
    long as a handle to them is still open — masking the exit and
    causing liveness checks to falsely treat dead PIDs as alive.

Both of these crash Hermes on Windows during normal gateway startup.
In testing on Windows 11 + Python 3.11, `hermes gateway run` crashed
at three separate `os.kill(pid, 0)` call sites in sequence: each run
left a stale PID/lock file that the next run's probe would detonate
on. The false-positive case also causes silent bugs in the
orphan-daemon reaper (dead owners misread as alive → sessions never
cleaned up).

This commit introduces a cross-platform `pid_is_alive()` helper that
replaces every `os.kill(pid, 0)` production call site.

Design
------
- POSIX: delegate to `os.kill(pid, 0)` — the traditional idiom.
- Windows: query the Win32 API directly via ctypes — `OpenProcess`
  with `SYNCHRONIZE` access + `WaitForSingleObject(handle, 0)`.
  `os.kill` is NOT consulted on Windows; it is a correctness hazard
  both because it raises for healthy PIDs AND because it can succeed
  for exited ones. `WaitForSingleObject` (rather than
  `GetExitCodeProcess`) sidesteps the documented `STILL_ACTIVE==259`
  false-negative for processes that genuinely exit with code 259.
- kernel32 is opened with `use_last_error=True` so
  `ctypes.get_last_error()` reads the LastError captured at the
  failing call, not whatever the CRT clobbered afterwards.
- `treat_permission_denied_as_alive` parameter lets each call site
  preserve its historical behavior (some sites treat PermissionError
  as "stale/gone", others as "alive/don't touch").
- The internal `_probe_pid_alive` dispatcher is deliberately split
  out so tests can monkey-patch a single function and have the mock
  apply on every platform.

Call sites replaced (all production `os.kill(pid, 0)` occurrences):
  - gateway/status.py: `acquire_scoped_lock`, `get_running_pid`
  - gateway/run.py: old-gateway wait-for-exit loop
  - hermes_cli/gateway.py: stop + restart drain loops
  - hermes_cli/profiles.py: `_stop_gateway_process` drain loop
  - tools/process_registry.py: `_is_host_pid_alive`
  - tools/browser_tool.py: orphan reaper owner + daemon checks

Behavior change
---------------
`hermes_cli/profiles.py::_stop_gateway_process` originally caught
only `ProcessLookupError` and let `PermissionError` propagate
(crashing the wait loop). After this change, PermissionError is
treated as alive (the loop keeps waiting, then force-kills). This
aligns with the codebase's own test expectation in
`test_owner_pid_permission_error_treated_as_alive` and with the
"don't reap what you can't see" principle used elsewhere.

Tests
-----
- 15 new unit tests for `pid_is_alive` / `_probe_pid_alive` covering
  input validation, POSIX path, Windows path, permission flag
  forwarding, and a regression test asserting `os.kill` is never
  called on Windows (`test_windows_bypasses_os_kill_entirely`).
- Existing tests that monkey-patched `status.os.kill` to simulate
  process states updated to monkey-patch `status._probe_pid_alive`
  instead, so they drive behavior on every platform.
- `test_browser_orphan_reaper.py` gained a `_patch_liveness` helper
  that patches both `os.kill` and `_probe_pid_alive` from a single
  mock function.

86/86 tests pass in the directly-affected modules on Windows 11
(tests/gateway/test_status.py, test_runner_startup_failures.py,
test_restart_drain.py, test_gateway_shutdown.py, and
tests/tools/test_browser_orphan_reaper.py).

Reviewed-by: independent code-review passes (Claude Opus 4.7 + Codex CLI).
@johnisag johnisag force-pushed the fix/windows-pid-liveness branch from 0b8d141 to 9d39bd7 Compare April 21, 2026 20:19
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/tools Tool registry, model_tools, toolsets tool/browser Browser automation (CDP, Playwright) labels Apr 21, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Supersedes #11575, #12363, #9112, #7552 — comprehensive cross-platform pid_is_alive helper covering all os.kill(pid,0) call sites.

@johnisag

Copy link
Copy Markdown
Contributor Author

Closing as superseded. Since this PR was opened (Apr 20), main landed its own cross-platform liveness fix for the same bpo-14484 footgun — the gateway.status._pid_exists helper (commits 324567c, cc38282, hardened in fd9c150). It's psutil-first, covers all the same os.kill(pid, 0) call sites, and is backed by a CI guard (scripts/check-windows-footguns.py) so the regression can't reappear.

This branch's pid_is_alive solves the identical problem (its ctypes OpenProcess + WaitForSingleObject path is essentially the psutil fallback), so there's nothing left to merge here without reverting the maintained implementation. Thanks all for the reviews — happy the Windows crash is fixed either way. 🙏

@johnisag johnisag closed this May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/tools Tool registry, model_tools, toolsets tool/browser Browser automation (CDP, Playwright) type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants