fix(gateway): cross-platform pid_is_alive helper, fixing Windows crashes#13198
fix(gateway): cross-platform pid_is_alive helper, fixing Windows crashes#13198johnisag wants to merge 1 commit into
Conversation
✅ Review Complete - LGTMTested - all 8 pid_is_alive tests passing Test Results
Detailed AnalysisCross-platform pid_is_alive fix ✅
Test coverage ✅
Impact ✅
RecommendationMerge immediately. This fix:
No additional changes needed. Ready for merge. Reviewer: @teknium1 |
`os.kill(pid, 0)` is the canonical POSIX liveness probe but it is
unreliable on Windows in both directions:
- It can raise `OSError(WinError 11)` for healthy processes (arch
mismatch), `OSError(WinError 87)` / `SystemError` for PIDs the
kernel rejects as malformed, and other quirks depending on the
target's access rights.
- It can return success for processes that have already exited, as
long as a handle to them is still open — masking the exit and
causing liveness checks to falsely treat dead PIDs as alive.
Both of these crash Hermes on Windows during normal gateway startup.
In testing on Windows 11 + Python 3.11, `hermes gateway run` crashed
at three separate `os.kill(pid, 0)` call sites in sequence: each run
left a stale PID/lock file that the next run's probe would detonate
on. The false-positive case also causes silent bugs in the
orphan-daemon reaper (dead owners misread as alive → sessions never
cleaned up).
This commit introduces a cross-platform `pid_is_alive()` helper that
replaces every `os.kill(pid, 0)` production call site.
Design
------
- POSIX: delegate to `os.kill(pid, 0)` — the traditional idiom.
- Windows: query the Win32 API directly via ctypes — `OpenProcess`
with `SYNCHRONIZE` access + `WaitForSingleObject(handle, 0)`.
`os.kill` is NOT consulted on Windows; it is a correctness hazard
both because it raises for healthy PIDs AND because it can succeed
for exited ones. `WaitForSingleObject` (rather than
`GetExitCodeProcess`) sidesteps the documented `STILL_ACTIVE==259`
false-negative for processes that genuinely exit with code 259.
- kernel32 is opened with `use_last_error=True` so
`ctypes.get_last_error()` reads the LastError captured at the
failing call, not whatever the CRT clobbered afterwards.
- `treat_permission_denied_as_alive` parameter lets each call site
preserve its historical behavior (some sites treat PermissionError
as "stale/gone", others as "alive/don't touch").
- The internal `_probe_pid_alive` dispatcher is deliberately split
out so tests can monkey-patch a single function and have the mock
apply on every platform.
Call sites replaced (all production `os.kill(pid, 0)` occurrences):
- gateway/status.py: `acquire_scoped_lock`, `get_running_pid`
- gateway/run.py: old-gateway wait-for-exit loop
- hermes_cli/gateway.py: stop + restart drain loops
- hermes_cli/profiles.py: `_stop_gateway_process` drain loop
- tools/process_registry.py: `_is_host_pid_alive`
- tools/browser_tool.py: orphan reaper owner + daemon checks
Behavior change
---------------
`hermes_cli/profiles.py::_stop_gateway_process` originally caught
only `ProcessLookupError` and let `PermissionError` propagate
(crashing the wait loop). After this change, PermissionError is
treated as alive (the loop keeps waiting, then force-kills). This
aligns with the codebase's own test expectation in
`test_owner_pid_permission_error_treated_as_alive` and with the
"don't reap what you can't see" principle used elsewhere.
Tests
-----
- 15 new unit tests for `pid_is_alive` / `_probe_pid_alive` covering
input validation, POSIX path, Windows path, permission flag
forwarding, and a regression test asserting `os.kill` is never
called on Windows (`test_windows_bypasses_os_kill_entirely`).
- Existing tests that monkey-patched `status.os.kill` to simulate
process states updated to monkey-patch `status._probe_pid_alive`
instead, so they drive behavior on every platform.
- `test_browser_orphan_reaper.py` gained a `_patch_liveness` helper
that patches both `os.kill` and `_probe_pid_alive` from a single
mock function.
86/86 tests pass in the directly-affected modules on Windows 11
(tests/gateway/test_status.py, test_runner_startup_failures.py,
test_restart_drain.py, test_gateway_shutdown.py, and
tests/tools/test_browser_orphan_reaper.py).
Reviewed-by: independent code-review passes (Claude Opus 4.7 + Codex CLI).
0b8d141 to
9d39bd7
Compare
|
Closing as superseded. Since this PR was opened (Apr 20), This branch's |
Summary
Fix three confirmed
hermes gateway runcrashes on Windows caused by unreliableos.kill(pid, 0)behavior, plus a latent correctness bug in the browser orphan reaper. Introduce a cross-platformpid_is_alive()helper that routes through the Win32 API on Windows and preserves traditionalos.killsemantics on POSIX.The problem
os.kill(pid, 0)— the canonical POSIX liveness probe — is unreliable on Windows in both directions:OSError(WinError 11)("An attempt was made to load a program with an incorrect format") for perfectly healthy processes when the target's arch differs from the caller, or raiseOSError(WinError 87)/SystemErrorfor PIDs the kernel rejects as malformed.Reproduction (the crash path)
On a clean Windows 11 + Python 3.11 install, three consecutive
hermes gateway runinvocations crashed at three differentos.kill(pid, 0)sites, each leaving a stale PID/lock file that detonated the next probe:Between crashes, stale files have to be manually removed:
The silent bug
tools/browser_tool.py::_reap_orphaned_browser_sessionsusesos.kill(owner_pid, 0)to decide whether a browser daemon's owning Hermes process is still alive. On Windows, if any handle to an exited owner process lingers,os.killreturns success — the reaper concludes the owner is alive and never cleans up the orphaned session directory. Stale session dirs accumulate forever.The fix
One new public helper —
gateway.status.pid_is_alive(pid, *, treat_permission_denied_as_alive=False)— with an internal_probe_pid_alivedispatcher.os.kill(pid, 0). Behavior is byte-identical to the existing idiom.OpenProcess(SYNCHRONIZE, ...)+WaitForSingleObject(handle, 0).os.killis never consulted on Windows. A regression test (test_windows_bypasses_os_kill_entirely) codifies this invariant.Design notes
WaitForSingleObjectoverGetExitCodeProcess. MSDN explicitly warns thatGetExitCodeProcessis unreliable for liveness because a process that legitimately exits with codeSTILL_ACTIVE == 259reads as still running.WaitForSingleObject(handle, 0)has no such quirk.WinDLL("kernel32", use_last_error=True)+ctypes.get_last_error(). Withoutuse_last_error=True, the defaultctypes.windll.kernel32does NOT reliably preserve Win32 LastError across the ctypes trampoline, and a stale LastError could be misread asERROR_ACCESS_DENIED— exactly the bug this PR is fixing.treat_permission_denied_as_aliveflag. Different call sites historically had differentPermissionErrorsemantics (some treated it as "stale/gone", others as "alive/don't touch"). The flag lets each site preserve its original behavior; see the mapping table below._probe_pid_aliveas the test mock point. Tests that previously monkey-patchedstatus.os.killto simulate process states now monkey-patch_probe_pid_alive, so the simulated state drives behavior on every platform regardless of which syscall the helper happens to use.Call sites replaced
PermissionError→gateway/status.py::acquire_scoped_lockFalsegateway/status.py::get_running_pidFalsegateway/run.py(wait-for-old-gateway-exit)Falsehermes_cli/gateway.py(stop + restart-drain loops)Falsehermes_cli/profiles.py::_stop_gateway_process(*)Truetools/process_registry.py::_is_host_pid_aliveFalsetools/browser_tool.py(orphan reaper, both sites)True(*) Intentional behavior change.
_stop_gateway_processoriginally caught onlyProcessLookupErrorand letPermissionErrorpropagate, crashing the wait loop. It now treatsPermissionErroras alive (the loop keeps waiting, then force-kills). This aligns withtest_owner_pid_permission_error_treated_as_aliveand the "don't reap what you can't see" principle used elsewhere in the codebase.Test plan
pid_is_alive/_probe_pid_alivecovering input validation, POSIX path, Windows path, permission-flag forwarding, and the Windows-doesn't-call-os.killinvariant.tests/gateway/test_status.pytests updated to monkey-patch_probe_pid_alive(cross-platform) instead ofstatus.os.kill(POSIX-only after this change).tests/tools/test_browser_orphan_reaper.pygained a_patch_liveness(mock_kill)helper that patches bothos.killand_probe_pid_alivefrom a single mock function; all 7 existing test sites migrated.tests/gateway/test_status.py,test_runner_startup_failures.py,test_restart_drain.py,test_gateway_shutdown.py, andtests/tools/test_browser_orphan_reaper.py).To verify locally
Manual reproduction on Windows
Before this PR:
After this PR, the gateway starts cleanly;
hermes gateway statusfrom a second terminal no longer tears down the running gateway's PID file.Related
This is the first of two Windows-targeted PRs. A follow-up will address the web-dashboard build path (
web/package.json'ssync-assetsscript uses Unixrm -rf/cp -r, andhermes_cli/main.py::_build_web_uiswallows stderr) — that's kept separate because it's a build-time shell-script portability issue rather than a runtime liveness-check issue.