Skip to content

fix(gateway): handle Windows OSError in PID liveness prob#11575

Closed
xy200303 wants to merge 7 commits into
NousResearch:mainfrom
xy200303:windows-gateway-pid-probe-fix
Closed

fix(gateway): handle Windows OSError in PID liveness prob#11575
xy200303 wants to merge 7 commits into
NousResearch:mainfrom
xy200303:windows-gateway-pid-probe-fix

Conversation

@xy200303

@xy200303 xy200303 commented Apr 17, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR fixes a Windows-specific gateway startup failure caused by PID liveness probing.

Hermes uses os.kill(pid, 0) to check whether a PID recorded in the gateway PID file is still alive. That works in most cases, but on Windows some stale or invalid PIDs can raise a plain OSError such as WinError 11 instead of the expected ProcessLookupError or PermissionError. Because that case was not handled, gateway startup could crash while checking for an existing process instead of safely treating the PID record as stale.

This fix centralizes PID liveness probing in a helper and treats Windows-specific OSError probe failures as “not alive”. That keeps startup resilient and prevents stale PID files from crashing the gateway.

Related Issue

Fixes #

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Updated gateway/status.py to add _pid_looks_alive(pid) for centralized PID liveness checks.
  • Changed gateway PID validation to treat Windows OSError probe failures as stale/non-running PIDs instead of crashing.
  • Reused the new helper in both get_running_pid() and scoped lock liveness checks.
  • Added a regression test in tests/gateway/test_status.py covering the Windows OSError PID probe case.

How to Test

  1. Run python -m pytest tests/gateway/test_status.py -q
  2. Confirm the new Windows regression test passes along with the existing gateway status tests.
  3. Reproduce the original scenario on Windows with a stale gateway.pid and verify Hermes no longer crashes during startup PID detection.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Windows 11

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

Test run:

python -m pytest tests/gateway/test_status.py -q
...............                                                          [100%]
15 passed in 6.98s

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery labels Apr 24, 2026
@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the contribution @xy200303! The core fix in this PR — catching OSError from os.kill(pid, 0) in gateway/status.py for Windows PID liveness probing — has already landed on main independently.

Automated hermes-sweeper review found the fix implemented at:

  • Commit 4c02e4597fix(status): catch OSError in os.kill(pid, 0) for Windows compatibility
  • gateway/status.py line 778 (get_running_pid) and line 504 (acquire_scoped_lock) both already have except OSError with the WinError 87 comment

The PR also bundles two unrelated changes (fix(run_agent): preserve provider-specific headers on /model switch and fix(auxiliary): preserve provider-specific headers for named custom providers) that may still be worth contributing as separate, focused PRs if they aren't yet on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants