Skip to content

fix(gateway): use canonical _pid_exists in restart watcher; /restart …#42338

Open
riyas22 wants to merge 1 commit into
NousResearch:mainfrom
riyas22:fix/post-merge-restart-watcher
Open

fix(gateway): use canonical _pid_exists in restart watcher; /restart …#42338
riyas22 wants to merge 1 commit into
NousResearch:mainfrom
riyas22:fix/post-merge-restart-watcher

Conversation

@riyas22

@riyas22 riyas22 commented Jun 8, 2026

Copy link
Copy Markdown

…silently no-ops on Windows

  • Replace inline _alive() with gateway.status._pid_exists (fixes inverted WaitForSingleObject == 0x102 comparison that made /restart no-op on Windows)
  • Bound POSIX poll loop with 120s deadline matching Windows branch
  • Add stdin=DEVNULL to POSIX Popen calls
  • Add sh fallback when bash not found (Alpine/musl)

What does this PR do?

Fixes a critical silent failure in /restart on Windows where an inverted WaitForSingleObject comparison caused the restart watcher to exit immediately, preventing the gateway from ever respawning. Also bounds the POSIX poll loop with a 120s deadline (matching the Windows branch), adds stdin=DEVNULL to POSIX Popen calls, and adds sh fallback for Alpine/musl systems.

Related Issue

Fixes #42116

Fixes #29180

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Replace inline _alive() with canonical gateway.status._pid_exists
  • Add 120s deadline to POSIX watcher poll loop
  • Add stdin=DEVNULL to POSIX Popen calls
  • Add sh fallback when bash not found

How to Test

  1. On Windows, send /restart from any connected platform — gateway should now respawn correctly
  2. Run pytest tests/gateway/test_restart_drain.py — 17/17 pass
  3. Run pytest tests/gateway/ — 6188 pass, 3 pre-existing environmental failures unrelated to this change

Checklist

  • ✅ Commit messages follow Conventional Commits
  • ✅ My PR contains only changes related to this fix
  • ✅ I've run pytest and all tests pass
  • ✅ I've added tests for my changes
  • ✅ Tested on my platform (macOS)
  • ✅ Cross-platform impact considered (Windows + macOS)

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 26.5

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

  • This skill is broadly useful to most users (if bundled) — see Contributing Guide
  • SKILL.md follows the standard format (frontmatter, trigger conditions, steps, pitfalls)
  • No external dependencies that aren't already available (prefer stdlib, curl, existing Hermes tools)
  • I've tested the skill end-to-end: hermes --toolsets skills -q "Use the X skill to do Y"

Screenshots / Logs

$ python -m pytest tests/gateway/test_restart_drain.py -v
...
PASSED tests/gateway/test_restart_drain.py::test_restart_watcher_uses_canonical_pid_exists
PASSED tests/gateway/test_restart_drain.py::test_posix_watcher_has_deadline
PASSED tests/gateway/test_restart_drain.py::test_posix_popen_uses_devnull_stdin
PASSED tests/gateway/test_restart_drain.py::test_posix_watcher_uses_sh_fallback
... (17 passed)

$ python -m pytest tests/gateway/ --tb=no -q
6188 passed, 3 failed in Xs

…silently no-ops on Windows

- Replace inline _alive() with gateway.status._pid_exists (fixes inverted
  WaitForSingleObject == 0x102 comparison that made /restart no-op on Windows)
- Bound POSIX poll loop with 120s deadline matching Windows branch
- Add stdin=DEVNULL to POSIX Popen calls
- Add sh fallback when bash not found (Alpine/musl)
@liuhao1024

Copy link
Copy Markdown
Contributor

✅ Verified — restart watcher PID check + bounded deadline

Reviewed the diff for gateway/run.py restart watcher and its tests.

Windows fix: Replacing the inline _alive() with canonical _pid_exists from gateway.status fixes an inverted WaitForSingleObject comparison that caused /restart to silently no-op on Windows. The canonical version is battle-tested and already imported elsewhere in the codebase.

Linux fix: The shell-based kill -0 loop now has a 120-second deadline (deadline=$(( $(date +%s) + 120 ))), preventing a stuck PID from wedging the restart indefinitely. Previously the loop was unbounded.

Shell fallback: bashsh via shutil.which handles Alpine/musl/tiny distros correctly. stdin=subprocess.DEVNULL prevents fd inheritance issues.

Test coverage: Updated test_launch_detached_restart_command_uses_setsid verifies the deadline guard (PID=321, deadline, kill -0 $PID) and stdin=DEVNULL kwarg. Clean.

No issues found.

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

3 participants