Skip to content

fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix#40909

Merged
teknium1 merged 2 commits into
mainfrom
fix/windows-gateway-reliability
Jun 7, 2026
Merged

fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix#40909
teknium1 merged 2 commits into
mainfrom
fix/windows-gateway-reliability

Conversation

@teknium1

@teknium1 teknium1 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Why

Three reports from a Windows 11 user:

  1. hermes gateway status seems to lie about gateway aliveness.
  2. Gateway dies after the Desktop GUI runs the update flow and does not come back.
  3. hermes gateway start flashes a cmd window or appears foreground.

Investigation on the user's box found:

  • (2) is real. The post-update relaunch watcher (launch_detached_profile_gateway_restart) uses windows_detach_popen_kwargs() which gave it DETACHED_PROCESS | CREATE_NEW_PROCESS_GROUP | CREATE_NO_WINDOW but NOT CREATE_BREAKAWAY_FROM_JOB. The Tauri updater is spawned by Electron inside Electron's job object; the watcher inherits that membership and dies when Electron exits. _spawn_detached() in gateway_windows.py already had the flag correct — but the rest of the codebase didn't.
  • (3) was a Startup-folder leak from test_install_wsl_no_systemd / test_start_wsl_no_systemd / test_status_wsl_* in tests/hermes_cli/test_gateway_wsl.py. Those tests stub is_linux/is_wsl/supports_systemd_services but not is_windows(). On a Windows host, the dispatcher hits the is_windows() branch before the WSL guidance branch and invokes gateway_windows.install() for real — which writes to the real %APPDATA%\…\Startup\Hermes_Gateway.cmd (never sandboxed by tmp_path) pointing at the now-deleted pytest fixture. Every login since flashed a cmd window that "file not found"-ed.
  • (1) is harder to repro at will — but the existing status collapses several signals into one ✓/✗ line, and --deep only printed paths. When the high-level summary is wrong, the user has no way to see which signal is wrong.

What this PR does

A. CREATE_BREAKAWAY_FROM_JOB on every detached spawn

hermes_cli/_subprocess_compat.py::windows_detach_flags() now ORs in 0x01000000. Every helper using windows_detach_popen_kwargs() inherits it. The inlined watcher respawn snippet in hermes_cli/gateway.py::launch_detached_profile_gateway_restart() also picks up the flag.

B. hermes gateway status --deep shows per-probe PASS/FAIL

Deep probes:
  [1] PASS  PID file present: …\gateway.pid (pid=4324)
  [2] PASS  Lock file held by a live process: …\gateway.lock
  [3] PASS  get_running_pid() => 4324
  [4] PASS  _pid_exists(4324) => True
  [5] PASS  gateway_state.json state='running' (updated 12s ago)
  [6] PASS  Last lifecycle event: tag=gateway.start pid=4324 ts=2026-06-06T14:48:48Z

C. Test-leak fix + defense in depth

Stubs is_windows=False on the 4 WSL-guidance tests so they can never reach the real gateway_windows.install() on a Windows host. Plus: _build_startup_launcher() now prefixes the launcher with if not exist <target> exit /b 0 so any future stale Startup entry silently no-ops instead of flashing a window.

What this PR explicitly does NOT do

An earlier iteration of this branch added a per-minute schtasks-based supervisor task for crash recovery. It flashed a console window every minute that stole window focus. This is a documented Windows quirk: Task Scheduler preallocates conhost for the spawn target even when the target is GUI-subsystem, and <Hidden>true</Hidden> in the XML only hides the task in Task Scheduler's UI, not the spawned window. We tried multiple variants (cmd.exe wrapper, direct pythonw, base GUI pythonw, XML+Hidden+InteractiveToken); all flash.

Research into prior art:

Project Mechanism
Ollama GUI-subsystem tray exe + Startup folder shortcut. Internal monitor+worker for crash recovery. No supervisor.
Tailscale Real Windows Service via sc.exe/mgr.CreateService. Session 0, no console possible. SCM auto-restart.
Syncthing --no-console flag inside the binary + Startup folder shortcut.
openclaw VBS wrapper with WScript.Shell.Run(..., 0, False) — suppresses window but Super User Q971162 confirms focus-steal still occurs in some cases.

None of these use a per-minute polling scheduled task. The "auto-restart on crash" responsibility belongs INSIDE the daemon (Ollama-style monitor+worker) OR is delegated to the Windows Service Control Manager (Tailscale) — not Task Scheduler.

So we removed the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB fix alone resolves the original user-reported "gateway dies after GUI update" complaint — the post-update watcher now actually survives Electron's job teardown and respawns the gateway, exactly as designed.

Live verification on the reporting user's machine

  • Removed the broken Startup .cmd (the one pointing to the pytest tmp path).
  • Wrote a fresh Startup\Hermes_Gateway.cmd using the patched _build_startup_launcher (with the if not exist guard) pointing to the real $HERMES_HOME\gateway-service\Hermes_Gateway.cmd.
  • hermes gateway status --deep shows 6/6 PASS for the running gateway PID 4324.
  • No scheduled tasks installed (schtasks /Query confirms zero Hermes tasks).
  • Real gateway running throughout, untouched.

Test results

tests/hermes_cli/test_gateway_windows.py     ✓ all pass
tests/hermes_cli/test_gateway_wsl.py         ✓ TestGatewayCommandWSLMessages: 4/4 pass
                                             (2 pre-existing TestSupportsSystemdServicesWSL
                                              failures on Windows hosts, not caused by this PR)
tests/gateway/test_status.py                 ✓ all pass
tests/gateway/test_status_command.py         ✓ all pass

Confirmed pre-existing failures by stashing this PR's changes and running on origin/main — same 2 fail.

Follow-up issue

#40899 — track the real Windows Service auto-restart via pywin32 (Tailscale-style). Out of scope here; this PR ships the focused fix.

Files changed

  • hermes_cli/_subprocess_compat.py — add CREATE_BREAKAWAY_FROM_JOB to windows_detach_flags()
  • hermes_cli/gateway.py — same flag in the inlined watcher respawn snippet
  • hermes_cli/gateway_windows.py--deep probes; if not exist guard in startup launcher; supervisor block fully removed
  • tests/hermes_cli/test_gateway_wsl.pyis_windows=False stubs on the 4 WSL guidance tests

Infographic

windows-gateway-reliability

teknium1 added 2 commits June 6, 2026 08:10
…tatus --deep

Three coordinated fixes for the Windows gateway reliability story:

1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn

   The 'hermes update' triggered from the Electron Desktop GUI ran inside
   Electron's job object. Without breakaway, the post-update gateway
   watcher spawned by update — already DETACHED_PROCESS — was still
   reaped when Electron's job tore down, so the gateway never came back
   after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000)
   to:
     - hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by
       every helper that calls windows_detach_popen_kwargs(), including
       launch_detached_profile_gateway_restart()
     - The watcher subprocess's own respawn snippet in
       hermes_cli/gateway.py (inlined flags so the watcher's child
       respawn also breaks away)

   _spawn_detached() in gateway_windows.py already had the flag; this
   change brings the rest of the codebase to parity.

2. Per-minute supervisor Scheduled Task — Windows equivalent of
   systemd Restart=always

   Introduces hermes_cli/gateway_supervisor.py and registers it as a
   second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1,
   LIMITED rights) alongside the existing ONLOGON task. Every minute,
   the supervisor uses the same gateway.status.get_running_pid() probe
   as 'hermes gateway status' and, if no gateway is alive, calls
   gateway_windows._spawn_detached() (which now includes BREAKAWAY) to
   bring one back.

   Covers every crash mode, not just 'machine rebooted': taskkill,
   OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw
   startup per minute when down, one PID-existence check per minute
   when up.

   Wired into both the schtasks-success and Startup-folder-fallback
   install paths via _install_supervisor_best_effort(), and removed in
   uninstall(). Best-effort: a failing supervisor install logs a
   warning but doesn't roll back the primary install.

3. 'hermes gateway status --deep' shows per-probe PASS/FAIL

   Replaces the existing terse '--deep' output (which only printed
   paths) with an actual diagnostic table:
     [1] PID file present
     [2] Lock file held by a live process
     [3] get_running_pid() result
     [4] _pid_exists(pid) — OS-level liveness
     [5] gateway_state.json (state + age)
     [6] Last lifecycle event from gateway-exit-diag.log

   When the high-level summary disagrees with reality, the user can
   see exactly which signal is lying.

Test-leak fix
-------------

tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages
monkey-patched is_linux/is_wsl/supports_systemd_services to simulate
WSL but did NOT stub is_windows(). On a Windows host, the dispatcher
in _gateway_command_inner takes the is_windows() branch BEFORE the
WSL guidance branch, so the test invoked gateway_windows.install()
for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd
— the REAL user Startup folder, never sandboxed by tmp_path — pointing
at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper.
When pytest tore down the tmp_path, every subsequent Windows login
flashed a cmd.exe window that failed to find the missing target.

Stubs is_windows=False on all four affected tests:
  test_install_wsl_no_systemd
  test_start_wsl_no_systemd
  test_status_wsl_running_manual
  test_status_wsl_not_running

Defense-in-depth: _build_startup_launcher() now prefixes the launcher
with 'if not exist <target> exit /b 0', so any future stale Startup
entry silently no-ops instead of flashing a console window.

Status enhancements
-------------------

- status() now reports supervisor task presence alongside the existing
  schtasks/Startup info, and nudges the user to reinstall if the
  supervisor isn't registered.
- Deep mode dumps both the supervisor task name + script path.
…kaway + deep probes

Earlier in this branch we added a per-minute schtasks-based supervisor to
respawn the gateway after crashes / GUI-update SIGTERMs. The implementation
flashed a brief console window on every firing, which stole window focus.
We tried several variants:

  - cmd.exe wrapper invoking pythonw  -> flashes (cmd.exe is console-subsystem)
  - schtasks /TR pointing at pythonw  -> flashes (uv venv launcher pythonw is
    actually subsystem=Console, not GUI; it respawns the real pythonw)
  - schtasks /TR pointing at base uv  -> still flashes (Task Scheduler-side
    conhost preallocation; documented Windows quirk)
  - XML registration with <Hidden>true>  -> still flashes (<Hidden> only hides
    the task in the Task Scheduler UI, not the spawned window)

Researched what leading projects do:

  - Ollama: GUI-subsystem tray exe + Startup-folder shortcut. No supervisor.
  - Tailscale: real Windows Service via SCM. Session 0, no console possible.
  - Syncthing: --no-console flag inside the binary + Startup folder.
  - openclaw: VBS Run(..., 0, False) wrapper. Suppresses the *window* but
    Super User Q971162 confirms focus-steal still occurs in some cases.

None of these use a per-minute polling scheduled task. The 'auto-restart on
crash' responsibility belongs INSIDE the daemon (Tailscale's in-process
recovery / Ollama's monitor+worker pair) OR is delegated to the Windows
Service Control Manager — not Task Scheduler.

So this commit drops the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB
fix in _subprocess_compat.py (from commit c1e5fa4) survives — that is the
*real* fix for problem #2 (GUI-update kills gateway): the post-update
watcher in launch_detached_profile_gateway_restart() now breaks out of
Electron's job object, so the gateway respawn watcher survives the GUI
quit and successfully respawns the gateway.

Surviving from c1e5fa4:
  * CREATE_BREAKAWAY_FROM_JOB in hermes_cli/_subprocess_compat.py (fixes #2)
  * Inlined breakaway flag in the watcher respawn snippet in gateway.py
  * hermes gateway status --deep PASS/FAIL probes (fixes #1 — visibility)
  * 'if not exist <target> exit /b 0' guard in _build_startup_launcher
    (fixes #3 — silent no-op for stale Startup entries)
  * tests/hermes_cli/test_gateway_wsl.py is_windows=False stubs (root cause
    of #3 — pytest WSL tests no longer leak Startup entries on Win hosts)

Removed in this commit:
  * hermes_cli/gateway_supervisor.py (entire file)
  * Supervisor section in hermes_cli/gateway_windows.py (~180 lines):
      get_supervisor_task_name, get_supervisor_script_path,
      _build_supervisor_cmd_script, _write_supervisor_script,
      _install_supervisor_task, is_supervisor_task_registered,
      _install_supervisor_best_effort
  * _install_supervisor_best_effort() calls in install() (3 spots)
  * supervisor cleanup block in uninstall()
  * supervisor display lines in status() / status(deep=True)

Future direction (out of scope for this PR): the right place for Windows
'Restart=always' semantics is a real Windows Service installed via
pywin32's win32serviceutil.ServiceFramework — session-0 isolation, SCM
auto-restart, no console window possible. That's a meaningful next-PR
project, not a band-aid.

Tests: 51 pass / 2 pre-existing failures in
tests/hermes_cli/test_gateway_{windows,wsl}.py (the 2 failures are
TestSupportsSystemdServicesWSL cases that fail on origin/main too —
unrelated to this PR).
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/windows-gateway-reliability vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9970 on HEAD, 9970 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5172 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@teknium1 teknium1 merged commit fc086da into main Jun 7, 2026
23 checks passed
@teknium1 teknium1 deleted the fix/windows-gateway-reliability branch June 7, 2026 02:54
teknium1 added a commit that referenced this pull request Jun 7, 2026
…enies it, plus regression tests for the breakaway bit (#40956)

#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`,
which fixed the headline bug (gateway dies after Desktop GUI update
and never comes back). The flag's own docstring acknowledges that
restrictive parent job objects can still refuse breakaway with
`ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen`
call:

  "Callers in this codebase already wrap detached spawns in
  try/except OSError and fall back to a cmd.exe wrapper, so the
  breakaway-denied case degrades gracefully rather than crashing."

That's true for `_spawn_detached` in `gateway_windows.py` (the
`hermes gateway start` path), which has both the breakaway bit AND a
retry-without-breakaway fallback. It's NOT true for the post-update
watcher path in `launch_detached_profile_gateway_restart`
(`hermes_cli/gateway.py`), which only has `except OSError: return
False` and gives up entirely. If a user's shell/terminal/container
wraps Hermes in a breakaway-denying job, the gateway-respawn watcher
silently fails to launch instead of trying again without breakaway.

This PR closes that gap and adds the regression tests that were
missing from the original fix.

## Changes

### `hermes_cli/_subprocess_compat.py`

Adds a sibling helper `windows_detach_flags_without_breakaway()` so
callers can express the fallback symbolically (via the helper) rather
than coding the magic `& ~0x01000000` mask at every site. Documented
on `windows_detach_flags` and `windows_detach_flags_without_breakaway`
with the recommended try/except pattern.

### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart`

Two changes, both aligned with the canonical pattern in
`gateway_windows._spawn_detached`:

1. The outer watcher Popen now wraps in `try/except OSError`, and on
   failure retries with `windows_detach_flags_without_breakaway()`
   (POSIX never reaches this branch — `start_new_session=True` can't
   raise OSError).
2. The inlined respawn payload (the `python -c` watcher) also
   wraps its CreateProcess in try/except OSError and retries with
   `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters
   because the watcher's job-object inheritance is independent of the
   outer process's — even if the outer Popen succeeds with breakaway,
   the respawned gateway might inherit a job that doesn't.

### Regression tests in `tests/tools/test_windows_native_support.py`

#40909 shipped the fix without any test that the breakaway bit is
present (the existing `test_windows_detach_flags_has_expected_win32_bits`
asserted only the three legacy bits). Four new tests close that:

- `test_windows_detach_flags_includes_breakaway_from_job` — explicit
  assertion that the breakaway bit is in the default bundle, with the
  rationale spelled out in the docstring so a future maintainer
  staring at this test understands why removing it would resurrect
  the gateway-dies-after-GUI-update bug.
- `test_windows_detach_flags_without_breakaway_drops_only_that_bit`
  — fallback payload keeps the other three detach bits intact.
- `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway`
  — static-text check on the stringified watcher payload. The inlined
  Python program isn't reachable via normal import-time inspection
  because it lives in a `textwrap.dedent("""...""")` literal that
  gets passed to a separate `python -c` interpreter. Asserting that
  both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex
  literal) appear inside the dedent block is a sufficient regression
  guard against accidental refactors.
- `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback`
  — static check that this PR's fallback retry is wired up
  symbolically. Without standing up a real Windows job object that
  refuses breakaway, we can't trigger the OSError in a unit test;
  the text guard catches the case where a future refactor removes
  the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry.

Also extends `test_windows_detach_flags_has_expected_win32_bits` to
include the breakaway bit assertion and updates
`test_windows_flags_zero_on_posix` to cover the new helper.

## Tests

Locally on Windows: 8/8 in the `-k "detach or breakaway or
popen_kwargs or launch_detached or gateway_run_update or
hermes_cli_gateway"` slice pass.

Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`:
172 passed, 10 failed. All 10 failures are pre-existing POSIX-only
tests running on a Windows host (os.geteuid, SIGKILL fallback,
is_linux fixture mismatches). Stashing this PR and re-running on bare
post-#40909 main reproduces all 10 identically — none are regressions.

POSIX paths unchanged: `windows_detach_flags()` and
`windows_detach_flags_without_breakaway()` both return 0 off Windows,
`windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`.

## Out of scope

- The other detached-spawn site in `hermes_cli/gateway.py` (around
  line 3068) also uses `windows_detach_popen_kwargs()` + `except
  OSError`. It deserves the same fallback treatment but the codepath
  is different enough (not the update-flow watcher) that it warrants
  a separate PR with its own scrutiny.
- `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs`
  too — same reasoning.

## Context

Follow-up to #40909 (merged). I had a parallel PR (#40934, closed)
that duplicated the core breakaway fix; the bits unique to that PR
that #40909 didn't cover are the contents of this one. Closing #40934
and opening this slimmed-down version as the focused follow-up.
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…bes + test-leak fix (NousResearch#40909)

* fix(gateway,windows): reliability — supervisor task, JOB breakaway, status --deep

Three coordinated fixes for the Windows gateway reliability story:

1. CREATE_BREAKAWAY_FROM_JOB on every detached spawn

   The 'hermes update' triggered from the Electron Desktop GUI ran inside
   Electron's job object. Without breakaway, the post-update gateway
   watcher spawned by update — already DETACHED_PROCESS — was still
   reaped when Electron's job tore down, so the gateway never came back
   after a GUI-initiated update. Adds CREATE_BREAKAWAY_FROM_JOB (0x01000000)
   to:
     - hermes_cli/_subprocess_compat.py::windows_detach_flags() — used by
       every helper that calls windows_detach_popen_kwargs(), including
       launch_detached_profile_gateway_restart()
     - The watcher subprocess's own respawn snippet in
       hermes_cli/gateway.py (inlined flags so the watcher's child
       respawn also breaks away)

   _spawn_detached() in gateway_windows.py already had the flag; this
   change brings the rest of the codebase to parity.

2. Per-minute supervisor Scheduled Task — Windows equivalent of
   systemd Restart=always

   Introduces hermes_cli/gateway_supervisor.py and registers it as a
   second Scheduled Task ('Hermes_Gateway_Supervisor', SC MINUTE /MO 1,
   LIMITED rights) alongside the existing ONLOGON task. Every minute,
   the supervisor uses the same gateway.status.get_running_pid() probe
   as 'hermes gateway status' and, if no gateway is alive, calls
   gateway_windows._spawn_detached() (which now includes BREAKAWAY) to
   bring one back.

   Covers every crash mode, not just 'machine rebooted': taskkill,
   OOM, GUI update SIGTERM, parent job teardown. Cheap — one pythonw
   startup per minute when down, one PID-existence check per minute
   when up.

   Wired into both the schtasks-success and Startup-folder-fallback
   install paths via _install_supervisor_best_effort(), and removed in
   uninstall(). Best-effort: a failing supervisor install logs a
   warning but doesn't roll back the primary install.

3. 'hermes gateway status --deep' shows per-probe PASS/FAIL

   Replaces the existing terse '--deep' output (which only printed
   paths) with an actual diagnostic table:
     [1] PID file present
     [2] Lock file held by a live process
     [3] get_running_pid() result
     [4] _pid_exists(pid) — OS-level liveness
     [5] gateway_state.json (state + age)
     [6] Last lifecycle event from gateway-exit-diag.log

   When the high-level summary disagrees with reality, the user can
   see exactly which signal is lying.

Test-leak fix
-------------

tests/hermes_cli/test_gateway_wsl.py::TestGatewayCommandWSLMessages
monkey-patched is_linux/is_wsl/supports_systemd_services to simulate
WSL but did NOT stub is_windows(). On a Windows host, the dispatcher
in _gateway_command_inner takes the is_windows() branch BEFORE the
WSL guidance branch, so the test invoked gateway_windows.install()
for real. install() writes to %APPDATA%\...\Startup\Hermes_Gateway.cmd
— the REAL user Startup folder, never sandboxed by tmp_path — pointing
at the test's pytest-of-<user>/pytest-<N>/.../gateway-service/ wrapper.
When pytest tore down the tmp_path, every subsequent Windows login
flashed a cmd.exe window that failed to find the missing target.

Stubs is_windows=False on all four affected tests:
  test_install_wsl_no_systemd
  test_start_wsl_no_systemd
  test_status_wsl_running_manual
  test_status_wsl_not_running

Defense-in-depth: _build_startup_launcher() now prefixes the launcher
with 'if not exist <target> exit /b 0', so any future stale Startup
entry silently no-ops instead of flashing a console window.

Status enhancements
-------------------

- status() now reports supervisor task presence alongside the existing
  schtasks/Startup info, and nudges the user to reinstall if the
  supervisor isn't registered.
- Deep mode dumps both the supervisor task name + script path.

* fix(gateway,windows): drop the per-minute supervisor task — keep breakaway + deep probes

Earlier in this branch we added a per-minute schtasks-based supervisor to
respawn the gateway after crashes / GUI-update SIGTERMs. The implementation
flashed a brief console window on every firing, which stole window focus.
We tried several variants:

  - cmd.exe wrapper invoking pythonw  -> flashes (cmd.exe is console-subsystem)
  - schtasks /TR pointing at pythonw  -> flashes (uv venv launcher pythonw is
    actually subsystem=Console, not GUI; it respawns the real pythonw)
  - schtasks /TR pointing at base uv  -> still flashes (Task Scheduler-side
    conhost preallocation; documented Windows quirk)
  - XML registration with <Hidden>true>  -> still flashes (<Hidden> only hides
    the task in the Task Scheduler UI, not the spawned window)

Researched what leading projects do:

  - Ollama: GUI-subsystem tray exe + Startup-folder shortcut. No supervisor.
  - Tailscale: real Windows Service via SCM. Session 0, no console possible.
  - Syncthing: --no-console flag inside the binary + Startup folder.
  - openclaw: VBS Run(..., 0, False) wrapper. Suppresses the *window* but
    Super User Q971162 confirms focus-steal still occurs in some cases.

None of these use a per-minute polling scheduled task. The 'auto-restart on
crash' responsibility belongs INSIDE the daemon (Tailscale's in-process
recovery / Ollama's monitor+worker pair) OR is delegated to the Windows
Service Control Manager — not Task Scheduler.

So this commit drops the supervisor entirely. The CREATE_BREAKAWAY_FROM_JOB
fix in _subprocess_compat.py (from commit c1e5fa4) survives — that is the
*real* fix for problem NousResearch#2 (GUI-update kills gateway): the post-update
watcher in launch_detached_profile_gateway_restart() now breaks out of
Electron's job object, so the gateway respawn watcher survives the GUI
quit and successfully respawns the gateway.

Surviving from c1e5fa4:
  * CREATE_BREAKAWAY_FROM_JOB in hermes_cli/_subprocess_compat.py (fixes NousResearch#2)
  * Inlined breakaway flag in the watcher respawn snippet in gateway.py
  * hermes gateway status --deep PASS/FAIL probes (fixes NousResearch#1 — visibility)
  * 'if not exist <target> exit /b 0' guard in _build_startup_launcher
    (fixes NousResearch#3 — silent no-op for stale Startup entries)
  * tests/hermes_cli/test_gateway_wsl.py is_windows=False stubs (root cause
    of NousResearch#3 — pytest WSL tests no longer leak Startup entries on Win hosts)

Removed in this commit:
  * hermes_cli/gateway_supervisor.py (entire file)
  * Supervisor section in hermes_cli/gateway_windows.py (~180 lines):
      get_supervisor_task_name, get_supervisor_script_path,
      _build_supervisor_cmd_script, _write_supervisor_script,
      _install_supervisor_task, is_supervisor_task_registered,
      _install_supervisor_best_effort
  * _install_supervisor_best_effort() calls in install() (3 spots)
  * supervisor cleanup block in uninstall()
  * supervisor display lines in status() / status(deep=True)

Future direction (out of scope for this PR): the right place for Windows
'Restart=always' semantics is a real Windows Service installed via
pywin32's win32serviceutil.ServiceFramework — session-0 isolation, SCM
auto-restart, no console window possible. That's a meaningful next-PR
project, not a band-aid.

Tests: 51 pass / 2 pre-existing failures in
tests/hermes_cli/test_gateway_{windows,wsl}.py (the 2 failures are
TestSupportsSystemdServicesWSL cases that fail on origin/main too —
unrelated to this PR).
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…enies it, plus regression tests for the breakaway bit (NousResearch#40956)

NousResearch#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`,
which fixed the headline bug (gateway dies after Desktop GUI update
and never comes back). The flag's own docstring acknowledges that
restrictive parent job objects can still refuse breakaway with
`ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen`
call:

  "Callers in this codebase already wrap detached spawns in
  try/except OSError and fall back to a cmd.exe wrapper, so the
  breakaway-denied case degrades gracefully rather than crashing."

That's true for `_spawn_detached` in `gateway_windows.py` (the
`hermes gateway start` path), which has both the breakaway bit AND a
retry-without-breakaway fallback. It's NOT true for the post-update
watcher path in `launch_detached_profile_gateway_restart`
(`hermes_cli/gateway.py`), which only has `except OSError: return
False` and gives up entirely. If a user's shell/terminal/container
wraps Hermes in a breakaway-denying job, the gateway-respawn watcher
silently fails to launch instead of trying again without breakaway.

This PR closes that gap and adds the regression tests that were
missing from the original fix.

## Changes

### `hermes_cli/_subprocess_compat.py`

Adds a sibling helper `windows_detach_flags_without_breakaway()` so
callers can express the fallback symbolically (via the helper) rather
than coding the magic `& ~0x01000000` mask at every site. Documented
on `windows_detach_flags` and `windows_detach_flags_without_breakaway`
with the recommended try/except pattern.

### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart`

Two changes, both aligned with the canonical pattern in
`gateway_windows._spawn_detached`:

1. The outer watcher Popen now wraps in `try/except OSError`, and on
   failure retries with `windows_detach_flags_without_breakaway()`
   (POSIX never reaches this branch — `start_new_session=True` can't
   raise OSError).
2. The inlined respawn payload (the `python -c` watcher) also
   wraps its CreateProcess in try/except OSError and retries with
   `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters
   because the watcher's job-object inheritance is independent of the
   outer process's — even if the outer Popen succeeds with breakaway,
   the respawned gateway might inherit a job that doesn't.

### Regression tests in `tests/tools/test_windows_native_support.py`

NousResearch#40909 shipped the fix without any test that the breakaway bit is
present (the existing `test_windows_detach_flags_has_expected_win32_bits`
asserted only the three legacy bits). Four new tests close that:

- `test_windows_detach_flags_includes_breakaway_from_job` — explicit
  assertion that the breakaway bit is in the default bundle, with the
  rationale spelled out in the docstring so a future maintainer
  staring at this test understands why removing it would resurrect
  the gateway-dies-after-GUI-update bug.
- `test_windows_detach_flags_without_breakaway_drops_only_that_bit`
  — fallback payload keeps the other three detach bits intact.
- `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway`
  — static-text check on the stringified watcher payload. The inlined
  Python program isn't reachable via normal import-time inspection
  because it lives in a `textwrap.dedent("""...""")` literal that
  gets passed to a separate `python -c` interpreter. Asserting that
  both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex
  literal) appear inside the dedent block is a sufficient regression
  guard against accidental refactors.
- `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback`
  — static check that this PR's fallback retry is wired up
  symbolically. Without standing up a real Windows job object that
  refuses breakaway, we can't trigger the OSError in a unit test;
  the text guard catches the case where a future refactor removes
  the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry.

Also extends `test_windows_detach_flags_has_expected_win32_bits` to
include the breakaway bit assertion and updates
`test_windows_flags_zero_on_posix` to cover the new helper.

## Tests

Locally on Windows: 8/8 in the `-k "detach or breakaway or
popen_kwargs or launch_detached or gateway_run_update or
hermes_cli_gateway"` slice pass.

Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`:
172 passed, 10 failed. All 10 failures are pre-existing POSIX-only
tests running on a Windows host (os.geteuid, SIGKILL fallback,
is_linux fixture mismatches). Stashing this PR and re-running on bare
post-NousResearch#40909 main reproduces all 10 identically — none are regressions.

POSIX paths unchanged: `windows_detach_flags()` and
`windows_detach_flags_without_breakaway()` both return 0 off Windows,
`windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`.

## Out of scope

- The other detached-spawn site in `hermes_cli/gateway.py` (around
  line 3068) also uses `windows_detach_popen_kwargs()` + `except
  OSError`. It deserves the same fallback treatment but the codepath
  is different enough (not the update-flow watcher) that it warrants
  a separate PR with its own scrutiny.
- `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs`
  too — same reasoning.

## Context

Follow-up to NousResearch#40909 (merged). I had a parallel PR (NousResearch#40934, closed)
that duplicated the core breakaway fix; the bits unique to that PR
that NousResearch#40909 didn't cover are the contents of this one. Closing NousResearch#40934
and opening this slimmed-down version as the focused follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant