fix(windows): respawn gateway after Desktop GUI update by breaking the watcher away from Electron's job object#40934
Conversation
…e watcher away from Electron's job object
`hermes update --gateway` (which the Tauri / Electron Desktop updater calls
to relaunch the user's gateway after an update) spawns a tiny watcher
subprocess that polls the old gateway PID, SIGTERMs it, and respawns the
new one. The watcher and its respawned gateway both need to outlive the
Electron Desktop process that triggered the update — Electron exits
mid-update so the venv shim is unlocked.
On Windows that was failing in a subtle way: the watcher was spawned
with `DETACHED_PROCESS | CREATE_NEW_PROCESS_GROUP | CREATE_NO_WINDOW`
but NOT `CREATE_BREAKAWAY_FROM_JOB`. Electron wraps its child processes
in a Win32 job object; when Electron quits, the OS tears the job down
and reaps every descendant that didn't explicitly break away — DETACHED
or not. The watcher died before it could relaunch the gateway. End
result: every Desktop-GUI-driven update left the user without a gateway
until they ran `hermes gateway start` manually.
`hermes gateway start` itself already does the right thing (its
`_spawn_detached` in `gateway_windows.py` includes the breakaway bit
plus a retry without it for restrictive job objects). The fix is to
mirror that pattern in the two spawn sites the update flow uses:
1. `windows_detach_flags()` in `_subprocess_compat.py` now includes
`CREATE_BREAKAWAY_FROM_JOB` (0x01000000) in its default bundle.
`windows_detach_popen_kwargs()` inherits the new flag automatically.
A new helper `windows_detach_flags_without_breakaway()` exposes the
pre-breakaway bundle as the documented retry payload.
2. `launch_detached_profile_gateway_restart()` in `gateway.py`:
- The inlined respawn script (the watcher's payload — a stringified
Python program passed to `python -c`) now sets the breakaway bit
on the respawned gateway, with the same access-denied fallback.
- The outer Popen that launches the watcher itself wraps in a
try/except OSError and retries without the breakaway bit using
`windows_detach_flags_without_breakaway()`, mirroring
`_spawn_detached` in `gateway_windows.py`.
POSIX paths are unchanged: `windows_detach_flags()` returns 0 off
Windows and `windows_detach_popen_kwargs()` still yields
`{"start_new_session": True}`.
Tests:
- `test_windows_detach_flags_includes_breakaway_from_job` — regression
guard so the breakaway bit isn't accidentally dropped from the
default flag bundle.
- `test_windows_detach_flags_without_breakaway_drops_only_that_bit`
— fallback payload keeps the other three detach bits intact.
- `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway`
— static check that the stringified watcher payload contains
`CREATE_BREAKAWAY_FROM_JOB` symbolically and as the hex literal, so
the intent is greppable and future refactors don't silently drop it.
- The existing `test_windows_detach_flags_has_expected_win32_bits` now
also asserts the breakaway bit is present.
🔎 Lint report:
|
|
Closing in favor of #40909 (now merged) which contained the same core CREATE_BREAKAWAY_FROM_JOB fix bundled with the status --deep probes and Startup-folder leak fix. The bits of this PR that #40909 did NOT include — the windows_detach_flags_without_breakaway() helper, the OSError access-denied fallback on the watcher Popen / inlined respawn (mirroring what _spawn_detached in gateway_windows.py already does), and the static-text regression test on the inlined watcher payload — are still potentially useful. Will open a focused follow-up PR for just those if the maintainers want them. |
…enies it, plus regression tests for the breakaway bit (#40956) #40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`, which fixed the headline bug (gateway dies after Desktop GUI update and never comes back). The flag's own docstring acknowledges that restrictive parent job objects can still refuse breakaway with `ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen` call: "Callers in this codebase already wrap detached spawns in try/except OSError and fall back to a cmd.exe wrapper, so the breakaway-denied case degrades gracefully rather than crashing." That's true for `_spawn_detached` in `gateway_windows.py` (the `hermes gateway start` path), which has both the breakaway bit AND a retry-without-breakaway fallback. It's NOT true for the post-update watcher path in `launch_detached_profile_gateway_restart` (`hermes_cli/gateway.py`), which only has `except OSError: return False` and gives up entirely. If a user's shell/terminal/container wraps Hermes in a breakaway-denying job, the gateway-respawn watcher silently fails to launch instead of trying again without breakaway. This PR closes that gap and adds the regression tests that were missing from the original fix. ## Changes ### `hermes_cli/_subprocess_compat.py` Adds a sibling helper `windows_detach_flags_without_breakaway()` so callers can express the fallback symbolically (via the helper) rather than coding the magic `& ~0x01000000` mask at every site. Documented on `windows_detach_flags` and `windows_detach_flags_without_breakaway` with the recommended try/except pattern. ### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart` Two changes, both aligned with the canonical pattern in `gateway_windows._spawn_detached`: 1. The outer watcher Popen now wraps in `try/except OSError`, and on failure retries with `windows_detach_flags_without_breakaway()` (POSIX never reaches this branch — `start_new_session=True` can't raise OSError). 2. The inlined respawn payload (the `python -c` watcher) also wraps its CreateProcess in try/except OSError and retries with `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters because the watcher's job-object inheritance is independent of the outer process's — even if the outer Popen succeeds with breakaway, the respawned gateway might inherit a job that doesn't. ### Regression tests in `tests/tools/test_windows_native_support.py` #40909 shipped the fix without any test that the breakaway bit is present (the existing `test_windows_detach_flags_has_expected_win32_bits` asserted only the three legacy bits). Four new tests close that: - `test_windows_detach_flags_includes_breakaway_from_job` — explicit assertion that the breakaway bit is in the default bundle, with the rationale spelled out in the docstring so a future maintainer staring at this test understands why removing it would resurrect the gateway-dies-after-GUI-update bug. - `test_windows_detach_flags_without_breakaway_drops_only_that_bit` — fallback payload keeps the other three detach bits intact. - `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway` — static-text check on the stringified watcher payload. The inlined Python program isn't reachable via normal import-time inspection because it lives in a `textwrap.dedent("""...""")` literal that gets passed to a separate `python -c` interpreter. Asserting that both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex literal) appear inside the dedent block is a sufficient regression guard against accidental refactors. - `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback` — static check that this PR's fallback retry is wired up symbolically. Without standing up a real Windows job object that refuses breakaway, we can't trigger the OSError in a unit test; the text guard catches the case where a future refactor removes the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry. Also extends `test_windows_detach_flags_has_expected_win32_bits` to include the breakaway bit assertion and updates `test_windows_flags_zero_on_posix` to cover the new helper. ## Tests Locally on Windows: 8/8 in the `-k "detach or breakaway or popen_kwargs or launch_detached or gateway_run_update or hermes_cli_gateway"` slice pass. Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`: 172 passed, 10 failed. All 10 failures are pre-existing POSIX-only tests running on a Windows host (os.geteuid, SIGKILL fallback, is_linux fixture mismatches). Stashing this PR and re-running on bare post-#40909 main reproduces all 10 identically — none are regressions. POSIX paths unchanged: `windows_detach_flags()` and `windows_detach_flags_without_breakaway()` both return 0 off Windows, `windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`. ## Out of scope - The other detached-spawn site in `hermes_cli/gateway.py` (around line 3068) also uses `windows_detach_popen_kwargs()` + `except OSError`. It deserves the same fallback treatment but the codepath is different enough (not the update-flow watcher) that it warrants a separate PR with its own scrutiny. - `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs` too — same reasoning. ## Context Follow-up to #40909 (merged). I had a parallel PR (#40934, closed) that duplicated the core breakaway fix; the bits unique to that PR that #40909 didn't cover are the contents of this one. Closing #40934 and opening this slimmed-down version as the focused follow-up.
…enies it, plus regression tests for the breakaway bit (NousResearch#40956) NousResearch#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`, which fixed the headline bug (gateway dies after Desktop GUI update and never comes back). The flag's own docstring acknowledges that restrictive parent job objects can still refuse breakaway with `ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen` call: "Callers in this codebase already wrap detached spawns in try/except OSError and fall back to a cmd.exe wrapper, so the breakaway-denied case degrades gracefully rather than crashing." That's true for `_spawn_detached` in `gateway_windows.py` (the `hermes gateway start` path), which has both the breakaway bit AND a retry-without-breakaway fallback. It's NOT true for the post-update watcher path in `launch_detached_profile_gateway_restart` (`hermes_cli/gateway.py`), which only has `except OSError: return False` and gives up entirely. If a user's shell/terminal/container wraps Hermes in a breakaway-denying job, the gateway-respawn watcher silently fails to launch instead of trying again without breakaway. This PR closes that gap and adds the regression tests that were missing from the original fix. ## Changes ### `hermes_cli/_subprocess_compat.py` Adds a sibling helper `windows_detach_flags_without_breakaway()` so callers can express the fallback symbolically (via the helper) rather than coding the magic `& ~0x01000000` mask at every site. Documented on `windows_detach_flags` and `windows_detach_flags_without_breakaway` with the recommended try/except pattern. ### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart` Two changes, both aligned with the canonical pattern in `gateway_windows._spawn_detached`: 1. The outer watcher Popen now wraps in `try/except OSError`, and on failure retries with `windows_detach_flags_without_breakaway()` (POSIX never reaches this branch — `start_new_session=True` can't raise OSError). 2. The inlined respawn payload (the `python -c` watcher) also wraps its CreateProcess in try/except OSError and retries with `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters because the watcher's job-object inheritance is independent of the outer process's — even if the outer Popen succeeds with breakaway, the respawned gateway might inherit a job that doesn't. ### Regression tests in `tests/tools/test_windows_native_support.py` NousResearch#40909 shipped the fix without any test that the breakaway bit is present (the existing `test_windows_detach_flags_has_expected_win32_bits` asserted only the three legacy bits). Four new tests close that: - `test_windows_detach_flags_includes_breakaway_from_job` — explicit assertion that the breakaway bit is in the default bundle, with the rationale spelled out in the docstring so a future maintainer staring at this test understands why removing it would resurrect the gateway-dies-after-GUI-update bug. - `test_windows_detach_flags_without_breakaway_drops_only_that_bit` — fallback payload keeps the other three detach bits intact. - `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway` — static-text check on the stringified watcher payload. The inlined Python program isn't reachable via normal import-time inspection because it lives in a `textwrap.dedent("""...""")` literal that gets passed to a separate `python -c` interpreter. Asserting that both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex literal) appear inside the dedent block is a sufficient regression guard against accidental refactors. - `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback` — static check that this PR's fallback retry is wired up symbolically. Without standing up a real Windows job object that refuses breakaway, we can't trigger the OSError in a unit test; the text guard catches the case where a future refactor removes the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry. Also extends `test_windows_detach_flags_has_expected_win32_bits` to include the breakaway bit assertion and updates `test_windows_flags_zero_on_posix` to cover the new helper. ## Tests Locally on Windows: 8/8 in the `-k "detach or breakaway or popen_kwargs or launch_detached or gateway_run_update or hermes_cli_gateway"` slice pass. Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`: 172 passed, 10 failed. All 10 failures are pre-existing POSIX-only tests running on a Windows host (os.geteuid, SIGKILL fallback, is_linux fixture mismatches). Stashing this PR and re-running on bare post-NousResearch#40909 main reproduces all 10 identically — none are regressions. POSIX paths unchanged: `windows_detach_flags()` and `windows_detach_flags_without_breakaway()` both return 0 off Windows, `windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`. ## Out of scope - The other detached-spawn site in `hermes_cli/gateway.py` (around line 3068) also uses `windows_detach_popen_kwargs()` + `except OSError`. It deserves the same fallback treatment but the codepath is different enough (not the update-flow watcher) that it warrants a separate PR with its own scrutiny. - `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs` too — same reasoning. ## Context Follow-up to NousResearch#40909 (merged). I had a parallel PR (NousResearch#40934, closed) that duplicated the core breakaway fix; the bits unique to that PR that NousResearch#40909 didn't cover are the contents of this one. Closing NousResearch#40934 and opening this slimmed-down version as the focused follow-up.
Symptom
On Windows, every Desktop-GUI-driven update leaves the user without a
running gateway until they manually run
hermes gateway start. This wasmine, here's the gateway status output after a GUI update with the
gateway alive beforehand:
The user has to know to manually restart it. Cron jobs / kanban
dispatch / messaging bots silently stop firing in the meantime.
Root cause
hermes update --gateway(the flag the Tauri / Electron Desktop updateralready passes via
apps/bootstrap-installer/src-tauri/src/update.rs)calls
launch_detached_profile_gateway_restart()inhermes_cli/gateway.pyfor each running profile gateway. That function spawns a tiny watcher
subprocess that polls the old PID until it exits, then respawns the
gateway.
The watcher and its respawned gateway both need to outlive the Electron
Desktop process that triggered the update — Electron exits mid-update so
the venv shim is unlocked. On Windows, the watcher was spawned with:
…but NOT
CREATE_BREAKAWAY_FROM_JOB(0x01000000).Electron wraps its child processes in a Win32 job object. When Electron
quits, the OS tears that job down and reaps every descendant that didn't
explicitly break away —
DETACHED_PROCESSdoesn't save you. The watcherdied before it could relaunch the gateway, and the user was left with no
gateway and no on-crash recovery (Startup folder
.cmdonly fires onnext login).
hermes gateway startitself already does the right thing — its_spawn_detachedingateway_windows.pyincludes the breakaway bitplus a retry without it for restrictive job-object configurations.
This PR mirrors that pattern in the two spawn sites the update flow uses.
Fix
1.
hermes_cli/_subprocess_compat.pywindows_detach_flags()now includesCREATE_BREAKAWAY_FROM_JOBinits default bundle.
windows_detach_popen_kwargs()inherits the newflag automatically — every existing caller of the helper gets the
fix for free.
windows_detach_flags_without_breakaway()exposes thepre-breakaway bundle as the documented retry payload for callers
that handle the
ERROR_ACCESS_DENIEDcase explicitly.2.
hermes_cli/gateway.py::launch_detached_profile_gateway_restartPython program passed to
python -c) now sets the breakaway bit onthe respawned gateway, with the same access-denied fallback as
_spawn_detached.try/except OSErrorand retries without the breakaway bit usingthe new helper, mirroring
_spawn_detached.POSIX paths are completely unchanged:
windows_detach_flags()returns 0off Windows and
windows_detach_popen_kwargs()still yields{"start_new_session": True}on Linux/macOS.Why a static-text test for the inlined watcher
The watcher is a stringified Python program —
textwrap.dedent("""...""")— that gets passed to a separate
python -cinterpreter. Parsing it foran exact AST node would be brittle (it would fight every future refactor
of the dedent block). The textual presence of
0x01000000AND thesymbolic name
_CREATE_BREAKAWAY_FROM_JOBin the watcher block is asufficient regression guard: anyone refactoring the watcher payload has
to consciously delete both before the bit silently goes away.
Tests
test_windows_detach_flags_includes_breakaway_from_job— regressionguard so the breakaway bit isn't accidentally dropped from the
default flag bundle.
test_windows_detach_flags_without_breakaway_drops_only_that_bit—fallback payload keeps the other three detach bits intact.
test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway— static check on the stringified payload (see above).
test_windows_detach_flags_has_expected_win32_bitsnowalso asserts the breakaway bit is present.
Locally on Windows the targeted slice (
-k "detach or breakaway") is 7passed, 0 failed. The broader
tests/hermes_cli/test_gateway*.pysliceruns 154 passed with 8 pre-existing POSIX-only-test-on-Windows failures
that also fail identically on
origin/main(os.geteuidetc.) — notregressions from this change.
Out of scope
apps/desktop/electron/main.cjsalready does the right thing on itsside:
releaseBackendLockForUpdate()reaps the backend tree beforespawning
hermes-setup.exe, and the Tauri updater callsforce_kill_other_hermes()ifwait_for_venv_freetimes out. Thebug fixed here is downstream of both — it's the post-update respawn
watcher, not the pre-update lock release.
hermes gateway installregisters aHermes_Gateway_Supervisorscheduled task that polls + respawns the gateway on Windows. This PR
doesn't depend on or interact with that task; the supervisor is a
belt-and-braces recovery path that would eventually bring the gateway
back, but with a per-minute polling delay. This PR makes the direct
update→respawn path work, so the supervisor's recovery role stays
for actual crashes.
Related skill
I added the analysis pattern to my private Hermes diagnostics skill
(
hermes-gateway-diagnostics, Step 6) when this first surfaced. Happyto fold it into the docs/troubleshooting site if useful.