Skip to content

fix(windows): respawn gateway after Desktop GUI update by breaking the watcher away from Electron's job object#40934

Closed
teknium1 wants to merge 1 commit into
mainfrom
fix/windows-update-gateway-respawn-breakaway
Closed

fix(windows): respawn gateway after Desktop GUI update by breaking the watcher away from Electron's job object#40934
teknium1 wants to merge 1 commit into
mainfrom
fix/windows-update-gateway-respawn-breakaway

Conversation

@teknium1

@teknium1 teknium1 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Symptom

On Windows, every Desktop-GUI-driven update leaves the user without a
running gateway until they manually run hermes gateway start. This was
mine, here's the gateway status output after a GUI update with the
gateway alive beforehand:

PS C:\WINDOWS\system32> hermes gateway status
✓ Windows login item installed: ...\Startup\Hermes_Gateway.cmd
✗ No gateway process detected

The user has to know to manually restart it. Cron jobs / kanban
dispatch / messaging bots silently stop firing in the meantime.

Root cause

hermes update --gateway (the flag the Tauri / Electron Desktop updater
already passes via apps/bootstrap-installer/src-tauri/src/update.rs)
calls launch_detached_profile_gateway_restart() in hermes_cli/gateway.py
for each running profile gateway. That function spawns a tiny watcher
subprocess that polls the old PID until it exits, then respawns the
gateway.

The watcher and its respawned gateway both need to outlive the Electron
Desktop process that triggered the update — Electron exits mid-update so
the venv shim is unlocked. On Windows, the watcher was spawned with:

DETACHED_PROCESS | CREATE_NEW_PROCESS_GROUP | CREATE_NO_WINDOW

…but NOT CREATE_BREAKAWAY_FROM_JOB (0x01000000).

Electron wraps its child processes in a Win32 job object. When Electron
quits, the OS tears that job down and reaps every descendant that didn't
explicitly break away — DETACHED_PROCESS doesn't save you. The watcher
died before it could relaunch the gateway, and the user was left with no
gateway and no on-crash recovery (Startup folder .cmd only fires on
next login).

hermes gateway start itself already does the right thing — its
_spawn_detached in gateway_windows.py includes the breakaway bit
plus a retry without it for restrictive job-object configurations.

# hermes_cli/gateway_windows.py:_spawn_detached  (already correct)
flags = 0x00000008 | 0x00000200 | 0x08000000 | 0x01000000
try:
    proc = subprocess.Popen(argv, ..., creationflags=flags, ...)
except OSError:
    flags_no_breakaway = flags & ~0x01000000
    proc = subprocess.Popen(argv, ..., creationflags=flags_no_breakaway, ...)

This PR mirrors that pattern in the two spawn sites the update flow uses.

Fix

1. hermes_cli/_subprocess_compat.py

  • windows_detach_flags() now includes CREATE_BREAKAWAY_FROM_JOB in
    its default bundle. windows_detach_popen_kwargs() inherits the new
    flag automatically — every existing caller of the helper gets the
    fix for free.
  • New helper windows_detach_flags_without_breakaway() exposes the
    pre-breakaway bundle as the documented retry payload for callers
    that handle the ERROR_ACCESS_DENIED case explicitly.

2. hermes_cli/gateway.py::launch_detached_profile_gateway_restart

  • The inlined respawn script (the watcher's payload — a stringified
    Python program passed to python -c) now sets the breakaway bit on
    the respawned gateway, with the same access-denied fallback as
    _spawn_detached.
  • The outer Popen that launches the watcher itself wraps in a
    try/except OSError and retries without the breakaway bit using
    the new helper, mirroring _spawn_detached.

POSIX paths are completely unchanged: windows_detach_flags() returns 0
off Windows and windows_detach_popen_kwargs() still yields
{"start_new_session": True} on Linux/macOS.

Why a static-text test for the inlined watcher

The watcher is a stringified Python program — textwrap.dedent("""...""")
— that gets passed to a separate python -c interpreter. Parsing it for
an exact AST node would be brittle (it would fight every future refactor
of the dedent block). The textual presence of 0x01000000 AND the
symbolic name _CREATE_BREAKAWAY_FROM_JOB in the watcher block is a
sufficient regression guard: anyone refactoring the watcher payload has
to consciously delete both before the bit silently goes away.

Tests

  • test_windows_detach_flags_includes_breakaway_from_job — regression
    guard so the breakaway bit isn't accidentally dropped from the
    default flag bundle.
  • test_windows_detach_flags_without_breakaway_drops_only_that_bit
    fallback payload keeps the other three detach bits intact.
  • test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway
    — static check on the stringified payload (see above).
  • The existing test_windows_detach_flags_has_expected_win32_bits now
    also asserts the breakaway bit is present.

Locally on Windows the targeted slice (-k "detach or breakaway") is 7
passed, 0 failed. The broader tests/hermes_cli/test_gateway*.py slice
runs 154 passed with 8 pre-existing POSIX-only-test-on-Windows failures
that also fail identically on origin/main (os.geteuid etc.) — not
regressions from this change.

Out of scope

  • apps/desktop/electron/main.cjs already does the right thing on its
    side: releaseBackendLockForUpdate() reaps the backend tree before
    spawning hermes-setup.exe, and the Tauri updater calls
    force_kill_other_hermes() if wait_for_venv_free times out. The
    bug fixed here is downstream of both — it's the post-update respawn
    watcher, not the pre-update lock release.
  • hermes gateway install registers a Hermes_Gateway_Supervisor
    scheduled task that polls + respawns the gateway on Windows. This PR
    doesn't depend on or interact with that task; the supervisor is a
    belt-and-braces recovery path that would eventually bring the gateway
    back, but with a per-minute polling delay. This PR makes the direct
    update→respawn path work, so the supervisor's recovery role stays
    for actual crashes.

Related skill

I added the analysis pattern to my private Hermes diagnostics skill
(hermes-gateway-diagnostics, Step 6) when this first surfaced. Happy
to fold it into the docs/troubleshooting site if useful.

…e watcher away from Electron's job object

`hermes update --gateway` (which the Tauri / Electron Desktop updater calls
to relaunch the user's gateway after an update) spawns a tiny watcher
subprocess that polls the old gateway PID, SIGTERMs it, and respawns the
new one. The watcher and its respawned gateway both need to outlive the
Electron Desktop process that triggered the update — Electron exits
mid-update so the venv shim is unlocked.

On Windows that was failing in a subtle way: the watcher was spawned
with `DETACHED_PROCESS | CREATE_NEW_PROCESS_GROUP | CREATE_NO_WINDOW`
but NOT `CREATE_BREAKAWAY_FROM_JOB`. Electron wraps its child processes
in a Win32 job object; when Electron quits, the OS tears the job down
and reaps every descendant that didn't explicitly break away — DETACHED
or not. The watcher died before it could relaunch the gateway. End
result: every Desktop-GUI-driven update left the user without a gateway
until they ran `hermes gateway start` manually.

`hermes gateway start` itself already does the right thing (its
`_spawn_detached` in `gateway_windows.py` includes the breakaway bit
plus a retry without it for restrictive job objects). The fix is to
mirror that pattern in the two spawn sites the update flow uses:

1. `windows_detach_flags()` in `_subprocess_compat.py` now includes
   `CREATE_BREAKAWAY_FROM_JOB` (0x01000000) in its default bundle.
   `windows_detach_popen_kwargs()` inherits the new flag automatically.
   A new helper `windows_detach_flags_without_breakaway()` exposes the
   pre-breakaway bundle as the documented retry payload.

2. `launch_detached_profile_gateway_restart()` in `gateway.py`:
   - The inlined respawn script (the watcher's payload — a stringified
     Python program passed to `python -c`) now sets the breakaway bit
     on the respawned gateway, with the same access-denied fallback.
   - The outer Popen that launches the watcher itself wraps in a
     try/except OSError and retries without the breakaway bit using
     `windows_detach_flags_without_breakaway()`, mirroring
     `_spawn_detached` in `gateway_windows.py`.

POSIX paths are unchanged: `windows_detach_flags()` returns 0 off
Windows and `windows_detach_popen_kwargs()` still yields
`{"start_new_session": True}`.

Tests:

- `test_windows_detach_flags_includes_breakaway_from_job` — regression
  guard so the breakaway bit isn't accidentally dropped from the
  default flag bundle.
- `test_windows_detach_flags_without_breakaway_drops_only_that_bit`
  — fallback payload keeps the other three detach bits intact.
- `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway`
  — static check that the stringified watcher payload contains
  `CREATE_BREAKAWAY_FROM_JOB` symbolically and as the hex literal, so
  the intent is greppable and future refactors don't silently drop it.
- The existing `test_windows_detach_flags_has_expected_win32_bits` now
  also asserts the breakaway bit is present.
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/windows-update-gateway-respawn-breakaway vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9969 on HEAD, 9969 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5171 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists labels Jun 7, 2026
@teknium1

teknium1 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Closing in favor of #40909 (now merged) which contained the same core CREATE_BREAKAWAY_FROM_JOB fix bundled with the status --deep probes and Startup-folder leak fix.

The bits of this PR that #40909 did NOT include — the windows_detach_flags_without_breakaway() helper, the OSError access-denied fallback on the watcher Popen / inlined respawn (mirroring what _spawn_detached in gateway_windows.py already does), and the static-text regression test on the inlined watcher payload — are still potentially useful. Will open a focused follow-up PR for just those if the maintainers want them.

@teknium1 teknium1 closed this Jun 7, 2026
teknium1 added a commit that referenced this pull request Jun 7, 2026
…enies it, plus regression tests for the breakaway bit (#40956)

#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`,
which fixed the headline bug (gateway dies after Desktop GUI update
and never comes back). The flag's own docstring acknowledges that
restrictive parent job objects can still refuse breakaway with
`ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen`
call:

  "Callers in this codebase already wrap detached spawns in
  try/except OSError and fall back to a cmd.exe wrapper, so the
  breakaway-denied case degrades gracefully rather than crashing."

That's true for `_spawn_detached` in `gateway_windows.py` (the
`hermes gateway start` path), which has both the breakaway bit AND a
retry-without-breakaway fallback. It's NOT true for the post-update
watcher path in `launch_detached_profile_gateway_restart`
(`hermes_cli/gateway.py`), which only has `except OSError: return
False` and gives up entirely. If a user's shell/terminal/container
wraps Hermes in a breakaway-denying job, the gateway-respawn watcher
silently fails to launch instead of trying again without breakaway.

This PR closes that gap and adds the regression tests that were
missing from the original fix.

## Changes

### `hermes_cli/_subprocess_compat.py`

Adds a sibling helper `windows_detach_flags_without_breakaway()` so
callers can express the fallback symbolically (via the helper) rather
than coding the magic `& ~0x01000000` mask at every site. Documented
on `windows_detach_flags` and `windows_detach_flags_without_breakaway`
with the recommended try/except pattern.

### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart`

Two changes, both aligned with the canonical pattern in
`gateway_windows._spawn_detached`:

1. The outer watcher Popen now wraps in `try/except OSError`, and on
   failure retries with `windows_detach_flags_without_breakaway()`
   (POSIX never reaches this branch — `start_new_session=True` can't
   raise OSError).
2. The inlined respawn payload (the `python -c` watcher) also
   wraps its CreateProcess in try/except OSError and retries with
   `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters
   because the watcher's job-object inheritance is independent of the
   outer process's — even if the outer Popen succeeds with breakaway,
   the respawned gateway might inherit a job that doesn't.

### Regression tests in `tests/tools/test_windows_native_support.py`

#40909 shipped the fix without any test that the breakaway bit is
present (the existing `test_windows_detach_flags_has_expected_win32_bits`
asserted only the three legacy bits). Four new tests close that:

- `test_windows_detach_flags_includes_breakaway_from_job` — explicit
  assertion that the breakaway bit is in the default bundle, with the
  rationale spelled out in the docstring so a future maintainer
  staring at this test understands why removing it would resurrect
  the gateway-dies-after-GUI-update bug.
- `test_windows_detach_flags_without_breakaway_drops_only_that_bit`
  — fallback payload keeps the other three detach bits intact.
- `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway`
  — static-text check on the stringified watcher payload. The inlined
  Python program isn't reachable via normal import-time inspection
  because it lives in a `textwrap.dedent("""...""")` literal that
  gets passed to a separate `python -c` interpreter. Asserting that
  both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex
  literal) appear inside the dedent block is a sufficient regression
  guard against accidental refactors.
- `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback`
  — static check that this PR's fallback retry is wired up
  symbolically. Without standing up a real Windows job object that
  refuses breakaway, we can't trigger the OSError in a unit test;
  the text guard catches the case where a future refactor removes
  the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry.

Also extends `test_windows_detach_flags_has_expected_win32_bits` to
include the breakaway bit assertion and updates
`test_windows_flags_zero_on_posix` to cover the new helper.

## Tests

Locally on Windows: 8/8 in the `-k "detach or breakaway or
popen_kwargs or launch_detached or gateway_run_update or
hermes_cli_gateway"` slice pass.

Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`:
172 passed, 10 failed. All 10 failures are pre-existing POSIX-only
tests running on a Windows host (os.geteuid, SIGKILL fallback,
is_linux fixture mismatches). Stashing this PR and re-running on bare
post-#40909 main reproduces all 10 identically — none are regressions.

POSIX paths unchanged: `windows_detach_flags()` and
`windows_detach_flags_without_breakaway()` both return 0 off Windows,
`windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`.

## Out of scope

- The other detached-spawn site in `hermes_cli/gateway.py` (around
  line 3068) also uses `windows_detach_popen_kwargs()` + `except
  OSError`. It deserves the same fallback treatment but the codepath
  is different enough (not the update-flow watcher) that it warrants
  a separate PR with its own scrutiny.
- `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs`
  too — same reasoning.

## Context

Follow-up to #40909 (merged). I had a parallel PR (#40934, closed)
that duplicated the core breakaway fix; the bits unique to that PR
that #40909 didn't cover are the contents of this one. Closing #40934
and opening this slimmed-down version as the focused follow-up.
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…enies it, plus regression tests for the breakaway bit (NousResearch#40956)

NousResearch#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`,
which fixed the headline bug (gateway dies after Desktop GUI update
and never comes back). The flag's own docstring acknowledges that
restrictive parent job objects can still refuse breakaway with
`ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen`
call:

  "Callers in this codebase already wrap detached spawns in
  try/except OSError and fall back to a cmd.exe wrapper, so the
  breakaway-denied case degrades gracefully rather than crashing."

That's true for `_spawn_detached` in `gateway_windows.py` (the
`hermes gateway start` path), which has both the breakaway bit AND a
retry-without-breakaway fallback. It's NOT true for the post-update
watcher path in `launch_detached_profile_gateway_restart`
(`hermes_cli/gateway.py`), which only has `except OSError: return
False` and gives up entirely. If a user's shell/terminal/container
wraps Hermes in a breakaway-denying job, the gateway-respawn watcher
silently fails to launch instead of trying again without breakaway.

This PR closes that gap and adds the regression tests that were
missing from the original fix.

## Changes

### `hermes_cli/_subprocess_compat.py`

Adds a sibling helper `windows_detach_flags_without_breakaway()` so
callers can express the fallback symbolically (via the helper) rather
than coding the magic `& ~0x01000000` mask at every site. Documented
on `windows_detach_flags` and `windows_detach_flags_without_breakaway`
with the recommended try/except pattern.

### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart`

Two changes, both aligned with the canonical pattern in
`gateway_windows._spawn_detached`:

1. The outer watcher Popen now wraps in `try/except OSError`, and on
   failure retries with `windows_detach_flags_without_breakaway()`
   (POSIX never reaches this branch — `start_new_session=True` can't
   raise OSError).
2. The inlined respawn payload (the `python -c` watcher) also
   wraps its CreateProcess in try/except OSError and retries with
   `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters
   because the watcher's job-object inheritance is independent of the
   outer process's — even if the outer Popen succeeds with breakaway,
   the respawned gateway might inherit a job that doesn't.

### Regression tests in `tests/tools/test_windows_native_support.py`

NousResearch#40909 shipped the fix without any test that the breakaway bit is
present (the existing `test_windows_detach_flags_has_expected_win32_bits`
asserted only the three legacy bits). Four new tests close that:

- `test_windows_detach_flags_includes_breakaway_from_job` — explicit
  assertion that the breakaway bit is in the default bundle, with the
  rationale spelled out in the docstring so a future maintainer
  staring at this test understands why removing it would resurrect
  the gateway-dies-after-GUI-update bug.
- `test_windows_detach_flags_without_breakaway_drops_only_that_bit`
  — fallback payload keeps the other three detach bits intact.
- `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway`
  — static-text check on the stringified watcher payload. The inlined
  Python program isn't reachable via normal import-time inspection
  because it lives in a `textwrap.dedent("""...""")` literal that
  gets passed to a separate `python -c` interpreter. Asserting that
  both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex
  literal) appear inside the dedent block is a sufficient regression
  guard against accidental refactors.
- `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback`
  — static check that this PR's fallback retry is wired up
  symbolically. Without standing up a real Windows job object that
  refuses breakaway, we can't trigger the OSError in a unit test;
  the text guard catches the case where a future refactor removes
  the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry.

Also extends `test_windows_detach_flags_has_expected_win32_bits` to
include the breakaway bit assertion and updates
`test_windows_flags_zero_on_posix` to cover the new helper.

## Tests

Locally on Windows: 8/8 in the `-k "detach or breakaway or
popen_kwargs or launch_detached or gateway_run_update or
hermes_cli_gateway"` slice pass.

Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`:
172 passed, 10 failed. All 10 failures are pre-existing POSIX-only
tests running on a Windows host (os.geteuid, SIGKILL fallback,
is_linux fixture mismatches). Stashing this PR and re-running on bare
post-NousResearch#40909 main reproduces all 10 identically — none are regressions.

POSIX paths unchanged: `windows_detach_flags()` and
`windows_detach_flags_without_breakaway()` both return 0 off Windows,
`windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`.

## Out of scope

- The other detached-spawn site in `hermes_cli/gateway.py` (around
  line 3068) also uses `windows_detach_popen_kwargs()` + `except
  OSError`. It deserves the same fallback treatment but the codepath
  is different enough (not the update-flow watcher) that it warrants
  a separate PR with its own scrutiny.
- `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs`
  too — same reasoning.

## Context

Follow-up to NousResearch#40909 (merged). I had a parallel PR (NousResearch#40934, closed)
that duplicated the core breakaway fix; the bits unique to that PR
that NousResearch#40909 didn't cover are the contents of this one. Closing NousResearch#40934
and opening this slimmed-down version as the focused follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants