Skip to content

fix(windows): retry watcher Popen without breakaway when parent job denies it, plus regression tests for the breakaway bit#40956

Merged
teknium1 merged 1 commit into
mainfrom
fix/windows-update-gateway-respawn-breakaway-fallback
Jun 7, 2026
Merged

fix(windows): retry watcher Popen without breakaway when parent job denies it, plus regression tests for the breakaway bit#40956
teknium1 merged 1 commit into
mainfrom
fix/windows-update-gateway-respawn-breakaway-fallback

Conversation

@teknium1

@teknium1 teknium1 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Why

Follow-up to #40909 (merged), which fixed the headline "gateway dies after Desktop GUI update and never comes back" bug by adding CREATE_BREAKAWAY_FROM_JOB to windows_detach_flags(). That PR's own docstring acknowledges the next-tier failure mode:

If a process is in a job that disallows breakaway (rare — JOB_OBJECT_LIMIT_BREAKAWAY_OK isn't set), CreateProcess returns ERROR_ACCESS_DENIED. Python surfaces that as PermissionError on the subprocess.Popen call. Callers in this codebase already wrap detached spawns in try/except OSError and fall back to a cmd.exe wrapper, so the breakaway-denied case degrades gracefully rather than crashing.

That's true for _spawn_detached in gateway_windows.py (the hermes gateway start path) which has both the breakaway bit AND a retry-without-breakaway fallback. It's NOT true for the post-update watcher path in launch_detached_profile_gateway_restart (hermes_cli/gateway.py), which has except OSError: return False and gives up entirely. If a user's shell/terminal/container wraps Hermes in a breakaway-denying job, the gateway-respawn watcher silently fails to launch instead of trying again without breakaway.

This PR closes the gap and adds the regression tests that were missing from the original fix.

Changes

hermes_cli/_subprocess_compat.py

Adds a sibling helper windows_detach_flags_without_breakaway() so callers can express the fallback symbolically rather than coding & ~0x01000000 magic at every site. Documented on both helpers with the recommended try/except pattern.

hermes_cli/gateway.py::launch_detached_profile_gateway_restart

Two changes aligned with the canonical pattern in gateway_windows._spawn_detached:

  1. The outer watcher Popen now wraps in try/except OSError and retries with windows_detach_flags_without_breakaway(). POSIX never reaches this branch (start_new_session=True can't raise OSError) so the fallback is Windows-only in practice.
  2. The inlined respawn payload (the stringified Python program passed to python -c) wraps its CreateProcess in try/except OSError and retries with _flags & ~_CREATE_BREAKAWAY_FROM_JOB on failure. This matters because the watcher's job-object inheritance is independent of the outer process's — even if the outer Popen succeeds with breakaway, the respawned gateway might inherit a job that doesn't.

Regression tests in tests/tools/test_windows_native_support.py

#40909 shipped the fix without any test that the breakaway bit is present (the existing test_windows_detach_flags_has_expected_win32_bits asserted only the three legacy bits). Four new tests close that:

Test What it guards
test_windows_detach_flags_includes_breakaway_from_job Explicit assertion that breakaway bit is in the default bundle. Removing it would resurrect the gateway-dies-after-GUI-update bug.
test_windows_detach_flags_without_breakaway_drops_only_that_bit Fallback payload keeps the other three detach bits intact.
test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway Static-text guard on the stringified watcher payload (since it lives in a textwrap.dedent("""...""") literal that's not reachable via import-time inspection).
test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback Static-text guard for the new fallback retry.

Also extends test_windows_detach_flags_has_expected_win32_bits to include the breakaway bit and updates test_windows_flags_zero_on_posix to cover the new helper.

Why static-text tests for the inlined payload

The watcher is a stringified Python program — textwrap.dedent("""...""") — that gets passed to a separate python -c interpreter. Parsing it for an exact AST node would be brittle (it would fight every future refactor of the dedent block). The textual presence of 0x01000000 AND the symbolic name _CREATE_BREAKAWAY_FROM_JOB in the watcher block is a sufficient regression guard: anyone refactoring the watcher payload has to consciously delete both before the bit silently goes away. Similarly for the ~_CREATE_BREAKAWAY_FROM_JOB retry — without standing up a real Windows job object that refuses breakaway, we can't trigger the OSError in a unit test.

Tests

On Windows, locally (post-#40909 main + this PR):

  • Targeted slice (-k "detach or breakaway or popen_kwargs or launch_detached or gateway_run_update or hermes_cli_gateway"): 8/8 pass, including all 4 new tests.
  • Broader tests/hermes_cli/test_gateway*.py + tests/tools/test_windows_native_support.py: 172 passed, 10 failed. All 10 failures are pre-existing POSIX-only tests running on a Windows host (os.geteuid, SIGKILL fallback, is_linux fixture mismatches). Stashing this PR and re-running on bare post-fix(gateway,windows): reliability — JOB breakaway + status --deep probes + test-leak fix #40909 main reproduces all 10 identically — none are regressions.

POSIX paths unchanged: windows_detach_flags() and windows_detach_flags_without_breakaway() both return 0 off Windows, windows_detach_popen_kwargs() still yields {"start_new_session": True} on Linux/macOS.

Out of scope

  • The other detached-spawn site in hermes_cli/gateway.py (around line 3068, also uses windows_detach_popen_kwargs() + except OSError) deserves the same fallback treatment, but the codepath is different enough that it warrants a separate PR with its own scrutiny.
  • gateway/run.py has Windows branches with windows_detach_popen_kwargs too — same reasoning.

Context

Follow-up to #40909 (merged). I had opened a parallel PR (#40934, since closed) that duplicated the core breakaway fix; the bits unique to that PR that #40909 didn't cover are exactly what's in this one. This is the focused follow-up.

…enies it, plus regression tests for the breakaway bit

#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`,
which fixed the headline bug (gateway dies after Desktop GUI update
and never comes back). The flag's own docstring acknowledges that
restrictive parent job objects can still refuse breakaway with
`ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen`
call:

  "Callers in this codebase already wrap detached spawns in
  try/except OSError and fall back to a cmd.exe wrapper, so the
  breakaway-denied case degrades gracefully rather than crashing."

That's true for `_spawn_detached` in `gateway_windows.py` (the
`hermes gateway start` path), which has both the breakaway bit AND a
retry-without-breakaway fallback. It's NOT true for the post-update
watcher path in `launch_detached_profile_gateway_restart`
(`hermes_cli/gateway.py`), which only has `except OSError: return
False` and gives up entirely. If a user's shell/terminal/container
wraps Hermes in a breakaway-denying job, the gateway-respawn watcher
silently fails to launch instead of trying again without breakaway.

This PR closes that gap and adds the regression tests that were
missing from the original fix.

## Changes

### `hermes_cli/_subprocess_compat.py`

Adds a sibling helper `windows_detach_flags_without_breakaway()` so
callers can express the fallback symbolically (via the helper) rather
than coding the magic `& ~0x01000000` mask at every site. Documented
on `windows_detach_flags` and `windows_detach_flags_without_breakaway`
with the recommended try/except pattern.

### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart`

Two changes, both aligned with the canonical pattern in
`gateway_windows._spawn_detached`:

1. The outer watcher Popen now wraps in `try/except OSError`, and on
   failure retries with `windows_detach_flags_without_breakaway()`
   (POSIX never reaches this branch — `start_new_session=True` can't
   raise OSError).
2. The inlined respawn payload (the `python -c` watcher) also
   wraps its CreateProcess in try/except OSError and retries with
   `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters
   because the watcher's job-object inheritance is independent of the
   outer process's — even if the outer Popen succeeds with breakaway,
   the respawned gateway might inherit a job that doesn't.

### Regression tests in `tests/tools/test_windows_native_support.py`

#40909 shipped the fix without any test that the breakaway bit is
present (the existing `test_windows_detach_flags_has_expected_win32_bits`
asserted only the three legacy bits). Four new tests close that:

- `test_windows_detach_flags_includes_breakaway_from_job` — explicit
  assertion that the breakaway bit is in the default bundle, with the
  rationale spelled out in the docstring so a future maintainer
  staring at this test understands why removing it would resurrect
  the gateway-dies-after-GUI-update bug.
- `test_windows_detach_flags_without_breakaway_drops_only_that_bit`
  — fallback payload keeps the other three detach bits intact.
- `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway`
  — static-text check on the stringified watcher payload. The inlined
  Python program isn't reachable via normal import-time inspection
  because it lives in a `textwrap.dedent("""...""")` literal that
  gets passed to a separate `python -c` interpreter. Asserting that
  both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex
  literal) appear inside the dedent block is a sufficient regression
  guard against accidental refactors.
- `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback`
  — static check that this PR's fallback retry is wired up
  symbolically. Without standing up a real Windows job object that
  refuses breakaway, we can't trigger the OSError in a unit test;
  the text guard catches the case where a future refactor removes
  the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry.

Also extends `test_windows_detach_flags_has_expected_win32_bits` to
include the breakaway bit assertion and updates
`test_windows_flags_zero_on_posix` to cover the new helper.

## Tests

Locally on Windows: 8/8 in the `-k "detach or breakaway or
popen_kwargs or launch_detached or gateway_run_update or
hermes_cli_gateway"` slice pass.

Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`:
172 passed, 10 failed. All 10 failures are pre-existing POSIX-only
tests running on a Windows host (os.geteuid, SIGKILL fallback,
is_linux fixture mismatches). Stashing this PR and re-running on bare
post-#40909 main reproduces all 10 identically — none are regressions.

POSIX paths unchanged: `windows_detach_flags()` and
`windows_detach_flags_without_breakaway()` both return 0 off Windows,
`windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`.

## Out of scope

- The other detached-spawn site in `hermes_cli/gateway.py` (around
  line 3068) also uses `windows_detach_popen_kwargs()` + `except
  OSError`. It deserves the same fallback treatment but the codepath
  is different enough (not the update-flow watcher) that it warrants
  a separate PR with its own scrutiny.
- `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs`
  too — same reasoning.

## Context

Follow-up to #40909 (merged). I had a parallel PR (#40934, closed)
that duplicated the core breakaway fix; the bits unique to that PR
that #40909 didn't cover are the contents of this one. Closing #40934
and opening this slimmed-down version as the focused follow-up.
@github-actions

github-actions Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/windows-update-gateway-respawn-breakaway-fallback vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9970 on HEAD, 9970 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5172 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery labels Jun 7, 2026
@teknium1 teknium1 merged commit fe0b3f2 into main Jun 7, 2026
23 checks passed
@teknium1 teknium1 deleted the fix/windows-update-gateway-respawn-breakaway-fallback branch June 7, 2026 08:22
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…enies it, plus regression tests for the breakaway bit (NousResearch#40956)

NousResearch#40909 added `CREATE_BREAKAWAY_FROM_JOB` to `windows_detach_flags()`,
which fixed the headline bug (gateway dies after Desktop GUI update
and never comes back). The flag's own docstring acknowledges that
restrictive parent job objects can still refuse breakaway with
`ERROR_ACCESS_DENIED`, surfacing as `OSError` on the `subprocess.Popen`
call:

  "Callers in this codebase already wrap detached spawns in
  try/except OSError and fall back to a cmd.exe wrapper, so the
  breakaway-denied case degrades gracefully rather than crashing."

That's true for `_spawn_detached` in `gateway_windows.py` (the
`hermes gateway start` path), which has both the breakaway bit AND a
retry-without-breakaway fallback. It's NOT true for the post-update
watcher path in `launch_detached_profile_gateway_restart`
(`hermes_cli/gateway.py`), which only has `except OSError: return
False` and gives up entirely. If a user's shell/terminal/container
wraps Hermes in a breakaway-denying job, the gateway-respawn watcher
silently fails to launch instead of trying again without breakaway.

This PR closes that gap and adds the regression tests that were
missing from the original fix.

## Changes

### `hermes_cli/_subprocess_compat.py`

Adds a sibling helper `windows_detach_flags_without_breakaway()` so
callers can express the fallback symbolically (via the helper) rather
than coding the magic `& ~0x01000000` mask at every site. Documented
on `windows_detach_flags` and `windows_detach_flags_without_breakaway`
with the recommended try/except pattern.

### `hermes_cli/gateway.py::launch_detached_profile_gateway_restart`

Two changes, both aligned with the canonical pattern in
`gateway_windows._spawn_detached`:

1. The outer watcher Popen now wraps in `try/except OSError`, and on
   failure retries with `windows_detach_flags_without_breakaway()`
   (POSIX never reaches this branch — `start_new_session=True` can't
   raise OSError).
2. The inlined respawn payload (the `python -c` watcher) also
   wraps its CreateProcess in try/except OSError and retries with
   `_flags & ~_CREATE_BREAKAWAY_FROM_JOB` on failure. This matters
   because the watcher's job-object inheritance is independent of the
   outer process's — even if the outer Popen succeeds with breakaway,
   the respawned gateway might inherit a job that doesn't.

### Regression tests in `tests/tools/test_windows_native_support.py`

NousResearch#40909 shipped the fix without any test that the breakaway bit is
present (the existing `test_windows_detach_flags_has_expected_win32_bits`
asserted only the three legacy bits). Four new tests close that:

- `test_windows_detach_flags_includes_breakaway_from_job` — explicit
  assertion that the breakaway bit is in the default bundle, with the
  rationale spelled out in the docstring so a future maintainer
  staring at this test understands why removing it would resurrect
  the gateway-dies-after-GUI-update bug.
- `test_windows_detach_flags_without_breakaway_drops_only_that_bit`
  — fallback payload keeps the other three detach bits intact.
- `test_launch_detached_profile_gateway_restart_inlined_watcher_uses_breakaway`
  — static-text check on the stringified watcher payload. The inlined
  Python program isn't reachable via normal import-time inspection
  because it lives in a `textwrap.dedent("""...""")` literal that
  gets passed to a separate `python -c` interpreter. Asserting that
  both `_CREATE_BREAKAWAY_FROM_JOB` (symbolic) and `0x01000000` (hex
  literal) appear inside the dedent block is a sufficient regression
  guard against accidental refactors.
- `test_launch_detached_profile_gateway_restart_outer_popen_has_access_denied_fallback`
  — static check that this PR's fallback retry is wired up
  symbolically. Without standing up a real Windows job object that
  refuses breakaway, we can't trigger the OSError in a unit test;
  the text guard catches the case where a future refactor removes
  the helper import or the `& ~_CREATE_BREAKAWAY_FROM_JOB` retry.

Also extends `test_windows_detach_flags_has_expected_win32_bits` to
include the breakaway bit assertion and updates
`test_windows_flags_zero_on_posix` to cover the new helper.

## Tests

Locally on Windows: 8/8 in the `-k "detach or breakaway or
popen_kwargs or launch_detached or gateway_run_update or
hermes_cli_gateway"` slice pass.

Broader `tests/hermes_cli/test_gateway*.py + test_windows_native_support.py`:
172 passed, 10 failed. All 10 failures are pre-existing POSIX-only
tests running on a Windows host (os.geteuid, SIGKILL fallback,
is_linux fixture mismatches). Stashing this PR and re-running on bare
post-NousResearch#40909 main reproduces all 10 identically — none are regressions.

POSIX paths unchanged: `windows_detach_flags()` and
`windows_detach_flags_without_breakaway()` both return 0 off Windows,
`windows_detach_popen_kwargs()` still yields `{"start_new_session": True}`.

## Out of scope

- The other detached-spawn site in `hermes_cli/gateway.py` (around
  line 3068) also uses `windows_detach_popen_kwargs()` + `except
  OSError`. It deserves the same fallback treatment but the codepath
  is different enough (not the update-flow watcher) that it warrants
  a separate PR with its own scrutiny.
- `gateway/run.py` has Windows branches with `windows_detach_popen_kwargs`
  too — same reasoning.

## Context

Follow-up to NousResearch#40909 (merged). I had a parallel PR (NousResearch#40934, closed)
that duplicated the core breakaway fix; the bits unique to that PR
that NousResearch#40909 didn't cover are the contents of this one. Closing NousResearch#40934
and opening this slimmed-down version as the focused follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants