Skip to content

fix(gateway): treat stat-error on PATH probes as missing dir, not crash#26640

Closed
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/gateway-service-path-permission-error
Closed

fix(gateway): treat stat-error on PATH probes as missing dir, not crash#26640
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/gateway-service-path-permission-error

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

Summary

  • _build_service_path_dirs() propagated PermissionError (and other OSErrors) from is_dir() probes, crashing generate_systemd_unit() / generate_launchd_plist() instead of just skipping the unreachable path.
  • Wrap each probe in a small _safe_is_dir() helper that returns False on OSError. Existing "skip non-existent" semantics are unchanged.
  • Adds two regression tests (PermissionError + generic OSError) and verifies the change unblocks the previously-broken TestSystemUnitHermesHome cases on CI.

The bug

hermes_cli/gateway.py:2106 (_build_service_path_dirs, introduced by d69eab1ef on 2026-05-15) probes four candidate directories with pathlib.Path.is_dir() to decide what to include in the service unit's PATH. Two of those probes target ~/.hermes/node/bin and ~/.hermes/node_modules/.bin resolved through get_hermes_home().

Path.is_dir() calls os.stat() and re-raises whatever stat returns. On CI runners executing as root, os.stat('/root/.hermes/node/bin') raises EACCES (the directory exists in a partially-locked-down layout) — and the unhandled PermissionError then propagates all the way up out of generate_systemd_unit(). Same shape would hit any production setup with a partially-locked-down ~/.hermes (multi-user systemd installs, sandboxed services).

This is what made tests/hermes_cli/test_gateway_service.py::TestSystemUnitHermesHome's two _remap/_target_user_home cases start failing on origin/main CI starting with run 25944419334 — the symptom of the bug, not the cause.

hermes_cli/gateway.py:2125: in _build_service_path_dirs
    if hermes_node.is_dir():
...
E   PermissionError: [Errno 13] Permission denied: '/root/.hermes/node/bin'

The fix

Add _safe_is_dir(path: Path) -> bool that wraps path.is_dir() in a try/except for OSError and returns False on stat failure. Route every is_dir() probe in _build_service_path_dirs() through it. No other semantics change — a path that succeeds at is_dir() is still treated identically.

Test plan

  • Focused regression test (new): test_service_path_treats_permission_error_as_missing patches Path.is_dir to raise PermissionError only for the /.hermes/node/bin probe and asserts the function still returns without crashing and excludes that path. Mirror test for generic OSError(EIO).
  • Adjacent suite: tests/hermes_cli/test_gateway_service_paths.py (5/5 passed) + tests/hermes_cli/test_gateway_service.py::TestSystemUnitHermesHome (5/5 passed locally).
  • Regression guard: with the production hunk reverted but tests kept, both new tests fail with raw OSError/PermissionError propagating out of _build_service_path_dirs. With the fix restored, both pass.

The other failures in test_gateway_service.py on local Mac (TestSystemdServiceRefresh::*, TestGatewaySystemServiceRouting::*UserSystemdUnavailableError) are pre-existing baseline failures on clean origin/main (no systemd / loginctl on macOS) and unrelated to this change.

Related

  • Picks up where d69eab1ef (fix(gateway): build service PATH from existing dirs only, include ~/.hermes/node_modules) left off — that commit correctly introduced the existence-check pattern but didn't account for stat-time errors.

Sibling code paths that may need the same fix: none located in this module — the other filesystem probes (_detect_venv_dir, _build_user_local_paths, _build_wsl_interop_paths) already guard or operate on paths where EACCES is unreachable. Intentionally left out of this PR's scope to keep the diff small — happy to widen if a sibling crash surfaces.

Copilot AI review requested due to automatic review settings May 15, 2026 23:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a defensive wrapper around Path.is_dir() so service unit PATH construction doesn’t fail when directory probes raise OSError (e.g., permission issues, broken symlinks, I/O errors), and introduces regression tests for those scenarios.

Changes:

  • Introduce _safe_is_dir() that treats OSError from Path.is_dir() as “missing directory”.
  • Update _build_service_path_dirs() to use _safe_is_dir() for all candidate directories.
  • Add unit tests covering PermissionError and generic OSError during PATH dir probing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/hermes_cli/test_gateway_service_paths.py Adds regression tests ensuring service PATH generation doesn’t crash on is_dir() errors.
hermes_cli/gateway.py Adds _safe_is_dir() and uses it to make PATH construction resilient to stat/filesystem errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +48 to +54
def fake_is_dir(self):
# Only the hermes_home node/bin probe trips EACCES; other paths
# must keep their real behavior so the rest of the function is
# exercised normally.
if str(self).endswith("/.hermes/node/bin"):
raise PermissionError(13, "Permission denied", str(self))
return real_is_dir(self)
Comment on lines +70 to +72

def fake_is_dir(self):
if str(self).endswith("/.hermes/node_modules/.bin"):
Comment on lines +38 to +42
Reproduces the CI baseline (#26622 audit): on Ubuntu runners executing
as root, ``stat('/root/.hermes/node/bin')`` can raise ``EACCES`` even
though the calling user owns ``/root``. Before the guard,
``generate_systemd_unit()`` and ``generate_launchd_plist()`` would
propagate the ``OSError`` and refuse to produce any unit at all.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery labels May 15, 2026
@briandevans

Copy link
Copy Markdown
Contributor Author

CI audit — both failing checks are pre-existing baselines unrelated to this PR's touched code (hermes_cli/gateway.py, tests/hermes_cli/test_gateway_service_paths.py).

Test Symptom on CI Root cause on main
tests/tools/test_transcription_dotenv_fallback.py::TestProviderSelectionGate::test_explicit_xai_sees_dotenv AssertionError: assert 'none' == 'xai' After e13c1b806 (recent), xAI credential resolution moved into tools.xai_http.resolve_xai_http_credentials(); the test still only patches hermes_cli.config.load_env, not the new tools.xai_http.get_env_value entry point used by the resolver.
tests/tools/test_transcription_dotenv_fallback.py::TestEndToEndRegressionGuard::test_xai_key_only_in_dotenv_before_fix assert False is True Same root cause — _transcribe_xai now reaches the xAI key through the same resolver.

Same two failures show on PR #26622 and PR #26594 in the same window. Happy to re-run CI once these are addressed on main.

@briandevans

Copy link
Copy Markdown
Contributor Author

@copilot All three findings addressed in commit 3f4f934a7:

  • Lines 54 & 72 (path-portability): Replaced str(self).endswith("/.hermes/node/bin") and the /.hermes/node_modules/.bin counterpart with direct Path equality (self == target). The predicate now fires on Windows runners where str(Path) uses backslashes — without this fix the injected exception would never be raised on Windows and the regression test would silently no-op.
  • Line 42 (docstring inconsistency): Reworded the EACCES rationale to the concrete failure mode (locked-down sandboxed filesystem layer returning EACCES from stat) without asserting /root ownership semantics that may be incorrect.

Both regression tests still pass locally and still trigger the production guard against an unguarded baseline.

briandevans and others added 2 commits May 16, 2026 00:20
`_build_service_path_dirs()` calls `is_dir()` on candidate venv, node,
and `~/.hermes/node*` paths to build PATH for systemd / launchd units.
Each probe assumed `is_dir()` would only ever return True/False — but
`pathlib.Path.is_dir()` calls `os.stat()` under the hood and can raise
`PermissionError` (and other `OSError` variants for broken symlinks,
unreachable network mounts, etc.).

On CI runners executing as root, `os.stat('/root/.hermes/node/bin')`
can raise `EACCES` even though the calling user technically owns the
path. The unhandled `OSError` propagated out of `generate_systemd_unit`
and `generate_launchd_plist`, refusing to produce any unit at all.

Same shape would hit any production setup with a partially-locked-down
`~/.hermes` (multi-user systemd installs, sandboxed services).

Wrap each `is_dir()` probe in a small helper that returns False on
`OSError`. The remaining is-it-present semantics are unchanged.

Regression evidence: this exact code path is what made
`tests/hermes_cli/test_gateway_service.py::TestSystemUnitHermesHome`'s
two `_remap`/`_target_user_home` cases start failing on `origin/main`
CI immediately after d69eab1 extracted `_build_service_path_dirs`
(2026-05-15). The tests reproduce on CI run 25944419334 against clean
main; local Mac runs pass because `/root/.hermes` doesn't exist there.

Sibling code paths that may need the same fix: none located — the
other filesystem probes in this module (`_detect_venv_dir`,
`_build_user_local_paths`, `_build_wsl_interop_paths`) already guard
or operate on paths where EACCES is unreachable. Happy to widen if a
sibling crash surfaces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address three Copilot review findings on PR NousResearch#26640:

- Replace `str(self).endswith("/.hermes/node/bin")` with `self == target`
  (and likewise for `node_modules/.bin`) so the predicate fires on
  Windows where Path stringifies with backslashes. Without this the
  injected exception was never raised on Windows runners and the
  regression test silently became a no-op.
- Reword the EACCES docstring to a concrete failure mode (locked-down
  sandboxed filesystem) without asserting ownership semantics that may
  be incorrect.

Pure test-side change. The production guard in `_build_service_path_dirs`
(`hermes_cli/gateway.py`) is unchanged. Both regression tests still
pass and still trigger the guard when run against the unguarded
baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@teknium1

Copy link
Copy Markdown
Contributor

Superseded by #27562 (commit bc7c608d5), which landed @aqilaziz's parallel #26565 fix.

The two PRs are functionally equivalent — both wrap the four is_dir() calls in _build_service_path_dirs() with an OSError-catching helper so a locked-down ~/.hermes/node/bin doesn't crash unit generation. #26565 happened to merge first via the salvage chain that w0lfgang88 flagged as the unblocker for the post-#25957 fallout.

Verified: scripts/run_tests.sh tests/hermes_cli/test_gateway_service.py -q → 132/132 pass on current main with #26565 merged.

Sorry about the duplicate — appreciate the analysis and the targeted regression tests on your PR. Closing without salvage since the substance is already in. If you want to follow up with your two PermissionError-specific regression tests (test_service_path_treats_permission_error_as_missing + test_service_path_treats_oserror_as_missing) on top of _is_dir, that's a welcome additive PR — they're a cleaner repro than what #26565 brought.

@teknium1 teknium1 closed this May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants