Skip to content

fix(gateway): adopt unit's HERMES_HOME for --system CLI ops#22533

Closed
mbac wants to merge 1 commit into
NousResearch:mainfrom
mbac:fix/sudo-system-hermes-home
Closed

fix(gateway): adopt unit's HERMES_HOME for --system CLI ops#22533
mbac wants to merge 1 commit into
NousResearch:mainfrom
mbac:fix/sudo-system-hermes-home

Conversation

@mbac

@mbac mbac commented May 9, 2026

Copy link
Copy Markdown
Contributor

Authorship & review disclaimer

This patch was authored by Claude (Opus) under the guidance of @mbac. @mbac is not proficient in coding and cannot personally attest to the quality or safety of this code. Reviewers, please review carefully.

What does this PR do?

Fixes the bug reported in #22035: sudo hermes gateway restart [--system] always reports "did not become active within 60s" (twice, ~245s total) even though the gateway restarts successfully.

Root cause: under sudo, HERMES_HOME is stripped and HOME=/root, so get_hermes_home() falls back to /root/.hermes. The wait loop in _wait_for_systemd_service_restart reads gateway_state.json from the wrong path, never observes gateway_state == "running", and hits the 60s timeout. The forced fallback then SIGTERMs the in-progress new instance.

The installed unit already pins Environment="HERMES_HOME=…", so we recover the correct path from the unit definition (systemctl show -p Environment) and mirror it into os.environ before any status read. This implements Option 1 from the issue's "Proposed Fix" section.

Related Issue

Fixes #22035

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

All in hermes_cli/gateway.py:

  • New _read_systemd_unit_environment(system) helper — parses the unit's Environment= line via systemctl show.
  • New _sync_hermes_home_from_systemd_unit(system) helper — when system=True, mirrors the unit's HERMES_HOME into os.environ if missing or different. No-op for user-scope units (they already inherit the user's env).
  • Call site added at the start of systemd_restart (after _require_service_installed), systemd_status (after the unit-existence check), and systemd_stop (before the get_running_pid / write_planned_stop_marker block). These are the --system entrypoints that subsequently read PID or runtime-status files derived from HERMES_HOME.

How to Test

Reproduction (from #22035):

  1. sudo hermes gateway install --system --run-as-user $USER
  2. sudo hermes gateway start --system
  3. Wait until ~/.hermes/gateway_state.json shows "gateway_state":"running".
  4. sudo hermes gateway restart --system

Before this patch: ~245s, two did not become active within 60s warnings, forced fallback SIGTERMs the new instance.

After this patch (verified on Ubuntu against this PR's branch on a system service running as a non-root user):

$ sudo hermes gateway restart
⏳ System service restarting gracefully (PID 654313)...
⏳ System service process started (PID 685819); waiting for gateway runtime...
✓ System service restarted (PID 685819)

A direct probe confirms the env sync produces the correct path resolution under sudo:

# Before sync: hermes_home = /root/.hermes; read_runtime_status() = None
# After sync:  hermes_home = /home/<user>/.hermes; gateway_state = "running"

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(gateway): …)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass — not run by the author; reviewers please verify
  • I've added tests for my changes — no new tests added; the fix is small and the failure mode requires a privileged systemd-level setup that's awkward to fixture. Happy to add a unit test for _read_systemd_unit_environment parsing if reviewers want one.
  • I've tested on my platform: Ubuntu 24.04 (system-scope service, non-root run-as user)

Documentation & Housekeeping

  • I've updated relevant documentation — N/A (no user-visible config/docs change)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact — N/A (the bug and fix are systemd-specific; helpers are guarded by system=True and only run when an existing systemd code path executes; macOS/Windows paths unaffected)
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Under sudo, HERMES_HOME is stripped and HOME=/root, so get_hermes_home()
falls back to /root/.hermes. The wait loop in
_wait_for_systemd_service_restart reads gateway_state.json from that
wrong path, never observes gateway_state == "running", and times out at
60s — even though the gateway is healthy on the very first poll. The
forced fallback then SIGTERMs the in-progress new instance, producing
the ~245s flap reported in NousResearch#22035.

The installed unit already pins Environment="HERMES_HOME=…", so we
recover the correct path from the unit definition before any status
read. Apply the sync in systemd_restart, systemd_status, and
systemd_stop — these are the system-scope entrypoints that read PID /
runtime-status files derived from HERMES_HOME.

Fixes NousResearch#22035

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@teknium1

teknium1 commented May 9, 2026

Copy link
Copy Markdown
Contributor

Merged via salvage PR #22803. salvage applied your patch; re-authored to your noreply email since the original commit used a Test User test@example.com placeholder. Your authorship as recorded in git log on main. Thanks for the contribution!

@teknium1 teknium1 closed this May 9, 2026
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists labels May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: gateway restart --system always reports failure (60s timeout × 2) — wrapper reads runtime status from root's HERMES_HOME

3 participants