fix(gateway): adopt unit's HERMES_HOME for --system CLI ops#22533
Closed
mbac wants to merge 1 commit into
Closed
Conversation
Under sudo, HERMES_HOME is stripped and HOME=/root, so get_hermes_home() falls back to /root/.hermes. The wait loop in _wait_for_systemd_service_restart reads gateway_state.json from that wrong path, never observes gateway_state == "running", and times out at 60s — even though the gateway is healthy on the very first poll. The forced fallback then SIGTERMs the in-progress new instance, producing the ~245s flap reported in NousResearch#22035. The installed unit already pins Environment="HERMES_HOME=…", so we recover the correct path from the unit definition before any status read. Apply the sync in systemd_restart, systemd_status, and systemd_stop — these are the system-scope entrypoints that read PID / runtime-status files derived from HERMES_HOME. Fixes NousResearch#22035 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 task
Contributor
|
Merged via salvage PR #22803. salvage applied your patch; re-authored to your noreply email since the original commit used a Test User test@example.com placeholder. Your authorship as recorded in git log on main. Thanks for the contribution! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes the bug reported in #22035:
sudo hermes gateway restart [--system]always reports "did not become active within 60s" (twice, ~245s total) even though the gateway restarts successfully.Root cause: under
sudo,HERMES_HOMEis stripped andHOME=/root, soget_hermes_home()falls back to/root/.hermes. The wait loop in_wait_for_systemd_service_restartreadsgateway_state.jsonfrom the wrong path, never observesgateway_state == "running", and hits the 60s timeout. The forced fallback then SIGTERMs the in-progress new instance.The installed unit already pins
Environment="HERMES_HOME=…", so we recover the correct path from the unit definition (systemctl show -p Environment) and mirror it intoos.environbefore any status read. This implements Option 1 from the issue's "Proposed Fix" section.Related Issue
Fixes #22035
Type of Change
Changes Made
All in
hermes_cli/gateway.py:_read_systemd_unit_environment(system)helper — parses the unit'sEnvironment=line viasystemctl show._sync_hermes_home_from_systemd_unit(system)helper — whensystem=True, mirrors the unit'sHERMES_HOMEintoos.environif missing or different. No-op for user-scope units (they already inherit the user's env).systemd_restart(after_require_service_installed),systemd_status(after the unit-existence check), andsystemd_stop(before theget_running_pid/write_planned_stop_markerblock). These are the--systementrypoints that subsequently read PID or runtime-status files derived fromHERMES_HOME.How to Test
Reproduction (from #22035):
sudo hermes gateway install --system --run-as-user $USERsudo hermes gateway start --system~/.hermes/gateway_state.jsonshows"gateway_state":"running".sudo hermes gateway restart --systemBefore this patch: ~245s, two
did not become active within 60swarnings, forced fallback SIGTERMs the new instance.After this patch (verified on Ubuntu against this PR's branch on a system service running as a non-root user):
A direct probe confirms the env sync produces the correct path resolution under sudo:
Checklist
Code
fix(gateway): …)pytest tests/ -qand all tests pass — not run by the author; reviewers please verify_read_systemd_unit_environmentparsing if reviewers want one.Documentation & Housekeeping
cli-config.yaml.exampleif I added/changed config keys — N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/Asystem=Trueand only run when an existing systemd code path executes; macOS/Windows paths unaffected)