Skip to content

fix(gateway): anchor service WorkingDirectory at HERMES_HOME, not the source checkout#34805

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-432d89c0
May 29, 2026
Merged

fix(gateway): anchor service WorkingDirectory at HERMES_HOME, not the source checkout#34805
teknium1 merged 2 commits into
mainfrom
hermes/hermes-432d89c0

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Gateway services no longer crash-loop forever when their source checkout moves or is deleted.

The systemd unit (and launchd plist) pinned WorkingDirectory to PROJECT_ROOT — the checkout the unit was generated from. When that checkout is transient (a git worktree, or a clone that hermes update later relocates/removes), the path rots. systemd then fails the start at the CHDIR step (status=200/CHDIR, "Changing to the requested working directory failed") before Python loads, so the on-boot refresh_systemd_unit_if_needed() self-heal never runs and Restart=always crash-loops forever on a dead directory.

Observed in the wild: a gateway that crash-looped 153 times overnight, Discord/Telegram/WhatsApp all offline, recovering only when a human ran hermes gateway restart (which regenerated the unit as a side effect).

Root cause

WorkingDirectory is vestigial for the gateway — ExecStart uses an absolute python interpreter + -m hermes_cli.main, so module resolution never depended on cwd. Pinning it to the volatile checkout was pure downside.

Changes

  • hermes_cli/gateway.py: new _stable_service_working_dir() → anchors WorkingDirectory at HERMES_HOME (stable, always exists), falling back to PROJECT_ROOT only if HERMES_HOME can't be resolved.
    • Applied to the user systemd unit, the system (root) unit (uses the target user's resolved HERMES_HOME), and the launchd plist (macOS had the identical latent bug).
  • tests/hermes_cli/test_gateway_service.py: TestServiceWorkingDirIsStable — HERMES_HOME anchor, PROJECT_ROOT fallback, user-unit asserts no /.worktrees/ path, launchd parity.

Why this prevents recurrence

  • The CHDIR death class (status=200) becomes impossible — cwd is always a real directory.
  • Existing broken installs self-heal: an old unit with a dead-checkout cwd now differs from the generated one, so refresh_systemd_unit_if_needed() rewrites it on the next start/restart/update (the on-boot refresh already runs there).

No speculative ExecStartPre hook: with cwd stable, the only remaining failure is a deleted venv (status=203/EXEC), which no unit rewrite can fix — it needs a reinstall.

Validation

Before After
WorkingDirectory (user unit) <checkout> (can vanish) HERMES_HOME (stable)
Checkout deleted + auto-restart crash-loop on CHDIR, bot offline unit heals on next start/restart
launchd (macOS) same rot risk anchored at HERMES_HOME
Tests 12 passed (4 new + 8 existing refresh, no regression)

teknium1 added 2 commits May 29, 2026 12:17
… source checkout

The systemd unit (and launchd plist) pinned WorkingDirectory to PROJECT_ROOT
(the checkout the unit was generated from). When that checkout is transient —
a git worktree, or a clone hermes update later relocates/removes — the path
rots. systemd then fails the start at the CHDIR step (status=200/CHDIR) BEFORE
Python loads, so the on-boot refresh_systemd_unit_if_needed() self-heal never
runs and Restart=always crash-loops forever on a dead directory. Observed in
the wild: a gateway that crash-looped 153 times overnight, bot offline until a
manual 'hermes gateway restart' regenerated the unit.

Anchor cwd at HERMES_HOME instead — it never moves, always exists, and the
gateway never needed cwd to be the checkout (ExecStart uses an absolute python
+ -m hermes_cli.main). Existing broken units now differ from the generated unit
and self-heal on the next start/restart/update.
test_system_unit_has_no_root_paths asserted the system unit's
WorkingDirectory was the remapped *checkout* path
(/home/alice/.hermes/hermes-agent). That is the brittle pin this PR
fixes — the system unit now anchors cwd at the target user's HERMES_HOME
(/home/alice/.hermes). The test's intent (no root-home leak, target-user
paths present) is unchanged and still holds.
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels May 29, 2026
@teknium1 teknium1 merged commit 38c4f8c into main May 29, 2026
23 checks passed
@teknium1 teknium1 deleted the hermes/hermes-432d89c0 branch May 29, 2026 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants