Skip to content

[Bug]: gateway should recover unloaded launchd jobs and report stale service state #1613

@PeterFile

Description

@PeterFile

Summary

There are still a few residual service-management bugs in hermes gateway after #1567.

They all show up when a local service definition exists but the service manager state is stale or broken:

  1. install only checks whether the plist/unit file exists, so a stale service definition is skipped instead of repaired.
  2. launchd start does not recover from launchctl start ai.hermes.gateway returning exit status 3 when the job is unloaded.
  3. gateway restart swallows launchd/systemd restart failures and falls back to foreground run_gateway(), which can make the service look recovered when the background service is still broken.
  4. launchd status only reports whether the job is loaded; it does not surface that a local plist exists and is stale/out-of-sync with the current install.

Environment

  • OS: macOS 15.6 / Darwin 24.6.0 x86_64
  • Python: 3.11.14
  • Hermes: Hermes Agent v0.2.0 (2026.3.12)
  • Repo state: main at cfa87e77

Reproduction

1. Stale service definition is skipped by install

  1. Install the gateway service.
  2. Move or rename the repo, or otherwise make the generated WorkingDirectory / ProgramArguments differ from the installed plist.
  3. Run hermes gateway install again without --force.

Observed:

  • The command exits with "Service already installed" and leaves the stale plist/unit untouched.

Expected:

  • If the local service definition exists but no longer matches the current install, install should repair it automatically.

2. launchd start cannot self-heal an unloaded job

  1. Have ~/Library/LaunchAgents/ai.hermes.gateway.plist present.
  2. Ensure the launchd job is not loaded (for example, launchctl unload ~/Library/LaunchAgents/ai.hermes.gateway.plist).
  3. Run hermes gateway start.

Observed:

  • launchctl start ai.hermes.gateway returns exit status 3 and the CLI does not retry with launchctl load.

Expected:

  • If the plist exists locally and start fails because the job is unloaded, Hermes should load the plist and retry once.

3. gateway restart masks broken service state

  1. Have a gateway plist/unit file present.
  2. Put the service manager into a broken/unloaded state so service restart fails.
  3. Run hermes gateway restart.

Observed:

  • The restart failure is swallowed.
  • Hermes falls back to run_gateway(verbose=False) in the foreground.
  • This makes the gateway appear recovered, while the managed background service is still broken.

Expected:

  • If a service definition exists but service restart fails, the command should report that failure clearly and exit non-zero instead of silently switching to a foreground process.

4. launchd status lacks local/stale plist diagnostics

  1. Leave a stale or outdated ~/Library/LaunchAgents/ai.hermes.gateway.plist on disk.
  2. Ensure the job is not loaded.
  3. Run hermes gateway status.

Observed:

  • Output only says the service is not loaded.
  • It does not mention the local plist path, whether the plist is stale, or that a repair/start command would reload it.

Expected:

  • status should show the local plist path and whether it matches the current generated service definition, so the user can distinguish "not installed" from "installed but stale/unloaded".

Why this matters

These are all bug-fix / robustness issues in the service-management path:

  • They affect macOS specifically, which is one of Hermes' supported platforms.
  • They make service recovery brittle after repo moves or failed loads.
  • They hide true background service failures behind a foreground fallback.
  • They make diagnosis harder than it needs to be.

I have a patch ready that adds targeted recovery + tests for these cases and will open a PR linked to this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions