Skip to content

macOS: launchd_restart() returns early after SIGUSR1, leaving gateway permanently dead #11932

@Ding-tech777

Description

@Ding-tech777

Summary

On macOS, when hermes update is triggered from within the Gateway process tree (e.g., agent executing via terminal tool), launchd_restart() sends SIGUSR1 and returns immediately without waiting for the gateway to exit or issuing launchctl kickstart. The gateway exits with code 75, but launchd does not restart it, leaving the service permanently dead until manual intervention.

Root Cause

In hermes_cli/gateway.py, launchd_restart() has two code paths:

Path A (SIGUSR1): Triggered when the gateway PID is an ancestor of the current process. Sends SIGUSR1, prints "Service restart requested", then returns immediately — no wait for exit, no kickstart.

Path B (SIGTERM + kickstart): Triggered when the gateway PID is NOT an ancestor. Sends SIGTERM, waits for exit, then runs launchctl kickstart -k.

When Path A is taken, the gateway receives SIGUSR1 and begins a graceful shutdown (drain + exit code 75). However, since launchd_restart() already returned, nobody is responsible for restarting the service. macOS launchd does not automatically restart after exit(75) in this configuration — system logs show "pending spawn, domain in on-demand-only mode" with no follow-up WILL_SPAWN.

Reproduction

  1. Have Hermes gateway running on macOS with launchd.
  2. From a Telegram conversation, ask the agent to run hermes update directly via its terminal tool (NOT using the /update slash command).
  3. The agent process is a child of the gateway, so _is_pid_ancestor_of_current_process() returns True → Path A is taken.
  4. Gateway exits with code 75 → launchd does not restart → service stays dead.

Note: The normal /update command avoids this by spawning hermes update --gateway via setsid + start_new_session=True, which detaches from the gateway process tree and takes Path B. This bug only manifests when the update command runs inside the gateway process tree.

Contrast with Linux

PR #9850 (merged) fixed a similar issue for Linux by adding systemctl is-active health checks and retry logic after systemctl restart. The macOS launchd path was completely omitted from that fix.

Evidence

macOS system logs consistently show exit(75) followed by no restart:

  • "exited due to exit(75)"
  • "pending spawn, domain in on-demand-only mode: ai.hermes.gateway"
  • No WILL_SPAWN entry follows

In contrast, when the gateway is killed by an external signal (SIGTERM/SIGKILL from outside the process tree), launchd immediately issues WILL_SPAWN and the service recovers within seconds.

Suggested Fix

Remove the early return in Path A and let both paths converge on _wait_for_gateway_exit() + launchctl kickstart -k. This ensures the gateway is always restarted regardless of how the update was triggered.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/cliCLI entry point, hermes_cli/, setup wizardcomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions