Skip to content

[Bug]: gateway exits code 1 (→ unit 'failed') on systemctl stop; planned stops should exit 0 #41631

@aleck31

Description

@aleck31

Summary

A gateway started under systemd (<name> gateway installRestart=always) exits with code 1 on a plain systemctl --user stop, leaving the unit in failed state. A planned operator stop should exit 0 and leave the unit inactive. The non-zero exit pollutes systemctl is-active/is-failed, requires systemctl reset-failed before a clean start, and misleads any health monitoring that reads unit state.

Reproduce

<name> gateway install          # creates hermes-gateway-<name>.service (Restart=always)
<name> gateway start
systemctl --user stop hermes-gateway-<name>
systemctl --user is-active hermes-gateway-<name>   # → "failed"  (expected: "inactive")
systemctl --user status hermes-gateway-<name>      # Main process exited, code=exited, status=1/FAILURE; Result: exit-code

Journal on stop:

Stopping hermes-gateway-<name>.service...
WARNING gateway.run: Shutdown context: signal=SIGTERM under_systemd=yes parent_name=systemd ...
INFO gateway.run: Exiting with code 1 (signal-initiated shutdown without restart request) so systemd Restart=on-failure can revive the gateway.
hermes-gateway-<name>.service: Main process exited, code=exited, status=1/FAILURE
hermes-gateway-<name>.service: Failed with result 'exit-code'.

Root cause

gateway/run.py, end of the gateway-run coroutine:

if _signal_initiated_shutdown and not runner._restart_requested:
    logger.info("Exiting with code 1 (signal-initiated shutdown without restart "
                "request) so systemd Restart=on-failure can revive the gateway.")
    return False  # → sys.exit(1)

Any SIGTERM that isn't a /restart//update/CLI-gateway stop (which use the planned-stop marker) lands here and exits 1 — including systemctl stop, which is a deliberate, planned operator stop and should be a clean exit.

Two issues compound it:

  1. The exit-1 rationale is self-defeating under the unit Hermes itself generates. The comment says exit-1 is "so systemd Restart=on-failure can revive" — but the installed unit uses Restart=always (see hermes_cli/gateway.py), under which exit 0 is also restarted. So exiting non-zero buys nothing for revival; it only converts a clean stop into a failed unit.

  2. systemctl stop isn't distinguished from an unexpected external kill. Both arrive as SIGTERM. But when systemd is stopping the unit it will not restart it regardless of exit code, so there's no need to exit non-zero; the non-zero exit just leaves a spurious failed. Hermes already detects INVOCATION_ID (knows it's under systemd) and has a planned-stop marker mechanism — a systemd-initiated stop could be treated as planned → exit 0.

Expected

systemctl stop (and any systemd-initiated stop of the unit) → exit 0 → unit inactive, not failed. Reserve the non-zero exit for the genuine "process got SIGTERM but the service manager is NOT stopping the unit" case (e.g. an external kill, OOM, container signal) where a restart is actually wanted.

(Note: a unit-file SuccessExitStatus=1 is not the fix — it would mask genuine exit-1 crashes as success, defeating failure detection. The distinction needs to be made in the gateway based on whether the SIGTERM is a systemd stop job vs. an external signal.)

Impact

  • failed state requires systemctl reset-failed before a clean restart, breaking simple stop→edit→start operator workflows.
  • Any monitoring keying on is-active/is-failed to detect crashes can't distinguish a deliberate stop from a real failure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions