Skip to content

fix(gateway): exit 0 on systemctl stop instead of exit 1 (failed unit)#41642

Open
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/systemd-stop-exit-code
Open

fix(gateway): exit 0 on systemctl stop instead of exit 1 (failed unit)#41642
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/systemd-stop-exit-code

Conversation

@liuhao1024

Copy link
Copy Markdown
Contributor

Summary

When the gateway runs under systemd and receives SIGTERM (e.g. from systemctl stop), it exits with code 1, leaving the unit in "failed" state. This requires systemctl reset-failed before a clean restart and pollutes any health monitoring that reads unit state.

Root Cause

The signal handler treats any SIGTERM without a planned-stop marker as an unexpected kill. But systemctl stop is a deliberate operator action that sends SIGTERM without writing a marker first.

The exit-1 rationale is self-defeating: the installed unit uses Restart=always, under which exit 0 is also restarted. So a non-zero exit buys nothing for revival — it only converts a clean stop into a "failed" unit.

Fix

In gateway/run.py, the signal handler now checks if the gateway is running under systemd (via INVOCATION_ID env var or ppid == 1) and the received signal is SIGTERM. If so, it treats it as a planned stop → exit 0 → unit goes "inactive" instead of "failed".

This preserves the exit-1 behavior for:

  • Non-systemd environments (standalone, Docker without systemd)
  • Any signal that is NOT SIGTERM under systemd (future-proofing)

Testing

  • 8 new tests in tests/gateway/test_systemd_stop_exit_code.py
  • Tests cover: under_systemd detection with/without INVOCATION_ID, signal discrimination (SIGTERM vs SIGINT), and the decision-logic condition
  • Existing test_clean_shutdown_marker.py passes (no regression)

Reproduce → Expected Behavior

Before:

hermes gateway install && hermes gateway start
systemctl --user stop hermes-gateway-<name>
systemctl --user is-active hermes-gateway-<name>  # → "failed"

After:

hermes gateway install && hermes gateway start
systemctl --user stop hermes-gateway-<name>
systemctl --user is-active hermes-gateway-<name>  # → "inactive"

Closes #41631

When the gateway runs under systemd and receives SIGTERM (e.g. from
`systemctl stop`), it exits with code 1, leaving the unit in 'failed'
state. This requires `systemctl reset-failed` before a clean restart
and pollutes health monitoring.

Root cause: the signal handler treats any SIGTERM without a planned-stop
marker as an unexpected kill, but `systemctl stop` is a deliberate
operator action. Since the installed unit uses `Restart=always`, exit
code doesn't affect restart behavior — a non-zero exit only creates the
spurious 'failed' state.

Fix: detect systemd-managed SIGTERM (via INVOCATION_ID / ppid==1) and
treat it as a planned stop → exit 0 → unit goes 'inactive'.

Closes NousResearch#41631
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Jun 8, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #41639 — both fix #41631 by treating SIGTERM under systemd as a planned stop (exit 0) in gateway/run.py so the unit reports "inactive" instead of "failed". Same root cause, same call site.

@liuhao1024

Copy link
Copy Markdown
Contributor Author

Thanks for flagging @alt-glitch. I've left a comparison on #41639 — this PR (#41642) uses the existing snapshot_shutdown_context() infrastructure with under_systemd detection and has 8 tests exercising the actual context function, while #41639 replicates the logic inline with os.environ.get() and 4 tests. Both fix #41631 equivalently. Keeping this one open as the more complete fix.

syx-labs added a commit to syx-labs/hermes-agent that referenced this pull request Jun 11, 2026
Cherry-pick of open upstream PR NousResearch#41642
(fixes NousResearch#41631). Railway's container manager sends SIGTERM on every
redeploy; without this, the gateway exits 1 and the supervisor
treats a planned stop as a crash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: gateway exits code 1 (→ unit 'failed') on systemctl stop; planned stops should exit 0

2 participants