fix(gateway): exit 0 on systemctl stop instead of exit 1 (failed unit) by liuhao1024 · Pull Request #41642 · NousResearch/hermes-agent

liuhao1024 · 2026-06-08T00:45:18Z

Summary

When the gateway runs under systemd and receives SIGTERM (e.g. from systemctl stop), it exits with code 1, leaving the unit in "failed" state. This requires systemctl reset-failed before a clean restart and pollutes any health monitoring that reads unit state.

Root Cause

The signal handler treats any SIGTERM without a planned-stop marker as an unexpected kill. But systemctl stop is a deliberate operator action that sends SIGTERM without writing a marker first.

The exit-1 rationale is self-defeating: the installed unit uses Restart=always, under which exit 0 is also restarted. So a non-zero exit buys nothing for revival — it only converts a clean stop into a "failed" unit.

Fix

In gateway/run.py, the signal handler now checks if the gateway is running under systemd (via INVOCATION_ID env var or ppid == 1) and the received signal is SIGTERM. If so, it treats it as a planned stop → exit 0 → unit goes "inactive" instead of "failed".

This preserves the exit-1 behavior for:

Non-systemd environments (standalone, Docker without systemd)
Any signal that is NOT SIGTERM under systemd (future-proofing)

Testing

8 new tests in tests/gateway/test_systemd_stop_exit_code.py
Tests cover: under_systemd detection with/without INVOCATION_ID, signal discrimination (SIGTERM vs SIGINT), and the decision-logic condition
Existing test_clean_shutdown_marker.py passes (no regression)

Reproduce → Expected Behavior

Before:

hermes gateway install && hermes gateway start
systemctl --user stop hermes-gateway-<name>
systemctl --user is-active hermes-gateway-<name>  # → "failed"

After:

hermes gateway install && hermes gateway start
systemctl --user stop hermes-gateway-<name>
systemctl --user is-active hermes-gateway-<name>  # → "inactive"

Closes #41631

When the gateway runs under systemd and receives SIGTERM (e.g. from `systemctl stop`), it exits with code 1, leaving the unit in 'failed' state. This requires `systemctl reset-failed` before a clean restart and pollutes health monitoring. Root cause: the signal handler treats any SIGTERM without a planned-stop marker as an unexpected kill, but `systemctl stop` is a deliberate operator action. Since the installed unit uses `Restart=always`, exit code doesn't affect restart behavior — a non-zero exit only creates the spurious 'failed' state. Fix: detect systemd-managed SIGTERM (via INVOCATION_ID / ppid==1) and treat it as a planned stop → exit 0 → unit goes 'inactive'. Closes NousResearch#41631

alt-glitch · 2026-06-08T01:23:39Z

Likely duplicate of #41639 — both fix #41631 by treating SIGTERM under systemd as a planned stop (exit 0) in gateway/run.py so the unit reports "inactive" instead of "failed". Same root cause, same call site.

liuhao1024 · 2026-06-08T01:34:02Z

Thanks for flagging @alt-glitch. I've left a comparison on #41639 — this PR (#41642) uses the existing snapshot_shutdown_context() infrastructure with under_systemd detection and has 8 tests exercising the actual context function, while #41639 replicates the logic inline with os.environ.get() and 4 tests. Both fix #41631 equivalently. Keeping this one open as the more complete fix.

Cherry-pick of open upstream PR NousResearch#41642 (fixes NousResearch#41631). Railway's container manager sends SIGTERM on every redeploy; without this, the gateway exits 1 and the supervisor treats a planned stop as a crash.

liuhao1024 mentioned this pull request Jun 8, 2026

[Bug]: gateway exits code 1 (→ unit 'failed') on systemctl stop; planned stops should exit 0 #41631

Open

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Jun 8, 2026

liuhao1024 mentioned this pull request Jun 8, 2026

fix(gateway): treat SIGTERM under systemd as planned stop (exit 0) #41639

Closed

13 tasks

alt-glitch mentioned this pull request Jun 8, 2026

fix(gateway): exit 0 when systemd sends SIGTERM via systemctl stop #41690

Open

2 tasks

yubingz mentioned this pull request Jun 9, 2026

fix(gateway): ExecStop should write planned-stop marker instead of inferring signal source #42517

Open

izumi0uu mentioned this pull request Jun 9, 2026

fix(gateway): write planned-stop marker from systemd ExecStop #42555

Open

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): exit 0 on systemctl stop instead of exit 1 (failed unit)#41642

fix(gateway): exit 0 on systemctl stop instead of exit 1 (failed unit)#41642
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/systemd-stop-exit-code

liuhao1024 commented Jun 8, 2026

Uh oh!

alt-glitch commented Jun 8, 2026

Uh oh!

liuhao1024 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

liuhao1024 commented Jun 8, 2026

Summary

Root Cause

Fix

Testing

Reproduce → Expected Behavior

Uh oh!

alt-glitch commented Jun 8, 2026

Uh oh!

liuhao1024 commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants