fix(gateway): exit 0 on systemctl stop instead of exit 1 (failed unit)#41642
Open
liuhao1024 wants to merge 1 commit into
Open
fix(gateway): exit 0 on systemctl stop instead of exit 1 (failed unit)#41642liuhao1024 wants to merge 1 commit into
liuhao1024 wants to merge 1 commit into
Conversation
When the gateway runs under systemd and receives SIGTERM (e.g. from `systemctl stop`), it exits with code 1, leaving the unit in 'failed' state. This requires `systemctl reset-failed` before a clean restart and pollutes health monitoring. Root cause: the signal handler treats any SIGTERM without a planned-stop marker as an unexpected kill, but `systemctl stop` is a deliberate operator action. Since the installed unit uses `Restart=always`, exit code doesn't affect restart behavior — a non-zero exit only creates the spurious 'failed' state. Fix: detect systemd-managed SIGTERM (via INVOCATION_ID / ppid==1) and treat it as a planned stop → exit 0 → unit goes 'inactive'. Closes NousResearch#41631
Collaborator
13 tasks
Contributor
Author
|
Thanks for flagging @alt-glitch. I've left a comparison on #41639 — this PR (#41642) uses the existing |
2 tasks
14 tasks
syx-labs
added a commit
to syx-labs/hermes-agent
that referenced
this pull request
Jun 11, 2026
Cherry-pick of open upstream PR NousResearch#41642 (fixes NousResearch#41631). Railway's container manager sends SIGTERM on every redeploy; without this, the gateway exits 1 and the supervisor treats a planned stop as a crash.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When the gateway runs under systemd and receives SIGTERM (e.g. from
systemctl stop), it exits with code 1, leaving the unit in"failed"state. This requiressystemctl reset-failedbefore a clean restart and pollutes any health monitoring that reads unit state.Root Cause
The signal handler treats any SIGTERM without a planned-stop marker as an unexpected kill. But
systemctl stopis a deliberate operator action that sends SIGTERM without writing a marker first.The exit-1 rationale is self-defeating: the installed unit uses
Restart=always, under which exit 0 is also restarted. So a non-zero exit buys nothing for revival — it only converts a clean stop into a"failed"unit.Fix
In
gateway/run.py, the signal handler now checks if the gateway is running under systemd (viaINVOCATION_IDenv var orppid == 1) and the received signal is SIGTERM. If so, it treats it as a planned stop → exit 0 → unit goes"inactive"instead of"failed".This preserves the exit-1 behavior for:
Testing
tests/gateway/test_systemd_stop_exit_code.pyunder_systemddetection with/withoutINVOCATION_ID, signal discrimination (SIGTERM vs SIGINT), and the decision-logic conditiontest_clean_shutdown_marker.pypasses (no regression)Reproduce → Expected Behavior
Before:
After:
Closes #41631