Skip to content

fix(gateway): exit 0 when systemd sends SIGTERM via systemctl stop#41690

Open
dcain2336 wants to merge 1 commit into
NousResearch:mainfrom
dcain2336:auto-fix-41631
Open

fix(gateway): exit 0 when systemd sends SIGTERM via systemctl stop#41690
dcain2336 wants to merge 1 commit into
NousResearch:mainfrom
dcain2336:auto-fix-41631

Conversation

@dcain2336

Copy link
Copy Markdown

Summary

When running under systemd (detected via INVOCATION_ID), treat SIGTERM as a planned stop so the unit exits cleanly (code 0) instead of code 1.

Only signals from outside the service manager (external kill, OOM, container signal) exit non-zero so Restart=on-failure can revive the gateway.

Problem

systemctl stop hermes-gateway.service sends SIGTERM to the gateway. The gateway didn't have a "planned stop marker" (only hermes gateway stop creates one), so it treated the signal as unexpected and exited 1 — putting the unit in a failed state despite systemd having intentionally stopped it.

Fix

In shutdown_signal_handler, after checking for takeover/planned-stop markers, we now check: if received_signal == SIGTERM and INVOCATION_ID is set in the environment, set planned_stop = True so _signal_initiated_shutdown stays False and the gateway exits 0.

This is safe because:

  • INVOCATION_ID is only set by systemd (not external kill, not container signal)
  • When the service manager itself initiates the stop, there's no reason to exit non-zero
  • External SIGTERM (no INVOCATION_ID) still exits 1 for Restart=on-failure

Changes

  • gateway/run.py: Added systemd-initiated SIGTERM detection in signal handler
  • tests/test_issue_41631_fix.py: 9 regression tests covering all scenarios

Test Plan

  • All 9 regression tests pass
  • Tests cover: systemd SIGTERM → exit 0, external SIGTERM → exit 1, takeover marker precedence, SIGINT behavior, empty INVOCATION_ID

Closes #41631

When running under systemd (INVOCATION_ID set), treat SIGTERM as a
planned stop so the unit exits cleanly (code 0) instead of code 1.
Only signals from outside the service manager (external kill, OOM,
container signal) exit non-zero so Restart=on-failure can revive.

Previously, systemctl stop caused the gateway to exit 1, which put the
unit in a failed state despite systemd having intentionally stopped it.

Closes NousResearch#41631
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Jun 8, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of open PR #41642 (and closed #41639) — same fix: treat SIGTERM under systemd (INVOCATION_ID) as a planned stop so the unit exits 0. All three close #41631.

@liuhao1024

Copy link
Copy Markdown
Contributor

Code Review — Positive Verification

Reviewed the full diff (211 lines: gateway/run.py + tests/test_issue_41631_fix.py).

Correctness:

  • The INVOCATION_ID environment variable check is the correct systemd-specific signal — it's set by systemd for all service processes and absent for non-systemd invocations
  • The fix correctly reuses the existing planned_stop flag to suppress the non-zero exit code, so Restart=on-failure won't trigger a spurious restart
  • The guard is placed after the existing planned-stop marker check, so explicit hermes gateway stop still takes priority

Edge cases handled:

  • Container environments without systemd: INVOCATION_ID absent → no effect
  • systemctl stop with Restart=always: SIGTERM + INVOCATION_ID → planned_stop=True → exit 0
  • External kill -TERM: no INVOCATION_ID_signal_initiated_shutdown remains True → exit 1 (correct)

Test coverage: 163-line test file covers the key scenarios. LGTM.

@liuhao1024

Copy link
Copy Markdown
Contributor

✅ Verified — systemd SIGTERM detection via INVOCATION_ID

Reviewed the diff in gateway/run.py and tests/test_issue_41631_fix.py.

  • Root cause is precise: systemctl stop sends SIGTERM but the gateway exited 1, triggering Restart=on-failure and creating a restart loop (issue [Bug]: gateway exits code 1 (→ unit 'failed') on systemctl stop; planned stops should exit 0 #41631).
  • Detection is correct: INVOCATION_ID is systemd-specific (set for Type=simple/notify service units). The guard not planned_stop and SIGTERM and INVOCATION_ID correctly marks service-manager-initiated stops as planned.
  • No false positives: External kill -TERM (no INVOCATION_ID) still exits 1. SIGINT (Ctrl+C) still exits 0 via the existing path. The takeover marker path is unaffected.
  • Test coverage: 8 tests covering all branching paths — systemd SIGTERM, external kill, SIGINT, marker precedence, and ppid==1 without INVOCATION_ID.

LGTM — clean gateway stability fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: gateway exits code 1 (→ unit 'failed') on systemctl stop; planned stops should exit 0

3 participants