Skip to content

fix(gateway): ExecStop should write planned-stop marker instead of inferring signal source #42517

@yubingz

Description

@yubingz

Related

Problem

Both PRs attempt to solve #41631 by inferring whether a SIGTERM came from systemctl stop — using INVOCATION_ID / _shutdown_ctx["under_systemd"] as the signal source indicator. This approach is fundamentally unreliable.

Why inference breaks

From inside the process, a SIGTERM is a SIGTERM — there is no reliable way to distinguish:

  • systemctl stop (intentional admin stop → should exit 0)
  • kill -SIGTERM $PID from an external process (unintended → should exit 1 so Restart=always can revive)
  • OOM killer sending SIGTERM within a systemd cgroup (unintended → should exit 1)

All three look identical inside shutdown_signal_handler(). Checking INVOCATION_ID or under_systemd only tells you "I am running under systemd" — not "systemd is the one who sent this SIGTERM."

Concrete failure cases

Scenario INVOCATION_ID Expected exit Both PRs exit
systemctl stop set 0 ✅ 0 ✅
kill -SIGTERM from external process, under systemd set 1 (restart wanted) 0
OOM killer SIGTERM in systemd cgroup set 1 (restart wanted) 0
Docker container with systemd, kill -SIGTERM set 1 (restart wanted) 0

In all the ❌ cases, the gateway exits 0, systemd marks the unit as inactive instead of failed, and no automatic restart happens — defeating the purpose of Restart=on-failure / Restart=always.

Additional issues with #41642

  • The test file (test_systemd_stop_exit_code.py) tests a simulated version of the condition logic rather than the actual shutdown_signal_handler() code path. The _simulate_planned_stop_detection() function duplicates the branching logic instead of invoking the real handler, so passing tests do not verify real behavior.
  • Relies on snapshot_shutdown_context() which performs I/O (/proc reads) — this runs synchronously inside the asyncio signal handler. If /proc is slow or unavailable, the detection silently degrades.

Proposed approach

The existing planned_stop marker mechanism is the correct design — hermes gateway stop already writes a marker file before sending SIGTERM. The problem is that systemctl stop bypasses this mechanism by sending SIGTERM directly.

Fix the systemd unit, not the signal handler:

[Service]
ExecStop=/usr/bin/hermes gateway stop
# OR: ExecStop=/bin/sh -c 'hermes gateway stop || true'

This way:

  1. systemctl stop → calls hermes gateway stop → writes planned-stop marker → sends SIGTERM → signal handler finds marker → exit 0 ✅
  2. External kill -SIGTERM → no marker → signal handler treats as unexpected → exit 1 → Restart=always revives ✅
  3. OOM killer → no marker → exit 1 → restart ✅

No inference, no guessing, no new edge cases.

If ExecStop change is not possible

If changing the installed unit is undesirable (e.g., upgrade path concerns), a lighter alternative is to add an ExecStop helper that writes the marker and then sends SIGTERM, without going through the full CLI:

#!/bin/sh
# hermes-systemd-stop-helper
MARKER_DIR=$(hermes config get marker_dir 2>/dev/null)
[ -n "$MARKER_DIR" ] && touch "$MARKER_DIR/planned-stop-$(cat $MAINPID)"
kill -SIGTERM "$MAINPID"

This still avoids the inference problem entirely.

Summary

Both PRs implement the right goal (exit 0 on systemctl stop) but use an unreliable mechanism (environment-based inference) that breaks the restart guarantee for unintended SIGTERMs. The correct fix is to make systemctl stop go through the existing planned-stop marker path by changing ExecStop, rather than teaching the signal handler to guess who sent the signal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions