fix(gateway): ExecStop should write planned-stop marker instead of inferring signal source

## Related

- Original issue: #41631
- PR #41642 (liuhao1024)
- PR #41690 (dcain2336)

## Problem

Both PRs attempt to solve #41631 by **inferring** whether a SIGTERM came from `systemctl stop` — using `INVOCATION_ID` / `_shutdown_ctx["under_systemd"]` as the signal source indicator. This approach is fundamentally unreliable.

### Why inference breaks

From inside the process, a SIGTERM is a SIGTERM — there is no reliable way to distinguish:
- `systemctl stop` (intentional admin stop → should exit 0)
- `kill -SIGTERM $PID` from an external process (unintended → should exit 1 so Restart=always can revive)
- OOM killer sending SIGTERM within a systemd cgroup (unintended → should exit 1)

All three look identical inside `shutdown_signal_handler()`. Checking `INVOCATION_ID` or `under_systemd` only tells you "I am running under systemd" — not "systemd is the one who sent this SIGTERM."

### Concrete failure cases

| Scenario | INVOCATION_ID | Expected exit | Both PRs exit |
|---|---|---|---|
| `systemctl stop` | set | 0 ✅ | 0 ✅ |
| `kill -SIGTERM` from external process, under systemd | set | 1 (restart wanted) | **0** ❌ |
| OOM killer SIGTERM in systemd cgroup | set | 1 (restart wanted) | **0** ❌ |
| Docker container with systemd, `kill -SIGTERM` | set | 1 (restart wanted) | **0** ❌ |

In all the ❌ cases, the gateway exits 0, systemd marks the unit as `inactive` instead of `failed`, and **no automatic restart happens** — defeating the purpose of `Restart=on-failure` / `Restart=always`.

### Additional issues with #41642

- The test file (`test_systemd_stop_exit_code.py`) tests a **simulated** version of the condition logic rather than the actual `shutdown_signal_handler()` code path. The `_simulate_planned_stop_detection()` function duplicates the branching logic instead of invoking the real handler, so passing tests do not verify real behavior.
- Relies on `snapshot_shutdown_context()` which performs I/O (`/proc` reads) — this runs synchronously inside the asyncio signal handler. If `/proc` is slow or unavailable, the detection silently degrades.

## Proposed approach

The existing `planned_stop` marker mechanism is the correct design — `hermes gateway stop` already writes a marker file before sending SIGTERM. The problem is that `systemctl stop` bypasses this mechanism by sending SIGTERM directly.

**Fix the systemd unit, not the signal handler:**

```ini
[Service]
ExecStop=/usr/bin/hermes gateway stop
# OR: ExecStop=/bin/sh -c 'hermes gateway stop || true'
```

This way:
1. `systemctl stop` → calls `hermes gateway stop` → writes planned-stop marker → sends SIGTERM → signal handler finds marker → exit 0 ✅
2. External `kill -SIGTERM` → no marker → signal handler treats as unexpected → exit 1 → Restart=always revives ✅
3. OOM killer → no marker → exit 1 → restart ✅

No inference, no guessing, no new edge cases.

### If ExecStop change is not possible

If changing the installed unit is undesirable (e.g., upgrade path concerns), a lighter alternative is to add an ExecStop helper that writes the marker and then sends SIGTERM, without going through the full CLI:

```bash
#!/bin/sh
# hermes-systemd-stop-helper
MARKER_DIR=$(hermes config get marker_dir 2>/dev/null)
[ -n "$MARKER_DIR" ] && touch "$MARKER_DIR/planned-stop-$(cat $MAINPID)"
kill -SIGTERM "$MAINPID"
```

This still avoids the inference problem entirely.

## Summary

Both PRs implement the right goal (exit 0 on `systemctl stop`) but use an unreliable mechanism (environment-based inference) that breaks the restart guarantee for unintended SIGTERMs. The correct fix is to make `systemctl stop` go through the existing planned-stop marker path by changing `ExecStop`, rather than teaching the signal handler to guess who sent the signal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): ExecStop should write planned-stop marker instead of inferring signal source #42517

Related

Problem

Why inference breaks

Concrete failure cases

Additional issues with #41642

Proposed approach

If ExecStop change is not possible

Summary

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Scenario	INVOCATION_ID	Expected exit	Both PRs exit
`systemctl stop`	set	0 ✅	0 ✅
`kill -SIGTERM` from external process, under systemd	set	1 (restart wanted)	0 ❌
OOM killer SIGTERM in systemd cgroup	set	1 (restart wanted)	0 ❌
Docker container with systemd, `kill -SIGTERM`	set	1 (restart wanted)	0 ❌

fix(gateway): ExecStop should write planned-stop marker instead of inferring signal source #42517

Description

Related

Problem

Why inference breaks

Concrete failure cases

Additional issues with #41642

Proposed approach

If ExecStop change is not possible

Summary

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions