Gateway self-restart/update can exit cleanly into draining state and stay dead under systemd Restart=on-failure

## Bug Description

On a live Linux systemd deployment, Hermes can enter a self-restart/update path where the gateway transitions into `draining`, exits cleanly, and then never comes back because the unit uses `Restart=on-failure`.

The practical result is that all messaging platforms silently lose responsiveness until someone manually starts `hermes-gateway` again.

In the environment where this was observed, Telegram was the trigger path, and after the restart/update sequence the gateway stopped responding across Telegram / WeChat / DingTalk.

## Environment
- Hermes: `v0.10.0 (2026.4.16)`
- OS: Linux with systemd-managed `hermes-gateway.service`
- Service policy: `Restart=on-failure`
- Platforms configured and previously working: Telegram / DingTalk / Weixin

## Reproduction Path
1. Run Hermes under systemd as `hermes-gateway.service`.
2. Trigger a gateway self-restart/update from a live messaging session (in the observed case, via Telegram).
3. Hermes begins draining active work before restart.
4. The gateway process exits cleanly.
5. systemd does not restart it, because the exit was treated as successful rather than failed.
6. Subsequent Telegram / DingTalk / WeChat messages get no response until a manual `systemctl start hermes-gateway`.

## Observed Evidence
On the affected host, the persisted state file still showed restart/drain intent after the gateway had disappeared:

```json
{
  "gateway_state": "draining",
  "restart_requested": true,
  "active_agents": 1
}
```

At the same time, systemd no longer had a running Hermes gateway process.

This means the restart flow had begun, but the service manager did not bring the gateway back.

## Expected Behavior
Any self-restart/update path initiated by Hermes while running under systemd should reliably leave the unit running again afterward.

That can be implemented in one of several ways, for example:
- route all self-restart flows through a service-managed restart path
- exit with a dedicated restart code that the unit treats as restartable
- or actively verify that the service has become `active` again before considering restart complete

## Actual Behavior
The gateway can exit successfully during restart/drain, and because the unit is configured with `Restart=on-failure`, systemd does not relaunch it.

## Why this seems distinct from related issues
This looks adjacent to, but not fully covered by, a few earlier reports/fixes:
- #8104 discussed `/restart` forcing the wrong restart path under systemd
- #6631 discussed `hermes update` not verifying that the service survived restart
- PR #8674 / #9945 improved waiting for service restart state transitions

However, this observed failure still happened on `v0.10.0 (2026.4.16)` in a real deployment, and the visible symptom was:
- restart requested
- gateway entered draining
- process exited cleanly
- service stayed dead

So there may still be at least one remaining or regressed self-restart path that bypasses the intended service-managed recovery logic.

## User Impact
From the operator's perspective this looks like:
- the bot was working
- a restart/update was triggered from chat
- after that, Hermes became silent on all channels
- no obvious user-facing error explains that the gateway never came back

## Suggested Direction
Audit every restart entrypoint used by:
- `/restart`
- upgrade/update flows
- any gateway self-managed restart helper

and ensure that under systemd they all converge on a restart mechanism that guarantees the service becomes active again (or emits a loud failure if it does not).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway self-restart/update can exit cleanly into draining state and stay dead under systemd Restart=on-failure #11258

Bug Description

Environment

Reproduction Path

Observed Evidence

Expected Behavior

Actual Behavior

Why this seems distinct from related issues

User Impact

Suggested Direction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gateway self-restart/update can exit cleanly into draining state and stay dead under systemd Restart=on-failure #11258

Description

Bug Description

Environment

Reproduction Path

Observed Evidence

Expected Behavior

Actual Behavior

Why this seems distinct from related issues

User Impact

Suggested Direction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions