Bug Description
On a live Linux systemd deployment, Hermes can enter a self-restart/update path where the gateway transitions into draining, exits cleanly, and then never comes back because the unit uses Restart=on-failure.
The practical result is that all messaging platforms silently lose responsiveness until someone manually starts hermes-gateway again.
In the environment where this was observed, Telegram was the trigger path, and after the restart/update sequence the gateway stopped responding across Telegram / WeChat / DingTalk.
Environment
- Hermes:
v0.10.0 (2026.4.16)
- OS: Linux with systemd-managed
hermes-gateway.service
- Service policy:
Restart=on-failure
- Platforms configured and previously working: Telegram / DingTalk / Weixin
Reproduction Path
- Run Hermes under systemd as
hermes-gateway.service.
- Trigger a gateway self-restart/update from a live messaging session (in the observed case, via Telegram).
- Hermes begins draining active work before restart.
- The gateway process exits cleanly.
- systemd does not restart it, because the exit was treated as successful rather than failed.
- Subsequent Telegram / DingTalk / WeChat messages get no response until a manual
systemctl start hermes-gateway.
Observed Evidence
On the affected host, the persisted state file still showed restart/drain intent after the gateway had disappeared:
{
"gateway_state": "draining",
"restart_requested": true,
"active_agents": 1
}
At the same time, systemd no longer had a running Hermes gateway process.
This means the restart flow had begun, but the service manager did not bring the gateway back.
Expected Behavior
Any self-restart/update path initiated by Hermes while running under systemd should reliably leave the unit running again afterward.
That can be implemented in one of several ways, for example:
- route all self-restart flows through a service-managed restart path
- exit with a dedicated restart code that the unit treats as restartable
- or actively verify that the service has become
active again before considering restart complete
Actual Behavior
The gateway can exit successfully during restart/drain, and because the unit is configured with Restart=on-failure, systemd does not relaunch it.
Why this seems distinct from related issues
This looks adjacent to, but not fully covered by, a few earlier reports/fixes:
However, this observed failure still happened on v0.10.0 (2026.4.16) in a real deployment, and the visible symptom was:
- restart requested
- gateway entered draining
- process exited cleanly
- service stayed dead
So there may still be at least one remaining or regressed self-restart path that bypasses the intended service-managed recovery logic.
User Impact
From the operator's perspective this looks like:
- the bot was working
- a restart/update was triggered from chat
- after that, Hermes became silent on all channels
- no obvious user-facing error explains that the gateway never came back
Suggested Direction
Audit every restart entrypoint used by:
/restart
- upgrade/update flows
- any gateway self-managed restart helper
and ensure that under systemd they all converge on a restart mechanism that guarantees the service becomes active again (or emits a loud failure if it does not).
Bug Description
On a live Linux systemd deployment, Hermes can enter a self-restart/update path where the gateway transitions into
draining, exits cleanly, and then never comes back because the unit usesRestart=on-failure.The practical result is that all messaging platforms silently lose responsiveness until someone manually starts
hermes-gatewayagain.In the environment where this was observed, Telegram was the trigger path, and after the restart/update sequence the gateway stopped responding across Telegram / WeChat / DingTalk.
Environment
v0.10.0 (2026.4.16)hermes-gateway.serviceRestart=on-failureReproduction Path
hermes-gateway.service.systemctl start hermes-gateway.Observed Evidence
On the affected host, the persisted state file still showed restart/drain intent after the gateway had disappeared:
{ "gateway_state": "draining", "restart_requested": true, "active_agents": 1 }At the same time, systemd no longer had a running Hermes gateway process.
This means the restart flow had begun, but the service manager did not bring the gateway back.
Expected Behavior
Any self-restart/update path initiated by Hermes while running under systemd should reliably leave the unit running again afterward.
That can be implemented in one of several ways, for example:
activeagain before considering restart completeActual Behavior
The gateway can exit successfully during restart/drain, and because the unit is configured with
Restart=on-failure, systemd does not relaunch it.Why this seems distinct from related issues
This looks adjacent to, but not fully covered by, a few earlier reports/fixes:
/restartforcing the wrong restart path under systemdhermes updatenot verifying that the service survived restartHowever, this observed failure still happened on
v0.10.0 (2026.4.16)in a real deployment, and the visible symptom was:So there may still be at least one remaining or regressed self-restart path that bypasses the intended service-managed recovery logic.
User Impact
From the operator's perspective this looks like:
Suggested Direction
Audit every restart entrypoint used by:
/restartand ensure that under systemd they all converge on a restart mechanism that guarantees the service becomes active again (or emits a loud failure if it does not).