Skip to content

Gateway self-restart/update can exit cleanly into draining state and stay dead under systemd Restart=on-failure #11258

@kevinskysunny

Description

@kevinskysunny

Bug Description

On a live Linux systemd deployment, Hermes can enter a self-restart/update path where the gateway transitions into draining, exits cleanly, and then never comes back because the unit uses Restart=on-failure.

The practical result is that all messaging platforms silently lose responsiveness until someone manually starts hermes-gateway again.

In the environment where this was observed, Telegram was the trigger path, and after the restart/update sequence the gateway stopped responding across Telegram / WeChat / DingTalk.

Environment

  • Hermes: v0.10.0 (2026.4.16)
  • OS: Linux with systemd-managed hermes-gateway.service
  • Service policy: Restart=on-failure
  • Platforms configured and previously working: Telegram / DingTalk / Weixin

Reproduction Path

  1. Run Hermes under systemd as hermes-gateway.service.
  2. Trigger a gateway self-restart/update from a live messaging session (in the observed case, via Telegram).
  3. Hermes begins draining active work before restart.
  4. The gateway process exits cleanly.
  5. systemd does not restart it, because the exit was treated as successful rather than failed.
  6. Subsequent Telegram / DingTalk / WeChat messages get no response until a manual systemctl start hermes-gateway.

Observed Evidence

On the affected host, the persisted state file still showed restart/drain intent after the gateway had disappeared:

{
  "gateway_state": "draining",
  "restart_requested": true,
  "active_agents": 1
}

At the same time, systemd no longer had a running Hermes gateway process.

This means the restart flow had begun, but the service manager did not bring the gateway back.

Expected Behavior

Any self-restart/update path initiated by Hermes while running under systemd should reliably leave the unit running again afterward.

That can be implemented in one of several ways, for example:

  • route all self-restart flows through a service-managed restart path
  • exit with a dedicated restart code that the unit treats as restartable
  • or actively verify that the service has become active again before considering restart complete

Actual Behavior

The gateway can exit successfully during restart/drain, and because the unit is configured with Restart=on-failure, systemd does not relaunch it.

Why this seems distinct from related issues

This looks adjacent to, but not fully covered by, a few earlier reports/fixes:

However, this observed failure still happened on v0.10.0 (2026.4.16) in a real deployment, and the visible symptom was:

  • restart requested
  • gateway entered draining
  • process exited cleanly
  • service stayed dead

So there may still be at least one remaining or regressed self-restart path that bypasses the intended service-managed recovery logic.

User Impact

From the operator's perspective this looks like:

  • the bot was working
  • a restart/update was triggered from chat
  • after that, Hermes became silent on all channels
  • no obvious user-facing error explains that the gateway never came back

Suggested Direction

Audit every restart entrypoint used by:

  • /restart
  • upgrade/update flows
  • any gateway self-managed restart helper

and ensure that under systemd they all converge on a restart mechanism that guarantees the service becomes active again (or emits a loud failure if it does not).

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions