Skip to content

Gateway does not auto-start after container restart/upgrade — signal-initiated shutdown persists gateway_state=stopped #42675

@ZishiW

Description

@ZishiW

Summary

On Docker (s6-overlay) deployments, after a container restart or image upgrade (docker compose up -d --force-recreate), the gateway does not auto-start. Messaging channels (WeChat/Telegram/etc.) go silently dark while the CLI still works. Root cause: every shutdown path — including the SIGTERM a supervisor/container sends on stop/upgrade — unconditionally persists gateway_state=stopped, and container_boot then treats that as an explicit user stop and refuses to bring the gateway back up.

Environment

  • Hermes Agent 0.16.0, official Docker image (s6-overlay), hermes dashboard + gateway in one container, HERMES_HOME on a persistent volume.

Repro

  1. Gateway running (gateway_state=running).
  2. docker compose up -d --force-recreate (or any image upgrade / container restart) sends SIGTERM to the gateway process.
  3. New container boots; container_boot reads gateway_state=stopped, registers the s6 service down → gateway never starts.
  4. Messaging channels stay down; hermes gateway status shows "not running". No error is surfaced to the user.

Root cause

gateway/run.py _stop_impl() ends with an unconditional:

self._update_runtime_status("stopped", self._exit_reason)   # run.py:5955 (0.16.0)

This runs for every shutdown, including signal-initiated ones (SIGTERM from s6/Docker on restart/upgrade). hermes_cli/container_boot.py intentionally preserves an explicit stopped across restarts ("explicit stopped/failed states keep winning") to respect a user who ran hermes gateway stop. But it cannot distinguish:

  • user-requested stop (hermes gateway stop) — should persist stopped; vs
  • signal/container-initiated stop (SIGTERM on upgrade/restart) — desired state is still "running"; should auto-recover.

Both currently write stopped, so a routine upgrade is misread as a deliberate stop and the gateway stays down.

Relationship to #39381

This sits one layer above the s6 down-marker issue (#39381): container_boot decides whether to lay down the s6 down marker based on this gateway_state. Fixing the state semantics prevents the upgrade case from ever reaching the down-marker path.

Possible fixes (seeking direction before a PR)

  1. In the shutdown path, only persist stopped for user-requested stops; for signal-initiated shutdown, leave the desired state untouched (or write a distinct interrupted/crashed runtime state that container_boot treats as "recover").
  2. Or have container_boot distinguish "process was signaled/interrupted" from "user explicitly stopped".
  3. Or track a persistent desired_state (running/stopped), set only by explicit start/stop commands, separate from the transient runtime status.

Happy to implement once there's agreement on the preferred shape.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundarea/dockerDocker image, Compose, packagingbackend/dockerDocker container executioncomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions