Summary
On Docker (s6-overlay) deployments, after a container restart or image upgrade (docker compose up -d --force-recreate), the gateway does not auto-start. Messaging channels (WeChat/Telegram/etc.) go silently dark while the CLI still works. Root cause: every shutdown path — including the SIGTERM a supervisor/container sends on stop/upgrade — unconditionally persists gateway_state=stopped, and container_boot then treats that as an explicit user stop and refuses to bring the gateway back up.
Environment
- Hermes Agent 0.16.0, official Docker image (s6-overlay),
hermes dashboard + gateway in one container, HERMES_HOME on a persistent volume.
Repro
- Gateway running (
gateway_state=running).
docker compose up -d --force-recreate (or any image upgrade / container restart) sends SIGTERM to the gateway process.
- New container boots;
container_boot reads gateway_state=stopped, registers the s6 service down → gateway never starts.
- Messaging channels stay down;
hermes gateway status shows "not running". No error is surfaced to the user.
Root cause
gateway/run.py _stop_impl() ends with an unconditional:
self._update_runtime_status("stopped", self._exit_reason) # run.py:5955 (0.16.0)
This runs for every shutdown, including signal-initiated ones (SIGTERM from s6/Docker on restart/upgrade). hermes_cli/container_boot.py intentionally preserves an explicit stopped across restarts ("explicit stopped/failed states keep winning") to respect a user who ran hermes gateway stop. But it cannot distinguish:
- user-requested stop (
hermes gateway stop) — should persist stopped; vs
- signal/container-initiated stop (SIGTERM on upgrade/restart) — desired state is still "running"; should auto-recover.
Both currently write stopped, so a routine upgrade is misread as a deliberate stop and the gateway stays down.
Relationship to #39381
This sits one layer above the s6 down-marker issue (#39381): container_boot decides whether to lay down the s6 down marker based on this gateway_state. Fixing the state semantics prevents the upgrade case from ever reaching the down-marker path.
Possible fixes (seeking direction before a PR)
- In the shutdown path, only persist
stopped for user-requested stops; for signal-initiated shutdown, leave the desired state untouched (or write a distinct interrupted/crashed runtime state that container_boot treats as "recover").
- Or have
container_boot distinguish "process was signaled/interrupted" from "user explicitly stopped".
- Or track a persistent
desired_state (running/stopped), set only by explicit start/stop commands, separate from the transient runtime status.
Happy to implement once there's agreement on the preferred shape.
Summary
On Docker (s6-overlay) deployments, after a container restart or image upgrade (
docker compose up -d --force-recreate), the gateway does not auto-start. Messaging channels (WeChat/Telegram/etc.) go silently dark while the CLI still works. Root cause: every shutdown path — including the SIGTERM a supervisor/container sends on stop/upgrade — unconditionally persistsgateway_state=stopped, andcontainer_bootthen treats that as an explicit user stop and refuses to bring the gateway back up.Environment
hermes dashboard+ gateway in one container,HERMES_HOMEon a persistent volume.Repro
gateway_state=running).docker compose up -d --force-recreate(or any image upgrade / container restart) sends SIGTERM to the gateway process.container_bootreadsgateway_state=stopped, registers the s6 service down → gateway never starts.hermes gateway statusshows "not running". No error is surfaced to the user.Root cause
gateway/run.py_stop_impl()ends with an unconditional:This runs for every shutdown, including signal-initiated ones (SIGTERM from s6/Docker on restart/upgrade).
hermes_cli/container_boot.pyintentionally preserves an explicitstoppedacross restarts ("explicit stopped/failed states keep winning") to respect a user who ranhermes gateway stop. But it cannot distinguish:hermes gateway stop) — should persiststopped; vsBoth currently write
stopped, so a routine upgrade is misread as a deliberate stop and the gateway stays down.Relationship to #39381
This sits one layer above the s6
down-marker issue (#39381):container_bootdecides whether to lay down the s6downmarker based on thisgateway_state. Fixing the state semantics prevents the upgrade case from ever reaching the down-marker path.Possible fixes (seeking direction before a PR)
stoppedfor user-requested stops; for signal-initiated shutdown, leave the desired state untouched (or write a distinctinterrupted/crashedruntime state thatcontainer_boottreats as "recover").container_bootdistinguish "process was signaled/interrupted" from "user explicitly stopped".desired_state(running/stopped), set only by explicit start/stop commands, separate from the transient runtime status.Happy to implement once there's agreement on the preferred shape.