Skip to content

fix(gateway): auto-start after container restart via planned-stop marker (#42675)#43236

Merged
benbarclay merged 3 commits into
mainfrom
fix/gateway-autostart-planned-stop-marker
Jun 10, 2026
Merged

fix(gateway): auto-start after container restart via planned-stop marker (#42675)#43236
benbarclay merged 3 commits into
mainfrom
fix/gateway-autostart-planned-stop-marker

Conversation

@benbarclay

Copy link
Copy Markdown
Collaborator

Problem (#42675)

On Docker (s6-overlay), the gateway runs as a dynamically-registered s6 service. When the container restarts or upgrades (docker compose up -d --force-recreate), s6 sends the gateway a plain SIGTERM. The teardown path (_stop_impl) ended with an unconditional:

self._update_runtime_status("stopped", self._exit_reason)   # gateway/run.py

That persists gateway_state=stopped to the volume. On the next boot, container_boot.py reads it and only auto-starts gateways whose last state was running (_AUTOSTART_STATES = frozenset({"running"})). So after a routine restart the gateway stays down — messaging channels (Telegram/WeChat/etc.) silently go dark, the CLI still works, and no error is surfaced.

The container_boot read-side is correct by design (it deliberately keeps stopped/startup_failed down to avoid crash-loops). The bug is on the write side: a container-restart SIGTERM should never have persisted stopped, because it isn't an explicit user stop.

Approach — use the existing planned-stop marker, not signal inference

The codebase already distinguishes intentional stops from unexpected signals via the planned-stop marker (write_planned_stop_marker / consume_planned_stop_marker_for_self). hermes gateway stop, systemd/launchd ExecStop, and Ctrl+C all write a marker before signalling, so shutdown_signal_handler classifies them as planned. An unmarked SIGTERM (container/s6 restart, OOM-killer, bare kill) is signal-initiated.

This PR wires that existing classification through to the state persist — deliberately avoiding the unreliable "infer intent from the signal" approach (cf. the analysis in #42517, which correctly argues that systemctl stop, an external kill, and an OOM-kill are indistinguishable from the signal alone).

Changes

gateway/run.py

  • New GatewayRunner._signal_initiated_shutdown, set in shutdown_signal_handler's unmarked-signal branch (the same branch that already sets the local flag for exit-code purposes).
  • In _stop_impl, a signal-initiated, non-restart teardown now persists running instead of stopped — preserving run-intent and overwriting the mid-shutdown draining marker (which is also not in _AUTOSTART_STATES, so simply skipping the write was insufficient — see E2E note). Operator stops and restarts persist stopped as before.

hermes_cli/service_manager.py

  • S6ServiceManager.stop() now writes the planned-stop marker for the supervised PID (parsed from s6-svstat) before s6-svc -d, so an in-container hermes gateway stop is classified as intentional — parity with the systemd/launchd/host stop paths, which already mark. Best-effort: a marker-write failure falls back to the safe signal-initiated path.

Tests

  • test_gateway_shutdown.py: persist-decision table — signal-initiated → running (final write), operator-initiated → stopped, restart → stopped.
  • test_service_manager.py: stop() writes the marker for the svstat PID; _supervised_pid parse; marker-write-failure tolerance.
  • The signal→running and s6-marker tests fail without the respective source change (verified by reverting each).

Real E2E (per behavioral-change requirement)

Built the image from this branch and ran a live container:

  • Started an unsupervised gateway (gateway_state=running, pid N).
  • Sent a plain SIGTERM (no marker — the container-restart case).
  • Confirmed the gateway exited AND gateway_state stayed running (fresh updated_at post-kill; shutdown-context log confirms the signal path ran). Before the fix this left stopped/draining.
  • Existing real container-restart suite (tests/docker/test_container_restart.py) still green against the built image.

Review note

This touches the shared teardown path (_stop_impl / shutdown_signal_handler), so it changes terminal-state persistence for all deployments (bare-metal systemd, OOM cases), not only containers — flagging for @teknium1 review accordingly. The change is mechanism-reuse (the marker primitive), not new inference.

Supersedes the inference-based attempts #42789 and #42740 (which key off "was it a signal?" and so misclassify OOM/external-kill), and is the s6/Docker counterpart to the systemd-focused #42555. Relates to #42517.

On Docker (s6-overlay), the gateway runs as a dynamically-registered s6
service. When the container stops/restarts/upgrades, s6 sends the gateway
a plain SIGTERM. The shutdown path (_stop_impl) ended with an
unconditional _update_runtime_status("stopped"), persisting
gateway_state=stopped to the volume. container_boot.py reads that on the
next boot and only auto-starts gateways whose last state was "running"
(_AUTOSTART_STATES) — so after a routine `docker compose up
--force-recreate` the gateway stays down and messaging channels silently
go dark, with no error surfaced (issue #42675).

The codebase already distinguishes intentional stops from unexpected
signals via the planned-stop marker (write_planned_stop_marker /
consume_planned_stop_marker_for_self): `hermes gateway stop`,
systemd/launchd ExecStop, and Ctrl+C write a marker before signalling,
so the handler classifies them as planned. An unmarked SIGTERM
(container/s6 restart, OOM, bare kill) is signal-initiated.

This wires that existing classification through to the state persist,
rather than adding unreliable signal-source inference:

- run.py: GatewayRunner._signal_initiated_shutdown, set in
  shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a
  signal-initiated (non-restart) teardown now persists "running" instead
  of "stopped" — preserving the operator's run-intent and overwriting the
  mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot.
  Operator stops and restarts persist "stopped" as before.

- service_manager.py: S6ServiceManager.stop() now writes the planned-stop
  marker for the supervised PID (read from s6-svstat) before `s6-svc -d`,
  so an in-container `hermes gateway stop` is correctly classified as
  intentional (parity with the systemd/launchd/host stop paths, which
  already mark). Best-effort: a marker-write failure falls back to the
  safe signal-initiated path.

Tests: shutdown persist-decision table (signal→running, operator→stopped,
restart→stopped), s6 stop marker write + svstat PID parse + failure
tolerance. The signal→running and s6-marker tests fail without the
respective source change. Verified end-to-end against a container built
from this branch: an unmarked SIGTERM to the live gateway leaves
gateway_state=running (shutdown-context log confirms signal path);
existing real container-restart suite still green.
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/gateway-autostart-planned-stop-marker vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10624 on HEAD, 10620 on base (🆕 +4)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5564 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery area/docker Docker image, Compose, packaging backend/docker Docker container execution P1 High — major feature broken, no workaround labels Jun 10, 2026
…rom container-kill

The per-profile-supervision section described the autostart-across-restart
contract as "running gateways come back, stopped stay stopped" without
spelling out what records 'stopped'. That contract was the source of
#42675 confusion: users expected a restart to bring the gateway back and
it didn't. With the write-side fix, only an explicit `hermes gateway stop`
records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and
unexpected exits) leave the state 'running' so the gateway auto-starts.
Make that distinction explicit in both the multi-profile and
per-profile-supervision sections.
@liuhao1024

Copy link
Copy Markdown
Contributor

Verification review — reviewed the diff (6 files, +287/-3) and test coverage.

The design correctly distinguishes operator-initiated stops from unexpected external signals:

  • Planned-stop marker written by S6ServiceManager.stop() before s6-svc -d ensures the gateway's signal handler classifies the SIGTERM as intentional, persisting gateway_state=stopped.
  • Signal-initiated shutdown (container restart, OOM, bare kill) persists gateway_state=running instead, so container_boot auto-starts the gateway on the next boot — the core fix for Gateway does not auto-start after container restart/upgrade — signal-initiated shutdown persists gateway_state=stopped #42675.
  • Restart exception: when self._restart_requested is True, the normal stopped persist applies because the restarting process brings the gateway back up itself. This is correct.
  • Best-effort marker write in stop() with try/except ensures a marker-write failure never blocks the stop command.
  • Test coverage is thorough: signal-initiated (persists running), operator-initiated (persists stopped), and restart (persists stopped despite signal flag) are all covered.

No issues found. Clean implementation.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Looks Good:

  • Fixes gateway auto-start after container restart by using a planned-stop marker. When the gateway is stopped gracefully (SIGTERM with planned-stop flag), it records this in a marker file; on restart the service manager detects the planned stop and does not auto-restart the gateway, preventing orphaned processes.
  • Handles the case where the gateway was stopped by the OS (no marker) — in that case auto-restart is appropriate.
  • 6 files, +287/-3. Changes include gateway/run.py, service_manager.py, tests, and docs.
  • Good test coverage with restart_test_helpers and existing shutdown tests.

Reviewed by Hermes Agent

Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp:
a live s6-supervised gateway is killed by an actual `docker restart`
SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must
auto-start on the next boot. Exercises the WRITE side of the fix that the
existing stamp-based tests bypass.

Verified to FAIL against an origin/main image (reconciler logs
prior_state=stopped action=registered — the #42675 bug) and PASS against
the fixed image (prior_state=running action=started).
@benbarclay benbarclay merged commit 5cf6e28 into main Jun 10, 2026
24 checks passed
@benbarclay benbarclay deleted the fix/gateway-autostart-planned-stop-marker branch June 10, 2026 04:01
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…ker (NousResearch#42675) (NousResearch#43236)

* fix(gateway): auto-start after container restart via planned-stop marker

On Docker (s6-overlay), the gateway runs as a dynamically-registered s6
service. When the container stops/restarts/upgrades, s6 sends the gateway
a plain SIGTERM. The shutdown path (_stop_impl) ended with an
unconditional _update_runtime_status("stopped"), persisting
gateway_state=stopped to the volume. container_boot.py reads that on the
next boot and only auto-starts gateways whose last state was "running"
(_AUTOSTART_STATES) — so after a routine `docker compose up
--force-recreate` the gateway stays down and messaging channels silently
go dark, with no error surfaced (issue NousResearch#42675).

The codebase already distinguishes intentional stops from unexpected
signals via the planned-stop marker (write_planned_stop_marker /
consume_planned_stop_marker_for_self): `hermes gateway stop`,
systemd/launchd ExecStop, and Ctrl+C write a marker before signalling,
so the handler classifies them as planned. An unmarked SIGTERM
(container/s6 restart, OOM, bare kill) is signal-initiated.

This wires that existing classification through to the state persist,
rather than adding unreliable signal-source inference:

- run.py: GatewayRunner._signal_initiated_shutdown, set in
  shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a
  signal-initiated (non-restart) teardown now persists "running" instead
  of "stopped" — preserving the operator's run-intent and overwriting the
  mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot.
  Operator stops and restarts persist "stopped" as before.

- service_manager.py: S6ServiceManager.stop() now writes the planned-stop
  marker for the supervised PID (read from s6-svstat) before `s6-svc -d`,
  so an in-container `hermes gateway stop` is correctly classified as
  intentional (parity with the systemd/launchd/host stop paths, which
  already mark). Best-effort: a marker-write failure falls back to the
  safe signal-initiated path.

Tests: shutdown persist-decision table (signal→running, operator→stopped,
restart→stopped), s6 stop marker write + svstat PID parse + failure
tolerance. The signal→running and s6-marker tests fail without the
respective source change. Verified end-to-end against a container built
from this branch: an unmarked SIGTERM to the live gateway leaves
gateway_state=running (shutdown-context log confirms signal path);
existing real container-restart suite still green.

* docs(docker): clarify gateway autostart distinguishes operator-stop from container-kill

The per-profile-supervision section described the autostart-across-restart
contract as "running gateways come back, stopped stay stopped" without
spelling out what records 'stopped'. That contract was the source of
NousResearch#42675 confusion: users expected a restart to bring the gateway back and
it didn't. With the write-side fix, only an explicit `hermes gateway stop`
records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and
unexpected exits) leave the state 'running' so the gateway auto-starts.
Make that distinction explicit in both the multi-profile and
per-profile-supervision sections.

* test(docker): real-restart autostart E2E for NousResearch#42675

Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp:
a live s6-supervised gateway is killed by an actual `docker restart`
SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must
auto-start on the next boot. Exercises the WRITE side of the fix that the
existing stamp-based tests bypass.

Verified to FAIL against an origin/main image (reconciler logs
prior_state=stopped action=registered — the NousResearch#42675 bug) and PASS against
the fixed image (prior_state=running action=started).
alt-glitch pushed a commit that referenced this pull request Jun 14, 2026
…ker (#42675) (#43236)

* fix(gateway): auto-start after container restart via planned-stop marker

On Docker (s6-overlay), the gateway runs as a dynamically-registered s6
service. When the container stops/restarts/upgrades, s6 sends the gateway
a plain SIGTERM. The shutdown path (_stop_impl) ended with an
unconditional _update_runtime_status("stopped"), persisting
gateway_state=stopped to the volume. container_boot.py reads that on the
next boot and only auto-starts gateways whose last state was "running"
(_AUTOSTART_STATES) — so after a routine `docker compose up
--force-recreate` the gateway stays down and messaging channels silently
go dark, with no error surfaced (issue #42675).

The codebase already distinguishes intentional stops from unexpected
signals via the planned-stop marker (write_planned_stop_marker /
consume_planned_stop_marker_for_self): `hermes gateway stop`,
systemd/launchd ExecStop, and Ctrl+C write a marker before signalling,
so the handler classifies them as planned. An unmarked SIGTERM
(container/s6 restart, OOM, bare kill) is signal-initiated.

This wires that existing classification through to the state persist,
rather than adding unreliable signal-source inference:

- run.py: GatewayRunner._signal_initiated_shutdown, set in
  shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a
  signal-initiated (non-restart) teardown now persists "running" instead
  of "stopped" — preserving the operator's run-intent and overwriting the
  mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot.
  Operator stops and restarts persist "stopped" as before.

- service_manager.py: S6ServiceManager.stop() now writes the planned-stop
  marker for the supervised PID (read from s6-svstat) before `s6-svc -d`,
  so an in-container `hermes gateway stop` is correctly classified as
  intentional (parity with the systemd/launchd/host stop paths, which
  already mark). Best-effort: a marker-write failure falls back to the
  safe signal-initiated path.

Tests: shutdown persist-decision table (signal→running, operator→stopped,
restart→stopped), s6 stop marker write + svstat PID parse + failure
tolerance. The signal→running and s6-marker tests fail without the
respective source change. Verified end-to-end against a container built
from this branch: an unmarked SIGTERM to the live gateway leaves
gateway_state=running (shutdown-context log confirms signal path);
existing real container-restart suite still green.

* docs(docker): clarify gateway autostart distinguishes operator-stop from container-kill

The per-profile-supervision section described the autostart-across-restart
contract as "running gateways come back, stopped stay stopped" without
spelling out what records 'stopped'. That contract was the source of
#42675 confusion: users expected a restart to bring the gateway back and
it didn't. With the write-side fix, only an explicit `hermes gateway stop`
records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and
unexpected exits) leave the state 'running' so the gateway auto-starts.
Make that distinction explicit in both the multi-profile and
per-profile-supervision sections.

* test(docker): real-restart autostart E2E for #42675

Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp:
a live s6-supervised gateway is killed by an actual `docker restart`
SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must
auto-start on the next boot. Exercises the WRITE side of the fix that the
existing stamp-based tests bypass.

Verified to FAIL against an origin/main image (reconciler logs
prior_state=stopped action=registered — the #42675 bug) and PASS against
the fixed image (prior_state=running action=started).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docker Docker image, Compose, packaging backend/docker Docker container execution comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants