fix(gateway): auto-start after container restart via planned-stop marker (#42675)#43236
Merged
Merged
Conversation
On Docker (s6-overlay), the gateway runs as a dynamically-registered s6
service. When the container stops/restarts/upgrades, s6 sends the gateway
a plain SIGTERM. The shutdown path (_stop_impl) ended with an
unconditional _update_runtime_status("stopped"), persisting
gateway_state=stopped to the volume. container_boot.py reads that on the
next boot and only auto-starts gateways whose last state was "running"
(_AUTOSTART_STATES) — so after a routine `docker compose up
--force-recreate` the gateway stays down and messaging channels silently
go dark, with no error surfaced (issue #42675).
The codebase already distinguishes intentional stops from unexpected
signals via the planned-stop marker (write_planned_stop_marker /
consume_planned_stop_marker_for_self): `hermes gateway stop`,
systemd/launchd ExecStop, and Ctrl+C write a marker before signalling,
so the handler classifies them as planned. An unmarked SIGTERM
(container/s6 restart, OOM, bare kill) is signal-initiated.
This wires that existing classification through to the state persist,
rather than adding unreliable signal-source inference:
- run.py: GatewayRunner._signal_initiated_shutdown, set in
shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a
signal-initiated (non-restart) teardown now persists "running" instead
of "stopped" — preserving the operator's run-intent and overwriting the
mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot.
Operator stops and restarts persist "stopped" as before.
- service_manager.py: S6ServiceManager.stop() now writes the planned-stop
marker for the supervised PID (read from s6-svstat) before `s6-svc -d`,
so an in-container `hermes gateway stop` is correctly classified as
intentional (parity with the systemd/launchd/host stop paths, which
already mark). Best-effort: a marker-write failure falls back to the
safe signal-initiated path.
Tests: shutdown persist-decision table (signal→running, operator→stopped,
restart→stopped), s6 stop marker write + svstat PID parse + failure
tolerance. The signal→running and s6-marker tests fail without the
respective source change. Verified end-to-end against a container built
from this branch: an unmarked SIGTERM to the live gateway leaves
gateway_state=running (shutdown-context log confirms signal path);
existing real container-restart suite still green.
Contributor
🔎 Lint report:
|
…rom container-kill The per-profile-supervision section described the autostart-across-restart contract as "running gateways come back, stopped stay stopped" without spelling out what records 'stopped'. That contract was the source of #42675 confusion: users expected a restart to bring the gateway back and it didn't. With the write-side fix, only an explicit `hermes gateway stop` records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and unexpected exits) leave the state 'running' so the gateway auto-starts. Make that distinction explicit in both the multi-profile and per-profile-supervision sections.
Contributor
|
Verification review — reviewed the diff (6 files, +287/-3) and test coverage. The design correctly distinguishes operator-initiated stops from unexpected external signals:
No issues found. Clean implementation. |
tonydwb
approved these changes
Jun 10, 2026
tonydwb
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Approved
Looks Good:
- Fixes gateway auto-start after container restart by using a planned-stop marker. When the gateway is stopped gracefully (SIGTERM with planned-stop flag), it records this in a marker file; on restart the service manager detects the planned stop and does not auto-restart the gateway, preventing orphaned processes.
- Handles the case where the gateway was stopped by the OS (no marker) — in that case auto-restart is appropriate.
- 6 files, +287/-3. Changes include gateway/run.py, service_manager.py, tests, and docs.
- Good test coverage with restart_test_helpers and existing shutdown tests.
Reviewed by Hermes Agent
Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp: a live s6-supervised gateway is killed by an actual `docker restart` SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must auto-start on the next boot. Exercises the WRITE side of the fix that the existing stamp-based tests bypass. Verified to FAIL against an origin/main image (reconciler logs prior_state=stopped action=registered — the #42675 bug) and PASS against the fixed image (prior_state=running action=started).
This was referenced Jun 10, 2026
Closed
Open
changman
pushed a commit
to changman/hermes-agent
that referenced
this pull request
Jun 10, 2026
…ker (NousResearch#42675) (NousResearch#43236) * fix(gateway): auto-start after container restart via planned-stop marker On Docker (s6-overlay), the gateway runs as a dynamically-registered s6 service. When the container stops/restarts/upgrades, s6 sends the gateway a plain SIGTERM. The shutdown path (_stop_impl) ended with an unconditional _update_runtime_status("stopped"), persisting gateway_state=stopped to the volume. container_boot.py reads that on the next boot and only auto-starts gateways whose last state was "running" (_AUTOSTART_STATES) — so after a routine `docker compose up --force-recreate` the gateway stays down and messaging channels silently go dark, with no error surfaced (issue NousResearch#42675). The codebase already distinguishes intentional stops from unexpected signals via the planned-stop marker (write_planned_stop_marker / consume_planned_stop_marker_for_self): `hermes gateway stop`, systemd/launchd ExecStop, and Ctrl+C write a marker before signalling, so the handler classifies them as planned. An unmarked SIGTERM (container/s6 restart, OOM, bare kill) is signal-initiated. This wires that existing classification through to the state persist, rather than adding unreliable signal-source inference: - run.py: GatewayRunner._signal_initiated_shutdown, set in shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a signal-initiated (non-restart) teardown now persists "running" instead of "stopped" — preserving the operator's run-intent and overwriting the mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot. Operator stops and restarts persist "stopped" as before. - service_manager.py: S6ServiceManager.stop() now writes the planned-stop marker for the supervised PID (read from s6-svstat) before `s6-svc -d`, so an in-container `hermes gateway stop` is correctly classified as intentional (parity with the systemd/launchd/host stop paths, which already mark). Best-effort: a marker-write failure falls back to the safe signal-initiated path. Tests: shutdown persist-decision table (signal→running, operator→stopped, restart→stopped), s6 stop marker write + svstat PID parse + failure tolerance. The signal→running and s6-marker tests fail without the respective source change. Verified end-to-end against a container built from this branch: an unmarked SIGTERM to the live gateway leaves gateway_state=running (shutdown-context log confirms signal path); existing real container-restart suite still green. * docs(docker): clarify gateway autostart distinguishes operator-stop from container-kill The per-profile-supervision section described the autostart-across-restart contract as "running gateways come back, stopped stay stopped" without spelling out what records 'stopped'. That contract was the source of NousResearch#42675 confusion: users expected a restart to bring the gateway back and it didn't. With the write-side fix, only an explicit `hermes gateway stop` records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and unexpected exits) leave the state 'running' so the gateway auto-starts. Make that distinction explicit in both the multi-profile and per-profile-supervision sections. * test(docker): real-restart autostart E2E for NousResearch#42675 Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp: a live s6-supervised gateway is killed by an actual `docker restart` SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must auto-start on the next boot. Exercises the WRITE side of the fix that the existing stamp-based tests bypass. Verified to FAIL against an origin/main image (reconciler logs prior_state=stopped action=registered — the NousResearch#42675 bug) and PASS against the fixed image (prior_state=running action=started).
alt-glitch
pushed a commit
that referenced
this pull request
Jun 14, 2026
…ker (#42675) (#43236) * fix(gateway): auto-start after container restart via planned-stop marker On Docker (s6-overlay), the gateway runs as a dynamically-registered s6 service. When the container stops/restarts/upgrades, s6 sends the gateway a plain SIGTERM. The shutdown path (_stop_impl) ended with an unconditional _update_runtime_status("stopped"), persisting gateway_state=stopped to the volume. container_boot.py reads that on the next boot and only auto-starts gateways whose last state was "running" (_AUTOSTART_STATES) — so after a routine `docker compose up --force-recreate` the gateway stays down and messaging channels silently go dark, with no error surfaced (issue #42675). The codebase already distinguishes intentional stops from unexpected signals via the planned-stop marker (write_planned_stop_marker / consume_planned_stop_marker_for_self): `hermes gateway stop`, systemd/launchd ExecStop, and Ctrl+C write a marker before signalling, so the handler classifies them as planned. An unmarked SIGTERM (container/s6 restart, OOM, bare kill) is signal-initiated. This wires that existing classification through to the state persist, rather than adding unreliable signal-source inference: - run.py: GatewayRunner._signal_initiated_shutdown, set in shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a signal-initiated (non-restart) teardown now persists "running" instead of "stopped" — preserving the operator's run-intent and overwriting the mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot. Operator stops and restarts persist "stopped" as before. - service_manager.py: S6ServiceManager.stop() now writes the planned-stop marker for the supervised PID (read from s6-svstat) before `s6-svc -d`, so an in-container `hermes gateway stop` is correctly classified as intentional (parity with the systemd/launchd/host stop paths, which already mark). Best-effort: a marker-write failure falls back to the safe signal-initiated path. Tests: shutdown persist-decision table (signal→running, operator→stopped, restart→stopped), s6 stop marker write + svstat PID parse + failure tolerance. The signal→running and s6-marker tests fail without the respective source change. Verified end-to-end against a container built from this branch: an unmarked SIGTERM to the live gateway leaves gateway_state=running (shutdown-context log confirms signal path); existing real container-restart suite still green. * docs(docker): clarify gateway autostart distinguishes operator-stop from container-kill The per-profile-supervision section described the autostart-across-restart contract as "running gateways come back, stopped stay stopped" without spelling out what records 'stopped'. That contract was the source of #42675 confusion: users expected a restart to bring the gateway back and it didn't. With the write-side fix, only an explicit `hermes gateway stop` records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and unexpected exits) leave the state 'running' so the gateway auto-starts. Make that distinction explicit in both the multi-profile and per-profile-supervision sections. * test(docker): real-restart autostart E2E for #42675 Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp: a live s6-supervised gateway is killed by an actual `docker restart` SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must auto-start on the next boot. Exercises the WRITE side of the fix that the existing stamp-based tests bypass. Verified to FAIL against an origin/main image (reconciler logs prior_state=stopped action=registered — the #42675 bug) and PASS against the fixed image (prior_state=running action=started).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (#42675)
On Docker (s6-overlay), the gateway runs as a dynamically-registered s6 service. When the container restarts or upgrades (
docker compose up -d --force-recreate), s6 sends the gateway a plain SIGTERM. The teardown path (_stop_impl) ended with an unconditional:That persists
gateway_state=stoppedto the volume. On the next boot,container_boot.pyreads it and only auto-starts gateways whose last state wasrunning(_AUTOSTART_STATES = frozenset({"running"})). So after a routine restart the gateway stays down — messaging channels (Telegram/WeChat/etc.) silently go dark, the CLI still works, and no error is surfaced.The
container_bootread-side is correct by design (it deliberately keepsstopped/startup_faileddown to avoid crash-loops). The bug is on the write side: a container-restart SIGTERM should never have persistedstopped, because it isn't an explicit user stop.Approach — use the existing planned-stop marker, not signal inference
The codebase already distinguishes intentional stops from unexpected signals via the planned-stop marker (
write_planned_stop_marker/consume_planned_stop_marker_for_self).hermes gateway stop, systemd/launchdExecStop, and Ctrl+C all write a marker before signalling, soshutdown_signal_handlerclassifies them as planned. An unmarked SIGTERM (container/s6 restart, OOM-killer, barekill) is signal-initiated.This PR wires that existing classification through to the state persist — deliberately avoiding the unreliable "infer intent from the signal" approach (cf. the analysis in #42517, which correctly argues that
systemctl stop, an externalkill, and an OOM-kill are indistinguishable from the signal alone).Changes
gateway/run.pyGatewayRunner._signal_initiated_shutdown, set inshutdown_signal_handler's unmarked-signal branch (the same branch that already sets the local flag for exit-code purposes)._stop_impl, a signal-initiated, non-restart teardown now persistsrunninginstead ofstopped— preserving run-intent and overwriting the mid-shutdowndrainingmarker (which is also not in_AUTOSTART_STATES, so simply skipping the write was insufficient — see E2E note). Operator stops and restarts persiststoppedas before.hermes_cli/service_manager.pyS6ServiceManager.stop()now writes the planned-stop marker for the supervised PID (parsed froms6-svstat) befores6-svc -d, so an in-containerhermes gateway stopis classified as intentional — parity with the systemd/launchd/host stop paths, which already mark. Best-effort: a marker-write failure falls back to the safe signal-initiated path.Tests
test_gateway_shutdown.py: persist-decision table — signal-initiated →running(final write), operator-initiated →stopped, restart →stopped.test_service_manager.py:stop()writes the marker for the svstat PID;_supervised_pidparse; marker-write-failure tolerance.Real E2E (per behavioral-change requirement)
Built the image from this branch and ran a live container:
gateway_state=running, pid N).gateway_statestayedrunning(freshupdated_atpost-kill; shutdown-context log confirms the signal path ran). Before the fix this leftstopped/draining.tests/docker/test_container_restart.py) still green against the built image.Review note
This touches the shared teardown path (
_stop_impl/shutdown_signal_handler), so it changes terminal-state persistence for all deployments (bare-metal systemd, OOM cases), not only containers — flagging for @teknium1 review accordingly. The change is mechanism-reuse (the marker primitive), not new inference.Supersedes the inference-based attempts #42789 and #42740 (which key off "was it a signal?" and so misclassify OOM/external-kill), and is the s6/Docker counterpart to the systemd-focused #42555. Relates to #42517.