fix(gateway): auto-start after container restart via planned-stop marker (#42675) by benbarclay · Pull Request #43236 · NousResearch/hermes-agent

benbarclay · 2026-06-10T02:29:17Z

Problem (#42675)

On Docker (s6-overlay), the gateway runs as a dynamically-registered s6 service. When the container restarts or upgrades (docker compose up -d --force-recreate), s6 sends the gateway a plain SIGTERM. The teardown path (_stop_impl) ended with an unconditional:

self._update_runtime_status("stopped", self._exit_reason)   # gateway/run.py

That persists gateway_state=stopped to the volume. On the next boot, container_boot.py reads it and only auto-starts gateways whose last state was running (_AUTOSTART_STATES = frozenset({"running"})). So after a routine restart the gateway stays down — messaging channels (Telegram/WeChat/etc.) silently go dark, the CLI still works, and no error is surfaced.

The container_boot read-side is correct by design (it deliberately keeps stopped/startup_failed down to avoid crash-loops). The bug is on the write side: a container-restart SIGTERM should never have persisted stopped, because it isn't an explicit user stop.

Approach — use the existing planned-stop marker, not signal inference

The codebase already distinguishes intentional stops from unexpected signals via the planned-stop marker (write_planned_stop_marker / consume_planned_stop_marker_for_self). hermes gateway stop, systemd/launchd ExecStop, and Ctrl+C all write a marker before signalling, so shutdown_signal_handler classifies them as planned. An unmarked SIGTERM (container/s6 restart, OOM-killer, bare kill) is signal-initiated.

This PR wires that existing classification through to the state persist — deliberately avoiding the unreliable "infer intent from the signal" approach (cf. the analysis in #42517, which correctly argues that systemctl stop, an external kill, and an OOM-kill are indistinguishable from the signal alone).

Changes

gateway/run.py

New GatewayRunner._signal_initiated_shutdown, set in shutdown_signal_handler's unmarked-signal branch (the same branch that already sets the local flag for exit-code purposes).
In _stop_impl, a signal-initiated, non-restart teardown now persists running instead of stopped — preserving run-intent and overwriting the mid-shutdown draining marker (which is also not in _AUTOSTART_STATES, so simply skipping the write was insufficient — see E2E note). Operator stops and restarts persist stopped as before.

hermes_cli/service_manager.py

S6ServiceManager.stop() now writes the planned-stop marker for the supervised PID (parsed from s6-svstat) before s6-svc -d, so an in-container hermes gateway stop is classified as intentional — parity with the systemd/launchd/host stop paths, which already mark. Best-effort: a marker-write failure falls back to the safe signal-initiated path.

Tests

test_gateway_shutdown.py: persist-decision table — signal-initiated → running (final write), operator-initiated → stopped, restart → stopped.
test_service_manager.py: stop() writes the marker for the svstat PID; _supervised_pid parse; marker-write-failure tolerance.
The signal→running and s6-marker tests fail without the respective source change (verified by reverting each).

Real E2E (per behavioral-change requirement)

Built the image from this branch and ran a live container:

Started an unsupervised gateway (gateway_state=running, pid N).
Sent a plain SIGTERM (no marker — the container-restart case).
Confirmed the gateway exited AND gateway_state stayed running (fresh updated_at post-kill; shutdown-context log confirms the signal path ran). Before the fix this left stopped/draining.
Existing real container-restart suite (tests/docker/test_container_restart.py) still green against the built image.

Review note

This touches the shared teardown path (_stop_impl / shutdown_signal_handler), so it changes terminal-state persistence for all deployments (bare-metal systemd, OOM cases), not only containers — flagging for @teknium1 review accordingly. The change is mechanism-reuse (the marker primitive), not new inference.

Supersedes the inference-based attempts #42789 and #42740 (which key off "was it a signal?" and so misclassify OOM/external-kill), and is the s6/Docker counterpart to the systemd-focused #42555. Relates to #42517.

On Docker (s6-overlay), the gateway runs as a dynamically-registered s6 service. When the container stops/restarts/upgrades, s6 sends the gateway a plain SIGTERM. The shutdown path (_stop_impl) ended with an unconditional _update_runtime_status("stopped"), persisting gateway_state=stopped to the volume. container_boot.py reads that on the next boot and only auto-starts gateways whose last state was "running" (_AUTOSTART_STATES) — so after a routine `docker compose up --force-recreate` the gateway stays down and messaging channels silently go dark, with no error surfaced (issue #42675). The codebase already distinguishes intentional stops from unexpected signals via the planned-stop marker (write_planned_stop_marker / consume_planned_stop_marker_for_self): `hermes gateway stop`, systemd/launchd ExecStop, and Ctrl+C write a marker before signalling, so the handler classifies them as planned. An unmarked SIGTERM (container/s6 restart, OOM, bare kill) is signal-initiated. This wires that existing classification through to the state persist, rather than adding unreliable signal-source inference: - run.py: GatewayRunner._signal_initiated_shutdown, set in shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a signal-initiated (non-restart) teardown now persists "running" instead of "stopped" — preserving the operator's run-intent and overwriting the mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot. Operator stops and restarts persist "stopped" as before. - service_manager.py: S6ServiceManager.stop() now writes the planned-stop marker for the supervised PID (read from s6-svstat) before `s6-svc -d`, so an in-container `hermes gateway stop` is correctly classified as intentional (parity with the systemd/launchd/host stop paths, which already mark). Best-effort: a marker-write failure falls back to the safe signal-initiated path. Tests: shutdown persist-decision table (signal→running, operator→stopped, restart→stopped), s6 stop marker write + svstat PID parse + failure tolerance. The signal→running and s6-marker tests fail without the respective source change. Verified end-to-end against a container built from this branch: an unmarked SIGTERM to the live gateway leaves gateway_state=running (shutdown-context log confirms signal path); existing real container-restart suite still green.

github-actions · 2026-06-10T02:30:00Z

🔎 Lint report: `fix/gateway-autostart-planned-stop-marker` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 10624 on HEAD, 10620 on base (🆕 +4)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5564 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

…rom container-kill The per-profile-supervision section described the autostart-across-restart contract as "running gateways come back, stopped stay stopped" without spelling out what records 'stopped'. That contract was the source of #42675 confusion: users expected a restart to bring the gateway back and it didn't. With the write-side fix, only an explicit `hermes gateway stop` records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and unexpected exits) leave the state 'running' so the gateway auto-starts. Make that distinction explicit in both the multi-profile and per-profile-supervision sections.

liuhao1024 · 2026-06-10T03:08:50Z

Verification review — reviewed the diff (6 files, +287/-3) and test coverage.

The design correctly distinguishes operator-initiated stops from unexpected external signals:

Planned-stop marker written by S6ServiceManager.stop() before s6-svc -d ensures the gateway's signal handler classifies the SIGTERM as intentional, persisting gateway_state=stopped.
Signal-initiated shutdown (container restart, OOM, bare kill) persists gateway_state=running instead, so container_boot auto-starts the gateway on the next boot — the core fix for Gateway does not auto-start after container restart/upgrade — signal-initiated shutdown persists gateway_state=stopped #42675.
Restart exception: when self._restart_requested is True, the normal stopped persist applies because the restarting process brings the gateway back up itself. This is correct.
Best-effort marker write in stop() with try/except ensures a marker-write failure never blocks the stop command.
Test coverage is thorough: signal-initiated (persists running), operator-initiated (persists stopped), and restart (persists stopped despite signal flag) are all covered.

No issues found. Clean implementation.

tonydwb

Code Review Summary

Verdict: Approved

Looks Good:

Fixes gateway auto-start after container restart by using a planned-stop marker. When the gateway is stopped gracefully (SIGTERM with planned-stop flag), it records this in a marker file; on restart the service manager detects the planned stop and does not auto-restart the gateway, preventing orphaned processes.
Handles the case where the gateway was stopped by the OS (no marker) — in that case auto-restart is appropriate.
6 files, +287/-3. Changes include gateway/run.py, service_manager.py, tests, and docs.
Good test coverage with restart_test_helpers and existing shutdown tests.

Reviewed by Hermes Agent

Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp: a live s6-supervised gateway is killed by an actual `docker restart` SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must auto-start on the next boot. Exercises the WRITE side of the fix that the existing stamp-based tests bypass. Verified to FAIL against an origin/main image (reconciler logs prior_state=stopped action=registered — the #42675 bug) and PASS against the fixed image (prior_state=running action=started).

…ker (NousResearch#42675) (NousResearch#43236) * fix(gateway): auto-start after container restart via planned-stop marker On Docker (s6-overlay), the gateway runs as a dynamically-registered s6 service. When the container stops/restarts/upgrades, s6 sends the gateway a plain SIGTERM. The shutdown path (_stop_impl) ended with an unconditional _update_runtime_status("stopped"), persisting gateway_state=stopped to the volume. container_boot.py reads that on the next boot and only auto-starts gateways whose last state was "running" (_AUTOSTART_STATES) — so after a routine `docker compose up --force-recreate` the gateway stays down and messaging channels silently go dark, with no error surfaced (issue NousResearch#42675). The codebase already distinguishes intentional stops from unexpected signals via the planned-stop marker (write_planned_stop_marker / consume_planned_stop_marker_for_self): `hermes gateway stop`, systemd/launchd ExecStop, and Ctrl+C write a marker before signalling, so the handler classifies them as planned. An unmarked SIGTERM (container/s6 restart, OOM, bare kill) is signal-initiated. This wires that existing classification through to the state persist, rather than adding unreliable signal-source inference: - run.py: GatewayRunner._signal_initiated_shutdown, set in shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a signal-initiated (non-restart) teardown now persists "running" instead of "stopped" — preserving the operator's run-intent and overwriting the mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot. Operator stops and restarts persist "stopped" as before. - service_manager.py: S6ServiceManager.stop() now writes the planned-stop marker for the supervised PID (read from s6-svstat) before `s6-svc -d`, so an in-container `hermes gateway stop` is correctly classified as intentional (parity with the systemd/launchd/host stop paths, which already mark). Best-effort: a marker-write failure falls back to the safe signal-initiated path. Tests: shutdown persist-decision table (signal→running, operator→stopped, restart→stopped), s6 stop marker write + svstat PID parse + failure tolerance. The signal→running and s6-marker tests fail without the respective source change. Verified end-to-end against a container built from this branch: an unmarked SIGTERM to the live gateway leaves gateway_state=running (shutdown-context log confirms signal path); existing real container-restart suite still green. * docs(docker): clarify gateway autostart distinguishes operator-stop from container-kill The per-profile-supervision section described the autostart-across-restart contract as "running gateways come back, stopped stay stopped" without spelling out what records 'stopped'. That contract was the source of NousResearch#42675 confusion: users expected a restart to bring the gateway back and it didn't. With the write-side fix, only an explicit `hermes gateway stop` records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and unexpected exits) leave the state 'running' so the gateway auto-starts. Make that distinction explicit in both the multi-profile and per-profile-supervision sections. * test(docker): real-restart autostart E2E for NousResearch#42675 Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp: a live s6-supervised gateway is killed by an actual `docker restart` SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must auto-start on the next boot. Exercises the WRITE side of the fix that the existing stamp-based tests bypass. Verified to FAIL against an origin/main image (reconciler logs prior_state=stopped action=registered — the NousResearch#42675 bug) and PASS against the fixed image (prior_state=running action=started).

…ker (#42675) (#43236) * fix(gateway): auto-start after container restart via planned-stop marker On Docker (s6-overlay), the gateway runs as a dynamically-registered s6 service. When the container stops/restarts/upgrades, s6 sends the gateway a plain SIGTERM. The shutdown path (_stop_impl) ended with an unconditional _update_runtime_status("stopped"), persisting gateway_state=stopped to the volume. container_boot.py reads that on the next boot and only auto-starts gateways whose last state was "running" (_AUTOSTART_STATES) — so after a routine `docker compose up --force-recreate` the gateway stays down and messaging channels silently go dark, with no error surfaced (issue #42675). The codebase already distinguishes intentional stops from unexpected signals via the planned-stop marker (write_planned_stop_marker / consume_planned_stop_marker_for_self): `hermes gateway stop`, systemd/launchd ExecStop, and Ctrl+C write a marker before signalling, so the handler classifies them as planned. An unmarked SIGTERM (container/s6 restart, OOM, bare kill) is signal-initiated. This wires that existing classification through to the state persist, rather than adding unreliable signal-source inference: - run.py: GatewayRunner._signal_initiated_shutdown, set in shutdown_signal_handler's unmarked-signal branch. In _stop_impl, a signal-initiated (non-restart) teardown now persists "running" instead of "stopped" — preserving the operator's run-intent and overwriting the mid-shutdown "draining" marker so _AUTOSTART_STATES matches on reboot. Operator stops and restarts persist "stopped" as before. - service_manager.py: S6ServiceManager.stop() now writes the planned-stop marker for the supervised PID (read from s6-svstat) before `s6-svc -d`, so an in-container `hermes gateway stop` is correctly classified as intentional (parity with the systemd/launchd/host stop paths, which already mark). Best-effort: a marker-write failure falls back to the safe signal-initiated path. Tests: shutdown persist-decision table (signal→running, operator→stopped, restart→stopped), s6 stop marker write + svstat PID parse + failure tolerance. The signal→running and s6-marker tests fail without the respective source change. Verified end-to-end against a container built from this branch: an unmarked SIGTERM to the live gateway leaves gateway_state=running (shutdown-context log confirms signal path); existing real container-restart suite still green. * docs(docker): clarify gateway autostart distinguishes operator-stop from container-kill The per-profile-supervision section described the autostart-across-restart contract as "running gateways come back, stopped stay stopped" without spelling out what records 'stopped'. That contract was the source of #42675 confusion: users expected a restart to bring the gateway back and it didn't. With the write-side fix, only an explicit `hermes gateway stop` records 'stopped'; container/s6 restart SIGTERMs (incl. image upgrades and unexpected exits) leave the state 'running' so the gateway auto-starts. Make that distinction explicit in both the multi-profile and per-profile-supervision sections. * test(docker): real-restart autostart E2E for #42675 Adds test_live_gateway_autostarts_after_real_restart_without_manual_state_stamp: a live s6-supervised gateway is killed by an actual `docker restart` SIGTERM (no manual gateway_state stamp, no planned-stop marker) and must auto-start on the next boot. Exercises the WRITE side of the fix that the existing stamp-based tests bypass. Verified to FAIL against an origin/main image (reconciler logs prior_state=stopped action=registered — the #42675 bug) and PASS against the fixed image (prior_state=running action=started).

alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery area/docker Docker image, Compose, packaging backend/docker Docker container execution P1 High — major feature broken, no workaround labels Jun 10, 2026

tonydwb approved these changes Jun 10, 2026

View reviewed changes

benbarclay merged commit 5cf6e28 into main Jun 10, 2026
24 checks passed

benbarclay deleted the fix/gateway-autostart-planned-stop-marker branch June 10, 2026 04:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): auto-start after container restart via planned-stop marker (#42675)#43236

fix(gateway): auto-start after container restart via planned-stop marker (#42675)#43236
benbarclay merged 3 commits into
mainfrom
fix/gateway-autostart-planned-stop-marker

benbarclay commented Jun 10, 2026

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

liuhao1024 commented Jun 10, 2026

Uh oh!

tonydwb left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

benbarclay commented Jun 10, 2026

Problem (#42675)

Approach — use the existing planned-stop marker, not signal inference

Changes

Tests

Real E2E (per behavioral-change requirement)

Review note

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: fix/gateway-autostart-planned-stop-marker vs origin/main

ruff

ty (type checker)

Uh oh!

liuhao1024 commented Jun 10, 2026

Uh oh!

tonydwb left a comment

Choose a reason for hiding this comment

Code Review Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions Bot commented Jun 10, 2026 •

edited

Loading

🔎 Lint report: `fix/gateway-autostart-planned-stop-marker` vs `origin/main`