Skip to content

fix(gateway): preserve running state on signal shutdown#42740

Closed
raywcm wants to merge 1 commit into
NousResearch:mainfrom
raywcm:fix/gateway-preserve-running-state-on-signal
Closed

fix(gateway): preserve running state on signal shutdown#42740
raywcm wants to merge 1 commit into
NousResearch:mainfrom
raywcm:fix/gateway-preserve-running-state-on-signal

Conversation

@raywcm

@raywcm raywcm commented Jun 9, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes the Docker/s6 gateway auto-start regression from #42675.

When the gateway receives an unexpected external SIGTERM, start_gateway() already exits non-zero so a service manager can revive it. However, the gateway teardown still persisted gateway_state="stopped". In Docker s6 deployments, container_boot only auto-starts profile gateway services whose previous state is running, so a container restart or image upgrade could leave messaging channels dark even though the gateway had been running before the supervisor stopped it.

This PR preserves gateway_state="running" for unexpected signal-initiated shutdowns while keeping intentional stops and planned restarts as stopped.

Related Issue

Fixes #42675

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/run.py
    • Track unexpected signal-initiated shutdowns separately from planned stops/restarts.
    • Preserve the persisted runtime gateway state as running for those external signal shutdowns so Docker/s6 reconciliation can bring the gateway back up on the next container boot.
    • Keep planned stop/restart behavior unchanged.
  • tests/gateway/test_gateway_shutdown.py
    • Add regression coverage that signal-preserved shutdowns persist running.
  • tests/gateway/restart_test_helpers.py
    • Initialize the new runner test flag in the object-constructed test helper.

How to Test

  1. Reproduce the bug condition: a running Docker/s6 gateway receives an external SIGTERM during container restart/upgrade; prior behavior writes gateway_state="stopped", so container_boot registers the service down on the next boot.
  2. Verify the fix: unexpected signal shutdown now sets the preserve flag and final runtime status remains running; planned stops/restarts still use the existing paths.
  3. Local checks run:
    • .venv/bin/python -m pytest -o 'addopts=' tests/gateway/test_gateway_shutdown.py -q15 passed in 35.06s
    • .venv/bin/python -m compileall -q gateway/run.py tests/gateway/test_gateway_shutdown.py tests/gateway/restart_test_helpers.py
    • .venv/bin/python -m ruff check gateway/run.py tests/gateway/test_gateway_shutdown.py tests/gateway/restart_test_helpers.pyAll checks passed!
    • git diff --check

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Linux container/PRoot environment with focused gateway shutdown tests

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

N/A

Screenshots / Logs

$ .venv/bin/python -m pytest -o 'addopts=' tests/gateway/test_gateway_shutdown.py -q
...............                                                          [100%]
15 passed in 35.06s

$ .venv/bin/python -m ruff check gateway/run.py tests/gateway/test_gateway_shutdown.py tests/gateway/restart_test_helpers.py
All checks passed!

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery area/docker Docker image, Compose, packaging backend/docker Docker container execution P1 High — major feature broken, no workaround labels Jun 9, 2026
@liuhao1024

Copy link
Copy Markdown
Contributor

✅ Verified — signal-initiated shutdown preserves running state correctly

Reviewed the _preserve_gateway_running_state_on_stop flag flow across shutdown_signal_handler_stop_impl.

  • Flag initialization: __init__ defaults to False; restart_test_helpers.py mirror updated. Confirmed in gateway/run.py:1956.
  • Signal-only activation: The flag is set to True only in the else branch of shutdown_signal_handler (non-restart signal path), correctly excluding restart-initiated shutdowns where _restart_requested=True already controls the state.
  • State logic: _update_runtime_status receives "running" only when the flag is set AND _restart_requested is False. The not self._restart_requested guard prevents the flag from interfering with normal restart flows.
  • Test coverage: test_unexpected_signal_shutdown_preserves_running_state_for_service_revival verifies the "running" status is written and the shutdown event is still set (process actually stops).

The fix is correct — service managers (launchd, systemd) that monitor the status file will see "running" and restart the process after an unexpected SIGTERM/SIGINT. No issues found.

@benbarclay

Copy link
Copy Markdown
Collaborator

Superseded by #43236 (merged), which fixes #42675 by persisting running on a signal-initiated shutdown — the same end goal as this PR (_preserve_gateway_running_state_on_stop), but gated on the planned-stop marker classification rather than a 'was it a signal' flag set in the handler.

The distinction matters for the cases in #42517's table: an external kill/OOM under systemd is also a bare signal, and the marker is what separates those from an intentional stop. #43236 additionally wires the marker into the s6 stop() path so in-container hermes gateway stop is classified correctly, and verifies the whole thing against a real docker restart (fails on a pre-fix image, passes on the fixed one). Appreciate the work here.

@benbarclay benbarclay closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docker Docker image, Compose, packaging backend/docker Docker container execution comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway does not auto-start after container restart/upgrade — signal-initiated shutdown persists gateway_state=stopped

4 participants