Skip to content

fix(gateway): write clean-shutdown marker at start of stop() not end#9128

Open
luinbytes wants to merge 1 commit into
NousResearch:mainfrom
luinbytes:fix/clean-shutdown-marker
Open

fix(gateway): write clean-shutdown marker at start of stop() not end#9128
luinbytes wants to merge 1 commit into
NousResearch:mainfrom
luinbytes:fix/clean-shutdown-marker

Conversation

@luinbytes

@luinbytes luinbytes commented Apr 13, 2026

Copy link
Copy Markdown

Summary

Moves the .clean_shutdown marker write from the end of _stop_impl() to the very beginning, before any async work (draining, disconnecting, cleanup).

Problem

#8299 introduced the .clean_shutdown marker to prevent suspend_recently_active() from firing after graceful restarts. However, the marker was written near the end of stop(), after draining agents and disconnecting adapters. When the gateway runs as a systemd service with TimeoutStopSec=60, a long-running drain can push stop() past the timeout. systemd then sends SIGKILL, killing the process before the marker is written. The next startup sees no marker, assumes a crash, and suspends all recent sessions — causing unwanted auto-resets.

Fix

Write the marker at the top of _stop_impl(), immediately on entry. The marker is idempotent (touch), so a false-positive (marker written but process crashes mid-drain) only skips one crash-recovery suspension — which is harmless since the drain was already attempted.

Edge case considered

If the process receives SIGTERM, writes the marker, then genuinely crashes during drain — the next startup will skip suspend_recently_active(). This is acceptable because:

  1. The drain was already initiated (agents were asked to stop)
  2. The only risk is a "stuck" session, but systemd will restart the gateway anyway
  3. The alternative (current behavior) is worse: losing ALL session context on every planned restart

Test plan

  • Syntax check passes
  • Diff is clean against upstream main
  • No overlap with other open PRs (only touches gateway/run.py stop() method)
  • Verify: hermes gateway restart preserves session after this change

When systemd sends SIGTERM with TimeoutStopSec, the gateway has a
limited window to shut down gracefully. The .clean_shutdown marker was
written near the END of stop(), after draining agents and disconnecting
adapters. If the process exceeded systemd's timeout, it got SIGKILL
before the marker was written, causing the next startup to call
suspend_recently_active() and auto-reset all sessions.

Moving the marker write to the TOP of _stop_impl() ensures it's written
immediately on any graceful shutdown signal, regardless of how long the
drain takes. The marker is idempotent (touch), so the worst case of a
false-positive is skipping one crash-recovery suspension — which is
harmless since drain already ran.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Apr 27, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #11099 — Clean-shutdown marker written at end of stop() — if drain exceeds systemd TimeoutStopSec, SIGKILL prevents marker write, causing session context loss on restart. Fix moves marker write to start of _stop_impl(). Duplicate of #11099.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants