fix(gateway): write clean-shutdown marker at start of stop() not end#9128
Open
luinbytes wants to merge 1 commit into
Open
fix(gateway): write clean-shutdown marker at start of stop() not end#9128luinbytes wants to merge 1 commit into
luinbytes wants to merge 1 commit into
Conversation
When systemd sends SIGTERM with TimeoutStopSec, the gateway has a limited window to shut down gracefully. The .clean_shutdown marker was written near the END of stop(), after draining agents and disconnecting adapters. If the process exceeded systemd's timeout, it got SIGKILL before the marker was written, causing the next startup to call suspend_recently_active() and auto-reset all sessions. Moving the marker write to the TOP of _stop_impl() ensures it's written immediately on any graceful shutdown signal, regardless of how long the drain takes. The marker is idempotent (touch), so the worst case of a false-positive is skipping one crash-recovery suspension — which is harmless since drain already ran.
Collaborator
19 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Moves the
.clean_shutdownmarker write from the end of_stop_impl()to the very beginning, before any async work (draining, disconnecting, cleanup).Problem
#8299introduced the.clean_shutdownmarker to preventsuspend_recently_active()from firing after graceful restarts. However, the marker was written near the end ofstop(), after draining agents and disconnecting adapters. When the gateway runs as a systemd service withTimeoutStopSec=60, a long-running drain can pushstop()past the timeout. systemd then sends SIGKILL, killing the process before the marker is written. The next startup sees no marker, assumes a crash, and suspends all recent sessions — causing unwanted auto-resets.Fix
Write the marker at the top of
_stop_impl(), immediately on entry. The marker is idempotent (touch), so a false-positive (marker written but process crashes mid-drain) only skips one crash-recovery suspension — which is harmless since the drain was already attempted.Edge case considered
If the process receives SIGTERM, writes the marker, then genuinely crashes during drain — the next startup will skip
suspend_recently_active(). This is acceptable because:Test plan
gateway/run.pystop()method)hermes gateway restartpreserves session after this change