fix(gateway): premark active sessions before drain#27831
Conversation
e5f6b02 to
83914a4
Compare
|
CI update after rebasing on The remaining failures are unrelated to this gateway shutdown diff:
They fail because python -m pytest tests/hermes_cli/test_aux_config.py -q -o 'addopts='Focused verification for this PR's gateway changes is green: python -m pytest tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_shutdown_cache_cleanup.py -q -o 'addopts='
ruff check gateway/run.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py
python -m py_compile gateway/run.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py
git diff --check |
magnus919
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Approved — correct implementation of a well-identified race window. No blocking issues found.
✅ Correctness
-
Pre-mark before drain: The core logic is right — mark
resume_pendingbefore entering the drain wait, so a SIGKILL during drain preserves session state. This closes a genuine race window (#27856). -
Selective clearing: The
_pre_drain_resume_marksset tracks exactly which sessions this stop() call marked, so graceful completions only clear the marks this call added. Pre-existingresume_pendingentries are left untouched — correct. -
Timeout path preserved: The existing post-timeout
mark_resume_pending()loop still runs, refreshing the marker for sessions still alive after drain. Both paths converge correctly. -
Edge cases handled: Suspended sessions, already-resume_pending sessions, and missing session_store are all skipped without error.
✅ Security & Robustness
-
The pre-mark uses the same SessionStore private APIs (
_lock,_entries,_save) as the existing timeout-path code — with proper# noqa: SLF001justification comments. Consistent with existing patterns. -
getattr(self, "_snapshot_running_agents", None)fallback for test/lightweight gateway objects is pragmatic. -
Broad
except Exceptionguard prevents the pre-mark from crashing the shutdown. Logged at DEBUG level.
✅ Code Quality
- Nested function is appropriate here — it's a shutdown-only helper that needs closure access to
self.session_store. - Docstring clearly explains why the pre-mark is necessary (systemd can kill during drain wait).
- Logging at INFO for pre-mark events is the right level — this is operational telemetry for restart debugging.
datetime.now()is consistent with existing gateway patterns (gateway/delivery.py, existing shutdown code).
✅ Testing
- Proper RED-GREEN workflow: test confirmed failure on upstream/main before the fix.
- The test directly validates that
resume_pendingis True inside the drain call, which is the exact invariant this fix protects. - Post-graceful-drain assertion checks that the marker was cleared — full round-trip coverage.
object.__new__(GatewayRunner)+ patched dependencies is a reasonable pattern for testing shutdown logic without a real gateway.
💡 Minor Suggestion (non-blocking)
The test uses asyncio.get_event_loop().run_until_complete() which is the deprecated event-loop API. The same pattern is used in the existing tests in this file, so it's consistent — but worth migrating to asyncio.run() if the tests are ever refactored.
Summary
Solid fix for a genuine race condition in gateway restart/shutdown. The pre-mark + clear-on-graceful pattern is correct, the edge cases are handled, and the test validates the invariant. No issues found beyond cosmetic suggestions. Ready to merge.
|
(Note from @magnus919: I'm a contributor, not a maintainer, so the 'Approved' state on my review above doesn't carry maintainer authority — it was submitted with the wrong event type. Consider it a friendly code review and recommendation, not a formal sign-off.) |
|
Closing this as superseded: the same pre-drain I opened a narrower follow-up for the remaining stale-recovery edge cases here: #29188. Thanks again to everyone who reviewed this thread — the original issue is covered in |
Summary
Marks currently running gateway sessions as
resume_pendingbefore the restart/shutdown drain wait begins, then clears those early markers if the drain finishes cleanly.This protects active messaging sessions when the service manager kills the gateway while it is still inside the drain window. In that case the existing post-timeout
mark_resume_pending()path never runs, so long-running sessions can be lost/stopped after restart.Why
The current flow only writes
resume_pendingafter_drain_active_agents(timeout)returnstimed_out=True.On systemd/launchd restarts the process can be terminated while still awaiting that drain, before control returns to the timeout branch. If the session's
updated_atis outside the startup fallback window, the next boot has no durable marker to resume it.This keeps the conservative
resume_pendingmodel from #12301, but moves the first durable marker earlier than the vulnerable wait.Behavior
restart_timeout/shutdown_timeoutbefore drain.suspended=Truesessions are not overridden.resume_pendingsessions are left alone and are not cleared by this cleanup._running_agentswithout the newer_snapshot_running_agents()helper.Related
Fixes #27856.
.clean_shutdownat the start of stop.resume_pendingrecovery model from fix(gateway): auto-resume sessions after drain-timeout restart (#11852) #12301 / spec: automatic session resume after gateway restart #11852.Test plan
RED:
upstream/main.tests/gateway/test_clean_shutdown_marker.py::TestCleanShutdownMarker::test_active_sessions_are_resume_marked_before_drainfailed becauseresume_pendingwas stillFalseinside the drain call.GREEN:
python -m pytest tests/gateway/test_clean_shutdown_marker.py::TestCleanShutdownMarker::test_active_sessions_are_resume_marked_before_drain tests/gateway/test_shutdown_cache_cleanup.py -q -o 'addopts='python -m pytest tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_shutdown_cache_cleanup.py -q -o 'addopts='python -m py_compile gateway/run.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.pyruff check gateway/run.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.pygit diff --checkBaseline note:
python -m pytest tests/hermes_cli/test_aux_config.py -q -o 'addopts='currently fails the same way onupstream/mainat41f1eddeebecauseDEFAULT_CONFIG["auxiliary"]is missingsession_search. This is unrelated to this PR's gateway shutdown diff.AI Disclosure
This fix was prepared with AI assistance.