fix(gateway): pre-mark sessions as resume_pending before drain to pre…#28217
fix(gateway): pre-mark sessions as resume_pending before drain to pre…#28217LifeJiggy wants to merge 1 commit into
Conversation
…vent data loss (NousResearch#27856) Pre-mark all running agent sessions as resume_pending BEFORE the drain wait begins. If the service manager kills the process during the drain (window), the durable marker is already written so the next gateway boot can recover in-flight sessions. On graceful drain completion, clear the early markers for sessions that finished successfully.
outsourc-e
left a comment
There was a problem hiding this comment.
The pre-drain persistence idea is right, but this version regresses restart marking semantics in the mixed timeout case. Because every running session is pre-marked before the drain, any session that finishes during the drain window still remains resume_pending if the overall drain later times out on some other session. That incorrectly injects the restart-interruption resume path into a turn that already completed cleanly.\n\nLocal repro:\n- python3 -m pytest -q -o addopts='' tests/gateway/test_restart_resume_pending.py::test_drain_timeout_only_marks_still_running_sessions => FAIL\n- same test on current origin/main => PASS\n\nI also hit the existing clean-drain expectation failure for the new write timing:\n- tests/gateway/test_restart_resume_pending.py::test_clean_drain_does_not_mark_resume_pending => FAIL on this PR, PASS on origin/main\n\nConcrete blocker: preserve the pre-drain durability win, but clear pre-marked keys that finished during the drain window before the timeout interruption path proceeds, so only still-running sessions remain resume_pending.
What does this PR do?
Pre-mark all running agent sessions as resume_pending before the drain wait begins in the gateway shutdown/restart flow. If the service manager kills the process during the drain window, the durable marker is already persisted, so the next gateway boot can recover in-flight sessions via auto-resume. On graceful drain completion (no timeout), the early markers are cleared for sessions that finished successfully.
Related Issue
Fixes #27856
Type of Change
Changes Made
How to Test
Checklist
Code
Documentation & Housekeeping