Skip to content

fix(gateway): pre-mark sessions as resume_pending before drain to pre…#28217

Closed
LifeJiggy wants to merge 1 commit into
NousResearch:mainfrom
LifeJiggy:fix/gateway-shutdown-drain-sessions
Closed

fix(gateway): pre-mark sessions as resume_pending before drain to pre…#28217
LifeJiggy wants to merge 1 commit into
NousResearch:mainfrom
LifeJiggy:fix/gateway-shutdown-drain-sessions

Conversation

@LifeJiggy

Copy link
Copy Markdown
Contributor

What does this PR do?

Pre-mark all running agent sessions as resume_pending before the drain wait begins in the gateway shutdown/restart flow. If the service manager kills the process during the drain window, the durable marker is already persisted, so the next gateway boot can recover in-flight sessions via auto-resume. On graceful drain completion (no timeout), the early markers are cleared for sessions that finished successfully.

Related Issue
Fixes #27856

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/run.py (_stop_impl): Snapshot running sessions and call SessionStore.mark_resume_pending() for each before _drain_active_agents(). After a graceful drain, call SessionStore.clear_resume_pending() for sessions that completed during the window. The existing timeout path remains unchanged (pre-mark is already written and the post-timeout mark_resume_pending loop is idempotent).

How to Test

  1. Start the gateway with a long-running agent session active
  2. Send a restart signal (/restart or SIGTERM → exit code 75)
  3. Verify that the new pre-drain mark_resume_pending log line fires for the active session
  4. On restart, verify that the interrupted session auto-resumes (the resume_pending marker was persisted before the drain wait)
  5. Repeat with graceful (timeout=0) drain and verify clear_resume_pending fires for finished sessions

Checklist
Code

Documentation & Housekeeping

  • I've updated relevant documentation — N/A (logic change with no new public API)
  • I've updated cli-config.yaml.example — N/A
  • I've considered cross-platform impact (Windows, macOS) — logic is platform-agn

…vent data loss (NousResearch#27856)

Pre-mark all running agent sessions as resume_pending BEFORE the drain
wait begins. If the service manager kills the process during the drain
(window), the durable marker is already written so the next gateway boot
can recover in-flight sessions. On graceful drain completion, clear the
early markers for sessions that finished successfully.
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround labels May 18, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Duplicate of #27831 — same fix (pre-mark sessions as resume_pending before drain), same target issue #27856. Both PRs are open.

@outsourc-e outsourc-e left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-drain persistence idea is right, but this version regresses restart marking semantics in the mixed timeout case. Because every running session is pre-marked before the drain, any session that finishes during the drain window still remains resume_pending if the overall drain later times out on some other session. That incorrectly injects the restart-interruption resume path into a turn that already completed cleanly.\n\nLocal repro:\n- python3 -m pytest -q -o addopts='' tests/gateway/test_restart_resume_pending.py::test_drain_timeout_only_marks_still_running_sessions => FAIL\n- same test on current origin/main => PASS\n\nI also hit the existing clean-drain expectation failure for the new write timing:\n- tests/gateway/test_restart_resume_pending.py::test_clean_drain_does_not_mark_resume_pending => FAIL on this PR, PASS on origin/main\n\nConcrete blocker: preserve the pre-drain durability win, but clear pre-marked keys that finished during the drain window before the timeout interruption path proceeds, so only still-running sessions remain resume_pending.

@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #28576 (cherry-picked onto current main with your authorship preserved via rebase-merge — commit e2a1a2b). Thanks for the contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway restart can lose long-running sessions during shutdown drain

4 participants