Skip to content

fix(gateway): premark active sessions before drain#27831

Closed
Qwinty wants to merge 2 commits into
NousResearch:mainfrom
Qwinty:fix/premark-resume-before-drain
Closed

fix(gateway): premark active sessions before drain#27831
Qwinty wants to merge 2 commits into
NousResearch:mainfrom
Qwinty:fix/premark-resume-before-drain

Conversation

@Qwinty

@Qwinty Qwinty commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Marks currently running gateway sessions as resume_pending before the restart/shutdown drain wait begins, then clears those early markers if the drain finishes cleanly.

This protects active messaging sessions when the service manager kills the gateway while it is still inside the drain window. In that case the existing post-timeout mark_resume_pending() path never runs, so long-running sessions can be lost/stopped after restart.

Why

The current flow only writes resume_pending after _drain_active_agents(timeout) returns timed_out=True.

On systemd/launchd restarts the process can be terminated while still awaiting that drain, before control returns to the timeout branch. If the session's updated_at is outside the startup fallback window, the next boot has no durable marker to resume it.

This keeps the conservative resume_pending model from #12301, but moves the first durable marker earlier than the vulnerable wait.

Behavior

  • Active sessions are pre-marked with restart_timeout / shutdown_timeout before drain.
  • If drain completes gracefully, only markers created by this stop call are cleared.
  • If drain times out, the existing post-timeout refresh and interrupt path still runs.
  • Existing suspended=True sessions are not overridden.
  • Existing resume_pending sessions are left alone and are not cleared by this cleanup.
  • The shutdown helper also tolerates lightweight test/fake gateway objects that expose _running_agents without the newer _snapshot_running_agents() helper.

Related

Fixes #27856.

Test plan

RED:

  • Applied the new regression test alone on upstream/main.
  • Confirmed tests/gateway/test_clean_shutdown_marker.py::TestCleanShutdownMarker::test_active_sessions_are_resume_marked_before_drain failed because resume_pending was still False inside the drain call.

GREEN:

  • python -m pytest tests/gateway/test_clean_shutdown_marker.py::TestCleanShutdownMarker::test_active_sessions_are_resume_marked_before_drain tests/gateway/test_shutdown_cache_cleanup.py -q -o 'addopts='
  • python -m pytest tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_shutdown_cache_cleanup.py -q -o 'addopts='
  • python -m py_compile gateway/run.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py
  • ruff check gateway/run.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py
  • git diff --check

Baseline note:

  • python -m pytest tests/hermes_cli/test_aux_config.py -q -o 'addopts=' currently fails the same way on upstream/main at 41f1eddee because DEFAULT_CONFIG["auxiliary"] is missing session_search. This is unrelated to this PR's gateway shutdown diff.

AI Disclosure

This fix was prepared with AI assistance.

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround labels May 18, 2026
@Qwinty Qwinty force-pushed the fix/premark-resume-before-drain branch from e5f6b02 to 83914a4 Compare May 18, 2026 08:06
@Qwinty

Qwinty commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

CI update after rebasing on upstream/main (41f1eddee): all checks passed except the full test job.

The remaining failures are unrelated to this gateway shutdown diff:

  • tests/hermes_cli/test_aux_config.py::test_session_search_defaults_include_extra_body_and_concurrency
  • tests/hermes_cli/test_aux_config.py::test_aux_tasks_keys_all_exist_in_default_config

They fail because DEFAULT_CONFIG["auxiliary"] is missing session_search. I reproduced the same two failures directly on upstream/main at 41f1eddee with:

python -m pytest tests/hermes_cli/test_aux_config.py -q -o 'addopts='

Focused verification for this PR's gateway changes is green:

python -m pytest tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_restart_resume_pending.py tests/gateway/test_shutdown_cache_cleanup.py -q -o 'addopts='
ruff check gateway/run.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py
python -m py_compile gateway/run.py tests/gateway/test_clean_shutdown_marker.py tests/gateway/test_shutdown_cache_cleanup.py
git diff --check

@magnus919 magnus919 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved — correct implementation of a well-identified race window. No blocking issues found.

✅ Correctness

  • Pre-mark before drain: The core logic is right — mark resume_pending before entering the drain wait, so a SIGKILL during drain preserves session state. This closes a genuine race window (#27856).

  • Selective clearing: The _pre_drain_resume_marks set tracks exactly which sessions this stop() call marked, so graceful completions only clear the marks this call added. Pre-existing resume_pending entries are left untouched — correct.

  • Timeout path preserved: The existing post-timeout mark_resume_pending() loop still runs, refreshing the marker for sessions still alive after drain. Both paths converge correctly.

  • Edge cases handled: Suspended sessions, already-resume_pending sessions, and missing session_store are all skipped without error.

✅ Security & Robustness

  • The pre-mark uses the same SessionStore private APIs (_lock, _entries, _save) as the existing timeout-path code — with proper # noqa: SLF001 justification comments. Consistent with existing patterns.

  • getattr(self, "_snapshot_running_agents", None) fallback for test/lightweight gateway objects is pragmatic.

  • Broad except Exception guard prevents the pre-mark from crashing the shutdown. Logged at DEBUG level.

✅ Code Quality

  • Nested function is appropriate here — it's a shutdown-only helper that needs closure access to self.session_store.
  • Docstring clearly explains why the pre-mark is necessary (systemd can kill during drain wait).
  • Logging at INFO for pre-mark events is the right level — this is operational telemetry for restart debugging.
  • datetime.now() is consistent with existing gateway patterns (gateway/delivery.py, existing shutdown code).

✅ Testing

  • Proper RED-GREEN workflow: test confirmed failure on upstream/main before the fix.
  • The test directly validates that resume_pending is True inside the drain call, which is the exact invariant this fix protects.
  • Post-graceful-drain assertion checks that the marker was cleared — full round-trip coverage.
  • object.__new__(GatewayRunner) + patched dependencies is a reasonable pattern for testing shutdown logic without a real gateway.

💡 Minor Suggestion (non-blocking)

The test uses asyncio.get_event_loop().run_until_complete() which is the deprecated event-loop API. The same pattern is used in the existing tests in this file, so it's consistent — but worth migrating to asyncio.run() if the tests are ever refactored.

Summary

Solid fix for a genuine race condition in gateway restart/shutdown. The pre-mark + clear-on-graceful pattern is correct, the edge cases are handled, and the test validates the invariant. No issues found beyond cosmetic suggestions. Ready to merge.

@magnus919

Copy link
Copy Markdown

(Note from @magnus919: I'm a contributor, not a maintainer, so the 'Approved' state on my review above doesn't carry maintainer authority — it was submitted with the wrong event type. Consider it a friendly code review and recommendation, not a formal sign-off.)

@Qwinty

Qwinty commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Closing this as superseded: the same pre-drain resume_pending fix landed upstream via #28576 (salvaged from #28217, which was the duplicate of this PR).

I opened a narrower follow-up for the remaining stale-recovery edge cases here: #29188.

Thanks again to everyone who reviewed this thread — the original issue is covered in main now.

@Qwinty Qwinty closed this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway restart can lose long-running sessions during shutdown drain

3 participants