Skip to content

feat(gateway): atomic active-session marker for precise post-crash recovery#20776

Closed
chrisworksai wants to merge 1 commit into
NousResearch:mainfrom
chrisworksai:feat/gateway-active-session-marker
Closed

feat(gateway): atomic active-session marker for precise post-crash recovery#20776
chrisworksai wants to merge 1 commit into
NousResearch:mainfrom
chrisworksai:feat/gateway-active-session-marker

Conversation

@chrisworksai

Copy link
Copy Markdown
Contributor

Summary

Adds an optional active-session marker file so suspend_recently_active() can precisely target sessions that were mid-turn at shutdown, instead of broadly sweeping anything updated in the last 120 seconds. Resume semantics are unchanged — only the targeting is sharper.

Why

suspend_recently_active() today uses a 120s time-based sweep (#7536). It catches the in-flight sessions correctly, but it also catches false-positives: sessions that received a message recently but had already finished processing it before the gateway stopped. Each false-positive forces an unnecessary auto-resume on the user's next message.

For clean stops / fast restarts (which account for the bulk of restarts in practice), the gateway already knows exactly which sessions had running agents at shutdown — self._running_agents. Persisting that set across the restart lets the next startup mark the precise set instead of guessing from timestamps.

For unclean exits (crash, OOM, power loss), the marker is absent and the existing time-based sweep runs unchanged.

Implementation

Three pieces:

  1. SessionStore.save_active_sessions(active_session_keys: set) — new method that writes a .active_sessions_at_shutdown JSON marker atomically (tmpfile + fsync + os.replace) into sessions_dir. Atomic write so a crash mid-write can't leave a half-written marker that the next startup misinterprets.

  2. Gateway shutdown call site in GatewayRunner.stop() — right after _notify_active_sessions_of_shutdown() and before _drain_active_agents(), persist set(self._running_agents.keys()). Wrapped in try/except-with-warning-log so a marker-write failure can't block a shutdown that was already in progress.

  3. SessionStore.suspend_recently_active() — check for the marker first. If present: mark only the sessions in the marker as resume_pending, then unlink the marker. If absent: existing time-based sweep runs unchanged.

Why these specific design choices

  • Marker file vs. database column: A file in sessions_dir requires no schema migration, no from_dict/to_dict changes, and is trivially inspectable / removable for debugging.
  • Atomic write: A crash between open() and write() would otherwise leave a marker that says "no sessions were active" — worse than no marker, because the time-based fallback wouldn't run. tmpfile + os.replace eliminates that window.
  • Consume-on-read: Deleting the marker after read means a second startup without an intervening shutdown reverts to the time-based path — protects against the marker becoming stale (e.g. if the second restart was caused by a crash that the first startup didn't see coming).
  • Resume semantics unchanged: Existing users who rely on resume_pending=True → auto-resume get exactly that behaviour, just with fewer false-positives. No new fields on SessionEntry. No config option. Pure improvement on the existing contract.

Compatibility

  • No schema migration needed.
  • Any installation that doesn't write the marker (e.g. an unclean exit, or an external embedder that calls SessionStore outside the gateway lifecycle) gets the existing time-based sweep — fully backward-compatible.
  • The marker file is in sessions_dir and prefixed with . so it doesn't show up in normal listings; it never collides with a session JSON file.

Test plan

  • Clean shutdown writes marker; next startup marks only marker-listed sessions as resume_pending
  • Marker is deleted after consumption
  • No marker (simulated crash) → time-based sweep runs as today
  • resume_pending and suspended entries skipped (existing behaviour preserved)
  • Atomic write — interrupting between tmp-write and os.replace doesn't corrupt the marker

🤖 Generated with Claude Code

…covery

Today suspend_recently_active() uses a 120-second time-based sweep:
any session updated within the cutoff is marked resume_pending. This
catches the in-flight sessions but also catches false-positives —
sessions that received a message recently but had already finished
processing it. Each false-positive forces an unnecessary auto-resume
on the user's next message.

This change adds an optional, more precise targeting path:

  1. SessionStore.save_active_sessions(active_session_keys) writes
     a `.active_sessions_at_shutdown` JSON marker atomically
     (tmpfile + fsync + os.replace) into sessions_dir.

  2. The gateway shutdown path calls save_active_sessions(set(
     self._running_agents.keys())) right before draining, so the
     marker captures exactly the sessions that were mid-turn.

  3. suspend_recently_active() now checks for the marker first.
     If present: only sessions in the marker are marked
     resume_pending, and the marker is consumed (deleted).
     If absent (real crash, OOM kill, power loss): falls back to
     the existing time-based sweep — no behavioural change for
     unclean exits.

The resume_pending semantics are unchanged — just the targeting is
sharper for the clean-shutdown / fast-restart cases that account for
the bulk of restarts. No new fields on SessionEntry, no migration
needed; the marker file is in sessions_dir alongside the existing
session JSON.
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery labels May 6, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #8143 (crash checkpoint for precise session recovery) — both aim to replace the time-based sweep in suspend_recently_active() with precise targeting. This PR uses an active-session marker file while #8143 uses agent_checkpoints.json. Consider consolidating.

@chrisworksai chrisworksai closed this by deleting the head repository May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants