feat(gateway): atomic active-session marker for precise post-crash recovery#20776
Closed
chrisworksai wants to merge 1 commit into
Closed
feat(gateway): atomic active-session marker for precise post-crash recovery#20776chrisworksai wants to merge 1 commit into
chrisworksai wants to merge 1 commit into
Conversation
…covery
Today suspend_recently_active() uses a 120-second time-based sweep:
any session updated within the cutoff is marked resume_pending. This
catches the in-flight sessions but also catches false-positives —
sessions that received a message recently but had already finished
processing it. Each false-positive forces an unnecessary auto-resume
on the user's next message.
This change adds an optional, more precise targeting path:
1. SessionStore.save_active_sessions(active_session_keys) writes
a `.active_sessions_at_shutdown` JSON marker atomically
(tmpfile + fsync + os.replace) into sessions_dir.
2. The gateway shutdown path calls save_active_sessions(set(
self._running_agents.keys())) right before draining, so the
marker captures exactly the sessions that were mid-turn.
3. suspend_recently_active() now checks for the marker first.
If present: only sessions in the marker are marked
resume_pending, and the marker is consumed (deleted).
If absent (real crash, OOM kill, power loss): falls back to
the existing time-based sweep — no behavioural change for
unclean exits.
The resume_pending semantics are unchanged — just the targeting is
sharper for the clean-shutdown / fast-restart cases that account for
the bulk of restarts. No new fields on SessionEntry, no migration
needed; the marker file is in sessions_dir alongside the existing
session JSON.
Collaborator
19 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an optional active-session marker file so
suspend_recently_active()can precisely target sessions that were mid-turn at shutdown, instead of broadly sweeping anything updated in the last 120 seconds. Resume semantics are unchanged — only the targeting is sharper.Why
suspend_recently_active()today uses a 120s time-based sweep (#7536). It catches the in-flight sessions correctly, but it also catches false-positives: sessions that received a message recently but had already finished processing it before the gateway stopped. Each false-positive forces an unnecessary auto-resume on the user's next message.For clean stops / fast restarts (which account for the bulk of restarts in practice), the gateway already knows exactly which sessions had running agents at shutdown —
self._running_agents. Persisting that set across the restart lets the next startup mark the precise set instead of guessing from timestamps.For unclean exits (crash, OOM, power loss), the marker is absent and the existing time-based sweep runs unchanged.
Implementation
Three pieces:
SessionStore.save_active_sessions(active_session_keys: set)— new method that writes a.active_sessions_at_shutdownJSON marker atomically (tmpfile + fsync + os.replace) intosessions_dir. Atomic write so a crash mid-write can't leave a half-written marker that the next startup misinterprets.Gateway shutdown call site in
GatewayRunner.stop()— right after_notify_active_sessions_of_shutdown()and before_drain_active_agents(), persistset(self._running_agents.keys()). Wrapped intry/except-with-warning-log so a marker-write failure can't block a shutdown that was already in progress.SessionStore.suspend_recently_active()— check for the marker first. If present: mark only the sessions in the marker asresume_pending, then unlink the marker. If absent: existing time-based sweep runs unchanged.Why these specific design choices
sessions_dirrequires no schema migration, nofrom_dict/to_dictchanges, and is trivially inspectable / removable for debugging.open()andwrite()would otherwise leave a marker that says "no sessions were active" — worse than no marker, because the time-based fallback wouldn't run. tmpfile + os.replace eliminates that window.resume_pending=True → auto-resumeget exactly that behaviour, just with fewer false-positives. No new fields onSessionEntry. No config option. Pure improvement on the existing contract.Compatibility
SessionStoreoutside the gateway lifecycle) gets the existing time-based sweep — fully backward-compatible.sessions_dirand prefixed with.so it doesn't show up in normal listings; it never collides with a session JSON file.Test plan
resume_pendingresume_pendingandsuspendedentries skipped (existing behaviour preserved)🤖 Generated with Claude Code