Skip to content

feat(gateway): add crash checkpoint for precise session recovery#8143

Open
chinadbo wants to merge 3 commits into
NousResearch:mainfrom
chinadbo:feat/session-state-recovery-on-crash
Open

feat(gateway): add crash checkpoint for precise session recovery#8143
chinadbo wants to merge 3 commits into
NousResearch:mainfrom
chinadbo:feat/session-state-recovery-on-crash

Conversation

@chinadbo

Copy link
Copy Markdown
Contributor

Summary

  • Added SessionCrashCheckpoint class that persists in-flight agent runs to agent_checkpoints.json
  • On agent start, the session key and ID are recorded; on clean completion, the entry is removed
  • On gateway restart, the checkpoint is read to precisely identify interrupted sessions, which are then suspended
  • The existing suspend_recently_active() time-window heuristic remains as fallback for sessions not tracked by the checkpoint
  • Checkpoint is cleared after processing so it doesn't accumulate stale entries

Test plan

  • 11 tests in tests/gateway/test_session_crash_recovery.py
  • Checkpoint write/read cycle for running and completed sessions
  • Edge cases: nonexistent file, corrupted JSON, mark_completed on unknown key, clear
  • Integration: crash simulation detects interrupted sessions, clean shutdown leaves empty checkpoint

@849497911-max

Copy link
Copy Markdown

This would solve session crash issues, please merge!

@chinadbo chinadbo force-pushed the feat/session-state-recovery-on-crash branch from 9aa85f3 to 63d5af7 Compare April 27, 2026 07:03
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery labels Apr 27, 2026
Persist in-flight agent runs to agent_checkpoints.json so that on
restart the gateway can precisely identify interrupted sessions instead
of relying solely on the suspend_recently_active time-window heuristic.
Sessions found in the checkpoint are suspended and the checkpoint is
cleared; the time-window heuristic remains as fallback.
- Fix startup ImportError: replace non-existent HERMES_HOME import with
  module-level _hermes_home variable (gateway was failing to start)
- Gate mark_completed on generation ownership: stale runs no longer
  clear the checkpoint entry when a newer generation owns the slot
- Add fsync + unique mkstemp to _write: checkpoint is now durable
  across power failures and safe under concurrent gateway instances
- Add defensive mark_completed on /stop sentinel fast-path for
  future-proofing if mark_running timing ever shifts
- Clear checkpoint on clean shutdown to prevent stale entries accumulating
- Add mark_completed after stale-eviction and interrupt-clear paths
- Move tempfile import to module level
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants