feat(gateway): add crash checkpoint for precise session recovery#8143
Open
chinadbo wants to merge 3 commits into
Open
feat(gateway): add crash checkpoint for precise session recovery#8143chinadbo wants to merge 3 commits into
chinadbo wants to merge 3 commits into
Conversation
|
This would solve session crash issues, please merge! |
9aa85f3 to
63d5af7
Compare
Persist in-flight agent runs to agent_checkpoints.json so that on restart the gateway can precisely identify interrupted sessions instead of relying solely on the suspend_recently_active time-window heuristic. Sessions found in the checkpoint are suspended and the checkpoint is cleared; the time-window heuristic remains as fallback.
- Fix startup ImportError: replace non-existent HERMES_HOME import with module-level _hermes_home variable (gateway was failing to start) - Gate mark_completed on generation ownership: stale runs no longer clear the checkpoint entry when a newer generation owns the slot - Add fsync + unique mkstemp to _write: checkpoint is now durable across power failures and safe under concurrent gateway instances - Add defensive mark_completed on /stop sentinel fast-path for future-proofing if mark_running timing ever shifts
- Clear checkpoint on clean shutdown to prevent stale entries accumulating - Add mark_completed after stale-eviction and interrupt-clear paths - Move tempfile import to module level
19fa0c0 to
cfb0764
Compare
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SessionCrashCheckpointclass that persists in-flight agent runs toagent_checkpoints.jsonsuspend_recently_active()time-window heuristic remains as fallback for sessions not tracked by the checkpointTest plan
tests/gateway/test_session_crash_recovery.py