-
Notifications
You must be signed in to change notification settings - Fork 0
feat: implement checkpoint recovery strategy #201
Copy link
Copy link
Closed
Labels
prio:highImportant, should be prioritizedImportant, should be prioritizedprio:mediumShould do, but not blockingShould do, but not blockingscope:large3+ days of work3+ days of workspec:agent-systemDESIGN_SPEC Section 3 - Agent SystemDESIGN_SPEC Section 3 - Agent Systemspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow EngineDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementationNew feature implementation
Description
Summary
The spec defines a Checkpoint Recovery strategy (§6.6 Strategy 2) that persists AgentContext snapshots after each completed turn and resumes from the last checkpoint on crash. Only Fail-and-Reassign is implemented.
Design Spec Reference
- §6.6 Agent Crash Recovery — Strategy 2: Checkpoint Recovery
Scope
- Per-turn
AgentContextcheckpoint persistence (leveragesmodel_dump_json()) - Heartbeat-based failure detection
- Resume from last checkpoint with environment reconciliation
- Max resume attempts before falling back to fail-and-reassign
RecoveryStrategyprotocol compliance
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
prio:highImportant, should be prioritizedImportant, should be prioritizedprio:mediumShould do, but not blockingShould do, but not blockingscope:large3+ days of work3+ days of workspec:agent-systemDESIGN_SPEC Section 3 - Agent SystemDESIGN_SPEC Section 3 - Agent Systemspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow EngineDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementationNew feature implementation