Skip to content

feat: implement checkpoint recovery strategy #201

@Aureliolo

Description

@Aureliolo

Summary

The spec defines a Checkpoint Recovery strategy (§6.6 Strategy 2) that persists AgentContext snapshots after each completed turn and resumes from the last checkpoint on crash. Only Fail-and-Reassign is implemented.

Design Spec Reference

  • §6.6 Agent Crash Recovery — Strategy 2: Checkpoint Recovery

Scope

  • Per-turn AgentContext checkpoint persistence (leverages model_dump_json())
  • Heartbeat-based failure detection
  • Resume from last checkpoint with environment reconciliation
  • Max resume attempts before falling back to fail-and-reassign
  • RecoveryStrategy protocol compliance

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:highImportant, should be prioritizedprio:mediumShould do, but not blockingscope:large3+ days of workspec:agent-systemDESIGN_SPEC Section 3 - Agent Systemspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementation

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions