Skip to content

Implement crash recovery with fail-and-reassign strategy (DESIGN_SPEC §6.6) #129

@Aureliolo

Description

@Aureliolo

Context

When an agent execution fails unexpectedly (unhandled exception, OOM, process kill), the framework needs a recovery mechanism. The MVP implements the fail-and-reassign strategy behind a RecoveryStrategy protocol, enabling future addition of checkpoint-based recovery without modifying existing code.

Acceptance Criteria

RecoveryStrategy Protocol

  • RecoveryStrategy protocol defined with recover(task, error, context) method
  • Protocol is pluggable — new strategies can be registered via config

Fail-and-Reassign Strategy (Default / MVP)

  • Engine catches failures at its outermost boundary
  • Redacted AgentContext snapshot logged on failure (turn count, accumulated cost — no message contents)
  • Task transitions to FAILED status
  • Failed tasks are available for reassignment (manual or automatic via task router)
  • max_retries configurable per task (default: 1)
  • Reassignment respects retry count — tasks exceeding max retries stay FAILED

New TaskStatus: FAILED

  • FAILED added to TaskStatus enum as a non-terminal state
  • Valid transitions updated: IN_PROGRESS → FAILED, FAILED → ASSIGNED (reassignment)
  • FAILED differs from CANCELLED (terminal) — failed tasks are eligible for reassignment

Testing

  • Unit tests for RecoveryStrategy protocol
  • Unit tests for fail-and-reassign with mock agent failures
  • Integration test: agent fails → task marked FAILED → reassignment available

Dependencies

Design Spec Reference

  • §6.6 — Agent Crash Recovery (Strategy 1: Fail-and-Reassign)

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:highImportant, should be prioritizedscope:medium1-3 days of workspec:agent-systemDESIGN_SPEC Section 3 - Agent Systemspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementationtype:testTest coverage, test infrastructure

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions