-
Notifications
You must be signed in to change notification settings - Fork 0
Implement crash recovery with fail-and-reassign strategy (DESIGN_SPEC §6.6) #129
Copy link
Copy link
Closed
Labels
prio:highImportant, should be prioritizedImportant, should be prioritizedscope:medium1-3 days of work1-3 days of workspec:agent-systemDESIGN_SPEC Section 3 - Agent SystemDESIGN_SPEC Section 3 - Agent Systemspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow EngineDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementationNew feature implementationtype:testTest coverage, test infrastructureTest coverage, test infrastructure
Milestone
Description
Context
When an agent execution fails unexpectedly (unhandled exception, OOM, process kill), the framework needs a recovery mechanism. The MVP implements the fail-and-reassign strategy behind a RecoveryStrategy protocol, enabling future addition of checkpoint-based recovery without modifying existing code.
Acceptance Criteria
RecoveryStrategy Protocol
-
RecoveryStrategyprotocol defined withrecover(task, error, context)method - Protocol is pluggable — new strategies can be registered via config
Fail-and-Reassign Strategy (Default / MVP)
- Engine catches failures at its outermost boundary
- Redacted
AgentContextsnapshot logged on failure (turn count, accumulated cost — no message contents) - Task transitions to
FAILEDstatus - Failed tasks are available for reassignment (manual or automatic via task router)
-
max_retriesconfigurable per task (default: 1) - Reassignment respects retry count — tasks exceeding max retries stay
FAILED
New TaskStatus: FAILED
-
FAILEDadded toTaskStatusenum as a non-terminal state - Valid transitions updated:
IN_PROGRESS → FAILED,FAILED → ASSIGNED(reassignment) -
FAILEDdiffers fromCANCELLED(terminal) — failed tasks are eligible for reassignment
Testing
- Unit tests for RecoveryStrategy protocol
- Unit tests for fail-and-reassign with mock agent failures
- Integration test: agent fails → task marked FAILED → reassignment available
Dependencies
- Implement agent engine core with ExecutionLoop protocol integration (DESIGN_SPEC §3.1, §6.1, §6.5) #11 — Agent engine core
- Implement single-task execution lifecycle (assign, execute, complete) #21 — Task execution lifecycle (status transitions)
Design Spec Reference
- §6.6 — Agent Crash Recovery (Strategy 1: Fail-and-Reassign)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
prio:highImportant, should be prioritizedImportant, should be prioritizedscope:medium1-3 days of work1-3 days of workspec:agent-systemDESIGN_SPEC Section 3 - Agent SystemDESIGN_SPEC Section 3 - Agent Systemspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow EngineDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementationNew feature implementationtype:testTest coverage, test infrastructureTest coverage, test infrastructure