feat: implement crash recovery with fail-and-reassign strategy by Aureliolo · Pull Request #149 · Aureliolo/synthorg

Aureliolo · 2026-03-07T08:22:31Z

Summary

Implement crash recovery with FailAndReassignStrategy per DESIGN_SPEC §6.6 (closes Implement crash recovery with fail-and-reassign strategy (DESIGN_SPEC §6.6) #129)
Add FAILED as a non-terminal TaskStatus with transitions: IN_PROGRESS → FAILED, ASSIGNED → FAILED, FAILED → ASSIGNED
Add RecoveryStrategy protocol and FailAndReassignStrategy default implementation
Add RecoveryResult with @computed_field can_reassign (retry_count < max_retries)
Add Task.max_retries (default=1, ge=0) and TaskExecution.retry_count fields
Integrate recovery into AgentEngine — both normal error path and fatal error path
Add 4 new event constants for structured recovery logging
Update DESIGN_SPEC §6.1 lifecycle diagram, §6.2 task definition, §6.6 recovery steps, §15.3 project structure

Key design decisions

ASSIGNED → FAILED transition added (not just IN_PROGRESS → FAILED) to handle early setup failures before the task reaches IN_PROGRESS
can_reassign is a @computed_field (not stored) per project convention for derived values
strategy_type uses NotBlankStr per project convention for identifier fields
Recovery failures are logged but never block the error result (defensive, non-propagating)
MemoryError/RecursionError always re-raised through recovery

Test plan

Unit tests for FailAndReassignStrategy (happy path, max retries exceeded, zero retries, snapshot redaction, error capture, logging events)
Unit tests for RecoveryStrategy protocol conformance
Unit tests for TaskExecution.retry_count (default, factory, transition preservation, FAILED non-terminal)
Unit tests for Task.max_retries (default, custom, zero, negative rejected)
Unit tests for AgentEngine recovery (provider error → FAILED, strategy invocation, failure swallowed, None bypass, timeout trigger, custom strategy, MemoryError propagation)
Unit tests for FAILED transitions (valid: IN_PROGRESS→FAILED, ASSIGNED→FAILED, FAILED→ASSIGNED; invalid: FAILED→COMPLETED, FAILED→IN_PROGRESS)
Integration test: full engine.run() with failing provider → FAILED status
All 2087 tests pass, 95.39% coverage
Pre-reviewed by 9 agents, 17 findings addressed

Closes #129

Add crash recovery per DESIGN_SPEC §6.6: - Add FAILED status to TaskStatus enum (non-terminal, can reassign) - Add IN_PROGRESS → FAILED and FAILED → ASSIGNED transitions - Add max_retries field to Task model (default=1) - Add retry_count field to TaskExecution runtime state - Create RecoveryStrategy protocol + FailAndReassignStrategy default - Wire recovery into AgentEngine (_apply_recovery) for all error paths - Recovery captures redacted AgentContextSnapshot (no message contents) - Recovery is async (forward-compatible with checkpoint recovery M4/M5) - Recovery failure is swallowed (never blocks error result) - Pass recovery_strategy=None to opt out Closes #129 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pre-reviewed by 9 agents, 17 findings addressed: Bug fix: - Add ASSIGNED→FAILED to valid transitions (fixes early setup failures leaving tasks stuck in ASSIGNED status) Type improvements: - RecoveryResult.strategy_type: str → NotBlankStr - RecoveryResult.can_reassign: stored Field → @computed_field - Extract "fail_reassign" to FailAndReassignStrategy.STRATEGY_TYPE constant Documentation: - DESIGN_SPEC §6.1: update lifecycle diagram with FAILED status - DESIGN_SPEC §6.2: add max_retries to task definition - DESIGN_SPEC §6.6: fix "tool call history" wording, add M3 limitation note - DESIGN_SPEC §15.3: add recovery.py to project structure - Task/TaskExecution/TaskStatus docstrings: add missing fields Tests: - Add TaskStatus.FAILED value assertion - Add ASSIGNED→FAILED valid transition test - Add FAILED completed_at/is_terminal verification - Add MemoryError propagation test for _apply_recovery - Fix import block formatting in test_recovery.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-07T08:22:42Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

coderabbitai · 2026-03-07T08:22:44Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3bd699d9-4c62-4b24-9d92-5ac4ed19d430

📥 Commits

Reviewing files that changed from the base of the PR and between f9383eb and b25d85a.

📒 Files selected for processing (11)

DESIGN_SPEC.md
src/ai_company/core/task_transitions.py
src/ai_company/engine/agent_engine.py
src/ai_company/engine/cost_recording.py
src/ai_company/engine/recovery.py
tests/integration/engine/test_crash_recovery.py
tests/unit/core/test_task.py
tests/unit/engine/test_agent_engine_errors.py
tests/unit/engine/test_agent_engine_lifecycle.py
tests/unit/engine/test_recovery.py
tests/unit/engine/test_task_execution.py

📝 Walkthrough

Summary by CodeRabbit

New Features
- Crash-recovery system with a fail-and-reassign strategy and task-level max_retries for controlled retries.
- New non-terminal FAILED state (reassignable) and retry_count tracking on task executions.
Improvements
- Task lifecycle extended with FAILED-related transitions and BLOCKED semantics.
- Enhanced observability with recovery events and per-execution cost recording.
Documentation
- Design docs updated to describe recovery flow, semantics, and future milestones.

Walkthrough

Implements crash-recovery (MVP fail-and-reassign): adds non-terminal TaskStatus.FAILED, per-task max_retries, TaskExecution.retry_count, a pluggable RecoveryStrategy with FailAndReassignStrategy, integrates recovery into AgentEngine, captures redacted context snapshots, and adds events and tests exercising retry/reassignment eligibility.

Changes

Cohort / File(s)	Summary
Task Model & State Machine `src/ai_company/core/task.py`, `src/ai_company/core/enums.py`, `src/ai_company/core/task_transitions.py`	Add `TaskStatus.FAILED` (non-terminal), add `Task.max_retries: int = 1`, update valid transitions (ASSIGNED/IN_PROGRESS → FAILED; FAILED → ASSIGNED), and adjust assignment validation.
Recovery Architecture `src/ai_company/engine/recovery.py`	New `RecoveryStrategy` protocol, `RecoveryResult` model (includes can_reassign), and `FailAndReassignStrategy` implementation that snapshots context, marks TaskExecution as FAILED, and computes reassignment eligibility.
Agent Engine Integration `src/ai_company/engine/agent_engine.py`, `src/ai_company/engine/__init__.py`	AgentEngine accepts `recovery_strategy`, adds async `_handle_fatal_error` and `_apply_recovery`, integrates recovery invocation into error paths, and re-exports recovery types.
TaskExecution & Run Results `src/ai_company/engine/task_execution.py`	Add `TaskExecution.retry_count` (default 0) and update `from_task(..., retry_count=...)` factory to preserve retry count.
Observability & Costing `src/ai_company/observability/events/execution.py`, `src/ai_company/engine/cost_recording.py`	Add recovery-related events (EXECUTION_RECOVERY_START/COMPLETE/FAILED/SNAPSHOT). New cost recording helper `record_execution_costs` for per-turn cost submission with robust logging.
Tests — Unit & Integration `tests/unit/...`, `tests/integration/engine/test_crash_recovery.py`	Add extensive unit tests for recovery protocol, FailAndReassignStrategy, AgentEngine recovery flows, Task/transition tests for FAILED and max_retries, and an integration test validating failure → FAILED and reassignment eligibility.
Docs / Design `DESIGN_SPEC.md`	Document non-terminal FAILED state, crash-recovery flow, max_retries, RecoveryStrategy/RecoveryResult expectations, and milestone (M3/M4/M5) notes.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as AgentEngine
    participant Provider as Provider
    participant Recovery as RecoveryStrategy
    participant TaskExec as TaskExecution

    Agent->>Provider: execute task
    Provider-->>Agent: error/exception
    Agent->>Agent: _handle_fatal_error()
    Agent->>Recovery: _apply_recovery(task_execution, error, context)
    activate Recovery
    Recovery->>TaskExec: capture redacted context snapshot
    Recovery->>TaskExec: set status -> FAILED, record error_message
    Recovery-->>Agent: RecoveryResult{can_reassign, context_snapshot}
    deactivate Recovery
    alt can_reassign
        Agent-->>Agent: task available for reassignment (retry_count < max_retries)
    else cannot_reassign
        Agent-->>Agent: task remains FAILED (no reassignment)
    end
    Agent-->>Caller: return AgentRunResult (ERROR)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: implement single-task execution lifecycle (#21) #144: Modifies AgentEngine/run-result and recovery-related surface; likely overlaps on recovery integration and exported run result types.
feat: implement crash recovery with fail-and-reassign strategy #149: Implements crash-recovery features (FAILED state, max_retries, RecoveryStrategy, retry semantics) similar to this PR — strong code-level overlap.
feat: implement Task, Project, Artifact, Budget, and Cost domain models #71: Previously extended task lifecycle definitions (Task, TaskStatus, transitions) that this PR further modifies to add FAILED and reassignment transitions.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 47.95% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: implementing crash recovery with fail-and-reassign strategy, which is the primary focus of the entire PR.
Description check	✅ Passed	The description provides a comprehensive overview of the changes, key design decisions, and test coverage. It is directly related to the changeset and explains the purpose and implementation approach.
Linked Issues check	✅ Passed	The PR implements all acceptance criteria from issue `#129`: RecoveryStrategy protocol, FailAndReassignStrategy with redacted snapshots and retry logic, FAILED as non-terminal TaskStatus with proper transitions, max_retries field, and comprehensive unit/integration tests for the protocol, transitions, and recovery behavior.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to issue `#129` requirements: crash recovery infrastructure, FAILED status, retry mechanics, and engine integration. No unrelated modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/crash-recovery

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch feat/crash-recovery

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-07T08:23:01Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's resilience by introducing a robust crash recovery mechanism. It allows the system to gracefully handle unexpected task failures by transitioning them to a new FAILED state, making them eligible for reassignment and retry based on a configurable max_retries policy. This ensures that transient errors do not lead to permanent task loss and improves the overall stability and reliability of agent executions.

Highlights

Crash Recovery Implementation: Implemented a robust crash recovery mechanism using a 'fail-and-reassign' strategy, as outlined in DESIGN_SPEC §6.6.
New Task Status: FAILED: Introduced a new non-terminal FAILED task status, allowing tasks to be retried after encountering errors, with defined transitions: IN_PROGRESS → FAILED, ASSIGNED → FAILED, and FAILED → ASSIGNED.
Recovery Strategy Protocol: Defined a RecoveryStrategy protocol and provided a default FailAndReassignStrategy implementation to standardize how task failures are handled.
Retry Mechanism: Added max_retries to the Task model and retry_count to TaskExecution to manage the number of allowed reassignment attempts after a failure.
AgentEngine Integration: Integrated the new recovery logic directly into the AgentEngine to gracefully handle both normal and fatal execution errors.
Documentation and Observability: Updated the DESIGN_SPEC.md to reflect the new FAILED status, task lifecycle, and recovery steps, and added new event constants for structured recovery logging.

Changelog

DESIGN_SPEC.md
- Updated the task lifecycle diagram to include the FAILED state and its transitions.
- Added a note on FAILED as a non-terminal state eligible for retry.
- Introduced max_retries to the task definition.
- Revised the crash recovery section to detail the FAILED state transitions and the role of RecoveryResult.can_reassign.
- Added a limitation note regarding automated reassignment for M3.
- Updated the project structure to include recovery.py.
src/ai_company/core/enums.py
- Added FAILED as a new member to the TaskStatus enum.
- Updated the quick reference summary of task transitions to include FAILED.
src/ai_company/core/task.py
- Added max_retries field (default 1, ge=0) to the Task Pydantic model.
- Updated _validate_assignment_consistency to include FAILED tasks as potentially having no assignee.
src/ai_company/core/task_transitions.py
- Updated the VALID_TRANSITIONS map to allow transitions to FAILED from ASSIGNED and IN_PROGRESS, and from FAILED to ASSIGNED.
- Updated the module docstring to reflect the new FAILED transitions.
src/ai_company/engine/init.py
- Imported FailAndReassignStrategy, RecoveryResult, and RecoveryStrategy.
- Exported these new classes in __all__.
src/ai_company/engine/agent_engine.py
- Imported FailAndReassignStrategy and RecoveryStrategy.
- Added EXECUTION_RECOVERY_FAILED to imported event constants.
- Defined _DEFAULT_RECOVERY_STRATEGY as FailAndReassignStrategy.
- Updated _EXECUTABLE_STATUSES docstring to clarify FAILED tasks are not executable directly.
- Added recovery_strategy parameter to AgentEngine.__init__.
- Modified run and _handle_fatal_error to await _handle_fatal_error and _apply_recovery respectively.
- Introduced _apply_recovery method to invoke the configured recovery strategy.
src/ai_company/engine/recovery.py
- Added new file: Defined RecoveryResult Pydantic model with task_execution, strategy_type, can_reassign (computed field), context_snapshot, and error_message.
- Added new file: Defined RecoveryStrategy protocol with recover and get_strategy_type methods.
- Added new file: Implemented FailAndReassignStrategy which transitions tasks to FAILED, captures a redacted context snapshot, and logs recovery events.
src/ai_company/engine/task_execution.py
- Added retry_count field (default 0, ge=0) to the TaskExecution Pydantic model.
- Modified from_task class method to accept an optional retry_count argument.
src/ai_company/observability/events/execution.py
- Added new constants: EXECUTION_RECOVERY_START, EXECUTION_RECOVERY_COMPLETE, EXECUTION_RECOVERY_FAILED, EXECUTION_RECOVERY_SNAPSHOT for recovery logging.
tests/integration/engine/test_crash_recovery.py
- Added new file: Contains integration tests for the full crash recovery flow, including scenarios for first failure (can reassign) and second failure (cannot reassign).
tests/unit/core/test_enums.py
- Updated test_task_status_has_8_members to reflect the addition of FAILED.
- Added TaskStatus.FAILED.value assertion in test_task_status_values.
tests/unit/core/test_task.py
- Added test_failed_without_assigned_to_allowed and test_failed_with_assigned_to_allowed.
- Added TestTaskMaxRetries class with tests for default, custom, zero, and negative max_retries values.
tests/unit/core/test_task_transitions.py
- Added test_in_progress_to_failed, test_assigned_to_failed, and test_failed_to_assigned for valid transitions.
- Added test_failed_to_completed_rejected and test_failed_to_in_progress_rejected for invalid transitions.
- Added test_failed_is_non_terminal assertion.
tests/unit/engine/test_agent_engine_errors.py
- Imported TaskStatus, FailAndReassignStrategy, RecoveryResult, and TaskExecution.
- Added TestAgentEngineRecovery class with tests for provider errors leading to FAILED status, recovery strategy invocation, recovery failure swallowing, no recovery when strategy is None, timeout triggering recovery, custom strategy usage, and MemoryError propagation.
tests/unit/engine/test_agent_engine_lifecycle.py
- Renamed test_error_stays_in_progress to test_error_transitions_to_failed.
- Updated the assertion to check for TaskStatus.FAILED instead of IN_PROGRESS after an error.
tests/unit/engine/test_recovery.py
- Added new file: Contains unit tests for RecoveryStrategy protocol conformance and FailAndReassignStrategy behavior, including happy path, max retries exceeded, zero retries, snapshot redaction, error message capture, and logging events.
tests/unit/engine/test_task_execution.py
- Added TestTaskExecutionRetryCount class with tests for default retry_count, from_task with retry_count, retry_count preservation on transition, and FAILED as a non-terminal state.

Activity

The pull request was pre-reviewed by 9 agents, and 17 findings were addressed.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a robust crash recovery mechanism using a 'fail-and-reassign' strategy. The changes are comprehensive, including updates to the task lifecycle, new data models, a recovery strategy protocol, and integration into the agent engine. The design specification has been updated accordingly, and the changes are well-covered by new unit and integration tests. My review found one critical syntax issue that will prevent the code from running and a suggestion to improve import handling to avoid potential circular dependencies. Overall, this is a solid implementation of a critical feature.

gemini-code-assist · 2026-03-07T08:25:33Z

src/ai_company/engine/agent_engine.py

+        except MemoryError, RecursionError:
+            raise


This except syntax is from Python 2. In Python 3, multiple exceptions must be grouped in a tuple. This will cause a SyntaxError when the module is imported.

except (MemoryError, RecursionError): raise

gemini-code-assist · 2026-03-07T08:25:33Z

src/ai_company/engine/recovery.py

+from typing import Protocol, runtime_checkable
+
+from pydantic import BaseModel, ConfigDict, Field, computed_field
+
+from ai_company.core.enums import TaskStatus
+from ai_company.core.types import NotBlankStr  # noqa: TC001
+from ai_company.engine.context import AgentContext, AgentContextSnapshot  # noqa: TC001
+from ai_company.engine.task_execution import TaskExecution  # noqa: TC001


AgentContext is only used as a type hint for a function argument. To avoid potential runtime circular dependencies, it's best practice to place such imports inside a TYPE_CHECKING block. This change separates the runtime-needed AgentContextSnapshot from the type-only AgentContext. The circular import was likely caused by AgentContext, so the noqa suppression on the AgentContextSnapshot import can also be removed.

from typing import Protocol, runtime_checkable, TYPE_CHECKING from pydantic import BaseModel, ConfigDict, Field, computed_field from ai_company.core.enums import TaskStatus from ai_company.core.types import NotBlankStr # noqa: TC001 from ai_company.engine.context import AgentContextSnapshot from ai_company.engine.task_execution import TaskExecution # noqa: TC001 if TYPE_CHECKING: from ai_company.engine.context import AgentContext

greptile-apps · 2026-03-07T08:27:14Z

Greptile Summary

This PR implements crash recovery for AgentEngine per DESIGN_SPEC §6.6, adding FAILED as a non-terminal TaskStatus, a RecoveryStrategy protocol with a FailAndReassignStrategy default, and integrating recovery into both the normal error path and the fatal-exception path.

Key verified changes:

TaskStatus.FAILED is correctly added as a non-terminal state with valid transitions ASSIGNED → FAILED, IN_PROGRESS → FAILED, and FAILED → ASSIGNED (for reassignment)
Task.max_retries and TaskExecution.retry_count fields track retry eligibility correctly
RecoveryResult.can_reassign properly computed as retry_count < max_retries, with docstring clarifying that the task router (M4) is responsible for incrementing retry_count on reassignment
_apply_recovery is called after normal ERROR-terminated loop results and inside _handle_fatal_error for uncaught exceptions — non-overlapping paths prevent double-recovery
_record_costs/_submit_cost cleanly extracted to cost_recording.py to keep agent_engine.py under the 800-line limit
DESIGN_SPEC §6.1 lifecycle diagram correctly shows both ASSIGNED → FAILED and IN_PROGRESS → FAILED transitions

One issue found: raise exc from build_exc at line 741 of agent_engine.py reverses the exception causality. The original exception is the victim; the build exception is a secondary failure that occurred while handling the first. The correct idiom is raise exc from None, which was the previous pattern.

Confidence Score: 4/5

Safe to merge after addressing the exception chain reversal at line 741 of agent_engine.py.
The crash recovery implementation is well-structured, thoroughly tested (2087 tests, 95.39% coverage), and correctly integrates the recovery logic at two non-overlapping paths. The state machine transitions are properly defined and all required fields are correctly validated. The sole issue is a semantic reversal of the Python exception chain at line 741 (raise exc from build_exc instead of raise exc from None), which affects only the error chain representation in the rare scenario where error recovery itself fails, but does not affect runtime behavior.
src/ai_company/engine/agent_engine.py — exception chain direction at line 741.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant AgentEngine
    participant ReactLoop
    participant FailAndReassignStrategy
    participant TaskExecution

    Caller->>AgentEngine: run(identity, task)
    AgentEngine->>AgentEngine: _prepare_context() → IN_PROGRESS
    AgentEngine->>ReactLoop: execute(context)

    alt Loop raises exception
        ReactLoop-->>AgentEngine: raises Exception
        AgentEngine->>AgentEngine: _handle_fatal_error(ctx, exc)
        AgentEngine->>FailAndReassignStrategy: recover(task_execution, error_msg, ctx)
        FailAndReassignStrategy->>TaskExecution: with_transition(FAILED, reason)
        TaskExecution-->>FailAndReassignStrategy: failed_execution
        FailAndReassignStrategy-->>AgentEngine: RecoveryResult(can_reassign, FAILED)
        AgentEngine-->>Caller: AgentRunResult(ERROR, task=FAILED)
    else Loop returns TerminationReason.ERROR
        ReactLoop-->>AgentEngine: ExecutionResult(ERROR)
        AgentEngine->>AgentEngine: _apply_post_execution_transitions() → no-op
        AgentEngine->>FailAndReassignStrategy: recover(task_execution, error_msg, ctx)
        FailAndReassignStrategy->>TaskExecution: with_transition(FAILED, reason)
        TaskExecution-->>FailAndReassignStrategy: failed_execution
        FailAndReassignStrategy-->>AgentEngine: RecoveryResult(can_reassign, FAILED)
        AgentEngine-->>Caller: AgentRunResult(ERROR, task=FAILED)
    end

    Note over Caller,TaskExecution: Task Router (M4) checks can_reassign<br/>and creates TaskExecution.from_task(task, retry_count+1)

_{Last reviewed commit: b25d85a}

src/ai_company/engine/agent_engine.py

DESIGN_SPEC.md

src/ai_company/engine/recovery.py

Copilot

Pull request overview

Implements crash recovery in the execution engine via a pluggable RecoveryStrategy (defaulting to a fail-and-reassign approach), adds a new non-terminal FAILED task status, and introduces retry tracking (max_retries / retry_count) to support safe reassignment flows.

Changes:

Add TaskStatus.FAILED plus updated valid transition map to allow failing and reassignment (FAILED → ASSIGNED).
Introduce crash recovery framework (RecoveryStrategy, FailAndReassignStrategy, RecoveryResult) and integrate it into AgentEngine error paths with new observability events.
Add retry configuration/telemetry fields (Task.max_retries, TaskExecution.retry_count) and expand unit/integration tests + update DESIGN_SPEC sections.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/unit/engine/test_task_execution.py	Adds unit coverage for `TaskExecution.retry_count` defaults and transition behavior with `FAILED`.
tests/unit/engine/test_recovery.py	New unit tests for recovery protocol/strategy behavior and recovery logging events.
tests/unit/engine/test_agent_engine_lifecycle.py	Updates lifecycle expectation: error outcomes transition task execution to `FAILED`.
tests/unit/engine/test_agent_engine_errors.py	Adds engine recovery tests (strategy invocation, swallow failure, opt-out, timeout, MemoryError propagation).
tests/unit/core/test_task_transitions.py	Extends transition validation tests for `FAILED` allowed/invalid edges and non-terminal behavior.
tests/unit/core/test_task.py	Adds tests for `FAILED` assignment rules and `Task.max_retries` validation.
tests/unit/core/test_enums.py	Updates enum member count/value checks to include `TaskStatus.FAILED`.
tests/integration/engine/test_crash_recovery.py	New integration test ensuring provider crash results in `FAILED` status and retry gating checks.
src/ai_company/observability/events/execution.py	Adds structured recovery event constants (`execution.recovery.*`).
src/ai_company/engine/task_execution.py	Adds `retry_count` field and plumbs it into `TaskExecution.from_task()`.
src/ai_company/engine/recovery.py	New recovery protocol + default `FailAndReassignStrategy`, `RecoveryResult` model, and snapshot/error logging.
src/ai_company/engine/agent_engine.py	Integrates recovery on `TerminationReason.ERROR` and fatal-error path; adds configurable strategy.
src/ai_company/engine/init.py	Re-exports recovery types/strategy as part of engine public API.
src/ai_company/core/task_transitions.py	Updates lifecycle transition map and docs to include `FAILED`.
src/ai_company/core/task.py	Adds `max_retries` field and relaxes assignee requirements for `FAILED`.
src/ai_company/core/enums.py	Adds `FAILED` to `TaskStatus` and updates lifecycle summary docs.
DESIGN_SPEC.md	Updates lifecycle diagram and recovery spec sections to reflect implemented crash recovery and retry limits.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-07T08:27:42Z

tests/unit/engine/test_recovery.py

+pytestmark = pytest.mark.timeout(30)
+
+


pytestmark = pytest.mark.timeout(30) is unique within tests/unit/engine (other unit-engine modules don’t set a module-level timeout). To keep markers consistent, consider removing the module-level timeout here or applying the same pattern across the unit/engine suite if timeouts are now desired.

Suggested change

pytestmark = pytest.mark.timeout(30)

Copilot · 2026-03-07T08:27:43Z

src/ai_company/engine/recovery.py

+    model_config = ConfigDict(frozen=True)
+
+    task_execution: TaskExecution = Field(
+        description="Updated execution with FAILED status",
+    )
+    strategy_type: NotBlankStr = Field(
+        description="Identifier of the recovery strategy used",
+    )
+    context_snapshot: AgentContextSnapshot = Field(
+        description="Redacted context snapshot (no message contents)",
+    )


RecoveryResult is declared as a frozen Pydantic model, but there’s no corresponding test_frozen coverage for it (the repo commonly asserts immutability for every ConfigDict(frozen=True) model). Adding a small unit test that field reassignment raises ValidationError would align with existing patterns and prevent accidental regression to mutability.

Copilot · 2026-03-07T08:27:43Z

src/ai_company/engine/task_execution.py

    @classmethod
-    def from_task(cls, task: Task) -> TaskExecution:
+    def from_task(
+        cls,
+        task: Task,
+        *,
+        retry_count: int = 0,
+    ) -> TaskExecution:
        """Create a fresh execution from a task definition.

        Args:
            task: The frozen task to wrap.
+            retry_count: Number of previous failure-reassignment cycles.

        Returns:
            New ``TaskExecution`` with status matching the task.
        """
-        execution = cls(task=task, status=task.status)
+        execution = cls(task=task, status=task.status, retry_count=retry_count)


TaskExecution.from_task() now accepts retry_count, but the only call site (AgentContext.from_identity) always uses the default, so engine-driven executions can’t start with a non-zero retry count. That makes max_retries/can_reassign effectively ignore previous retries unless callers bypass AgentEngine.run() and build TaskExecution manually. Consider plumbing an optional retry_count through AgentEngine.run() / AgentContext.from_identity() (or persisting it on the task) so retries can be executed via the public API.

Copilot · 2026-03-07T08:27:43Z

src/ai_company/engine/agent_engine.py

+_DEFAULT_RECOVERY_STRATEGY = FailAndReassignStrategy()
+"""Module-level default instance for the recovery strategy."""
+


Using a module-level _DEFAULT_RECOVERY_STRATEGY instance as the default argument means all AgentEngine instances share the same strategy object. This is safe only as long as the strategy remains strictly stateless; if it later gains config/state (metrics, caches, injected services), that shared instance can cause cross-engine coupling. Prefer defaulting to None and instantiating FailAndReassignStrategy() inside __init__ when no strategy is provided (while still allowing None to explicitly disable recovery via a separate sentinel).

Suggested change

_DEFAULT_RECOVERY_STRATEGY = FailAndReassignStrategy()

"""Module-level default instance for the recovery strategy."""

def get_default_recovery_strategy() -> RecoveryStrategy:

"""Factory for the default recovery strategy.

Returns a new ``FailAndReassignStrategy`` instance to avoid sharing

mutable state across multiple ``AgentEngine`` instances.

"""

return FailAndReassignStrategy()

# Backwards-compatible alias: evaluate to a fresh default strategy when used.

_DEFAULT_RECOVERY_STRATEGY: RecoveryStrategy = get_default_recovery_strategy()

"""Module-level default for the recovery strategy.

Note: this is created via ``get_default_recovery_strategy()`` to avoid

sharing a single mutable instance across multiple engines.

"""

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/ai_company/engine/task_execution.py (1)
97-127: ⚠️ Potential issue | 🟠 Major

Retry/reassignment can now produce invalid Task snapshots.

With retry_count enabling the new FAILED -> ASSIGNED flow, TaskExecution can represent a reassigned attempt while self.task.assigned_to is still None—which src/ai_company/core/task.py allows for FAILED but not for ASSIGNED. to_task_snapshot() later uses self.task.model_copy(update={"status": self.status}), so it can silently emit an invalid assigned task without running Task validators. Please carry the reassigned assignee in execution state, or rebuild snapshots via Task.model_validate(...) / Task.with_transition(...) with the required overrides.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/task_execution.py` around lines 97 - 127, The
TaskExecution.from_task + to_task_snapshot flow can emit an invalid ASSIGNED
Task because from_task only copies Task into execution while retry_count-driven
FAILED->ASSIGNED transitions may require a non-None assignee; update
TaskExecution to carry the reassigned assignee (e.g., add an assigned_to field
on TaskExecution) and modify from_task to accept/record that assignee when
creating a reassigned attempt, and update to_task_snapshot to rebuild a
validated Task snapshot by calling Task.model_validate(...) or
Task.with_transition(...) (instead of Task.model_copy) supplying both
status=self.status and the carried assigned_to so validators run and an ASSIGNED
Task is never emitted with a None assignee.
DESIGN_SPEC.md (1)
697-720: ⚠️ Potential issue | 🟠 Major

Split BLOCKED from CANCELLED in the lifecycle diagram.

Line 714 still renders BLOCKED / CANCELLED as a shared terminal sink, which contradicts Line 720 and §6.6 where BLOCKED is non-terminal and returns to ASSIGNED. Please add the BLOCKED → ASSIGNED path or separate the two states so the state machine stays unambiguous.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DESIGN_SPEC.md` around lines 697 - 720, The ASCII lifecycle diagram currently
combines "BLOCKED / CANCELLED" into one terminal sink which contradicts §6.6;
update the diagram to separate BLOCKED and CANCELLED as distinct nodes and add
an arrow from BLOCKED back to ASSIGNED labeled "unblocked" (or "blocked /
cancelled" as appropriate), keeping CANCELLED as a terminal node with no
outgoing transitions and ensuring FAILED remains non-terminal returning to
ASSIGNED; target the diagram block containing the states "ASSIGNED",
"IN_PROGRESS", "IN_REVIEW", "COMPLETED", "BLOCKED", and "CANCELLED" and adjust
the ASCII art arrows so BLOCKED → ASSIGNED is explicit and CANCELLED stays
terminal.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@DESIGN_SPEC.md`:
- Around line 973-975: The wording should explicitly place the recovery catch
point on the AgentEngine boundary: update the sentence that currently reads
“outermost `try/except` in the execution loop” to say something like “the
outermost `try/except` at the AgentEngine boundary (e.g., the top-level
try/catch in AgentEngine.run / execute loop)”; keep the rest of the step
describing logging an ERROR with a redacted AgentContext snapshot (turn count,
accumulated cost — exclude message contents) so responsibility is unambiguously
assigned to AgentEngine.
- Around line 2281-2283: Update the project-structure map entry for
run_result.py to reflect the added RecoveryResult type: change the short
description from "AgentRunResult outcome model" to something like
"AgentRunResult and RecoveryResult outcome models" so it documents both
AgentRunResult and RecoveryResult in run_result.py; ensure you reference
run_result.py and the symbols AgentRunResult and RecoveryResult in the updated
line.
- Around line 720-721: The blockquote containing "**Non-terminal states:**
BLOCKED and FAILED are non-terminal — BLOCKED returns to ASSIGNED when
unblocked, FAILED returns to ASSIGNED for retry (see §6.6). COMPLETED and
CANCELLED are terminal states with no outgoing transitions." contains an extra
blank line; remove that blank line so the blockquote lines are contiguous (no
empty line inside the quote) to satisfy markdownlint MD028.

In `@src/ai_company/engine/agent_engine.py`:
- Around line 265-270: The retry_count on TaskExecution is being reset to zero
when AgentEngine.run rebuilds state via AgentContext.from_identity which calls
TaskExecution.from_task, so recovery/reassignment logic using
task_execution.retry_count (and can_reassign) never exhausts; fix by persisting
or passing the current retry_count into the rebuilt TaskExecution: update
AgentContext.from_identity (and any call sites in AgentEngine.run and the second
block around lines 575-606) to accept and propagate the existing
TaskExecution.retry_count (or the TaskExecution object) instead of constructing
a fresh TaskExecution.from_task(task) with retry_count=0, and ensure
AgentEngine._apply_recovery increments and persists retry_count back to the
task/context so subsequent run() invocations see the updated value.

In `@tests/integration/engine/test_crash_recovery.py`:
- Around line 89-92: The ModelConfig instance uses a non-canonical model_id
"test-model-001"; update the ModelConfig(..., model_id=...) to use the approved
test alias "test-small-001" (keep provider="test-provider") so the fixture
conforms to repo test-data conventions and allowed aliases.

In `@tests/unit/engine/test_agent_engine_errors.py`:
- Around line 362-364: Add a 30s pytest timeout mark for the new
TestAgentEngineRecovery test class so each test is limited to 30 seconds; either
add a class decorator `@pytest.mark.timeout`(30) above the TestAgentEngineRecovery
class or add a module-level pytestmark = pytest.mark.timeout(30) at top of the
file, and ensure pytest is imported if not already (referencing
TestAgentEngineRecovery to locate the class).

In `@tests/unit/engine/test_task_execution.py`:
- Around line 85-114: Add the required 30-second pytest timeout marker at the
class (or module) level so each test in the TestTaskExecutionRetryCount block
gets the repo-required timeout; locate the TestTaskExecutionRetryCount class and
decorate it with pytest.mark.timeout(30) (or add the same marker at the top of
the module) to enforce the per-test 30s guard for methods like
test_retry_count_default_zero, test_from_task_with_retry_count,
test_retry_count_preserved_on_transition, and
test_failed_transition_not_terminal.

---

Outside diff comments:
In `@DESIGN_SPEC.md`:
- Around line 697-720: The ASCII lifecycle diagram currently combines "BLOCKED /
CANCELLED" into one terminal sink which contradicts §6.6; update the diagram to
separate BLOCKED and CANCELLED as distinct nodes and add an arrow from BLOCKED
back to ASSIGNED labeled "unblocked" (or "blocked / cancelled" as appropriate),
keeping CANCELLED as a terminal node with no outgoing transitions and ensuring
FAILED remains non-terminal returning to ASSIGNED; target the diagram block
containing the states "ASSIGNED", "IN_PROGRESS", "IN_REVIEW", "COMPLETED",
"BLOCKED", and "CANCELLED" and adjust the ASCII art arrows so BLOCKED → ASSIGNED
is explicit and CANCELLED stays terminal.

In `@src/ai_company/engine/task_execution.py`:
- Around line 97-127: The TaskExecution.from_task + to_task_snapshot flow can
emit an invalid ASSIGNED Task because from_task only copies Task into execution
while retry_count-driven FAILED->ASSIGNED transitions may require a non-None
assignee; update TaskExecution to carry the reassigned assignee (e.g., add an
assigned_to field on TaskExecution) and modify from_task to accept/record that
assignee when creating a reassigned attempt, and update to_task_snapshot to
rebuild a validated Task snapshot by calling Task.model_validate(...) or
Task.with_transition(...) (instead of Task.model_copy) supplying both
status=self.status and the carried assigned_to so validators run and an ASSIGNED
Task is never emitted with a None assignee.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 598115f0-a607-4eab-9205-e5f38a107f77

📥 Commits

Reviewing files that changed from the base of the PR and between 833c190 and f9383eb.

📒 Files selected for processing (17)

DESIGN_SPEC.md
src/ai_company/core/enums.py
src/ai_company/core/task.py
src/ai_company/core/task_transitions.py
src/ai_company/engine/__init__.py
src/ai_company/engine/agent_engine.py
src/ai_company/engine/recovery.py
src/ai_company/engine/task_execution.py
src/ai_company/observability/events/execution.py
tests/integration/engine/test_crash_recovery.py
tests/unit/core/test_enums.py
tests/unit/core/test_task.py
tests/unit/core/test_task_transitions.py
tests/unit/engine/test_agent_engine_errors.py
tests/unit/engine/test_agent_engine_lifecycle.py
tests/unit/engine/test_recovery.py
tests/unit/engine/test_task_execution.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Agent
GitHub Check: Greptile Review

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py