Skip to content

feat: implement crash recovery with fail-and-reassign strategy#149

Merged
Aureliolo merged 3 commits intomainfrom
feat/crash-recovery
Mar 7, 2026
Merged

feat: implement crash recovery with fail-and-reassign strategy#149
Aureliolo merged 3 commits intomainfrom
feat/crash-recovery

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

  • Implement crash recovery with FailAndReassignStrategy per DESIGN_SPEC §6.6 (closes Implement crash recovery with fail-and-reassign strategy (DESIGN_SPEC §6.6) #129)
  • Add FAILED as a non-terminal TaskStatus with transitions: IN_PROGRESS → FAILED, ASSIGNED → FAILED, FAILED → ASSIGNED
  • Add RecoveryStrategy protocol and FailAndReassignStrategy default implementation
  • Add RecoveryResult with @computed_field can_reassign (retry_count < max_retries)
  • Add Task.max_retries (default=1, ge=0) and TaskExecution.retry_count fields
  • Integrate recovery into AgentEngine — both normal error path and fatal error path
  • Add 4 new event constants for structured recovery logging
  • Update DESIGN_SPEC §6.1 lifecycle diagram, §6.2 task definition, §6.6 recovery steps, §15.3 project structure

Key design decisions

  • ASSIGNED → FAILED transition added (not just IN_PROGRESS → FAILED) to handle early setup failures before the task reaches IN_PROGRESS
  • can_reassign is a @computed_field (not stored) per project convention for derived values
  • strategy_type uses NotBlankStr per project convention for identifier fields
  • Recovery failures are logged but never block the error result (defensive, non-propagating)
  • MemoryError/RecursionError always re-raised through recovery

Test plan

  • Unit tests for FailAndReassignStrategy (happy path, max retries exceeded, zero retries, snapshot redaction, error capture, logging events)
  • Unit tests for RecoveryStrategy protocol conformance
  • Unit tests for TaskExecution.retry_count (default, factory, transition preservation, FAILED non-terminal)
  • Unit tests for Task.max_retries (default, custom, zero, negative rejected)
  • Unit tests for AgentEngine recovery (provider error → FAILED, strategy invocation, failure swallowed, None bypass, timeout trigger, custom strategy, MemoryError propagation)
  • Unit tests for FAILED transitions (valid: IN_PROGRESS→FAILED, ASSIGNED→FAILED, FAILED→ASSIGNED; invalid: FAILED→COMPLETED, FAILED→IN_PROGRESS)
  • Integration test: full engine.run() with failing provider → FAILED status
  • All 2087 tests pass, 95.39% coverage
  • Pre-reviewed by 9 agents, 17 findings addressed

Closes #129

Aureliolo and others added 2 commits March 7, 2026 08:58
Add crash recovery per DESIGN_SPEC §6.6:

- Add FAILED status to TaskStatus enum (non-terminal, can reassign)
- Add IN_PROGRESS → FAILED and FAILED → ASSIGNED transitions
- Add max_retries field to Task model (default=1)
- Add retry_count field to TaskExecution runtime state
- Create RecoveryStrategy protocol + FailAndReassignStrategy default
- Wire recovery into AgentEngine (_apply_recovery) for all error paths
- Recovery captures redacted AgentContextSnapshot (no message contents)
- Recovery is async (forward-compatible with checkpoint recovery M4/M5)
- Recovery failure is swallowed (never blocks error result)
- Pass recovery_strategy=None to opt out

Closes #129

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-reviewed by 9 agents, 17 findings addressed:

Bug fix:
- Add ASSIGNED→FAILED to valid transitions (fixes early setup failures
  leaving tasks stuck in ASSIGNED status)

Type improvements:
- RecoveryResult.strategy_type: str → NotBlankStr
- RecoveryResult.can_reassign: stored Field → @computed_field
- Extract "fail_reassign" to FailAndReassignStrategy.STRATEGY_TYPE constant

Documentation:
- DESIGN_SPEC §6.1: update lifecycle diagram with FAILED status
- DESIGN_SPEC §6.2: add max_retries to task definition
- DESIGN_SPEC §6.6: fix "tool call history" wording, add M3 limitation note
- DESIGN_SPEC §15.3: add recovery.py to project structure
- Task/TaskExecution/TaskStatus docstrings: add missing fields

Tests:
- Add TaskStatus.FAILED value assertion
- Add ASSIGNED→FAILED valid transition test
- Add FAILED completed_at/is_terminal verification
- Add MemoryError propagation test for _apply_recovery
- Fix import block formatting in test_recovery.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 7, 2026 08:22
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 7, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 7, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3bd699d9-4c62-4b24-9d92-5ac4ed19d430

📥 Commits

Reviewing files that changed from the base of the PR and between f9383eb and b25d85a.

📒 Files selected for processing (11)
  • DESIGN_SPEC.md
  • src/ai_company/core/task_transitions.py
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/engine/cost_recording.py
  • src/ai_company/engine/recovery.py
  • tests/integration/engine/test_crash_recovery.py
  • tests/unit/core/test_task.py
  • tests/unit/engine/test_agent_engine_errors.py
  • tests/unit/engine/test_agent_engine_lifecycle.py
  • tests/unit/engine/test_recovery.py
  • tests/unit/engine/test_task_execution.py

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Crash-recovery system with a fail-and-reassign strategy and task-level max_retries for controlled retries.
    • New non-terminal FAILED state (reassignable) and retry_count tracking on task executions.
  • Improvements

    • Task lifecycle extended with FAILED-related transitions and BLOCKED semantics.
    • Enhanced observability with recovery events and per-execution cost recording.
  • Documentation

    • Design docs updated to describe recovery flow, semantics, and future milestones.

Walkthrough

Implements crash-recovery (MVP fail-and-reassign): adds non-terminal TaskStatus.FAILED, per-task max_retries, TaskExecution.retry_count, a pluggable RecoveryStrategy with FailAndReassignStrategy, integrates recovery into AgentEngine, captures redacted context snapshots, and adds events and tests exercising retry/reassignment eligibility.

Changes

Cohort / File(s) Summary
Task Model & State Machine
src/ai_company/core/task.py, src/ai_company/core/enums.py, src/ai_company/core/task_transitions.py
Add TaskStatus.FAILED (non-terminal), add Task.max_retries: int = 1, update valid transitions (ASSIGNED/IN_PROGRESS → FAILED; FAILED → ASSIGNED), and adjust assignment validation.
Recovery Architecture
src/ai_company/engine/recovery.py
New RecoveryStrategy protocol, RecoveryResult model (includes can_reassign), and FailAndReassignStrategy implementation that snapshots context, marks TaskExecution as FAILED, and computes reassignment eligibility.
Agent Engine Integration
src/ai_company/engine/agent_engine.py, src/ai_company/engine/__init__.py
AgentEngine accepts recovery_strategy, adds async _handle_fatal_error and _apply_recovery, integrates recovery invocation into error paths, and re-exports recovery types.
TaskExecution & Run Results
src/ai_company/engine/task_execution.py
Add TaskExecution.retry_count (default 0) and update from_task(..., retry_count=...) factory to preserve retry count.
Observability & Costing
src/ai_company/observability/events/execution.py, src/ai_company/engine/cost_recording.py
Add recovery-related events (EXECUTION_RECOVERY_START/COMPLETE/FAILED/SNAPSHOT). New cost recording helper record_execution_costs for per-turn cost submission with robust logging.
Tests — Unit & Integration
tests/unit/..., tests/integration/engine/test_crash_recovery.py
Add extensive unit tests for recovery protocol, FailAndReassignStrategy, AgentEngine recovery flows, Task/transition tests for FAILED and max_retries, and an integration test validating failure → FAILED and reassignment eligibility.
Docs / Design
DESIGN_SPEC.md
Document non-terminal FAILED state, crash-recovery flow, max_retries, RecoveryStrategy/RecoveryResult expectations, and milestone (M3/M4/M5) notes.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as AgentEngine
    participant Provider as Provider
    participant Recovery as RecoveryStrategy
    participant TaskExec as TaskExecution

    Agent->>Provider: execute task
    Provider-->>Agent: error/exception
    Agent->>Agent: _handle_fatal_error()
    Agent->>Recovery: _apply_recovery(task_execution, error, context)
    activate Recovery
    Recovery->>TaskExec: capture redacted context snapshot
    Recovery->>TaskExec: set status -> FAILED, record error_message
    Recovery-->>Agent: RecoveryResult{can_reassign, context_snapshot}
    deactivate Recovery
    alt can_reassign
        Agent-->>Agent: task available for reassignment (retry_count < max_retries)
    else cannot_reassign
        Agent-->>Agent: task remains FAILED (no reassignment)
    end
    Agent-->>Caller: return AgentRunResult (ERROR)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.95% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: implementing crash recovery with fail-and-reassign strategy, which is the primary focus of the entire PR.
Description check ✅ Passed The description provides a comprehensive overview of the changes, key design decisions, and test coverage. It is directly related to the changeset and explains the purpose and implementation approach.
Linked Issues check ✅ Passed The PR implements all acceptance criteria from issue #129: RecoveryStrategy protocol, FailAndReassignStrategy with redacted snapshots and retry logic, FAILED as non-terminal TaskStatus with proper transitions, max_retries field, and comprehensive unit/integration tests for the protocol, transitions, and recovery behavior.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #129 requirements: crash recovery infrastructure, FAILED status, retry mechanics, and engine integration. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/crash-recovery
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch feat/crash-recovery

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's resilience by introducing a robust crash recovery mechanism. It allows the system to gracefully handle unexpected task failures by transitioning them to a new FAILED state, making them eligible for reassignment and retry based on a configurable max_retries policy. This ensures that transient errors do not lead to permanent task loss and improves the overall stability and reliability of agent executions.

Highlights

  • Crash Recovery Implementation: Implemented a robust crash recovery mechanism using a 'fail-and-reassign' strategy, as outlined in DESIGN_SPEC §6.6.
  • New Task Status: FAILED: Introduced a new non-terminal FAILED task status, allowing tasks to be retried after encountering errors, with defined transitions: IN_PROGRESS → FAILED, ASSIGNED → FAILED, and FAILED → ASSIGNED.
  • Recovery Strategy Protocol: Defined a RecoveryStrategy protocol and provided a default FailAndReassignStrategy implementation to standardize how task failures are handled.
  • Retry Mechanism: Added max_retries to the Task model and retry_count to TaskExecution to manage the number of allowed reassignment attempts after a failure.
  • AgentEngine Integration: Integrated the new recovery logic directly into the AgentEngine to gracefully handle both normal and fatal execution errors.
  • Documentation and Observability: Updated the DESIGN_SPEC.md to reflect the new FAILED status, task lifecycle, and recovery steps, and added new event constants for structured recovery logging.
Changelog
  • DESIGN_SPEC.md
    • Updated the task lifecycle diagram to include the FAILED state and its transitions.
    • Added a note on FAILED as a non-terminal state eligible for retry.
    • Introduced max_retries to the task definition.
    • Revised the crash recovery section to detail the FAILED state transitions and the role of RecoveryResult.can_reassign.
    • Added a limitation note regarding automated reassignment for M3.
    • Updated the project structure to include recovery.py.
  • src/ai_company/core/enums.py
    • Added FAILED as a new member to the TaskStatus enum.
    • Updated the quick reference summary of task transitions to include FAILED.
  • src/ai_company/core/task.py
    • Added max_retries field (default 1, ge=0) to the Task Pydantic model.
    • Updated _validate_assignment_consistency to include FAILED tasks as potentially having no assignee.
  • src/ai_company/core/task_transitions.py
    • Updated the VALID_TRANSITIONS map to allow transitions to FAILED from ASSIGNED and IN_PROGRESS, and from FAILED to ASSIGNED.
    • Updated the module docstring to reflect the new FAILED transitions.
  • src/ai_company/engine/init.py
    • Imported FailAndReassignStrategy, RecoveryResult, and RecoveryStrategy.
    • Exported these new classes in __all__.
  • src/ai_company/engine/agent_engine.py
    • Imported FailAndReassignStrategy and RecoveryStrategy.
    • Added EXECUTION_RECOVERY_FAILED to imported event constants.
    • Defined _DEFAULT_RECOVERY_STRATEGY as FailAndReassignStrategy.
    • Updated _EXECUTABLE_STATUSES docstring to clarify FAILED tasks are not executable directly.
    • Added recovery_strategy parameter to AgentEngine.__init__.
    • Modified run and _handle_fatal_error to await _handle_fatal_error and _apply_recovery respectively.
    • Introduced _apply_recovery method to invoke the configured recovery strategy.
  • src/ai_company/engine/recovery.py
    • Added new file: Defined RecoveryResult Pydantic model with task_execution, strategy_type, can_reassign (computed field), context_snapshot, and error_message.
    • Added new file: Defined RecoveryStrategy protocol with recover and get_strategy_type methods.
    • Added new file: Implemented FailAndReassignStrategy which transitions tasks to FAILED, captures a redacted context snapshot, and logs recovery events.
  • src/ai_company/engine/task_execution.py
    • Added retry_count field (default 0, ge=0) to the TaskExecution Pydantic model.
    • Modified from_task class method to accept an optional retry_count argument.
  • src/ai_company/observability/events/execution.py
    • Added new constants: EXECUTION_RECOVERY_START, EXECUTION_RECOVERY_COMPLETE, EXECUTION_RECOVERY_FAILED, EXECUTION_RECOVERY_SNAPSHOT for recovery logging.
  • tests/integration/engine/test_crash_recovery.py
    • Added new file: Contains integration tests for the full crash recovery flow, including scenarios for first failure (can reassign) and second failure (cannot reassign).
  • tests/unit/core/test_enums.py
    • Updated test_task_status_has_8_members to reflect the addition of FAILED.
    • Added TaskStatus.FAILED.value assertion in test_task_status_values.
  • tests/unit/core/test_task.py
    • Added test_failed_without_assigned_to_allowed and test_failed_with_assigned_to_allowed.
    • Added TestTaskMaxRetries class with tests for default, custom, zero, and negative max_retries values.
  • tests/unit/core/test_task_transitions.py
    • Added test_in_progress_to_failed, test_assigned_to_failed, and test_failed_to_assigned for valid transitions.
    • Added test_failed_to_completed_rejected and test_failed_to_in_progress_rejected for invalid transitions.
    • Added test_failed_is_non_terminal assertion.
  • tests/unit/engine/test_agent_engine_errors.py
    • Imported TaskStatus, FailAndReassignStrategy, RecoveryResult, and TaskExecution.
    • Added TestAgentEngineRecovery class with tests for provider errors leading to FAILED status, recovery strategy invocation, recovery failure swallowing, no recovery when strategy is None, timeout triggering recovery, custom strategy usage, and MemoryError propagation.
  • tests/unit/engine/test_agent_engine_lifecycle.py
    • Renamed test_error_stays_in_progress to test_error_transitions_to_failed.
    • Updated the assertion to check for TaskStatus.FAILED instead of IN_PROGRESS after an error.
  • tests/unit/engine/test_recovery.py
    • Added new file: Contains unit tests for RecoveryStrategy protocol conformance and FailAndReassignStrategy behavior, including happy path, max retries exceeded, zero retries, snapshot redaction, error message capture, and logging events.
  • tests/unit/engine/test_task_execution.py
    • Added TestTaskExecutionRetryCount class with tests for default retry_count, from_task with retry_count, retry_count preservation on transition, and FAILED as a non-terminal state.
Activity
  • The pull request was pre-reviewed by 9 agents, and 17 findings were addressed.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust crash recovery mechanism using a 'fail-and-reassign' strategy. The changes are comprehensive, including updates to the task lifecycle, new data models, a recovery strategy protocol, and integration into the agent engine. The design specification has been updated accordingly, and the changes are well-covered by new unit and integration tests. My review found one critical syntax issue that will prevent the code from running and a suggestion to improve import handling to avoid potential circular dependencies. Overall, this is a solid implementation of a critical feature.

Comment on lines +607 to +608
except MemoryError, RecursionError:
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This except syntax is from Python 2. In Python 3, multiple exceptions must be grouped in a tuple. This will cause a SyntaxError when the module is imported.

        except (MemoryError, RecursionError):
            raise

Comment on lines +11 to +18
from typing import Protocol, runtime_checkable

from pydantic import BaseModel, ConfigDict, Field, computed_field

from ai_company.core.enums import TaskStatus
from ai_company.core.types import NotBlankStr # noqa: TC001
from ai_company.engine.context import AgentContext, AgentContextSnapshot # noqa: TC001
from ai_company.engine.task_execution import TaskExecution # noqa: TC001
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

AgentContext is only used as a type hint for a function argument. To avoid potential runtime circular dependencies, it's best practice to place such imports inside a TYPE_CHECKING block. This change separates the runtime-needed AgentContextSnapshot from the type-only AgentContext. The circular import was likely caused by AgentContext, so the noqa suppression on the AgentContextSnapshot import can also be removed.

from typing import Protocol, runtime_checkable, TYPE_CHECKING

from pydantic import BaseModel, ConfigDict, Field, computed_field

from ai_company.core.enums import TaskStatus
from ai_company.core.types import NotBlankStr  # noqa: TC001
from ai_company.engine.context import AgentContextSnapshot
from ai_company.engine.task_execution import TaskExecution  # noqa: TC001

if TYPE_CHECKING:
    from ai_company.engine.context import AgentContext

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 7, 2026

Greptile Summary

This PR implements crash recovery for AgentEngine per DESIGN_SPEC §6.6, adding FAILED as a non-terminal TaskStatus, a RecoveryStrategy protocol with a FailAndReassignStrategy default, and integrating recovery into both the normal error path and the fatal-exception path.

Key verified changes:

  • TaskStatus.FAILED is correctly added as a non-terminal state with valid transitions ASSIGNED → FAILED, IN_PROGRESS → FAILED, and FAILED → ASSIGNED (for reassignment)
  • Task.max_retries and TaskExecution.retry_count fields track retry eligibility correctly
  • RecoveryResult.can_reassign properly computed as retry_count < max_retries, with docstring clarifying that the task router (M4) is responsible for incrementing retry_count on reassignment
  • _apply_recovery is called after normal ERROR-terminated loop results and inside _handle_fatal_error for uncaught exceptions — non-overlapping paths prevent double-recovery
  • _record_costs/_submit_cost cleanly extracted to cost_recording.py to keep agent_engine.py under the 800-line limit
  • DESIGN_SPEC §6.1 lifecycle diagram correctly shows both ASSIGNED → FAILED and IN_PROGRESS → FAILED transitions

One issue found: raise exc from build_exc at line 741 of agent_engine.py reverses the exception causality. The original exception is the victim; the build exception is a secondary failure that occurred while handling the first. The correct idiom is raise exc from None, which was the previous pattern.

Confidence Score: 4/5

  • Safe to merge after addressing the exception chain reversal at line 741 of agent_engine.py.
  • The crash recovery implementation is well-structured, thoroughly tested (2087 tests, 95.39% coverage), and correctly integrates the recovery logic at two non-overlapping paths. The state machine transitions are properly defined and all required fields are correctly validated. The sole issue is a semantic reversal of the Python exception chain at line 741 (raise exc from build_exc instead of raise exc from None), which affects only the error chain representation in the rare scenario where error recovery itself fails, but does not affect runtime behavior.
  • src/ai_company/engine/agent_engine.py — exception chain direction at line 741.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant AgentEngine
    participant ReactLoop
    participant FailAndReassignStrategy
    participant TaskExecution

    Caller->>AgentEngine: run(identity, task)
    AgentEngine->>AgentEngine: _prepare_context() → IN_PROGRESS
    AgentEngine->>ReactLoop: execute(context)

    alt Loop raises exception
        ReactLoop-->>AgentEngine: raises Exception
        AgentEngine->>AgentEngine: _handle_fatal_error(ctx, exc)
        AgentEngine->>FailAndReassignStrategy: recover(task_execution, error_msg, ctx)
        FailAndReassignStrategy->>TaskExecution: with_transition(FAILED, reason)
        TaskExecution-->>FailAndReassignStrategy: failed_execution
        FailAndReassignStrategy-->>AgentEngine: RecoveryResult(can_reassign, FAILED)
        AgentEngine-->>Caller: AgentRunResult(ERROR, task=FAILED)
    else Loop returns TerminationReason.ERROR
        ReactLoop-->>AgentEngine: ExecutionResult(ERROR)
        AgentEngine->>AgentEngine: _apply_post_execution_transitions() → no-op
        AgentEngine->>FailAndReassignStrategy: recover(task_execution, error_msg, ctx)
        FailAndReassignStrategy->>TaskExecution: with_transition(FAILED, reason)
        TaskExecution-->>FailAndReassignStrategy: failed_execution
        FailAndReassignStrategy-->>AgentEngine: RecoveryResult(can_reassign, FAILED)
        AgentEngine-->>Caller: AgentRunResult(ERROR, task=FAILED)
    end

    Note over Caller,TaskExecution: Task Router (M4) checks can_reassign<br/>and creates TaskExecution.from_task(task, retry_count+1)
Loading

Last reviewed commit: b25d85a

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements crash recovery in the execution engine via a pluggable RecoveryStrategy (defaulting to a fail-and-reassign approach), adds a new non-terminal FAILED task status, and introduces retry tracking (max_retries / retry_count) to support safe reassignment flows.

Changes:

  • Add TaskStatus.FAILED plus updated valid transition map to allow failing and reassignment (FAILED → ASSIGNED).
  • Introduce crash recovery framework (RecoveryStrategy, FailAndReassignStrategy, RecoveryResult) and integrate it into AgentEngine error paths with new observability events.
  • Add retry configuration/telemetry fields (Task.max_retries, TaskExecution.retry_count) and expand unit/integration tests + update DESIGN_SPEC sections.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/engine/test_task_execution.py Adds unit coverage for TaskExecution.retry_count defaults and transition behavior with FAILED.
tests/unit/engine/test_recovery.py New unit tests for recovery protocol/strategy behavior and recovery logging events.
tests/unit/engine/test_agent_engine_lifecycle.py Updates lifecycle expectation: error outcomes transition task execution to FAILED.
tests/unit/engine/test_agent_engine_errors.py Adds engine recovery tests (strategy invocation, swallow failure, opt-out, timeout, MemoryError propagation).
tests/unit/core/test_task_transitions.py Extends transition validation tests for FAILED allowed/invalid edges and non-terminal behavior.
tests/unit/core/test_task.py Adds tests for FAILED assignment rules and Task.max_retries validation.
tests/unit/core/test_enums.py Updates enum member count/value checks to include TaskStatus.FAILED.
tests/integration/engine/test_crash_recovery.py New integration test ensuring provider crash results in FAILED status and retry gating checks.
src/ai_company/observability/events/execution.py Adds structured recovery event constants (execution.recovery.*).
src/ai_company/engine/task_execution.py Adds retry_count field and plumbs it into TaskExecution.from_task().
src/ai_company/engine/recovery.py New recovery protocol + default FailAndReassignStrategy, RecoveryResult model, and snapshot/error logging.
src/ai_company/engine/agent_engine.py Integrates recovery on TerminationReason.ERROR and fatal-error path; adds configurable strategy.
src/ai_company/engine/init.py Re-exports recovery types/strategy as part of engine public API.
src/ai_company/core/task_transitions.py Updates lifecycle transition map and docs to include FAILED.
src/ai_company/core/task.py Adds max_retries field and relaxes assignee requirements for FAILED.
src/ai_company/core/enums.py Adds FAILED to TaskStatus and updates lifecycle summary docs.
DESIGN_SPEC.md Updates lifecycle diagram and recovery spec sections to reflect implemented crash recovery and retry limits.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +26 to +28
pytestmark = pytest.mark.timeout(30)


Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytestmark = pytest.mark.timeout(30) is unique within tests/unit/engine (other unit-engine modules don’t set a module-level timeout). To keep markers consistent, consider removing the module-level timeout here or applying the same pattern across the unit/engine suite if timeouts are now desired.

Suggested change
pytestmark = pytest.mark.timeout(30)

Copilot uses AI. Check for mistakes.
Comment on lines +40 to +50
model_config = ConfigDict(frozen=True)

task_execution: TaskExecution = Field(
description="Updated execution with FAILED status",
)
strategy_type: NotBlankStr = Field(
description="Identifier of the recovery strategy used",
)
context_snapshot: AgentContextSnapshot = Field(
description="Redacted context snapshot (no message contents)",
)
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecoveryResult is declared as a frozen Pydantic model, but there’s no corresponding test_frozen coverage for it (the repo commonly asserts immutability for every ConfigDict(frozen=True) model). Adding a small unit test that field reassignment raises ValidationError would align with existing patterns and prevent accidental regression to mutability.

Copilot uses AI. Check for mistakes.
Comment on lines 111 to +127
@classmethod
def from_task(cls, task: Task) -> TaskExecution:
def from_task(
cls,
task: Task,
*,
retry_count: int = 0,
) -> TaskExecution:
"""Create a fresh execution from a task definition.

Args:
task: The frozen task to wrap.
retry_count: Number of previous failure-reassignment cycles.

Returns:
New ``TaskExecution`` with status matching the task.
"""
execution = cls(task=task, status=task.status)
execution = cls(task=task, status=task.status, retry_count=retry_count)
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TaskExecution.from_task() now accepts retry_count, but the only call site (AgentContext.from_identity) always uses the default, so engine-driven executions can’t start with a non-zero retry count. That makes max_retries/can_reassign effectively ignore previous retries unless callers bypass AgentEngine.run() and build TaskExecution manually. Consider plumbing an optional retry_count through AgentEngine.run() / AgentContext.from_identity() (or persisting it on the task) so retries can be executed via the public API.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +65
_DEFAULT_RECOVERY_STRATEGY = FailAndReassignStrategy()
"""Module-level default instance for the recovery strategy."""

Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a module-level _DEFAULT_RECOVERY_STRATEGY instance as the default argument means all AgentEngine instances share the same strategy object. This is safe only as long as the strategy remains strictly stateless; if it later gains config/state (metrics, caches, injected services), that shared instance can cause cross-engine coupling. Prefer defaulting to None and instantiating FailAndReassignStrategy() inside __init__ when no strategy is provided (while still allowing None to explicitly disable recovery via a separate sentinel).

Suggested change
_DEFAULT_RECOVERY_STRATEGY = FailAndReassignStrategy()
"""Module-level default instance for the recovery strategy."""
def get_default_recovery_strategy() -> RecoveryStrategy:
"""Factory for the default recovery strategy.
Returns a new ``FailAndReassignStrategy`` instance to avoid sharing
mutable state across multiple ``AgentEngine`` instances.
"""
return FailAndReassignStrategy()
# Backwards-compatible alias: evaluate to a fresh default strategy when used.
_DEFAULT_RECOVERY_STRATEGY: RecoveryStrategy = get_default_recovery_strategy()
"""Module-level default for the recovery strategy.
Note: this is created via ``get_default_recovery_strategy()`` to avoid
sharing a single mutable instance across multiple engines.
"""

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/ai_company/engine/task_execution.py (1)

97-127: ⚠️ Potential issue | 🟠 Major

Retry/reassignment can now produce invalid Task snapshots.

With retry_count enabling the new FAILED -> ASSIGNED flow, TaskExecution can represent a reassigned attempt while self.task.assigned_to is still None—which src/ai_company/core/task.py allows for FAILED but not for ASSIGNED. to_task_snapshot() later uses self.task.model_copy(update={"status": self.status}), so it can silently emit an invalid assigned task without running Task validators. Please carry the reassigned assignee in execution state, or rebuild snapshots via Task.model_validate(...) / Task.with_transition(...) with the required overrides.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/task_execution.py` around lines 97 - 127, The
TaskExecution.from_task + to_task_snapshot flow can emit an invalid ASSIGNED
Task because from_task only copies Task into execution while retry_count-driven
FAILED->ASSIGNED transitions may require a non-None assignee; update
TaskExecution to carry the reassigned assignee (e.g., add an assigned_to field
on TaskExecution) and modify from_task to accept/record that assignee when
creating a reassigned attempt, and update to_task_snapshot to rebuild a
validated Task snapshot by calling Task.model_validate(...) or
Task.with_transition(...) (instead of Task.model_copy) supplying both
status=self.status and the carried assigned_to so validators run and an ASSIGNED
Task is never emitted with a None assignee.
DESIGN_SPEC.md (1)

697-720: ⚠️ Potential issue | 🟠 Major

Split BLOCKED from CANCELLED in the lifecycle diagram.

Line 714 still renders BLOCKED / CANCELLED as a shared terminal sink, which contradicts Line 720 and §6.6 where BLOCKED is non-terminal and returns to ASSIGNED. Please add the BLOCKED → ASSIGNED path or separate the two states so the state machine stays unambiguous.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DESIGN_SPEC.md` around lines 697 - 720, The ASCII lifecycle diagram currently
combines "BLOCKED / CANCELLED" into one terminal sink which contradicts §6.6;
update the diagram to separate BLOCKED and CANCELLED as distinct nodes and add
an arrow from BLOCKED back to ASSIGNED labeled "unblocked" (or "blocked /
cancelled" as appropriate), keeping CANCELLED as a terminal node with no
outgoing transitions and ensuring FAILED remains non-terminal returning to
ASSIGNED; target the diagram block containing the states "ASSIGNED",
"IN_PROGRESS", "IN_REVIEW", "COMPLETED", "BLOCKED", and "CANCELLED" and adjust
the ASCII art arrows so BLOCKED → ASSIGNED is explicit and CANCELLED stays
terminal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@DESIGN_SPEC.md`:
- Around line 973-975: The wording should explicitly place the recovery catch
point on the AgentEngine boundary: update the sentence that currently reads
“outermost `try/except` in the execution loop” to say something like “the
outermost `try/except` at the AgentEngine boundary (e.g., the top-level
try/catch in AgentEngine.run / execute loop)”; keep the rest of the step
describing logging an ERROR with a redacted AgentContext snapshot (turn count,
accumulated cost — exclude message contents) so responsibility is unambiguously
assigned to AgentEngine.
- Around line 2281-2283: Update the project-structure map entry for
run_result.py to reflect the added RecoveryResult type: change the short
description from "AgentRunResult outcome model" to something like
"AgentRunResult and RecoveryResult outcome models" so it documents both
AgentRunResult and RecoveryResult in run_result.py; ensure you reference
run_result.py and the symbols AgentRunResult and RecoveryResult in the updated
line.
- Around line 720-721: The blockquote containing "**Non-terminal states:**
BLOCKED and FAILED are non-terminal — BLOCKED returns to ASSIGNED when
unblocked, FAILED returns to ASSIGNED for retry (see §6.6). COMPLETED and
CANCELLED are terminal states with no outgoing transitions." contains an extra
blank line; remove that blank line so the blockquote lines are contiguous (no
empty line inside the quote) to satisfy markdownlint MD028.

In `@src/ai_company/engine/agent_engine.py`:
- Around line 265-270: The retry_count on TaskExecution is being reset to zero
when AgentEngine.run rebuilds state via AgentContext.from_identity which calls
TaskExecution.from_task, so recovery/reassignment logic using
task_execution.retry_count (and can_reassign) never exhausts; fix by persisting
or passing the current retry_count into the rebuilt TaskExecution: update
AgentContext.from_identity (and any call sites in AgentEngine.run and the second
block around lines 575-606) to accept and propagate the existing
TaskExecution.retry_count (or the TaskExecution object) instead of constructing
a fresh TaskExecution.from_task(task) with retry_count=0, and ensure
AgentEngine._apply_recovery increments and persists retry_count back to the
task/context so subsequent run() invocations see the updated value.

In `@tests/integration/engine/test_crash_recovery.py`:
- Around line 89-92: The ModelConfig instance uses a non-canonical model_id
"test-model-001"; update the ModelConfig(..., model_id=...) to use the approved
test alias "test-small-001" (keep provider="test-provider") so the fixture
conforms to repo test-data conventions and allowed aliases.

In `@tests/unit/engine/test_agent_engine_errors.py`:
- Around line 362-364: Add a 30s pytest timeout mark for the new
TestAgentEngineRecovery test class so each test is limited to 30 seconds; either
add a class decorator `@pytest.mark.timeout`(30) above the TestAgentEngineRecovery
class or add a module-level pytestmark = pytest.mark.timeout(30) at top of the
file, and ensure pytest is imported if not already (referencing
TestAgentEngineRecovery to locate the class).

In `@tests/unit/engine/test_task_execution.py`:
- Around line 85-114: Add the required 30-second pytest timeout marker at the
class (or module) level so each test in the TestTaskExecutionRetryCount block
gets the repo-required timeout; locate the TestTaskExecutionRetryCount class and
decorate it with pytest.mark.timeout(30) (or add the same marker at the top of
the module) to enforce the per-test 30s guard for methods like
test_retry_count_default_zero, test_from_task_with_retry_count,
test_retry_count_preserved_on_transition, and
test_failed_transition_not_terminal.

---

Outside diff comments:
In `@DESIGN_SPEC.md`:
- Around line 697-720: The ASCII lifecycle diagram currently combines "BLOCKED /
CANCELLED" into one terminal sink which contradicts §6.6; update the diagram to
separate BLOCKED and CANCELLED as distinct nodes and add an arrow from BLOCKED
back to ASSIGNED labeled "unblocked" (or "blocked / cancelled" as appropriate),
keeping CANCELLED as a terminal node with no outgoing transitions and ensuring
FAILED remains non-terminal returning to ASSIGNED; target the diagram block
containing the states "ASSIGNED", "IN_PROGRESS", "IN_REVIEW", "COMPLETED",
"BLOCKED", and "CANCELLED" and adjust the ASCII art arrows so BLOCKED → ASSIGNED
is explicit and CANCELLED stays terminal.

In `@src/ai_company/engine/task_execution.py`:
- Around line 97-127: The TaskExecution.from_task + to_task_snapshot flow can
emit an invalid ASSIGNED Task because from_task only copies Task into execution
while retry_count-driven FAILED->ASSIGNED transitions may require a non-None
assignee; update TaskExecution to carry the reassigned assignee (e.g., add an
assigned_to field on TaskExecution) and modify from_task to accept/record that
assignee when creating a reassigned attempt, and update to_task_snapshot to
rebuild a validated Task snapshot by calling Task.model_validate(...) or
Task.with_transition(...) (instead of Task.model_copy) supplying both
status=self.status and the carried assigned_to so validators run and an ASSIGNED
Task is never emitted with a None assignee.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 598115f0-a607-4eab-9205-e5f38a107f77

📥 Commits

Reviewing files that changed from the base of the PR and between 833c190 and f9383eb.

📒 Files selected for processing (17)
  • DESIGN_SPEC.md
  • src/ai_company/core/enums.py
  • src/ai_company/core/task.py
  • src/ai_company/core/task_transitions.py
  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/engine/task_execution.py
  • src/ai_company/observability/events/execution.py
  • tests/integration/engine/test_crash_recovery.py
  • tests/unit/core/test_enums.py
  • tests/unit/core/test_task.py
  • tests/unit/core/test_task_transitions.py
  • tests/unit/engine/test_agent_engine_errors.py
  • tests/unit/engine/test_agent_engine_lifecycle.py
  • tests/unit/engine/test_recovery.py
  • tests/unit/engine/test_task_execution.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Agent
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Use Python 3.14+ with PEP 649 native lazy annotations
Do NOT use from __future__ import annotations—Python 3.14 has PEP 649
Use except A, B: syntax (no parentheses) for exception handling on Python 3.14—ruff enforces this
Add type hints to all public functions in Python; mypy strict mode is enforced
Use Google-style docstrings on all public classes and functions—ruff D rules enforce this
Create new objects instead of mutating existing ones; use copy.deepcopy() at construction for non-Pydantic internal collections and MappingProxyType wrapping for read-only enforcement
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (using model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 with BaseModel, model_validator, computed_field, and ConfigDict
Use @computed_field for derived values instead of storing + validating redundant fields (e.g. TokenUsage.total_tokens)
Use NotBlankStr (from core.types) for all identifier/name fields—including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants—instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Enforce line length of 88 characters (ruff enforces this)
Functions should be less than 50 lines, files less than 800 lines
Handle errors explicitly; never silently swallow errors in Python code
Validate at system boundaries (user input, external APIs, config files)

Files:

  • src/ai_company/observability/events/execution.py
  • tests/unit/engine/test_recovery.py
  • src/ai_company/engine/__init__.py
  • tests/unit/core/test_enums.py
  • src/ai_company/engine/task_execution.py
  • src/ai_company/engine/recovery.py
  • tests/unit/core/test_task.py
  • tests/unit/engine/test_agent_engine_errors.py
  • src/ai_company/core/task_transitions.py
  • src/ai_company/core/enums.py
  • tests/unit/engine/test_task_execution.py
  • src/ai_company/core/task.py
  • tests/integration/engine/test_crash_recovery.py
  • tests/unit/core/test_task_transitions.py
  • tests/unit/engine/test_agent_engine_lifecycle.py
  • src/ai_company/engine/agent_engine.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: Every module with business logic MUST import from ai_company.observability import get_logger then logger = get_logger(__name__)
Never use import logging, logging.getLogger(), or print() in application code
Always use logger as the variable name for loggers (not _logger, not log)
Use event name constants from domain-specific modules under ai_company.observability.events (e.g. PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget). Import directly: from ai_company.observability.events.<domain> import EVENT_CONSTANT
Always use structured logging with logger.info(EVENT, key=value) format—never logger.info('msg %s', val)
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
DEBUG level logging should be used for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports do NOT need logging

Files:

  • src/ai_company/observability/events/execution.py
  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/task_execution.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/core/task_transitions.py
  • src/ai_company/core/enums.py
  • src/ai_company/core/task.py
  • src/ai_company/engine/agent_engine.py
{src/**/*.py,tests/**/*.py,src/**/*.yaml,src/**/*.yml,tests/**/*.yaml,tests/**/*.yml,examples/**/*.yaml,examples/**/*.yml}

📄 CodeRabbit inference engine (CLAUDE.md)

NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases. Vendor names may only appear in: (1) DESIGN_SPEC.md provider list, (2) .claude/ skill/agent files, (3) third-party import paths/module names (e.g. litellm.types.llms.openai). Tests must use test-provider, test-small-001, etc.

Files:

  • src/ai_company/observability/events/execution.py
  • tests/unit/engine/test_recovery.py
  • src/ai_company/engine/__init__.py
  • tests/unit/core/test_enums.py
  • src/ai_company/engine/task_execution.py
  • src/ai_company/engine/recovery.py
  • tests/unit/core/test_task.py
  • tests/unit/engine/test_agent_engine_errors.py
  • src/ai_company/core/task_transitions.py
  • src/ai_company/core/enums.py
  • tests/unit/engine/test_task_execution.py
  • src/ai_company/core/task.py
  • tests/integration/engine/test_crash_recovery.py
  • tests/unit/core/test_task_transitions.py
  • tests/unit/engine/test_agent_engine_lifecycle.py
  • src/ai_company/engine/agent_engine.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Mark unit tests with @pytest.mark.unit, integration tests with @pytest.mark.integration, e2e tests with @pytest.mark.e2e, and slow tests with @pytest.mark.slow
Use asyncio_mode = 'auto' for pytest async tests—no manual @pytest.mark.asyncio needed
Set a 30-second timeout per test

Files:

  • tests/unit/engine/test_recovery.py
  • tests/unit/core/test_enums.py
  • tests/unit/core/test_task.py
  • tests/unit/engine/test_agent_engine_errors.py
  • tests/unit/engine/test_task_execution.py
  • tests/integration/engine/test_crash_recovery.py
  • tests/unit/core/test_task_transitions.py
  • tests/unit/engine/test_agent_engine_lifecycle.py
src/ai_company/{providers,engine}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

RetryExhaustedError signals that all retries failed—the engine layer catches this to trigger fallback chains

Files:

  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/task_execution.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/engine/agent_engine.py
🧠 Learnings (2)
📚 Learning: 2026-03-06T21:51:55.175Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-06T21:51:55.175Z
Learning: Applies to src/**/*.py : Use event name constants from domain-specific modules under `ai_company.observability.events` (e.g. `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

  • src/ai_company/observability/events/execution.py
📚 Learning: 2026-03-06T21:51:55.175Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-06T21:51:55.175Z
Learning: Applies to src/ai_company/{providers,engine}/**/*.py : `RetryExhaustedError` signals that all retries failed—the engine layer catches this to trigger fallback chains

Applied to files:

  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/recovery.py
  • tests/integration/engine/test_crash_recovery.py
  • DESIGN_SPEC.md
  • src/ai_company/engine/agent_engine.py
🧬 Code graph analysis (13)
tests/unit/engine/test_recovery.py (5)
src/ai_company/core/enums.py (2)
  • TaskStatus (122-146)
  • TaskType (149-157)
src/ai_company/core/task.py (1)
  • Task (38-212)
src/ai_company/core/agent.py (1)
  • AgentIdentity (177-235)
src/ai_company/engine/recovery.py (7)
  • RecoveryResult (29-61)
  • RecoveryStrategy (65-94)
  • get_strategy_type (92-94)
  • get_strategy_type (164-166)
  • recover (73-90)
  • recover (108-162)
  • can_reassign (59-61)
tests/unit/engine/conftest.py (2)
  • sample_agent_with_personality (52-79)
  • sample_task_with_criteria (95-116)
src/ai_company/engine/__init__.py (1)
src/ai_company/engine/recovery.py (3)
  • FailAndReassignStrategy (97-166)
  • RecoveryResult (29-61)
  • RecoveryStrategy (65-94)
tests/unit/core/test_enums.py (1)
src/ai_company/core/enums.py (1)
  • TaskStatus (122-146)
src/ai_company/engine/task_execution.py (1)
src/ai_company/core/task.py (1)
  • Task (38-212)
src/ai_company/engine/recovery.py (4)
src/ai_company/core/enums.py (1)
  • TaskStatus (122-146)
src/ai_company/engine/context.py (3)
  • AgentContext (87-307)
  • AgentContextSnapshot (41-84)
  • to_snapshot (280-302)
src/ai_company/engine/task_execution.py (2)
  • TaskExecution (60-246)
  • with_transition (135-193)
src/ai_company/core/task.py (1)
  • with_transition (179-212)
tests/unit/core/test_task.py (2)
tests/integration/engine/test_crash_recovery.py (1)
  • _make_task (96-112)
src/ai_company/core/enums.py (1)
  • TaskStatus (122-146)
tests/unit/engine/test_agent_engine_errors.py (5)
src/ai_company/engine/recovery.py (6)
  • FailAndReassignStrategy (97-166)
  • RecoveryResult (29-61)
  • recover (73-90)
  • recover (108-162)
  • get_strategy_type (92-94)
  • get_strategy_type (164-166)
src/ai_company/providers/enums.py (1)
  • FinishReason (15-22)
src/ai_company/core/task.py (1)
  • Task (38-212)
src/ai_company/engine/agent_engine.py (1)
  • AgentEngine (75-843)
src/ai_company/engine/context.py (1)
  • AgentContext (87-307)
src/ai_company/core/task_transitions.py (1)
src/ai_company/core/enums.py (1)
  • TaskStatus (122-146)
tests/unit/engine/test_task_execution.py (4)
tests/unit/engine/conftest.py (1)
  • sample_task_with_criteria (95-116)
src/ai_company/core/task.py (2)
  • Task (38-212)
  • with_transition (179-212)
src/ai_company/engine/task_execution.py (4)
  • TaskExecution (60-246)
  • from_task (112-133)
  • with_transition (135-193)
  • is_terminal (244-246)
src/ai_company/core/enums.py (1)
  • TaskStatus (122-146)
tests/integration/engine/test_crash_recovery.py (3)
src/ai_company/core/enums.py (1)
  • TaskStatus (122-146)
src/ai_company/engine/loop_protocol.py (1)
  • TerminationReason (24-30)
src/ai_company/engine/task_execution.py (2)
  • TaskExecution (60-246)
  • from_task (112-133)
tests/unit/core/test_task_transitions.py (2)
src/ai_company/core/task_transitions.py (1)
  • validate_transition (65-99)
src/ai_company/core/enums.py (1)
  • TaskStatus (122-146)
tests/unit/engine/test_agent_engine_lifecycle.py (1)
src/ai_company/core/enums.py (1)
  • TaskStatus (122-146)
src/ai_company/engine/agent_engine.py (4)
src/ai_company/engine/recovery.py (4)
  • FailAndReassignStrategy (97-166)
  • RecoveryStrategy (65-94)
  • recover (73-90)
  • recover (108-162)
src/ai_company/core/enums.py (1)
  • TaskStatus (122-146)
src/ai_company/engine/loop_protocol.py (1)
  • TerminationReason (24-30)
src/ai_company/engine/run_result.py (1)
  • termination_reason (64-66)
🪛 markdownlint-cli2 (0.21.0)
DESIGN_SPEC.md

[warning] 721-721: Blank line inside blockquote

(MD028, no-blanks-blockquote)

🔇 Additional comments (2)
src/ai_company/engine/agent_engine.py (2)

607-608: ⚠️ Potential issue | 🔴 Critical

Fix syntax error in multiple exception handlers—use parenthesized syntax for Python 3.14.

except MemoryError, RecursionError: is a SyntaxError in Python 3.x and will prevent the module from importing. This pattern occurs at 5 locations (lines 197, 282, 607, 736, 826) and must be changed to except (MemoryError, RecursionError): throughout the file.

Minimal fix example
-        except MemoryError, RecursionError:
+        except (MemoryError, RecursionError):
             raise
⛔ Skipped due to learnings
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-06T21:51:55.175Z
Learning: Applies to **/*.py : Use `except A, B:` syntax (no parentheses) for exception handling on Python 3.14—ruff enforces this
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-06T21:51:55.175Z
Learning: Applies to **/*.py : Handle errors explicitly; never silently swallow errors in Python code

609-616: Recovery strategy exceptions are designed to be handled gracefully, not re-raised.

The RecoveryStrategy.recover() protocol documents return type as RecoveryResult with no declared exceptions. Recovery strategies are defensive handlers invoked after the execution loop completes—they are not part of the provider-layer retry/fallback chain. The generic exception handler correctly logs and returns the original result, allowing recovery failures to not block task finalization.

			> Likely an incorrect or invalid review comment.

Comment on lines +720 to +721
> **Non-terminal states:** BLOCKED and FAILED are non-terminal — BLOCKED returns to ASSIGNED when unblocked, FAILED returns to ASSIGNED for retry (see §6.6). COMPLETED and CANCELLED are terminal states with no outgoing transitions.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove the blank line inside this blockquote.

Line 721 triggers markdownlint MD028 (no-blanks-blockquote). Keep the blockquote contiguous to avoid the lint failure.

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 721-721: Blank line inside blockquote

(MD028, no-blanks-blockquote)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DESIGN_SPEC.md` around lines 720 - 721, The blockquote containing
"**Non-terminal states:** BLOCKED and FAILED are non-terminal — BLOCKED returns
to ASSIGNED when unblocked, FAILED returns to ASSIGNED for retry (see §6.6).
COMPLETED and CANCELLED are terminal states with no outgoing transitions."
contains an extra blank line; remove that blank line so the blockquote lines are
contiguous (no empty line inside the quote) to satisfy markdownlint MD028.

Comment on lines +2281 to 2283
│ │ ├── recovery.py # Crash recovery strategies (RecoveryStrategy protocol)
│ │ ├── run_result.py # AgentRunResult outcome model
│ │ ├── agent_engine.py # Agent execution engine
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Update the run_result.py description to include RecoveryResult.

The new recovery.py entry is documented here, but run_result.py is still described only as AgentRunResult outcome model. With this PR adding RecoveryResult there as well, the project-structure map is now stale.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DESIGN_SPEC.md` around lines 2281 - 2283, Update the project-structure map
entry for run_result.py to reflect the added RecoveryResult type: change the
short description from "AgentRunResult outcome model" to something like
"AgentRunResult and RecoveryResult outcome models" so it documents both
AgentRunResult and RecoveryResult in run_result.py; ensure you reference
run_result.py and the symbols AgentRunResult and RecoveryResult in the updated
line.

Comment on lines +265 to +270
if execution_result.termination_reason == TerminationReason.ERROR:
execution_result = await self._apply_recovery(
execution_result,
agent_id,
task_id,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Recovery retries will never exhaust through the public run() path.

This code now relies on task_execution.retry_count, but run() still rebuilds execution state via AgentContext.from_identity(..., task=task), and that path creates TaskExecution.from_task(task) with retry_count=0. After a FAILED -> ASSIGNED reassignment, the next AgentEngine.run() starts back at zero, so can_reassign will stay True indefinitely unless callers bypass the public API and hand-roll a TaskExecution. Persist or thread the retry count across the reassignment boundary before shipping this.

Also applies to: 575-606

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/agent_engine.py` around lines 265 - 270, The
retry_count on TaskExecution is being reset to zero when AgentEngine.run
rebuilds state via AgentContext.from_identity which calls
TaskExecution.from_task, so recovery/reassignment logic using
task_execution.retry_count (and can_reassign) never exhausts; fix by persisting
or passing the current retry_count into the rebuilt TaskExecution: update
AgentContext.from_identity (and any call sites in AgentEngine.run and the second
block around lines 575-606) to accept and propagate the existing
TaskExecution.retry_count (or the TaskExecution object) instead of constructing
a fresh TaskExecution.from_task(task) with retry_count=0, and ensure
AgentEngine._apply_recovery increments and persists retry_count back to the
task/context so subsequent run() invocations see the updated value.

…iewers

- Extract cost recording to engine/cost_recording.py (agent_engine.py 861→759 lines)
- Fix raise exc from None → raise exc from build_exc (preserve exception chain)
- Upgrade _apply_recovery logger.warning → logger.exception (preserve traceback)
- RecoveryResult.error_message: str → NotBlankStr (project convention)
- FailAndReassignStrategy.STRATEGY_TYPE: Final[str] (immutability intent)
- Update DESIGN_SPEC §6.1 diagram: add ASSIGNED→FAILED, separate BLOCKED/CANCELLED
- Update DESIGN_SPEC §6.6: document RecoveryStrategy protocol methods/fields,
  fix "execution loop" → "AgentEngine.run()", list AgentContextSnapshot fields
- Add cost_recording.py to DESIGN_SPEC §15.3 project structure
- Fix MD028 blank line in blockquote
- Add pytestmark timeout(30) to 3 test files missing it
- Fix test-model-001 → test-small-001 (vendor-agnostic convention)
- Add test: RecoveryResult frozen model immutability
- Add test: Task.with_transition(FAILED→ASSIGNED) with validator
- Add test: RecursionError in recovery propagates
- Update docstrings: recovery.py, agent_engine.py, task_transitions.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Aureliolo Aureliolo merged commit e6e91ed into main Mar 7, 2026
7 checks passed
@Aureliolo Aureliolo deleted the feat/crash-recovery branch March 7, 2026 08:53
original_error=error_msg,
)
raise exc from None
raise exc from build_exc
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise exc from build_exc reverses the exception causality.

Here, exc is the original exception (passed as the method's parameter), while build_exc is a secondary exception that occurred while handling exc inside the try block. Using raise X from Y sets X.__cause__ = Y, which asserts that Y caused X — but here, exc happened first and build_exc arose secondarily when attempting recovery.

The previous raise exc from None was correct: it suppresses the misleading implicit exception chain and surfaces only the original failure, which is what callers need to diagnose the root cause. The build_exc is already captured in the structured log on line 738, so no debugging information is lost.

Suggested change
raise exc from build_exc
raise exc from None
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/engine/agent_engine.py
Line: 741

Comment:
`raise exc from build_exc` reverses the exception causality.

Here, `exc` is the original exception (passed as the method's parameter), while `build_exc` is a *secondary* exception that occurred *while handling* `exc` inside the try block. Using `raise X from Y` sets `X.__cause__ = Y`, which asserts that `Y` *caused* `X` — but here, `exc` happened first and `build_exc` arose secondarily when attempting recovery.

The previous `raise exc from None` was correct: it suppresses the misleading implicit exception chain and surfaces only the original failure, which is what callers need to diagnose the root cause. The `build_exc` is already captured in the structured log on line 738, so no debugging information is lost.

```suggestion
            raise exc from None
```

How can I resolve this? If you propose a fix, please make it concise.

Aureliolo added a commit that referenced this pull request Mar 10, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.1.1](ai-company-v0.1.0...ai-company-v0.1.1)
(2026-03-10)


### Features

* add autonomy levels and approval timeout policies
([#42](#42),
[#126](#126))
([#197](#197))
([eecc25a](eecc25a))
* add CFO cost optimization service with anomaly detection, reports, and
approval decisions
([#186](#186))
([a7fa00b](a7fa00b))
* add code quality toolchain (ruff, mypy, pre-commit, dependabot)
([#63](#63))
([36681a8](36681a8))
* add configurable cost tiers and subscription/quota-aware tracking
([#67](#67))
([#185](#185))
([9baedfa](9baedfa))
* add container packaging, Docker Compose, and CI pipeline
([#269](#269))
([435bdfe](435bdfe)),
closes [#267](#267)
* add coordination error taxonomy classification pipeline
([#146](#146))
([#181](#181))
([70c7480](70c7480))
* add cost-optimized, hierarchical, and auction assignment strategies
([#175](#175))
([ce924fa](ce924fa)),
closes [#173](#173)
* add design specification, license, and project setup
([8669a09](8669a09))
* add env var substitution and config file auto-discovery
([#77](#77))
([7f53832](7f53832))
* add FastestStrategy routing + vendor-agnostic cleanup
([#140](#140))
([09619cb](09619cb)),
closes [#139](#139)
* add HR engine and performance tracking
([#45](#45),
[#47](#47))
([#193](#193))
([2d091ea](2d091ea))
* add issue auto-search and resolution verification to PR review skill
([#119](#119))
([deecc39](deecc39))
* add memory retrieval, ranking, and context injection pipeline
([#41](#41))
([873b0aa](873b0aa))
* add pluggable MemoryBackend protocol with models, config, and events
([#180](#180))
([46cfdd4](46cfdd4))
* add pluggable MemoryBackend protocol with models, config, and events
([#32](#32))
([46cfdd4](46cfdd4))
* add pluggable PersistenceBackend protocol with SQLite implementation
([#36](#36))
([f753779](f753779))
* add progressive trust and promotion/demotion subsystems
([#43](#43),
[#49](#49))
([3a87c08](3a87c08))
* add retry handler, rate limiter, and provider resilience
([#100](#100))
([b890545](b890545))
* add SecOps security agent with rule engine, audit log, and ToolInvoker
integration ([#40](#40))
([83b7b6c](83b7b6c))
* add shared org memory and memory consolidation/archival
([#125](#125),
[#48](#48))
([4a0832b](4a0832b))
* design unified provider interface
([#86](#86))
([3e23d64](3e23d64))
* expand template presets, rosters, and add inheritance
([#80](#80),
[#81](#81),
[#84](#84))
([15a9134](15a9134))
* implement agent runtime state vs immutable config split
([#115](#115))
([4cb1ca5](4cb1ca5))
* implement AgentEngine core orchestrator
([#11](#11))
([#143](#143))
([f2eb73a](f2eb73a))
* implement basic tool system (registry, invocation, results)
([#15](#15))
([c51068b](c51068b))
* implement built-in file system tools
([#18](#18))
([325ef98](325ef98))
* implement communication foundation — message bus, dispatcher, and
messenger ([#157](#157))
([8e71bfd](8e71bfd))
* implement company template system with 7 built-in presets
([#85](#85))
([cbf1496](cbf1496))
* implement conflict resolution protocol
([#122](#122))
([#166](#166))
([e03f9f2](e03f9f2))
* implement core entity and role system models
([#69](#69))
([acf9801](acf9801))
* implement crash recovery with fail-and-reassign strategy
([#149](#149))
([e6e91ed](e6e91ed))
* implement engine extensions — Plan-and-Execute loop and call
categorization
([#134](#134),
[#135](#135))
([#159](#159))
([9b2699f](9b2699f))
* implement enterprise logging system with structlog
([#73](#73))
([2f787e5](2f787e5))
* implement graceful shutdown with cooperative timeout strategy
([#130](#130))
([6592515](6592515))
* implement hierarchical delegation and loop prevention
([#12](#12),
[#17](#17))
([6be60b6](6be60b6))
* implement LiteLLM driver and provider registry
([#88](#88))
([ae3f18b](ae3f18b)),
closes [#4](#4)
* implement LLM decomposition strategy and workspace isolation
([#174](#174))
([aa0eefe](aa0eefe))
* implement meeting protocol system
([#123](#123))
([ee7caca](ee7caca))
* implement message and communication domain models
([#74](#74))
([560a5d2](560a5d2))
* implement model routing engine
([#99](#99))
([d3c250b](d3c250b))
* implement parallel agent execution
([#22](#22))
([#161](#161))
([65940b3](65940b3))
* implement per-call cost tracking service
([#7](#7))
([#102](#102))
([c4f1f1c](c4f1f1c))
* implement personality injection and system prompt construction
([#105](#105))
([934dd85](934dd85))
* implement single-task execution lifecycle
([#21](#21))
([#144](#144))
([c7e64e4](c7e64e4))
* implement subprocess sandbox for tool execution isolation
([#131](#131))
([#153](#153))
([3c8394e](3c8394e))
* implement task assignment subsystem with pluggable strategies
([#172](#172))
([c7f1b26](c7f1b26)),
closes [#26](#26)
[#30](#30)
* implement task decomposition and routing engine
([#14](#14))
([9c7fb52](9c7fb52))
* implement Task, Project, Artifact, Budget, and Cost domain models
([#71](#71))
([81eabf1](81eabf1))
* implement tool permission checking
([#16](#16))
([833c190](833c190))
* implement YAML config loader with Pydantic validation
([#59](#59))
([ff3a2ba](ff3a2ba))
* implement YAML config loader with Pydantic validation
([#75](#75))
([ff3a2ba](ff3a2ba))
* initialize project with uv, hatchling, and src layout
([39005f9](39005f9))
* initialize project with uv, hatchling, and src layout
([#62](#62))
([39005f9](39005f9))
* Litestar REST API, WebSocket feed, and approval queue (M6)
([#189](#189))
([29fcd08](29fcd08))
* make TokenUsage.total_tokens a computed field
([#118](#118))
([c0bab18](c0bab18)),
closes [#109](#109)
* parallel tool execution in ToolInvoker.invoke_all
([#137](#137))
([58517ee](58517ee))
* testing framework, CI pipeline, and M0 gap fixes
([#64](#64))
([f581749](f581749))
* wire all modules into observability system
([#97](#97))
([f7a0617](f7a0617))


### Bug Fixes

* address Greptile post-merge review findings from PRs
[#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175)
([#176](#176))
([c5ca929](c5ca929))
* address post-merge review feedback from PRs
[#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167)
([#170](#170))
([3bf897a](3bf897a)),
closes [#169](#169)
* enforce strict mypy on test files
([#89](#89))
([aeeff8c](aeeff8c))
* harden Docker sandbox, MCP bridge, and code runner
([#50](#50),
[#53](#53))
([d5e1b6e](d5e1b6e))
* harden git tools security + code quality improvements
([#150](#150))
([000a325](000a325))
* harden subprocess cleanup, env filtering, and shutdown resilience
([#155](#155))
([d1fe1fb](d1fe1fb))
* incorporate post-merge feedback + pre-PR review fixes
([#164](#164))
([c02832a](c02832a))
* pre-PR review fixes for post-merge findings
([#183](#183))
([26b3108](26b3108))
* strengthen immutability for BaseTool schema and ToolInvoker boundaries
([#117](#117))
([7e5e861](7e5e861))


### Performance

* harden non-inferable principle implementation
([#195](#195))
([02b5f4e](02b5f4e)),
closes [#188](#188)


### Refactoring

* adopt NotBlankStr across all models
([#108](#108))
([#120](#120))
([ef89b90](ef89b90))
* extract _SpendingTotals base class from spending summary models
([#111](#111))
([2f39c1b](2f39c1b))
* harden BudgetEnforcer with error handling, validation extraction, and
review fixes
([#182](#182))
([c107bf9](c107bf9))
* harden personality profiles, department validation, and template
rendering ([#158](#158))
([10b2299](10b2299))
* pre-PR review improvements for ExecutionLoop + ReAct loop
([#124](#124))
([8dfb3c0](8dfb3c0))
* split events.py into per-domain event modules
([#136](#136))
([e9cba89](e9cba89))


### Documentation

* add ADR-001 memory layer evaluation and selection
([#178](#178))
([db3026f](db3026f)),
closes [#39](#39)
* add agent scaling research findings to DESIGN_SPEC
([#145](#145))
([57e487b](57e487b))
* add CLAUDE.md, contributing guide, and dev documentation
([#65](#65))
([55c1025](55c1025)),
closes [#54](#54)
* add crash recovery, sandboxing, analytics, and testing decisions
([#127](#127))
([5c11595](5c11595))
* address external review feedback with MVP scope and new protocols
([#128](#128))
([3b30b9a](3b30b9a))
* expand design spec with pluggable strategy protocols
([#121](#121))
([6832db6](6832db6))
* finalize 23 design decisions (ADR-002)
([#190](#190))
([8c39742](8c39742))
* update project docs for M2.5 conventions and add docs-consistency
review agent
([#114](#114))
([99766ee](99766ee))


### Tests

* add e2e single agent integration tests
([#24](#24))
([#156](#156))
([f566fb4](f566fb4))
* add provider adapter integration tests
([#90](#90))
([40a61f4](40a61f4))


### CI/CD

* add Release Please for automated versioning and GitHub Releases
([#278](#278))
([a488758](a488758))
* bump actions/checkout from 4 to 6
([#95](#95))
([1897247](1897247))
* bump actions/upload-artifact from 4 to 7
([#94](#94))
([27b1517](27b1517))
* harden CI/CD pipeline
([#92](#92))
([ce4693c](ce4693c))
* split vulnerability scans into critical-fail and high-warn tiers
([#277](#277))
([aba48af](aba48af))


### Maintenance

* add /worktree skill for parallel worktree management
([#171](#171))
([951e337](951e337))
* add design spec context loading to research-link skill
([8ef9685](8ef9685))
* add post-merge-cleanup skill
([#70](#70))
([f913705](f913705))
* add pre-pr-review skill and update CLAUDE.md
([#103](#103))
([92e9023](92e9023))
* add research-link skill and rename skill files to SKILL.md
([#101](#101))
([651c577](651c577))
* bump aiosqlite from 0.21.0 to 0.22.1
([#191](#191))
([3274a86](3274a86))
* bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group
([#96](#96))
([0338d0c](0338d0c))
* bump ruff from 0.15.4 to 0.15.5
([a49ee46](a49ee46))
* fix M0 audit items
([#66](#66))
([c7724b5](c7724b5))
* pin setup-uv action to full SHA
([#281](#281))
([4448002](4448002))
* post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests,
hookify rules
([#148](#148))
([c57a6a9](c57a6a9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Aureliolo added a commit that referenced this pull request Mar 11, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.1.0](v0.0.0...v0.1.0)
(2026-03-11)


### Features

* add autonomy levels and approval timeout policies
([#42](#42),
[#126](#126))
([#197](#197))
([eecc25a](eecc25a))
* add CFO cost optimization service with anomaly detection, reports, and
approval decisions
([#186](#186))
([a7fa00b](a7fa00b))
* add code quality toolchain (ruff, mypy, pre-commit, dependabot)
([#63](#63))
([36681a8](36681a8))
* add configurable cost tiers and subscription/quota-aware tracking
([#67](#67))
([#185](#185))
([9baedfa](9baedfa))
* add container packaging, Docker Compose, and CI pipeline
([#269](#269))
([435bdfe](435bdfe)),
closes [#267](#267)
* add coordination error taxonomy classification pipeline
([#146](#146))
([#181](#181))
([70c7480](70c7480))
* add cost-optimized, hierarchical, and auction assignment strategies
([#175](#175))
([ce924fa](ce924fa)),
closes [#173](#173)
* add design specification, license, and project setup
([8669a09](8669a09))
* add env var substitution and config file auto-discovery
([#77](#77))
([7f53832](7f53832))
* add FastestStrategy routing + vendor-agnostic cleanup
([#140](#140))
([09619cb](09619cb)),
closes [#139](#139)
* add HR engine and performance tracking
([#45](#45),
[#47](#47))
([#193](#193))
([2d091ea](2d091ea))
* add issue auto-search and resolution verification to PR review skill
([#119](#119))
([deecc39](deecc39))
* add mandatory JWT + API key authentication
([#256](#256))
([c279cfe](c279cfe))
* add memory retrieval, ranking, and context injection pipeline
([#41](#41))
([873b0aa](873b0aa))
* add pluggable MemoryBackend protocol with models, config, and events
([#180](#180))
([46cfdd4](46cfdd4))
* add pluggable MemoryBackend protocol with models, config, and events
([#32](#32))
([46cfdd4](46cfdd4))
* add pluggable output scan response policies
([#263](#263))
([b9907e8](b9907e8))
* add pluggable PersistenceBackend protocol with SQLite implementation
([#36](#36))
([f753779](f753779))
* add progressive trust and promotion/demotion subsystems
([#43](#43),
[#49](#49))
([3a87c08](3a87c08))
* add retry handler, rate limiter, and provider resilience
([#100](#100))
([b890545](b890545))
* add SecOps security agent with rule engine, audit log, and ToolInvoker
integration ([#40](#40))
([83b7b6c](83b7b6c))
* add shared org memory and memory consolidation/archival
([#125](#125),
[#48](#48))
([4a0832b](4a0832b))
* design unified provider interface
([#86](#86))
([3e23d64](3e23d64))
* expand template presets, rosters, and add inheritance
([#80](#80),
[#81](#81),
[#84](#84))
([15a9134](15a9134))
* implement agent runtime state vs immutable config split
([#115](#115))
([4cb1ca5](4cb1ca5))
* implement AgentEngine core orchestrator
([#11](#11))
([#143](#143))
([f2eb73a](f2eb73a))
* implement AuditRepository for security audit log persistence
([#279](#279))
([94bc29f](94bc29f))
* implement basic tool system (registry, invocation, results)
([#15](#15))
([c51068b](c51068b))
* implement built-in file system tools
([#18](#18))
([325ef98](325ef98))
* implement communication foundation — message bus, dispatcher, and
messenger ([#157](#157))
([8e71bfd](8e71bfd))
* implement company template system with 7 built-in presets
([#85](#85))
([cbf1496](cbf1496))
* implement conflict resolution protocol
([#122](#122))
([#166](#166))
([e03f9f2](e03f9f2))
* implement core entity and role system models
([#69](#69))
([acf9801](acf9801))
* implement crash recovery with fail-and-reassign strategy
([#149](#149))
([e6e91ed](e6e91ed))
* implement engine extensions — Plan-and-Execute loop and call
categorization
([#134](#134),
[#135](#135))
([#159](#159))
([9b2699f](9b2699f))
* implement enterprise logging system with structlog
([#73](#73))
([2f787e5](2f787e5))
* implement graceful shutdown with cooperative timeout strategy
([#130](#130))
([6592515](6592515))
* implement hierarchical delegation and loop prevention
([#12](#12),
[#17](#17))
([6be60b6](6be60b6))
* implement LiteLLM driver and provider registry
([#88](#88))
([ae3f18b](ae3f18b)),
closes [#4](#4)
* implement LLM decomposition strategy and workspace isolation
([#174](#174))
([aa0eefe](aa0eefe))
* implement meeting protocol system
([#123](#123))
([ee7caca](ee7caca))
* implement message and communication domain models
([#74](#74))
([560a5d2](560a5d2))
* implement model routing engine
([#99](#99))
([d3c250b](d3c250b))
* implement parallel agent execution
([#22](#22))
([#161](#161))
([65940b3](65940b3))
* implement per-call cost tracking service
([#7](#7))
([#102](#102))
([c4f1f1c](c4f1f1c))
* implement personality injection and system prompt construction
([#105](#105))
([934dd85](934dd85))
* implement single-task execution lifecycle
([#21](#21))
([#144](#144))
([c7e64e4](c7e64e4))
* implement subprocess sandbox for tool execution isolation
([#131](#131))
([#153](#153))
([3c8394e](3c8394e))
* implement task assignment subsystem with pluggable strategies
([#172](#172))
([c7f1b26](c7f1b26)),
closes [#26](#26)
[#30](#30)
* implement task decomposition and routing engine
([#14](#14))
([9c7fb52](9c7fb52))
* implement Task, Project, Artifact, Budget, and Cost domain models
([#71](#71))
([81eabf1](81eabf1))
* implement tool permission checking
([#16](#16))
([833c190](833c190))
* implement YAML config loader with Pydantic validation
([#59](#59))
([ff3a2ba](ff3a2ba))
* implement YAML config loader with Pydantic validation
([#75](#75))
([ff3a2ba](ff3a2ba))
* initialize project with uv, hatchling, and src layout
([39005f9](39005f9))
* initialize project with uv, hatchling, and src layout
([#62](#62))
([39005f9](39005f9))
* Litestar REST API, WebSocket feed, and approval queue (M6)
([#189](#189))
([29fcd08](29fcd08))
* make TokenUsage.total_tokens a computed field
([#118](#118))
([c0bab18](c0bab18)),
closes [#109](#109)
* parallel tool execution in ToolInvoker.invoke_all
([#137](#137))
([58517ee](58517ee))
* testing framework, CI pipeline, and M0 gap fixes
([#64](#64))
([f581749](f581749))
* wire all modules into observability system
([#97](#97))
([f7a0617](f7a0617))


### Bug Fixes

* address Greptile post-merge review findings from PRs
[#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175)
([#176](#176))
([c5ca929](c5ca929))
* address post-merge review feedback from PRs
[#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167)
([#170](#170))
([3bf897a](3bf897a)),
closes [#169](#169)
* enforce strict mypy on test files
([#89](#89))
([aeeff8c](aeeff8c))
* harden Docker sandbox, MCP bridge, and code runner
([#50](#50),
[#53](#53))
([d5e1b6e](d5e1b6e))
* harden git tools security + code quality improvements
([#150](#150))
([000a325](000a325))
* harden subprocess cleanup, env filtering, and shutdown resilience
([#155](#155))
([d1fe1fb](d1fe1fb))
* incorporate post-merge feedback + pre-PR review fixes
([#164](#164))
([c02832a](c02832a))
* pre-PR review fixes for post-merge findings
([#183](#183))
([26b3108](26b3108))
* resolve circular imports, bump litellm, fix release tag format
([#286](#286))
([a6659b5](a6659b5))
* strengthen immutability for BaseTool schema and ToolInvoker boundaries
([#117](#117))
([7e5e861](7e5e861))


### Performance

* harden non-inferable principle implementation
([#195](#195))
([02b5f4e](02b5f4e)),
closes [#188](#188)


### Refactoring

* adopt NotBlankStr across all models
([#108](#108))
([#120](#120))
([ef89b90](ef89b90))
* extract _SpendingTotals base class from spending summary models
([#111](#111))
([2f39c1b](2f39c1b))
* harden BudgetEnforcer with error handling, validation extraction, and
review fixes
([#182](#182))
([c107bf9](c107bf9))
* harden personality profiles, department validation, and template
rendering ([#158](#158))
([10b2299](10b2299))
* pre-PR review improvements for ExecutionLoop + ReAct loop
([#124](#124))
([8dfb3c0](8dfb3c0))
* split events.py into per-domain event modules
([#136](#136))
([e9cba89](e9cba89))


### Documentation

* add ADR-001 memory layer evaluation and selection
([#178](#178))
([db3026f](db3026f)),
closes [#39](#39)
* add agent scaling research findings to DESIGN_SPEC
([#145](#145))
([57e487b](57e487b))
* add CLAUDE.md, contributing guide, and dev documentation
([#65](#65))
([55c1025](55c1025)),
closes [#54](#54)
* add crash recovery, sandboxing, analytics, and testing decisions
([#127](#127))
([5c11595](5c11595))
* address external review feedback with MVP scope and new protocols
([#128](#128))
([3b30b9a](3b30b9a))
* expand design spec with pluggable strategy protocols
([#121](#121))
([6832db6](6832db6))
* finalize 23 design decisions (ADR-002)
([#190](#190))
([8c39742](8c39742))
* update project docs for M2.5 conventions and add docs-consistency
review agent
([#114](#114))
([99766ee](99766ee))


### Tests

* add e2e single agent integration tests
([#24](#24))
([#156](#156))
([f566fb4](f566fb4))
* add provider adapter integration tests
([#90](#90))
([40a61f4](40a61f4))


### CI/CD

* add Release Please for automated versioning and GitHub Releases
([#278](#278))
([a488758](a488758))
* bump actions/checkout from 4 to 6
([#95](#95))
([1897247](1897247))
* bump actions/upload-artifact from 4 to 7
([#94](#94))
([27b1517](27b1517))
* bump anchore/scan-action from 6.5.1 to 7.3.2
([#271](#271))
([80a1c15](80a1c15))
* bump docker/build-push-action from 6.19.2 to 7.0.0
([#273](#273))
([dd0219e](dd0219e))
* bump docker/login-action from 3.7.0 to 4.0.0
([#272](#272))
([33d6238](33d6238))
* bump docker/metadata-action from 5.10.0 to 6.0.0
([#270](#270))
([baee04e](baee04e))
* bump docker/setup-buildx-action from 3.12.0 to 4.0.0
([#274](#274))
([5fc06f7](5fc06f7))
* bump sigstore/cosign-installer from 3.9.1 to 4.1.0
([#275](#275))
([29dd16c](29dd16c))
* harden CI/CD pipeline
([#92](#92))
([ce4693c](ce4693c))
* split vulnerability scans into critical-fail and high-warn tiers
([#277](#277))
([aba48af](aba48af))


### Maintenance

* add /worktree skill for parallel worktree management
([#171](#171))
([951e337](951e337))
* add design spec context loading to research-link skill
([8ef9685](8ef9685))
* add post-merge-cleanup skill
([#70](#70))
([f913705](f913705))
* add pre-pr-review skill and update CLAUDE.md
([#103](#103))
([92e9023](92e9023))
* add research-link skill and rename skill files to SKILL.md
([#101](#101))
([651c577](651c577))
* bump aiosqlite from 0.21.0 to 0.22.1
([#191](#191))
([3274a86](3274a86))
* bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group
([#96](#96))
([0338d0c](0338d0c))
* bump ruff from 0.15.4 to 0.15.5
([a49ee46](a49ee46))
* fix M0 audit items
([#66](#66))
([c7724b5](c7724b5))
* **main:** release ai-company 0.1.1
([#282](#282))
([2f4703d](2f4703d))
* pin setup-uv action to full SHA
([#281](#281))
([4448002](4448002))
* post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests,
hookify rules
([#148](#148))
([c57a6a9](c57a6a9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>
Aureliolo added a commit that referenced this pull request Mar 30, 2026
Add per-sink enable/disable, level/format overrides, rotation config,
and custom sink creation -- all hot-reloadable without restart via
ObservabilitySettingsSubscriber. Also fix MSW postMessage origin check
(CodeQL alert #149, CWE-020/CWE-940).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Aureliolo added a commit that referenced this pull request Mar 30, 2026
Add per-sink enable/disable, level/format overrides, rotation config,
and custom sink creation -- all hot-reloadable without restart via
ObservabilitySettingsSubscriber. Also fix MSW postMessage origin check
(CodeQL alert #149, CWE-020/CWE-940).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Aureliolo added a commit that referenced this pull request Mar 30, 2026
Add per-sink enable/disable, level/format overrides, rotation config,
and custom sink creation -- all hot-reloadable without restart via
ObservabilitySettingsSubscriber. Also fix MSW postMessage origin check
(CodeQL alert #149, CWE-020/CWE-940).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Aureliolo added a commit that referenced this pull request Mar 30, 2026
## Summary

- Add per-sink enable/disable, level/format overrides, rotation config,
and custom sink creation -- all hot-reloadable without restart via
`ObservabilitySettingsSubscriber`
- Two new JSON settings (`sink_overrides`, `custom_sinks`) in the
observability namespace with full validation (console sink protection,
path traversal prevention, duplicate path detection)
- Extend `configure_logging` and `build_handler` to support custom
routing overrides for user-defined sinks
- Fix MSW `postMessage` handler missing origin check (CodeQL alert #149,
CWE-020/CWE-940)

## Test plan

- [x] 44 unit tests for `sink_config_builder` (overrides, custom sinks,
validation, combined, edge cases)
- [x] 16 unit tests for `ObservabilitySettingsSubscriber` (protocol,
rebuild, error handling, namespace guard, idempotency)
- [x] 6 integration tests for hot reload (disable sink, level change,
custom sink, routing filters, module-level loggers, message
preservation)
- [x] All 616 observability + settings tests pass
- [x] ruff lint + format clean
- [x] mypy strict pass
- [x] Pre-reviewed by 4 agents, 13 findings addressed

Closes #564

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement crash recovery with fail-and-reassign strategy (DESIGN_SPEC §6.6)

2 participants