feat: implement crash recovery with fail-and-reassign strategy#149
feat: implement crash recovery with fail-and-reassign strategy#149
Conversation
Add crash recovery per DESIGN_SPEC §6.6: - Add FAILED status to TaskStatus enum (non-terminal, can reassign) - Add IN_PROGRESS → FAILED and FAILED → ASSIGNED transitions - Add max_retries field to Task model (default=1) - Add retry_count field to TaskExecution runtime state - Create RecoveryStrategy protocol + FailAndReassignStrategy default - Wire recovery into AgentEngine (_apply_recovery) for all error paths - Recovery captures redacted AgentContextSnapshot (no message contents) - Recovery is async (forward-compatible with checkpoint recovery M4/M5) - Recovery failure is swallowed (never blocks error result) - Pass recovery_strategy=None to opt out Closes #129 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-reviewed by 9 agents, 17 findings addressed: Bug fix: - Add ASSIGNED→FAILED to valid transitions (fixes early setup failures leaving tasks stuck in ASSIGNED status) Type improvements: - RecoveryResult.strategy_type: str → NotBlankStr - RecoveryResult.can_reassign: stored Field → @computed_field - Extract "fail_reassign" to FailAndReassignStrategy.STRATEGY_TYPE constant Documentation: - DESIGN_SPEC §6.1: update lifecycle diagram with FAILED status - DESIGN_SPEC §6.2: add max_retries to task definition - DESIGN_SPEC §6.6: fix "tool call history" wording, add M3 limitation note - DESIGN_SPEC §15.3: add recovery.py to project structure - Task/TaskExecution/TaskStatus docstrings: add missing fields Tests: - Add TaskStatus.FAILED value assertion - Add ASSIGNED→FAILED valid transition test - Add FAILED completed_at/is_terminal verification - Add MemoryError propagation test for _apply_recovery - Fix import block formatting in test_recovery.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (11)
📝 WalkthroughSummary by CodeRabbit
WalkthroughImplements crash-recovery (MVP fail-and-reassign): adds non-terminal TaskStatus.FAILED, per-task max_retries, TaskExecution.retry_count, a pluggable RecoveryStrategy with FailAndReassignStrategy, integrates recovery into AgentEngine, captures redacted context snapshots, and adds events and tests exercising retry/reassignment eligibility. Changes
Sequence Diagram(s)sequenceDiagram
participant Agent as AgentEngine
participant Provider as Provider
participant Recovery as RecoveryStrategy
participant TaskExec as TaskExecution
Agent->>Provider: execute task
Provider-->>Agent: error/exception
Agent->>Agent: _handle_fatal_error()
Agent->>Recovery: _apply_recovery(task_execution, error, context)
activate Recovery
Recovery->>TaskExec: capture redacted context snapshot
Recovery->>TaskExec: set status -> FAILED, record error_message
Recovery-->>Agent: RecoveryResult{can_reassign, context_snapshot}
deactivate Recovery
alt can_reassign
Agent-->>Agent: task available for reassignment (retry_count < max_retries)
else cannot_reassign
Agent-->>Agent: task remains FAILED (no reassignment)
end
Agent-->>Caller: return AgentRunResult (ERROR)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's resilience by introducing a robust crash recovery mechanism. It allows the system to gracefully handle unexpected task failures by transitioning them to a new Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a robust crash recovery mechanism using a 'fail-and-reassign' strategy. The changes are comprehensive, including updates to the task lifecycle, new data models, a recovery strategy protocol, and integration into the agent engine. The design specification has been updated accordingly, and the changes are well-covered by new unit and integration tests. My review found one critical syntax issue that will prevent the code from running and a suggestion to improve import handling to avoid potential circular dependencies. Overall, this is a solid implementation of a critical feature.
| except MemoryError, RecursionError: | ||
| raise |
src/ai_company/engine/recovery.py
Outdated
| from typing import Protocol, runtime_checkable | ||
|
|
||
| from pydantic import BaseModel, ConfigDict, Field, computed_field | ||
|
|
||
| from ai_company.core.enums import TaskStatus | ||
| from ai_company.core.types import NotBlankStr # noqa: TC001 | ||
| from ai_company.engine.context import AgentContext, AgentContextSnapshot # noqa: TC001 | ||
| from ai_company.engine.task_execution import TaskExecution # noqa: TC001 |
There was a problem hiding this comment.
AgentContext is only used as a type hint for a function argument. To avoid potential runtime circular dependencies, it's best practice to place such imports inside a TYPE_CHECKING block. This change separates the runtime-needed AgentContextSnapshot from the type-only AgentContext. The circular import was likely caused by AgentContext, so the noqa suppression on the AgentContextSnapshot import can also be removed.
from typing import Protocol, runtime_checkable, TYPE_CHECKING
from pydantic import BaseModel, ConfigDict, Field, computed_field
from ai_company.core.enums import TaskStatus
from ai_company.core.types import NotBlankStr # noqa: TC001
from ai_company.engine.context import AgentContextSnapshot
from ai_company.engine.task_execution import TaskExecution # noqa: TC001
if TYPE_CHECKING:
from ai_company.engine.context import AgentContext
Greptile SummaryThis PR implements crash recovery for Key verified changes:
One issue found: Confidence Score: 4/5
Sequence DiagramsequenceDiagram
participant Caller
participant AgentEngine
participant ReactLoop
participant FailAndReassignStrategy
participant TaskExecution
Caller->>AgentEngine: run(identity, task)
AgentEngine->>AgentEngine: _prepare_context() → IN_PROGRESS
AgentEngine->>ReactLoop: execute(context)
alt Loop raises exception
ReactLoop-->>AgentEngine: raises Exception
AgentEngine->>AgentEngine: _handle_fatal_error(ctx, exc)
AgentEngine->>FailAndReassignStrategy: recover(task_execution, error_msg, ctx)
FailAndReassignStrategy->>TaskExecution: with_transition(FAILED, reason)
TaskExecution-->>FailAndReassignStrategy: failed_execution
FailAndReassignStrategy-->>AgentEngine: RecoveryResult(can_reassign, FAILED)
AgentEngine-->>Caller: AgentRunResult(ERROR, task=FAILED)
else Loop returns TerminationReason.ERROR
ReactLoop-->>AgentEngine: ExecutionResult(ERROR)
AgentEngine->>AgentEngine: _apply_post_execution_transitions() → no-op
AgentEngine->>FailAndReassignStrategy: recover(task_execution, error_msg, ctx)
FailAndReassignStrategy->>TaskExecution: with_transition(FAILED, reason)
TaskExecution-->>FailAndReassignStrategy: failed_execution
FailAndReassignStrategy-->>AgentEngine: RecoveryResult(can_reassign, FAILED)
AgentEngine-->>Caller: AgentRunResult(ERROR, task=FAILED)
end
Note over Caller,TaskExecution: Task Router (M4) checks can_reassign<br/>and creates TaskExecution.from_task(task, retry_count+1)
Last reviewed commit: b25d85a |
There was a problem hiding this comment.
Pull request overview
Implements crash recovery in the execution engine via a pluggable RecoveryStrategy (defaulting to a fail-and-reassign approach), adds a new non-terminal FAILED task status, and introduces retry tracking (max_retries / retry_count) to support safe reassignment flows.
Changes:
- Add
TaskStatus.FAILEDplus updated valid transition map to allow failing and reassignment (FAILED → ASSIGNED). - Introduce crash recovery framework (
RecoveryStrategy,FailAndReassignStrategy,RecoveryResult) and integrate it intoAgentEngineerror paths with new observability events. - Add retry configuration/telemetry fields (
Task.max_retries,TaskExecution.retry_count) and expand unit/integration tests + update DESIGN_SPEC sections.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/engine/test_task_execution.py | Adds unit coverage for TaskExecution.retry_count defaults and transition behavior with FAILED. |
| tests/unit/engine/test_recovery.py | New unit tests for recovery protocol/strategy behavior and recovery logging events. |
| tests/unit/engine/test_agent_engine_lifecycle.py | Updates lifecycle expectation: error outcomes transition task execution to FAILED. |
| tests/unit/engine/test_agent_engine_errors.py | Adds engine recovery tests (strategy invocation, swallow failure, opt-out, timeout, MemoryError propagation). |
| tests/unit/core/test_task_transitions.py | Extends transition validation tests for FAILED allowed/invalid edges and non-terminal behavior. |
| tests/unit/core/test_task.py | Adds tests for FAILED assignment rules and Task.max_retries validation. |
| tests/unit/core/test_enums.py | Updates enum member count/value checks to include TaskStatus.FAILED. |
| tests/integration/engine/test_crash_recovery.py | New integration test ensuring provider crash results in FAILED status and retry gating checks. |
| src/ai_company/observability/events/execution.py | Adds structured recovery event constants (execution.recovery.*). |
| src/ai_company/engine/task_execution.py | Adds retry_count field and plumbs it into TaskExecution.from_task(). |
| src/ai_company/engine/recovery.py | New recovery protocol + default FailAndReassignStrategy, RecoveryResult model, and snapshot/error logging. |
| src/ai_company/engine/agent_engine.py | Integrates recovery on TerminationReason.ERROR and fatal-error path; adds configurable strategy. |
| src/ai_company/engine/init.py | Re-exports recovery types/strategy as part of engine public API. |
| src/ai_company/core/task_transitions.py | Updates lifecycle transition map and docs to include FAILED. |
| src/ai_company/core/task.py | Adds max_retries field and relaxes assignee requirements for FAILED. |
| src/ai_company/core/enums.py | Adds FAILED to TaskStatus and updates lifecycle summary docs. |
| DESIGN_SPEC.md | Updates lifecycle diagram and recovery spec sections to reflect implemented crash recovery and retry limits. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pytestmark = pytest.mark.timeout(30) | ||
|
|
||
|
|
There was a problem hiding this comment.
pytestmark = pytest.mark.timeout(30) is unique within tests/unit/engine (other unit-engine modules don’t set a module-level timeout). To keep markers consistent, consider removing the module-level timeout here or applying the same pattern across the unit/engine suite if timeouts are now desired.
| pytestmark = pytest.mark.timeout(30) |
| model_config = ConfigDict(frozen=True) | ||
|
|
||
| task_execution: TaskExecution = Field( | ||
| description="Updated execution with FAILED status", | ||
| ) | ||
| strategy_type: NotBlankStr = Field( | ||
| description="Identifier of the recovery strategy used", | ||
| ) | ||
| context_snapshot: AgentContextSnapshot = Field( | ||
| description="Redacted context snapshot (no message contents)", | ||
| ) |
There was a problem hiding this comment.
RecoveryResult is declared as a frozen Pydantic model, but there’s no corresponding test_frozen coverage for it (the repo commonly asserts immutability for every ConfigDict(frozen=True) model). Adding a small unit test that field reassignment raises ValidationError would align with existing patterns and prevent accidental regression to mutability.
| @classmethod | ||
| def from_task(cls, task: Task) -> TaskExecution: | ||
| def from_task( | ||
| cls, | ||
| task: Task, | ||
| *, | ||
| retry_count: int = 0, | ||
| ) -> TaskExecution: | ||
| """Create a fresh execution from a task definition. | ||
|
|
||
| Args: | ||
| task: The frozen task to wrap. | ||
| retry_count: Number of previous failure-reassignment cycles. | ||
|
|
||
| Returns: | ||
| New ``TaskExecution`` with status matching the task. | ||
| """ | ||
| execution = cls(task=task, status=task.status) | ||
| execution = cls(task=task, status=task.status, retry_count=retry_count) |
There was a problem hiding this comment.
TaskExecution.from_task() now accepts retry_count, but the only call site (AgentContext.from_identity) always uses the default, so engine-driven executions can’t start with a non-zero retry count. That makes max_retries/can_reassign effectively ignore previous retries unless callers bypass AgentEngine.run() and build TaskExecution manually. Consider plumbing an optional retry_count through AgentEngine.run() / AgentContext.from_identity() (or persisting it on the task) so retries can be executed via the public API.
| _DEFAULT_RECOVERY_STRATEGY = FailAndReassignStrategy() | ||
| """Module-level default instance for the recovery strategy.""" | ||
|
|
There was a problem hiding this comment.
Using a module-level _DEFAULT_RECOVERY_STRATEGY instance as the default argument means all AgentEngine instances share the same strategy object. This is safe only as long as the strategy remains strictly stateless; if it later gains config/state (metrics, caches, injected services), that shared instance can cause cross-engine coupling. Prefer defaulting to None and instantiating FailAndReassignStrategy() inside __init__ when no strategy is provided (while still allowing None to explicitly disable recovery via a separate sentinel).
| _DEFAULT_RECOVERY_STRATEGY = FailAndReassignStrategy() | |
| """Module-level default instance for the recovery strategy.""" | |
| def get_default_recovery_strategy() -> RecoveryStrategy: | |
| """Factory for the default recovery strategy. | |
| Returns a new ``FailAndReassignStrategy`` instance to avoid sharing | |
| mutable state across multiple ``AgentEngine`` instances. | |
| """ | |
| return FailAndReassignStrategy() | |
| # Backwards-compatible alias: evaluate to a fresh default strategy when used. | |
| _DEFAULT_RECOVERY_STRATEGY: RecoveryStrategy = get_default_recovery_strategy() | |
| """Module-level default for the recovery strategy. | |
| Note: this is created via ``get_default_recovery_strategy()`` to avoid | |
| sharing a single mutable instance across multiple engines. | |
| """ |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/ai_company/engine/task_execution.py (1)
97-127:⚠️ Potential issue | 🟠 MajorRetry/reassignment can now produce invalid
Tasksnapshots.With
retry_countenabling the newFAILED -> ASSIGNEDflow,TaskExecutioncan represent a reassigned attempt whileself.task.assigned_tois stillNone—whichsrc/ai_company/core/task.pyallows forFAILEDbut not forASSIGNED.to_task_snapshot()later usesself.task.model_copy(update={"status": self.status}), so it can silently emit an invalid assigned task without runningTaskvalidators. Please carry the reassigned assignee in execution state, or rebuild snapshots viaTask.model_validate(...)/Task.with_transition(...)with the required overrides.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/task_execution.py` around lines 97 - 127, The TaskExecution.from_task + to_task_snapshot flow can emit an invalid ASSIGNED Task because from_task only copies Task into execution while retry_count-driven FAILED->ASSIGNED transitions may require a non-None assignee; update TaskExecution to carry the reassigned assignee (e.g., add an assigned_to field on TaskExecution) and modify from_task to accept/record that assignee when creating a reassigned attempt, and update to_task_snapshot to rebuild a validated Task snapshot by calling Task.model_validate(...) or Task.with_transition(...) (instead of Task.model_copy) supplying both status=self.status and the carried assigned_to so validators run and an ASSIGNED Task is never emitted with a None assignee.DESIGN_SPEC.md (1)
697-720:⚠️ Potential issue | 🟠 MajorSplit
BLOCKEDfromCANCELLEDin the lifecycle diagram.Line 714 still renders
BLOCKED / CANCELLEDas a shared terminal sink, which contradicts Line 720 and §6.6 whereBLOCKEDis non-terminal and returns toASSIGNED. Please add theBLOCKED → ASSIGNEDpath or separate the two states so the state machine stays unambiguous.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DESIGN_SPEC.md` around lines 697 - 720, The ASCII lifecycle diagram currently combines "BLOCKED / CANCELLED" into one terminal sink which contradicts §6.6; update the diagram to separate BLOCKED and CANCELLED as distinct nodes and add an arrow from BLOCKED back to ASSIGNED labeled "unblocked" (or "blocked / cancelled" as appropriate), keeping CANCELLED as a terminal node with no outgoing transitions and ensuring FAILED remains non-terminal returning to ASSIGNED; target the diagram block containing the states "ASSIGNED", "IN_PROGRESS", "IN_REVIEW", "COMPLETED", "BLOCKED", and "CANCELLED" and adjust the ASCII art arrows so BLOCKED → ASSIGNED is explicit and CANCELLED stays terminal.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@DESIGN_SPEC.md`:
- Around line 973-975: The wording should explicitly place the recovery catch
point on the AgentEngine boundary: update the sentence that currently reads
“outermost `try/except` in the execution loop” to say something like “the
outermost `try/except` at the AgentEngine boundary (e.g., the top-level
try/catch in AgentEngine.run / execute loop)”; keep the rest of the step
describing logging an ERROR with a redacted AgentContext snapshot (turn count,
accumulated cost — exclude message contents) so responsibility is unambiguously
assigned to AgentEngine.
- Around line 2281-2283: Update the project-structure map entry for
run_result.py to reflect the added RecoveryResult type: change the short
description from "AgentRunResult outcome model" to something like
"AgentRunResult and RecoveryResult outcome models" so it documents both
AgentRunResult and RecoveryResult in run_result.py; ensure you reference
run_result.py and the symbols AgentRunResult and RecoveryResult in the updated
line.
- Around line 720-721: The blockquote containing "**Non-terminal states:**
BLOCKED and FAILED are non-terminal — BLOCKED returns to ASSIGNED when
unblocked, FAILED returns to ASSIGNED for retry (see §6.6). COMPLETED and
CANCELLED are terminal states with no outgoing transitions." contains an extra
blank line; remove that blank line so the blockquote lines are contiguous (no
empty line inside the quote) to satisfy markdownlint MD028.
In `@src/ai_company/engine/agent_engine.py`:
- Around line 265-270: The retry_count on TaskExecution is being reset to zero
when AgentEngine.run rebuilds state via AgentContext.from_identity which calls
TaskExecution.from_task, so recovery/reassignment logic using
task_execution.retry_count (and can_reassign) never exhausts; fix by persisting
or passing the current retry_count into the rebuilt TaskExecution: update
AgentContext.from_identity (and any call sites in AgentEngine.run and the second
block around lines 575-606) to accept and propagate the existing
TaskExecution.retry_count (or the TaskExecution object) instead of constructing
a fresh TaskExecution.from_task(task) with retry_count=0, and ensure
AgentEngine._apply_recovery increments and persists retry_count back to the
task/context so subsequent run() invocations see the updated value.
In `@tests/integration/engine/test_crash_recovery.py`:
- Around line 89-92: The ModelConfig instance uses a non-canonical model_id
"test-model-001"; update the ModelConfig(..., model_id=...) to use the approved
test alias "test-small-001" (keep provider="test-provider") so the fixture
conforms to repo test-data conventions and allowed aliases.
In `@tests/unit/engine/test_agent_engine_errors.py`:
- Around line 362-364: Add a 30s pytest timeout mark for the new
TestAgentEngineRecovery test class so each test is limited to 30 seconds; either
add a class decorator `@pytest.mark.timeout`(30) above the TestAgentEngineRecovery
class or add a module-level pytestmark = pytest.mark.timeout(30) at top of the
file, and ensure pytest is imported if not already (referencing
TestAgentEngineRecovery to locate the class).
In `@tests/unit/engine/test_task_execution.py`:
- Around line 85-114: Add the required 30-second pytest timeout marker at the
class (or module) level so each test in the TestTaskExecutionRetryCount block
gets the repo-required timeout; locate the TestTaskExecutionRetryCount class and
decorate it with pytest.mark.timeout(30) (or add the same marker at the top of
the module) to enforce the per-test 30s guard for methods like
test_retry_count_default_zero, test_from_task_with_retry_count,
test_retry_count_preserved_on_transition, and
test_failed_transition_not_terminal.
---
Outside diff comments:
In `@DESIGN_SPEC.md`:
- Around line 697-720: The ASCII lifecycle diagram currently combines "BLOCKED /
CANCELLED" into one terminal sink which contradicts §6.6; update the diagram to
separate BLOCKED and CANCELLED as distinct nodes and add an arrow from BLOCKED
back to ASSIGNED labeled "unblocked" (or "blocked / cancelled" as appropriate),
keeping CANCELLED as a terminal node with no outgoing transitions and ensuring
FAILED remains non-terminal returning to ASSIGNED; target the diagram block
containing the states "ASSIGNED", "IN_PROGRESS", "IN_REVIEW", "COMPLETED",
"BLOCKED", and "CANCELLED" and adjust the ASCII art arrows so BLOCKED → ASSIGNED
is explicit and CANCELLED stays terminal.
In `@src/ai_company/engine/task_execution.py`:
- Around line 97-127: The TaskExecution.from_task + to_task_snapshot flow can
emit an invalid ASSIGNED Task because from_task only copies Task into execution
while retry_count-driven FAILED->ASSIGNED transitions may require a non-None
assignee; update TaskExecution to carry the reassigned assignee (e.g., add an
assigned_to field on TaskExecution) and modify from_task to accept/record that
assignee when creating a reassigned attempt, and update to_task_snapshot to
rebuild a validated Task snapshot by calling Task.model_validate(...) or
Task.with_transition(...) (instead of Task.model_copy) supplying both
status=self.status and the carried assigned_to so validators run and an ASSIGNED
Task is never emitted with a None assignee.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 598115f0-a607-4eab-9205-e5f38a107f77
📒 Files selected for processing (17)
DESIGN_SPEC.mdsrc/ai_company/core/enums.pysrc/ai_company/core/task.pysrc/ai_company/core/task_transitions.pysrc/ai_company/engine/__init__.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/engine/recovery.pysrc/ai_company/engine/task_execution.pysrc/ai_company/observability/events/execution.pytests/integration/engine/test_crash_recovery.pytests/unit/core/test_enums.pytests/unit/core/test_task.pytests/unit/core/test_task_transitions.pytests/unit/engine/test_agent_engine_errors.pytests/unit/engine/test_agent_engine_lifecycle.pytests/unit/engine/test_recovery.pytests/unit/engine/test_task_execution.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Agent
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Python 3.14+ with PEP 649 native lazy annotations
Do NOT usefrom __future__ import annotations—Python 3.14 has PEP 649
Useexcept A, B:syntax (no parentheses) for exception handling on Python 3.14—ruff enforces this
Add type hints to all public functions in Python; mypy strict mode is enforced
Use Google-style docstrings on all public classes and functions—ruff D rules enforce this
Create new objects instead of mutating existing ones; usecopy.deepcopy()at construction for non-Pydantic internal collections andMappingProxyTypewrapping for read-only enforcement
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (usingmodel_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 withBaseModel,model_validator,computed_field, andConfigDict
Use@computed_fieldfor derived values instead of storing + validating redundant fields (e.g.TokenUsage.total_tokens)
UseNotBlankStr(fromcore.types) for all identifier/name fields—including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants—instead of manual whitespace validators
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over barecreate_task
Enforce line length of 88 characters (ruff enforces this)
Functions should be less than 50 lines, files less than 800 lines
Handle errors explicitly; never silently swallow errors in Python code
Validate at system boundaries (user input, external APIs, config files)
Files:
src/ai_company/observability/events/execution.pytests/unit/engine/test_recovery.pysrc/ai_company/engine/__init__.pytests/unit/core/test_enums.pysrc/ai_company/engine/task_execution.pysrc/ai_company/engine/recovery.pytests/unit/core/test_task.pytests/unit/engine/test_agent_engine_errors.pysrc/ai_company/core/task_transitions.pysrc/ai_company/core/enums.pytests/unit/engine/test_task_execution.pysrc/ai_company/core/task.pytests/integration/engine/test_crash_recovery.pytests/unit/core/test_task_transitions.pytests/unit/engine/test_agent_engine_lifecycle.pysrc/ai_company/engine/agent_engine.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: Every module with business logic MUST importfrom ai_company.observability import get_loggerthenlogger = get_logger(__name__)
Never useimport logging,logging.getLogger(), orprint()in application code
Always useloggeras the variable name for loggers (not_logger, notlog)
Use event name constants from domain-specific modules underai_company.observability.events(e.g.PROVIDER_CALL_STARTfromevents.provider,BUDGET_RECORD_ADDEDfromevents.budget). Import directly:from ai_company.observability.events.<domain> import EVENT_CONSTANT
Always use structured logging withlogger.info(EVENT, key=value)format—neverlogger.info('msg %s', val)
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
DEBUG level logging should be used for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports do NOT need logging
Files:
src/ai_company/observability/events/execution.pysrc/ai_company/engine/__init__.pysrc/ai_company/engine/task_execution.pysrc/ai_company/engine/recovery.pysrc/ai_company/core/task_transitions.pysrc/ai_company/core/enums.pysrc/ai_company/core/task.pysrc/ai_company/engine/agent_engine.py
{src/**/*.py,tests/**/*.py,src/**/*.yaml,src/**/*.yml,tests/**/*.yaml,tests/**/*.yml,examples/**/*.yaml,examples/**/*.yml}
📄 CodeRabbit inference engine (CLAUDE.md)
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names:
example-provider,example-large-001,example-medium-001,example-small-001,large/medium/smallas aliases. Vendor names may only appear in: (1) DESIGN_SPEC.md provider list, (2).claude/skill/agent files, (3) third-party import paths/module names (e.g.litellm.types.llms.openai). Tests must usetest-provider,test-small-001, etc.
Files:
src/ai_company/observability/events/execution.pytests/unit/engine/test_recovery.pysrc/ai_company/engine/__init__.pytests/unit/core/test_enums.pysrc/ai_company/engine/task_execution.pysrc/ai_company/engine/recovery.pytests/unit/core/test_task.pytests/unit/engine/test_agent_engine_errors.pysrc/ai_company/core/task_transitions.pysrc/ai_company/core/enums.pytests/unit/engine/test_task_execution.pysrc/ai_company/core/task.pytests/integration/engine/test_crash_recovery.pytests/unit/core/test_task_transitions.pytests/unit/engine/test_agent_engine_lifecycle.pysrc/ai_company/engine/agent_engine.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Mark unit tests with@pytest.mark.unit, integration tests with@pytest.mark.integration, e2e tests with@pytest.mark.e2e, and slow tests with@pytest.mark.slow
Useasyncio_mode = 'auto'for pytest async tests—no manual@pytest.mark.asyncioneeded
Set a 30-second timeout per test
Files:
tests/unit/engine/test_recovery.pytests/unit/core/test_enums.pytests/unit/core/test_task.pytests/unit/engine/test_agent_engine_errors.pytests/unit/engine/test_task_execution.pytests/integration/engine/test_crash_recovery.pytests/unit/core/test_task_transitions.pytests/unit/engine/test_agent_engine_lifecycle.py
src/ai_company/{providers,engine}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
RetryExhaustedErrorsignals that all retries failed—the engine layer catches this to trigger fallback chains
Files:
src/ai_company/engine/__init__.pysrc/ai_company/engine/task_execution.pysrc/ai_company/engine/recovery.pysrc/ai_company/engine/agent_engine.py
🧠 Learnings (2)
📚 Learning: 2026-03-06T21:51:55.175Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-06T21:51:55.175Z
Learning: Applies to src/**/*.py : Use event name constants from domain-specific modules under `ai_company.observability.events` (e.g. `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
src/ai_company/observability/events/execution.py
📚 Learning: 2026-03-06T21:51:55.175Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-06T21:51:55.175Z
Learning: Applies to src/ai_company/{providers,engine}/**/*.py : `RetryExhaustedError` signals that all retries failed—the engine layer catches this to trigger fallback chains
Applied to files:
src/ai_company/engine/__init__.pysrc/ai_company/engine/recovery.pytests/integration/engine/test_crash_recovery.pyDESIGN_SPEC.mdsrc/ai_company/engine/agent_engine.py
🧬 Code graph analysis (13)
tests/unit/engine/test_recovery.py (5)
src/ai_company/core/enums.py (2)
TaskStatus(122-146)TaskType(149-157)src/ai_company/core/task.py (1)
Task(38-212)src/ai_company/core/agent.py (1)
AgentIdentity(177-235)src/ai_company/engine/recovery.py (7)
RecoveryResult(29-61)RecoveryStrategy(65-94)get_strategy_type(92-94)get_strategy_type(164-166)recover(73-90)recover(108-162)can_reassign(59-61)tests/unit/engine/conftest.py (2)
sample_agent_with_personality(52-79)sample_task_with_criteria(95-116)
src/ai_company/engine/__init__.py (1)
src/ai_company/engine/recovery.py (3)
FailAndReassignStrategy(97-166)RecoveryResult(29-61)RecoveryStrategy(65-94)
tests/unit/core/test_enums.py (1)
src/ai_company/core/enums.py (1)
TaskStatus(122-146)
src/ai_company/engine/task_execution.py (1)
src/ai_company/core/task.py (1)
Task(38-212)
src/ai_company/engine/recovery.py (4)
src/ai_company/core/enums.py (1)
TaskStatus(122-146)src/ai_company/engine/context.py (3)
AgentContext(87-307)AgentContextSnapshot(41-84)to_snapshot(280-302)src/ai_company/engine/task_execution.py (2)
TaskExecution(60-246)with_transition(135-193)src/ai_company/core/task.py (1)
with_transition(179-212)
tests/unit/core/test_task.py (2)
tests/integration/engine/test_crash_recovery.py (1)
_make_task(96-112)src/ai_company/core/enums.py (1)
TaskStatus(122-146)
tests/unit/engine/test_agent_engine_errors.py (5)
src/ai_company/engine/recovery.py (6)
FailAndReassignStrategy(97-166)RecoveryResult(29-61)recover(73-90)recover(108-162)get_strategy_type(92-94)get_strategy_type(164-166)src/ai_company/providers/enums.py (1)
FinishReason(15-22)src/ai_company/core/task.py (1)
Task(38-212)src/ai_company/engine/agent_engine.py (1)
AgentEngine(75-843)src/ai_company/engine/context.py (1)
AgentContext(87-307)
src/ai_company/core/task_transitions.py (1)
src/ai_company/core/enums.py (1)
TaskStatus(122-146)
tests/unit/engine/test_task_execution.py (4)
tests/unit/engine/conftest.py (1)
sample_task_with_criteria(95-116)src/ai_company/core/task.py (2)
Task(38-212)with_transition(179-212)src/ai_company/engine/task_execution.py (4)
TaskExecution(60-246)from_task(112-133)with_transition(135-193)is_terminal(244-246)src/ai_company/core/enums.py (1)
TaskStatus(122-146)
tests/integration/engine/test_crash_recovery.py (3)
src/ai_company/core/enums.py (1)
TaskStatus(122-146)src/ai_company/engine/loop_protocol.py (1)
TerminationReason(24-30)src/ai_company/engine/task_execution.py (2)
TaskExecution(60-246)from_task(112-133)
tests/unit/core/test_task_transitions.py (2)
src/ai_company/core/task_transitions.py (1)
validate_transition(65-99)src/ai_company/core/enums.py (1)
TaskStatus(122-146)
tests/unit/engine/test_agent_engine_lifecycle.py (1)
src/ai_company/core/enums.py (1)
TaskStatus(122-146)
src/ai_company/engine/agent_engine.py (4)
src/ai_company/engine/recovery.py (4)
FailAndReassignStrategy(97-166)RecoveryStrategy(65-94)recover(73-90)recover(108-162)src/ai_company/core/enums.py (1)
TaskStatus(122-146)src/ai_company/engine/loop_protocol.py (1)
TerminationReason(24-30)src/ai_company/engine/run_result.py (1)
termination_reason(64-66)
🪛 markdownlint-cli2 (0.21.0)
DESIGN_SPEC.md
[warning] 721-721: Blank line inside blockquote
(MD028, no-blanks-blockquote)
🔇 Additional comments (2)
src/ai_company/engine/agent_engine.py (2)
607-608:⚠️ Potential issue | 🔴 CriticalFix syntax error in multiple exception handlers—use parenthesized syntax for Python 3.14.
except MemoryError, RecursionError:is a SyntaxError in Python 3.x and will prevent the module from importing. This pattern occurs at 5 locations (lines 197, 282, 607, 736, 826) and must be changed toexcept (MemoryError, RecursionError):throughout the file.Minimal fix example
- except MemoryError, RecursionError: + except (MemoryError, RecursionError): raise⛔ Skipped due to learnings
Learnt from: CR Repo: Aureliolo/ai-company PR: 0 File: CLAUDE.md:0-0 Timestamp: 2026-03-06T21:51:55.175Z Learning: Applies to **/*.py : Use `except A, B:` syntax (no parentheses) for exception handling on Python 3.14—ruff enforces thisLearnt from: CR Repo: Aureliolo/ai-company PR: 0 File: CLAUDE.md:0-0 Timestamp: 2026-03-06T21:51:55.175Z Learning: Applies to **/*.py : Handle errors explicitly; never silently swallow errors in Python code
609-616: Recovery strategy exceptions are designed to be handled gracefully, not re-raised.The
RecoveryStrategy.recover()protocol documents return type asRecoveryResultwith no declared exceptions. Recovery strategies are defensive handlers invoked after the execution loop completes—they are not part of the provider-layer retry/fallback chain. The generic exception handler correctly logs and returns the original result, allowing recovery failures to not block task finalization.> Likely an incorrect or invalid review comment.
| > **Non-terminal states:** BLOCKED and FAILED are non-terminal — BLOCKED returns to ASSIGNED when unblocked, FAILED returns to ASSIGNED for retry (see §6.6). COMPLETED and CANCELLED are terminal states with no outgoing transitions. | ||
|
|
There was a problem hiding this comment.
Remove the blank line inside this blockquote.
Line 721 triggers markdownlint MD028 (no-blanks-blockquote). Keep the blockquote contiguous to avoid the lint failure.
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 721-721: Blank line inside blockquote
(MD028, no-blanks-blockquote)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@DESIGN_SPEC.md` around lines 720 - 721, The blockquote containing
"**Non-terminal states:** BLOCKED and FAILED are non-terminal — BLOCKED returns
to ASSIGNED when unblocked, FAILED returns to ASSIGNED for retry (see §6.6).
COMPLETED and CANCELLED are terminal states with no outgoing transitions."
contains an extra blank line; remove that blank line so the blockquote lines are
contiguous (no empty line inside the quote) to satisfy markdownlint MD028.
| │ │ ├── recovery.py # Crash recovery strategies (RecoveryStrategy protocol) | ||
| │ │ ├── run_result.py # AgentRunResult outcome model | ||
| │ │ ├── agent_engine.py # Agent execution engine |
There was a problem hiding this comment.
Update the run_result.py description to include RecoveryResult.
The new recovery.py entry is documented here, but run_result.py is still described only as AgentRunResult outcome model. With this PR adding RecoveryResult there as well, the project-structure map is now stale.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@DESIGN_SPEC.md` around lines 2281 - 2283, Update the project-structure map
entry for run_result.py to reflect the added RecoveryResult type: change the
short description from "AgentRunResult outcome model" to something like
"AgentRunResult and RecoveryResult outcome models" so it documents both
AgentRunResult and RecoveryResult in run_result.py; ensure you reference
run_result.py and the symbols AgentRunResult and RecoveryResult in the updated
line.
| if execution_result.termination_reason == TerminationReason.ERROR: | ||
| execution_result = await self._apply_recovery( | ||
| execution_result, | ||
| agent_id, | ||
| task_id, | ||
| ) |
There was a problem hiding this comment.
Recovery retries will never exhaust through the public run() path.
This code now relies on task_execution.retry_count, but run() still rebuilds execution state via AgentContext.from_identity(..., task=task), and that path creates TaskExecution.from_task(task) with retry_count=0. After a FAILED -> ASSIGNED reassignment, the next AgentEngine.run() starts back at zero, so can_reassign will stay True indefinitely unless callers bypass the public API and hand-roll a TaskExecution. Persist or thread the retry count across the reassignment boundary before shipping this.
Also applies to: 575-606
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/engine/agent_engine.py` around lines 265 - 270, The
retry_count on TaskExecution is being reset to zero when AgentEngine.run
rebuilds state via AgentContext.from_identity which calls
TaskExecution.from_task, so recovery/reassignment logic using
task_execution.retry_count (and can_reassign) never exhausts; fix by persisting
or passing the current retry_count into the rebuilt TaskExecution: update
AgentContext.from_identity (and any call sites in AgentEngine.run and the second
block around lines 575-606) to accept and propagate the existing
TaskExecution.retry_count (or the TaskExecution object) instead of constructing
a fresh TaskExecution.from_task(task) with retry_count=0, and ensure
AgentEngine._apply_recovery increments and persists retry_count back to the
task/context so subsequent run() invocations see the updated value.
…iewers - Extract cost recording to engine/cost_recording.py (agent_engine.py 861→759 lines) - Fix raise exc from None → raise exc from build_exc (preserve exception chain) - Upgrade _apply_recovery logger.warning → logger.exception (preserve traceback) - RecoveryResult.error_message: str → NotBlankStr (project convention) - FailAndReassignStrategy.STRATEGY_TYPE: Final[str] (immutability intent) - Update DESIGN_SPEC §6.1 diagram: add ASSIGNED→FAILED, separate BLOCKED/CANCELLED - Update DESIGN_SPEC §6.6: document RecoveryStrategy protocol methods/fields, fix "execution loop" → "AgentEngine.run()", list AgentContextSnapshot fields - Add cost_recording.py to DESIGN_SPEC §15.3 project structure - Fix MD028 blank line in blockquote - Add pytestmark timeout(30) to 3 test files missing it - Fix test-model-001 → test-small-001 (vendor-agnostic convention) - Add test: RecoveryResult frozen model immutability - Add test: Task.with_transition(FAILED→ASSIGNED) with validator - Add test: RecursionError in recovery propagates - Update docstrings: recovery.py, agent_engine.py, task_transitions.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| original_error=error_msg, | ||
| ) | ||
| raise exc from None | ||
| raise exc from build_exc |
There was a problem hiding this comment.
raise exc from build_exc reverses the exception causality.
Here, exc is the original exception (passed as the method's parameter), while build_exc is a secondary exception that occurred while handling exc inside the try block. Using raise X from Y sets X.__cause__ = Y, which asserts that Y caused X — but here, exc happened first and build_exc arose secondarily when attempting recovery.
The previous raise exc from None was correct: it suppresses the misleading implicit exception chain and surfaces only the original failure, which is what callers need to diagnose the root cause. The build_exc is already captured in the structured log on line 738, so no debugging information is lost.
| raise exc from build_exc | |
| raise exc from None |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/engine/agent_engine.py
Line: 741
Comment:
`raise exc from build_exc` reverses the exception causality.
Here, `exc` is the original exception (passed as the method's parameter), while `build_exc` is a *secondary* exception that occurred *while handling* `exc` inside the try block. Using `raise X from Y` sets `X.__cause__ = Y`, which asserts that `Y` *caused* `X` — but here, `exc` happened first and `build_exc` arose secondarily when attempting recovery.
The previous `raise exc from None` was correct: it suppresses the misleading implicit exception chain and surfaces only the original failure, which is what callers need to diagnose the root cause. The `build_exc` is already captured in the structured log on line 738, so no debugging information is lost.
```suggestion
raise exc from None
```
How can I resolve this? If you propose a fix, please make it concise.🤖 I have created a release *beep* *boop* --- ## [0.1.1](ai-company-v0.1.0...ai-company-v0.1.1) (2026-03-10) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.0...v0.1.0) (2026-03-11) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add mandatory JWT + API key authentication ([#256](#256)) ([c279cfe](c279cfe)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable output scan response policies ([#263](#263)) ([b9907e8](b9907e8)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement AuditRepository for security audit log persistence ([#279](#279)) ([94bc29f](94bc29f)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * resolve circular imports, bump litellm, fix release tag format ([#286](#286)) ([a6659b5](a6659b5)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * bump anchore/scan-action from 6.5.1 to 7.3.2 ([#271](#271)) ([80a1c15](80a1c15)) * bump docker/build-push-action from 6.19.2 to 7.0.0 ([#273](#273)) ([dd0219e](dd0219e)) * bump docker/login-action from 3.7.0 to 4.0.0 ([#272](#272)) ([33d6238](33d6238)) * bump docker/metadata-action from 5.10.0 to 6.0.0 ([#270](#270)) ([baee04e](baee04e)) * bump docker/setup-buildx-action from 3.12.0 to 4.0.0 ([#274](#274)) ([5fc06f7](5fc06f7)) * bump sigstore/cosign-installer from 3.9.1 to 4.1.0 ([#275](#275)) ([29dd16c](29dd16c)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * **main:** release ai-company 0.1.1 ([#282](#282)) ([2f4703d](2f4703d)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>
Add per-sink enable/disable, level/format overrides, rotation config, and custom sink creation -- all hot-reloadable without restart via ObservabilitySettingsSubscriber. Also fix MSW postMessage origin check (CodeQL alert #149, CWE-020/CWE-940). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add per-sink enable/disable, level/format overrides, rotation config, and custom sink creation -- all hot-reloadable without restart via ObservabilitySettingsSubscriber. Also fix MSW postMessage origin check (CodeQL alert #149, CWE-020/CWE-940). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add per-sink enable/disable, level/format overrides, rotation config, and custom sink creation -- all hot-reloadable without restart via ObservabilitySettingsSubscriber. Also fix MSW postMessage origin check (CodeQL alert #149, CWE-020/CWE-940). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary - Add per-sink enable/disable, level/format overrides, rotation config, and custom sink creation -- all hot-reloadable without restart via `ObservabilitySettingsSubscriber` - Two new JSON settings (`sink_overrides`, `custom_sinks`) in the observability namespace with full validation (console sink protection, path traversal prevention, duplicate path detection) - Extend `configure_logging` and `build_handler` to support custom routing overrides for user-defined sinks - Fix MSW `postMessage` handler missing origin check (CodeQL alert #149, CWE-020/CWE-940) ## Test plan - [x] 44 unit tests for `sink_config_builder` (overrides, custom sinks, validation, combined, edge cases) - [x] 16 unit tests for `ObservabilitySettingsSubscriber` (protocol, rebuild, error handling, namespace guard, idempotency) - [x] 6 integration tests for hot reload (disable sink, level change, custom sink, routing filters, module-level loggers, message preservation) - [x] All 616 observability + settings tests pass - [x] ruff lint + format clean - [x] mypy strict pass - [x] Pre-reviewed by 4 agents, 13 findings addressed Closes #564 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
FailAndReassignStrategyper DESIGN_SPEC §6.6 (closes Implement crash recovery with fail-and-reassign strategy (DESIGN_SPEC §6.6) #129)FAILEDas a non-terminalTaskStatuswith transitions:IN_PROGRESS → FAILED,ASSIGNED → FAILED,FAILED → ASSIGNEDRecoveryStrategyprotocol andFailAndReassignStrategydefault implementationRecoveryResultwith@computed_field can_reassign(retry_count < max_retries)Task.max_retries(default=1, ge=0) andTaskExecution.retry_countfieldsAgentEngine— both normal error path and fatal error pathKey design decisions
ASSIGNED → FAILEDtransition added (not justIN_PROGRESS → FAILED) to handle early setup failures before the task reachesIN_PROGRESScan_reassignis a@computed_field(not stored) per project convention for derived valuesstrategy_typeusesNotBlankStrper project convention for identifier fieldsMemoryError/RecursionErroralways re-raised through recoveryTest plan
FailAndReassignStrategy(happy path, max retries exceeded, zero retries, snapshot redaction, error capture, logging events)RecoveryStrategyprotocol conformanceTaskExecution.retry_count(default, factory, transition preservation, FAILED non-terminal)Task.max_retries(default, custom, zero, negative rejected)AgentEnginerecovery (provider error → FAILED, strategy invocation, failure swallowed, None bypass, timeout trigger, custom strategy, MemoryError propagation)Closes #129