Skip to content

feat: implement checkpoint recovery strategy#367

Merged
Aureliolo merged 9 commits intomainfrom
feat/checkpoint-recovery
Mar 14, 2026
Merged

feat: implement checkpoint recovery strategy#367
Aureliolo merged 9 commits intomainfrom
feat/checkpoint-recovery

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

  • Implement checkpoint recovery strategy (§6.6 Strategy 2) — per-turn AgentContext persistence, heartbeat-based failure detection, resume from last checkpoint with reconciliation, max resume attempts with fallback to fail-and-reassign
  • Add Checkpoint, Heartbeat, CheckpointConfig frozen Pydantic models, CheckpointRecoveryStrategy with async-safe resume counter, CheckpointCallback type alias, and make_checkpoint_callback factory
  • Add CheckpointRepository and HeartbeatRepository protocols with SQLite implementations, V6 migration (checkpoints + heartbeats tables with indexes)
  • Integrate checkpoint callback into ReactLoop and PlanExecuteLoop, resume path in AgentEngine with completion_config and effective_autonomy threading
  • Add 16 checkpoint event constants across events/checkpoint, events/execution, and events/persistence

Closes #201

Post-PR Review

Reviewed by 15 local agents + 5 external reviewers (Copilot, Gemini, Greptile, CodeRabbit, Socket). Two review rounds:

Round 1: 46 findings addressed from 15 agents + 5 external reviewers
Round 2: 8 additional findings addressed from CodeRabbit + Greptile re-review

Key fixes across both rounds:

  • Critical bug: execution_id mismatch causing checkpoint cleanup to never work (Copilot, Greptile, 4 agents)
  • BudgetEnforcer bypass: Resume path now uses enforcer when configured (CodeRabbit)
  • Orphaned rows: Cleanup on max-resume fallback path (Greptile)
  • Heartbeat cleanup: Delete heartbeat alongside checkpoints after resume (Greptile)
  • Conditional cleanup: Skip cleanup on ERROR termination to preserve resume data (CodeRabbit)
  • Lock scope: Fallback delegation moved outside async lock (3 agents)
  • UTC normalization: Heartbeat timestamps normalized for correct comparisons (persistence-reviewer, CodeRabbit)
  • Function/file extraction: _resume_from_checkpoint (118→35 lines), recover() (108→30 lines), new checkpoint/resume.py module
  • Docs drift: 7 documentation fixes across CLAUDE.md, design spec, README

Test plan

  • All 7451+ unit tests pass
  • 94%+ coverage (80% minimum)
  • mypy strict: 0 errors
  • ruff lint + format: clean
  • Pre-commit hooks: all pass
  • Verify checkpoint save/load roundtrip in integration environment
  • Verify resume from checkpoint produces correct agent behavior
  • Verify max_resume_attempts fallback to fail-and-reassign

🤖 Generated with Claude Code

Adds the Checkpoint Recovery strategy (Engine design §6.6, Strategy 2)
which persists AgentContext snapshots after each completed turn and
resumes from the last checkpoint on crash, preserving progress. After
max_resume_attempts failures, falls back to fail-and-reassign.

New modules:
- engine/checkpoint/ — models, callback factory, strategy
- persistence/sqlite/ — checkpoint and heartbeat repositories
- observability/events/checkpoint.py — lifecycle event constants

Key changes:
- RecoveryResult extended with checkpoint_context_json, resume_attempt,
  can_resume fields (backward-compatible defaults)
- ReactLoop and PlanExecuteLoop accept optional checkpoint_callback
- AgentEngine wires checkpoint repos, builds callback-enabled loops,
  and resumes from checkpoint on recoverable errors
- SQLite schema V6 adds checkpoints and heartbeats tables
- PersistenceBackend protocol gains checkpoints/heartbeats properties
- 100 new unit tests, full suite green at 94% coverage
Pre-reviewed by 15 agents, 42 findings addressed:

- Fix missing completion_config and effective_autonomy on resume path
- Add MemoryError/RecursionError re-raise in strategy checkpoint load
- Add asyncio.Lock to protect _resume_counts from TOCTOU race
- Clean up resume counts in fallback path to prevent memory leak
- Validate checkpoint_repo and heartbeat_repo are provided together
- Log warning for unsupported loop types in checkpoint injection
- Skip heartbeat update when checkpoint save fails (prevent limbo)
- Guard against turn-zero checkpoint via explicit check
- Sanitize error message in LLM reconciliation ChatMessage
- Replace assert with explicit RuntimeError for checkpoint_context_json
- Add cross-field validator for checkpoint_context_json/resume_attempt
- Clean up checkpoints after successful resume completion
- Thread completion_config and effective_autonomy through recovery chain
- Extract duplicate checkpoint callback blocks into helper method
- Add public config property to PlanExecuteLoop (remove SLF001 noqa)
- Use EXECUTION_CHECKPOINT_CALLBACK_FAILED event for error paths
- Add structured error info to callback exception logging
- Use NotBlankStr for repo params and callback_factory identifiers
- Move cursor.rowcount read before commit in repos
- Fix HeartbeatRepository.get_stale to accept AwareDatetime
- Drop DESC from composite index (portability for SQLite < 3.47)
- Document intentional FK absence in migrations
- Document heartbeat_interval_seconds as reserved for future use
- Update CLAUDE.md package structure and logging event docs
- Update PersistenceBackend and RecoveryResult docstrings
- Fix various docstring accuracy issues across modules
- Add FakePersistenceBackend checkpoint/heartbeat repos in test conftest
- Add protocol compliance assertions for CheckpointRepository/HeartbeatRepository
- Add test_checkpoint_events_exist assertions
- Fix _FakeCostRecordRepository.aggregate missing task_id (pre-existing)
- Verify checkpoint/heartbeat model content in callback factory tests
- Fix test_strategy.py return type annotation
Critical fixes:
- execution_id mismatch: use checkpoint_ctx.execution_id for cleanup
  instead of fabricated f"{agent_id}:{task_id}" (Copilot, Greptile,
  code-reviewer, python-reviewer, silent-failure, security)
- PEP 758 except syntax in models.py
- Extract _resume_from_checkpoint (118→35 lines) into helpers
- Extract recover() (108→30 lines) into smaller methods
- Move _make_loop_with_callback to checkpoint/resume.py
- Create engine/checkpoint/resume.py for resume orchestration helpers

Major fixes:
- Include actual error_message in reconciliation (not hardcoded text)
- Move _delegate_to_fallback outside lock scope in strategy
- Clean up heartbeat after successful resume (Greptile)
- Conditional checkpoint cleanup — skip on ERROR (CodeRabbit)
- Fix EXECUTION_LOOP_STARTED → EXECUTION_LOOP_START in CLAUDE.md
- Fix PARALLEL_EXECUTION_STARTED → PARALLEL_GROUP_START in CLAUDE.md
- Add 3 missing checkpoint fields to RecoveryResult table in design spec
- Remove non-existent storage field from YAML config example
- Update reconciliation description to match implementation
- Make clear_resume_count async with lock for thread safety
- Normalize heartbeat timestamps to UTC for correct comparisons
- Narrow except Exception to PersistenceError in strategy
- Split make_checkpoint_callback (92→3 functions under 50 lines)

Medium fixes:
- Use CHECKPOINT_DELETE_FAILED event constant for cleanup warnings
- Add exc_info=True on cleanup warning logs
- Log before raising RuntimeError on None checkpoint_context_json
- Type clear_resume_count parameter as NotBlankStr
- Add JSON validation to RecoveryResult.checkpoint_context_json
- Bound _resume_counts dict with _MAX_TRACKED_EXECUTIONS eviction
- Update RecoveryStrategy protocol docstring for checkpoint recovery
- Add checkpoint recovery to README capabilities
- Update ReactLoop checkpoint_callback docstring
- FakeCheckpointRepository.get_latest validates no-filter ValueError
- Add test for RecoveryResult._validate_checkpoint_consistency
- Add CHECKPOINT_UNSUPPORTED_LOOP event constant

Minor fixes:
- Fix checkpoint __init__.py docstring (configurable intervals)
- Fix callback.py type alias docstring (may skip)
- Fix callback_factory docstring (turn-0 skip mention)
- FakeHeartbeatRepository.get_stale sorts by timestamp
- Add turn=0 parametrize cases to test_callback_factory
- Tighten context_json validator to reject arrays/primitives
- Add composite (task_id, turn_number) index
- Add CHECK (turn_number >= 0) constraint to DDL
- Remove duplicate test_schema_version from test_migrations_v2
- Use SCHEMA_VERSION constant in test_migrations_v2
CodeRabbit (5 new actionable comments on commit 6a8cfa7):
- Resume path now uses BudgetEnforcer when configured (was bypassing
  enforcer-specific features like per-agent allocation, auto-downgrade)
- Log deserialization errors before re-raising in resume.py
- Use HEARTBEAT_DELETE_FAILED event constant for heartbeat cleanup
  failures (was incorrectly using CHECKPOINT_DELETE_FAILED)
- Fix reconciliation bullet in design spec (was inconsistent with
  implementation description below it)
- Add comment explaining checkpoint placement before tool execution
  (intentional: preserves LLM response on mid-tool crash)

Greptile (3 findings, confidence 3/5):
- Clean up orphaned checkpoint/heartbeat rows on max-resume fallback
  path (delegate_to_fallback now calls cleanup_after_resume)
- Add heartbeat_repo to CheckpointRecoveryStrategy for fallback cleanup
- Document in-memory resume counter limitation with TODO comment
  (persisting counter across restarts is a planned improvement)
- Wrong heartbeat event constant — already fixed above
Copilot AI review requested due to automatic review settings March 14, 2026 08:10
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 14, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a checkpoint-based crash-recovery subsystem: per-turn AgentContext persistence and heartbeats, loop hooks and callbacks for checkpointing, a CheckpointRecoveryStrategy with resume attempts and fallback, SQLite repos + migration v6, engine wiring for resume flows, observability events, tests, and docs updates.

Changes

Cohort / File(s) Summary
Checkpoint core
src/ai_company/engine/checkpoint/__init__.py, src/ai_company/engine/checkpoint/models.py, src/ai_company/engine/checkpoint/callback.py, src/ai_company/engine/checkpoint/callback_factory.py, src/ai_company/engine/checkpoint/resume.py, src/ai_company/engine/checkpoint/strategy.py
New checkpoint subsystem: immutable models (Checkpoint, Heartbeat, CheckpointConfig), CheckpointCallback type and factory, loop-injection and deserialize/reconcile helpers, cleanup helpers, and CheckpointRecoveryStrategy with per-execution resume counting, reservation, and fallback delegation.
Engine integration
src/ai_company/engine/agent_engine.py, src/ai_company/engine/__init__.py, src/ai_company/engine/react_loop.py, src/ai_company/engine/plan_execute_loop.py
AgentEngine constructor/run signatures extended to accept checkpoint_repo, heartbeat_repo, and checkpoint_config; run/_execute/_post_execution pipeline threaded with completion_config and effective_autonomy; resume-from-checkpoint workflow and loop callback wiring added; ReactLoop and PlanExecuteLoop accept and invoke checkpoint callbacks.
Recovery & API models
src/ai_company/engine/recovery.py, docs/design/engine.md
RecoveryResult gains checkpoint_context_json, resume_attempt, and computed can_resume plus validation; RecoveryStrategy adds finalize hook; design doc updated to reflect checkpoint resume semantics and new fields.
Persistence protocol & implementations
src/ai_company/persistence/protocol.py, src/ai_company/persistence/repositories.py, src/ai_company/persistence/sqlite/backend.py, src/ai_company/persistence/sqlite/checkpoint_repo.py, src/ai_company/persistence/sqlite/heartbeat_repo.py, src/ai_company/persistence/sqlite/migrations.py, src/ai_company/persistence/sqlite/__init__.py
PersistenceBackend exposes checkpoints and heartbeats repositories; new repository protocols and SQLite implementations added; schema bumped to v6 with DDL for checkpoints and heartbeats; backend wired to create/expose repos.
Observability events
src/ai_company/observability/events/checkpoint.py, src/ai_company/observability/events/execution.py, src/ai_company/observability/events/persistence.py
New structured event constants for checkpoint lifecycle, heartbeat lifecycle, loop integration, recovery flow, execution resume/checkpoint events, and persistence-level checkpoint/heartbeat events.
Tests & fixtures
tests/unit/api/conftest.py, tests/unit/persistence/test_protocol.py, tests/unit/engine/checkpoint/*, tests/unit/persistence/sqlite/*, tests/unit/observability/test_events.py, tests/unit/engine/*, tests/unit/core/conftest.py
Extensive unit tests and fake in-memory checkpoint/heartbeat repos added: models, callback factory, resume helpers, recovery strategy, SQLite repos, migrations v6; fixtures updated to expose checkpoint/heartbeat repos and TaskFactory parent_task_id.
Docs & README
README.md, CLAUDE.md, docs/design/engine.md
Docs updated: README mentions persistence→engine edge and crash recovery; CLAUDE.md logging guidance augmented with explicit import and structured kwargs examples; design doc updated for checkpoint recovery details and new RecoveryResult fields.

Sequence Diagram(s)

mermaid
sequenceDiagram
rect rgba(30,144,255,0.5)
participant Engine
participant Loop
participant CheckpointRepo
participant HeartbeatRepo
participant Observability
end
Note over Engine,Loop: Normal execution with per-turn checkpointing
Engine->>Loop: start execution (with checkpoint_callback)
Loop->>Loop: run turn -> produce AgentContext
Loop->>CheckpointRepo: save checkpoint (context_json, turn)
alt save success
CheckpointRepo-->>HeartbeatRepo: upsert heartbeat (last_heartbeat_at)
HeartbeatRepo-->>Observability: emit HEARTBEAT_UPDATED
CheckpointRepo-->>Observability: emit CHECKPOINT_SAVED
else save failure
CheckpointRepo-->>Observability: emit CHECKPOINT_SAVE_FAILED
end
Note over Engine,CheckpointRepo: On crash detection
Engine->>CheckpointRepo: get_latest(execution_id, task_id)
alt checkpoint found and attempts left
CheckpointRepo-->>Engine: checkpoint (context_json, turn)
Engine->>Loop: deserialize_and_reconcile(context_json + system message)
Loop->>Loop: resume execution from checkpoint
Engine->>CheckpointRepo: cleanup_checkpoint_artifacts(execution_id)
else no checkpoint or exhausted
Engine->>RecoveryStrategy: delegate fallback (fail-and-reassign)
end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

  • Issue 201 — Implements the Checkpoint Recovery strategy (per-turn AgentContext persistence, heartbeat detection, resume-from-checkpoint, max-attempts and fallback), directly matching this PR's objectives.

Possibly related PRs

  • PR 331 — Modifies AgentEngine.run signature and threads effective_autonomy; overlaps with this PR's run/signature changes.
  • PR 197 — Also extends engine execution flow and persistence-related APIs (parked/parked_contexts vs. checkpoint repos), touching the same core files.
  • PR 328 — Alters AgentEngine constructor and execution wiring; related to integrating runtime collaborators (checkpoint/heartbeat).
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: implement checkpoint recovery strategy' clearly and accurately summarizes the main feature added in this PR—the checkpoint recovery strategy implementation.
Description check ✅ Passed The PR description is comprehensive and directly related to the changeset, detailing the checkpoint recovery strategy implementation, key fixes, test results, and remaining verification tasks.
Linked Issues check ✅ Passed The PR fully addresses the scope defined in issue #201: implements per-turn context persistence, heartbeat detection, checkpoint resume with reconciliation, max resume attempts, and RecoveryStrategy compliance with persistence backends and engine loop integration.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the checkpoint recovery strategy objective; no unrelated modifications detected. Updates to documentation and test infrastructure support the core checkpoint feature.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/checkpoint-recovery
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch feat/checkpoint-recovery
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's resilience by introducing a robust checkpoint recovery strategy. Agents can now automatically recover from crashes by resuming from their last saved state, preserving valuable progress on long-running tasks. This feature incorporates periodic state persistence, liveness detection, and a configurable retry mechanism, ensuring greater operational stability and a graceful fallback to task re-assignment when recovery limits are met.

Highlights

  • Checkpoint Recovery Strategy Implemented: A per-turn AgentContext persistence mechanism was introduced, allowing agents to resume execution from the last saved state after a crash. This includes heartbeat-based failure detection and a configurable number of resume attempts before falling back to task re-assignment.
  • New Data Models and Repositories: New Pydantic models for Checkpoint, Heartbeat, and CheckpointConfig were added. Corresponding CheckpointRepository and HeartbeatRepository protocols were defined, along with their SQLite implementations.
  • Execution Loop Integration: The checkpoint callback has been integrated into both the ReactLoop and PlanExecuteLoop. The AgentEngine now includes a comprehensive resume path to handle state deserialization and reconciliation.
  • Database Schema Update: A new V6 database migration was added to create dedicated 'checkpoints' and 'heartbeats' tables, complete with necessary indexes, to support the persistence of recovery data.
  • Enhanced Observability: Sixteen new event constants were introduced across the 'events/checkpoint', 'events/execution', and 'events/persistence' modules to provide detailed logging and monitoring of the checkpointing and recovery processes.
Changelog
  • CLAUDE.md
    • Updated the description of the 'engine/' module to reflect the new checkpoint recovery capabilities.
    • Added new event constants for checkpoint and persistence events to the logging guidelines.
  • README.md
    • Modified the "Agent Orchestration" section to highlight the new crash recovery feature.
    • Updated the architecture diagram to show 'Persistence' feeding into 'Engine'.
  • docs/design/engine.md
    • Expanded the 'RecoveryResult' model with 'checkpoint_context_json', 'resume_attempt', and 'can_resume' fields.
    • Removed the "Planned" warning for the Checkpoint Recovery strategy, indicating its implementation.
    • Refined the description of checkpoint configuration and the reconciliation process during resume.
  • src/ai_company/engine/init.py
    • Exported new checkpoint-related modules and classes, including 'Checkpoint', 'Heartbeat', 'CheckpointConfig', 'CheckpointRecoveryStrategy', and 'make_checkpoint_callback'.
  • src/ai_company/engine/agent_engine.py
    • Introduced parameters for 'checkpoint_repo', 'heartbeat_repo', and 'checkpoint_config' in the 'AgentEngine' constructor.
    • Implemented the core logic for resuming from checkpoints, including deserialization, reconciliation, and cleanup.
    • Modified execution flow to integrate checkpoint callbacks into loops and handle recovery results.
  • src/ai_company/engine/checkpoint/init.py
    • Added the checkpoint module's public API.
  • src/ai_company/engine/checkpoint/callback.py
    • Defined the 'CheckpointCallback' type alias.
  • src/ai_company/engine/checkpoint/callback_factory.py
    • Created a factory for checkpoint persistence callbacks.
  • src/ai_company/engine/checkpoint/models.py
    • Defined Pydantic models for 'Checkpoint', 'Heartbeat', and 'CheckpointConfig'.
  • src/ai_company/engine/checkpoint/resume.py
    • Added helper functions for checkpoint deserialization, reconciliation, loop integration, and cleanup.
  • src/ai_company/engine/checkpoint/strategy.py
    • Implemented the 'CheckpointRecoveryStrategy' for managing resume attempts and fallback.
  • src/ai_company/engine/plan_execute_loop.py
    • Modified the 'PlanExecuteLoop' to support checkpoint callbacks.
  • src/ai_company/engine/react_loop.py
    • Modified the 'ReactLoop' to support checkpoint callbacks.
  • src/ai_company/engine/recovery.py
    • Updated 'RecoveryResult' with checkpoint-specific fields and validation.
  • src/ai_company/observability/events/checkpoint.py
    • Added new event constants for checkpoint and heartbeat operations.
  • src/ai_company/observability/events/execution.py
    • Added new event constants for checkpoint callback and resume events.
  • src/ai_company/observability/events/persistence.py
    • Added new event constants for checkpoint and heartbeat persistence.
  • src/ai_company/persistence/protocol.py
    • Updated 'PersistenceBackend' to include checkpoint and heartbeat repositories.
  • src/ai_company/persistence/repositories.py
    • Defined new 'CheckpointRepository' and 'HeartbeatRepository' protocols.
  • src/ai_company/persistence/sqlite/init.py
    • Imported and exposed new SQLite repository implementations.
  • src/ai_company/persistence/sqlite/backend.py
    • Integrated new SQLite checkpoint and heartbeat repositories.
  • src/ai_company/persistence/sqlite/checkpoint_repo.py
    • Implemented SQLite persistence for checkpoints.
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
    • Implemented SQLite persistence for heartbeats.
  • src/ai_company/persistence/sqlite/migrations.py
    • Updated schema version to 6 and added SQL for new checkpoint and heartbeat tables.
  • tests/unit/api/conftest.py
    • Added fake implementations for checkpoint and heartbeat repositories for testing.
  • tests/unit/engine/checkpoint/test_callback_factory.py
    • Added unit tests for the checkpoint callback factory.
  • tests/unit/engine/checkpoint/test_models.py
    • Added unit tests for checkpoint and heartbeat Pydantic models.
  • tests/unit/engine/checkpoint/test_strategy.py
    • Added unit tests for the checkpoint recovery strategy.
  • tests/unit/engine/test_recovery_checkpoint_fields.py
    • Added unit tests for new checkpoint-related fields in 'RecoveryResult'.
  • tests/unit/observability/test_events.py
    • Updated event tests to include new checkpoint events.
  • tests/unit/persistence/sqlite/test_checkpoint_repo.py
    • Added unit tests for the SQLite checkpoint repository.
  • tests/unit/persistence/sqlite/test_heartbeat_repo.py
    • Added unit tests for the SQLite heartbeat repository.
  • tests/unit/persistence/sqlite/test_migrations_v6.py
    • Added unit tests for the V6 database migration.
  • tests/unit/persistence/test_migrations_v2.py
    • Updated schema version assertion in migration tests.
  • tests/unit/persistence/test_protocol.py
    • Updated fake repositories and protocol tests to include checkpoint and heartbeat interfaces.
Activity
  • The pull request was reviewed by 15 local agents and 5 external reviewers (Copilot, Gemini, Greptile, CodeRabbit, Socket).
  • Two rounds of reviews were conducted, addressing a total of 54 findings (46 in Round 1, 8 in Round 2).
  • Key fixes included resolving an execution_id mismatch, ensuring BudgetEnforcer bypass was addressed, implementing cleanup for orphaned rows and heartbeats, conditional cleanup on error termination, correcting lock scope, and normalizing UTC timestamps.
  • Significant refactoring involved extracting _resume_from_checkpoint and recover() functions into a new checkpoint/resume.py module, reducing their line counts.
  • Seven documentation fixes were applied across CLAUDE.md, the design specification, and README.md.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Aureliolo Aureliolo temporarily deployed to cloudflare-preview March 14, 2026 08:10 — with GitHub Actions Inactive
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/ai_company/engine/agent_engine.py (1)

1089-1093: ⚠️ Potential issue | 🟡 Minor

Missing completion_config and effective_autonomy in fatal error recovery path.

_build_error_execution calls _apply_recovery without completion_config and effective_autonomy. If checkpoint recovery is triggered during fatal error handling, the resumed execution won't have proper configuration threading.

This is low-impact since fatal errors typically occur before checkpoints exist.

🛡️ Proposed fix

Update _handle_fatal_error and _build_error_execution signatures to accept and forward these parameters:

     async def _build_error_execution(  # noqa: PLR0913
         self,
         identity: AgentIdentity,
         task: Task,
         agent_id: str,
         task_id: str,
         error_msg: str,
         ctx: AgentContext | None,
+        *,
+        completion_config: CompletionConfig | None = None,
+        effective_autonomy: EffectiveAutonomy | None = None,
     ) -> ExecutionResult:
         """Create an error ``ExecutionResult`` and apply recovery."""
         error_ctx = ctx or AgentContext.from_identity(identity, task=task)
         error_execution = ExecutionResult(
             context=error_ctx,
             termination_reason=TerminationReason.ERROR,
             error_message=error_msg,
         )
         return await self._apply_recovery(
             error_execution,
             agent_id,
             task_id,
+            completion_config=completion_config,
+            effective_autonomy=effective_autonomy,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/agent_engine.py` around lines 1089 - 1093, The
fatal-error recovery path is missing completion_config and effective_autonomy
and thus _build_error_execution currently calls _apply_recovery without them;
update the signatures of _handle_fatal_error and _build_error_execution to
accept completion_config and effective_autonomy, and forward those parameters
into the call to _apply_recovery (and any intermediate calls) so that when
checkpoint recovery is applied the resumed execution receives the correct
completion_config and effective_autonomy values; ensure parameter names match
existing usage and are passed unchanged through to _apply_recovery.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 674-739: The _resume_from_checkpoint function is over the 50-line
limit; extract the core resume logic into a smaller helper (e.g.,
_reconstruct_and_run_resume) that performs deserialize_and_reconcile(...) and
awaits _execute_resumed_loop(...), preserves the existing try/except behavior
(re-raising MemoryError/RecursionError, logging other exceptions with
EXECUTION_RESUME_FAILED and agent_id/task_id), and returns the ExecutionResult
along with checkpoint_ctx.execution_id so _resume_from_checkpoint can call await
self._finalize_resume(...) and return the result; ensure logger.info and the
initial checkpoint null check remain in _resume_from_checkpoint and that
signature/exception semantics of _execute_resumed_loop and _finalize_resume are
unchanged.

In `@src/ai_company/engine/checkpoint/resume.py`:
- Around line 38-86: In deserialize_and_reconcile, change the exception logging
to use a distinct error event constant instead of
CHECKPOINT_RECOVERY_RECONCILIATION: replace the logger.exception call in the
except ValueError block to log CHECKPOINT_RECOVERY_DESERIALIZE_FAILED (including
agent_id, task_id and the error message) and keep the success path logging
(logger.debug) using CHECKPOINT_RECOVERY_RECONCILIATION; ensure
CHECKPOINT_RECOVERY_DESERIALIZE_FAILED is defined/ imported where constants live
so log tooling can distinguish deserialize failures from successful
reconciliation events.

In `@src/ai_company/engine/checkpoint/strategy.py`:
- Around line 222-228: The current FIFO eviction on self._resume_counts can drop
an actively-recovering execution; change to LRU eviction by turning
_resume_counts into an order-preserving structure and updating access order
whenever an execution_id is touched. Concretely, replace the plain dict use of
_resume_counts with an OrderedDict (or maintain last_access timestamps) and in
the code paths that read/increment _resume_counts (the block updating
self._resume_counts[execution_id] and the eviction check guarded by
_MAX_TRACKED_EXECUTIONS) call move_to_end(execution_id) (or update the
timestamp) so the eviction removes the least-recently-used key instead of the
oldest-inserted key. Ensure the eviction still pops from the front when len(...)
> _MAX_TRACKED_EXECUTIONS.
- Around line 231-241: The code reads _resume_counts in _build_resume_result
which races with updates made inside the lock by _reserve_resume_attempt; modify
_reserve_resume_attempt to return the reserved resume count and change recover
to capture that return value and pass it as an explicit resume_attempt argument
into _build_resume_result (update _build_resume_result signature to accept
resume_attempt instead of reading _resume_counts itself) so the metadata uses
the locked, returned value rather than reading _resume_counts outside the lock.

In `@tests/unit/persistence/sqlite/test_migrations_v6.py`:
- Around line 100-105: The test's expected index set in test_migrations_v6.py is
missing the checkpoint index "idx_cp_task_turn"; update the expected set (used
with the variable indexes) to include "idx_cp_task_turn" so the assertion
(assert expected.issubset(indexes)) verifies all checkpoint indexes created by
the v6 migration.

---

Outside diff comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 1089-1093: The fatal-error recovery path is missing
completion_config and effective_autonomy and thus _build_error_execution
currently calls _apply_recovery without them; update the signatures of
_handle_fatal_error and _build_error_execution to accept completion_config and
effective_autonomy, and forward those parameters into the call to
_apply_recovery (and any intermediate calls) so that when checkpoint recovery is
applied the resumed execution receives the correct completion_config and
effective_autonomy values; ensure parameter names match existing usage and are
passed unchanged through to _apply_recovery.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6555c608-4472-4e35-a2e7-74f236039a47

📥 Commits

Reviewing files that changed from the base of the PR and between 494013f and 838670e.

📒 Files selected for processing (36)
  • CLAUDE.md
  • README.md
  • docs/design/engine.md
  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/engine/checkpoint/callback.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/observability/events/persistence.py
  • src/ai_company/persistence/protocol.py
  • src/ai_company/persistence/repositories.py
  • src/ai_company/persistence/sqlite/__init__.py
  • src/ai_company/persistence/sqlite/backend.py
  • src/ai_company/persistence/sqlite/checkpoint_repo.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/persistence/sqlite/migrations.py
  • tests/unit/api/conftest.py
  • tests/unit/engine/checkpoint/__init__.py
  • tests/unit/engine/checkpoint/test_callback_factory.py
  • tests/unit/engine/checkpoint/test_models.py
  • tests/unit/engine/checkpoint/test_strategy.py
  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • tests/unit/observability/test_events.py
  • tests/unit/persistence/sqlite/test_checkpoint_repo.py
  • tests/unit/persistence/sqlite/test_heartbeat_repo.py
  • tests/unit/persistence/sqlite/test_migrations_v6.py
  • tests/unit/persistence/test_migrations_v2.py
  • tests/unit/persistence/test_protocol.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Agent
  • GitHub Check: Build Backend
  • GitHub Check: Greptile Review
  • GitHub Check: Test (Python 3.14)
  • GitHub Check: Build Backend
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax: except A, B: (no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001. Tests must use test-provider, test-small-001, etc.

Files:

  • tests/unit/engine/checkpoint/test_models.py
  • src/ai_company/persistence/sqlite/backend.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/engine/checkpoint/callback.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • tests/unit/observability/test_events.py
  • src/ai_company/persistence/sqlite/__init__.py
  • tests/unit/persistence/test_migrations_v2.py
  • src/ai_company/observability/events/checkpoint.py
  • tests/unit/engine/checkpoint/test_callback_factory.py
  • tests/unit/persistence/sqlite/test_migrations_v6.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/persistence/protocol.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/checkpoint/resume.py
  • tests/unit/persistence/sqlite/test_checkpoint_repo.py
  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/persistence/sqlite/migrations.py
  • tests/unit/api/conftest.py
  • src/ai_company/engine/checkpoint/__init__.py
  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • src/ai_company/persistence/repositories.py
  • tests/unit/persistence/sqlite/test_heartbeat_repo.py
  • tests/unit/engine/checkpoint/test_strategy.py
  • src/ai_company/observability/events/persistence.py
  • src/ai_company/persistence/sqlite/checkpoint_repo.py
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/checkpoint/strategy.py
  • tests/unit/persistence/test_protocol.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Tests must use markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: use asyncio_mode = "auto" — no manual @pytest.mark.asyncio needed.
Test timeout: 30 seconds per test.
Prefer @pytest.mark.parametrize for testing similar cases.

Files:

  • tests/unit/engine/checkpoint/test_models.py
  • tests/unit/observability/test_events.py
  • tests/unit/persistence/test_migrations_v2.py
  • tests/unit/engine/checkpoint/test_callback_factory.py
  • tests/unit/persistence/sqlite/test_migrations_v6.py
  • tests/unit/persistence/sqlite/test_checkpoint_repo.py
  • tests/unit/api/conftest.py
  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • tests/unit/persistence/sqlite/test_heartbeat_repo.py
  • tests/unit/engine/checkpoint/test_strategy.py
  • tests/unit/persistence/test_protocol.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: All public functions must have type hints. Enforce via mypy strict mode.
Google-style docstrings required on public classes and functions, enforced by ruff D rules.
Create new objects instead of mutating existing ones. For non-Pydantic internal collections, use copy.deepcopy() at construction plus MappingProxyType wrapping for read-only enforcement.
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and use copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 with BaseModel, model_validator, computed_field, ConfigDict. Use @computed_field for derived values instead of storing redundant fields (e.g., TokenUsage.total_tokens).
Use NotBlankStr from core.types for all identifier/name fields in Pydantic models, including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants, instead of manual whitespace validators.
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task.
Functions must be less than 50 lines; files must be less than 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries: user input, external APIs, config files.
NEVER use import logging, logging.getLogger(), or print() in application code.
Always use logger as the variable name for loggers (not _logger, not log).
Event names must always use constants from domain-specific modules under ai_company.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget). Import dire...

Files:

  • src/ai_company/persistence/sqlite/backend.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/engine/checkpoint/callback.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • src/ai_company/persistence/sqlite/__init__.py
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/persistence/protocol.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/persistence/sqlite/migrations.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/persistence/repositories.py
  • src/ai_company/observability/events/persistence.py
  • src/ai_company/persistence/sqlite/checkpoint_repo.py
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/checkpoint/strategy.py
src/ai_company/**/[!_]*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Every module with business logic MUST have: from ai_company.observability import get_logger then logger = get_logger(__name__)

Files:

  • src/ai_company/persistence/sqlite/backend.py
  • src/ai_company/persistence/sqlite/heartbeat_repo.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/engine/checkpoint/callback.py
  • src/ai_company/engine/checkpoint/callback_factory.py
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/engine/checkpoint/models.py
  • src/ai_company/persistence/protocol.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/persistence/sqlite/migrations.py
  • src/ai_company/persistence/repositories.py
  • src/ai_company/observability/events/persistence.py
  • src/ai_company/persistence/sqlite/checkpoint_repo.py
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/checkpoint/strategy.py
docs/design/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

When approved deviations occur, update the relevant docs/design/ page to reflect the new reality.

Files:

  • docs/design/engine.md
docs/**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

Documentation source in docs/ built with Zensical. Design spec in docs/design/ (7 pages). Architecture in docs/architecture/. Roadmap in docs/roadmap/. Security in docs/security.md.

Files:

  • docs/design/engine.md
🧠 Learnings (9)
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Event names must always use constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

  • src/ai_company/observability/events/execution.py
  • tests/unit/observability/test_events.py
  • CLAUDE.md
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/observability/events/persistence.py
  • src/ai_company/engine/plan_execute_loop.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/ai_company/**/[!_]*.py : Every module with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`

Applied to files:

  • CLAUDE.md
  • src/ai_company/engine/plan_execute_loop.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Structured logging: always use `logger.info(EVENT, key=value)` — never `logger.info("msg %s", val)`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All state transitions must log at INFO level.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : NEVER use `import logging`, `logging.getLogger()`, or `print()` in application code.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All error paths must log at WARNING or ERROR with context before raising.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Always use `logger` as the variable name for loggers (not `_logger`, not `log`).

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : DEBUG logging for object creation, internal flow, and entry/exit of key functions.

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Pure data models, enums, and re-exports do NOT need logging.

Applied to files:

  • CLAUDE.md
🧬 Code graph analysis (22)
tests/unit/engine/checkpoint/test_models.py (1)
src/ai_company/engine/checkpoint/models.py (3)
  • Checkpoint (25-65)
  • CheckpointConfig (88-120)
  • Heartbeat (68-85)
src/ai_company/persistence/sqlite/heartbeat_repo.py (4)
src/ai_company/engine/checkpoint/models.py (1)
  • Heartbeat (68-85)
src/ai_company/observability/_logger.py (1)
  • get_logger (8-28)
src/ai_company/persistence/errors.py (1)
  • QueryError (33-34)
src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
  • _row_to_model (159-174)
src/ai_company/engine/checkpoint/callback.py (1)
src/ai_company/engine/context.py (1)
  • AgentContext (87-307)
src/ai_company/persistence/sqlite/__init__.py (2)
src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
  • SQLiteCheckpointRepository (27-174)
src/ai_company/persistence/sqlite/heartbeat_repo.py (1)
  • SQLiteHeartbeatRepository (27-168)
tests/unit/persistence/test_migrations_v2.py (2)
src/ai_company/persistence/sqlite/migrations.py (1)
  • get_user_version (261-265)
tests/unit/persistence/sqlite/conftest.py (1)
  • memory_db (15-22)
tests/unit/engine/checkpoint/test_callback_factory.py (5)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • make_checkpoint_callback (34-133)
src/ai_company/engine/checkpoint/models.py (1)
  • CheckpointConfig (88-120)
src/ai_company/engine/context.py (2)
  • AgentContext (87-307)
  • from_identity (140-171)
src/ai_company/persistence/sqlite/heartbeat_repo.py (1)
  • save (37-67)
src/ai_company/memory/errors.py (1)
  • MemoryError (13-14)
tests/unit/persistence/sqlite/test_migrations_v6.py (2)
src/ai_company/persistence/sqlite/migrations.py (1)
  • run_migrations (338-406)
tests/unit/persistence/test_migrations_v2.py (1)
  • memory_db (21-26)
src/ai_company/engine/checkpoint/models.py (2)
src/ai_company/tools/base.py (1)
  • description (138-140)
src/ai_company/engine/parallel_models.py (2)
  • agent_id (79-81)
  • task_id (87-89)
src/ai_company/persistence/protocol.py (4)
src/ai_company/persistence/repositories.py (3)
  • CheckpointRepository (484-533)
  • CostRecordRepository (111-163)
  • HeartbeatRepository (537-595)
tests/unit/api/conftest.py (2)
  • checkpoints (477-478)
  • heartbeats (481-482)
tests/unit/persistence/test_protocol.py (2)
  • checkpoints (309-310)
  • heartbeats (313-314)
src/ai_company/persistence/sqlite/backend.py (2)
  • checkpoints (373-379)
  • heartbeats (382-388)
src/ai_company/engine/react_loop.py (2)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • _checkpoint_callback (63-76)
src/ai_company/memory/errors.py (1)
  • MemoryError (13-14)
tests/unit/persistence/sqlite/test_checkpoint_repo.py (3)
src/ai_company/engine/checkpoint/models.py (1)
  • Checkpoint (25-65)
src/ai_company/persistence/sqlite/checkpoint_repo.py (4)
  • SQLiteCheckpointRepository (27-174)
  • save (37-66)
  • get_latest (68-132)
  • delete_by_execution (134-157)
tests/unit/persistence/sqlite/conftest.py (1)
  • migrated_db (26-34)
src/ai_company/engine/__init__.py (3)
src/ai_company/engine/checkpoint/models.py (3)
  • Checkpoint (25-65)
  • CheckpointConfig (88-120)
  • Heartbeat (68-85)
src/ai_company/engine/checkpoint/strategy.py (1)
  • CheckpointRecoveryStrategy (45-284)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • make_checkpoint_callback (34-133)
src/ai_company/engine/recovery.py (1)
src/ai_company/engine/task_execution.py (1)
  • TaskExecution (60-231)
src/ai_company/persistence/sqlite/migrations.py (2)
tests/unit/persistence/sqlite/test_user_repo.py (1)
  • db (22-28)
tests/unit/hr/test_persistence.py (1)
  • db (29-35)
tests/unit/api/conftest.py (2)
src/ai_company/engine/checkpoint/models.py (2)
  • Checkpoint (25-65)
  • Heartbeat (68-85)
src/ai_company/persistence/protocol.py (2)
  • checkpoints (146-148)
  • heartbeats (151-153)
src/ai_company/engine/checkpoint/__init__.py (4)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • make_checkpoint_callback (34-133)
src/ai_company/engine/checkpoint/models.py (3)
  • Checkpoint (25-65)
  • CheckpointConfig (88-120)
  • Heartbeat (68-85)
src/ai_company/engine/checkpoint/resume.py (2)
  • cleanup_after_resume (129-172)
  • deserialize_and_reconcile (38-86)
src/ai_company/engine/checkpoint/strategy.py (1)
  • CheckpointRecoveryStrategy (45-284)
tests/unit/engine/test_recovery_checkpoint_fields.py (2)
src/ai_company/engine/recovery.py (5)
  • FailAndReassignStrategy (150-220)
  • RecoveryResult (31-112)
  • can_resume (110-112)
  • recover (124-143)
  • recover (162-216)
src/ai_company/engine/checkpoint/strategy.py (1)
  • recover (83-143)
src/ai_company/persistence/repositories.py (4)
src/ai_company/engine/checkpoint/models.py (2)
  • Checkpoint (25-65)
  • Heartbeat (68-85)
src/ai_company/persistence/sqlite/checkpoint_repo.py (2)
  • get_latest (68-132)
  • delete_by_execution (134-157)
src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
  • get (69-94)
  • get_stale (96-127)
  • delete (129-151)
src/ai_company/persistence/sqlite/parked_context_repo.py (1)
  • get (65-91)
src/ai_company/engine/agent_engine.py (7)
src/ai_company/engine/checkpoint/models.py (1)
  • CheckpointConfig (88-120)
src/ai_company/engine/checkpoint/resume.py (3)
  • cleanup_after_resume (129-172)
  • deserialize_and_reconcile (38-86)
  • make_loop_with_callback (89-126)
src/ai_company/engine/checkpoint/strategy.py (2)
  • CheckpointRecoveryStrategy (45-284)
  • clear_resume_count (149-162)
src/ai_company/engine/recovery.py (3)
  • RecoveryResult (31-112)
  • RecoveryStrategy (116-147)
  • can_resume (110-112)
src/ai_company/persistence/repositories.py (2)
  • CheckpointRepository (484-533)
  • HeartbeatRepository (537-595)
src/ai_company/engine/loop_protocol.py (3)
  • ExecutionLoop (151-189)
  • ExecutionResult (79-140)
  • execute (159-185)
src/ai_company/engine/context.py (1)
  • AgentContext (87-307)
src/ai_company/engine/plan_execute_loop.py (3)
src/ai_company/engine/plan_models.py (1)
  • PlanExecuteConfig (91-118)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • _checkpoint_callback (63-76)
src/ai_company/engine/context.py (1)
  • AgentContext (87-307)
src/ai_company/engine/checkpoint/strategy.py (8)
src/ai_company/engine/checkpoint/models.py (1)
  • Checkpoint (25-65)
src/ai_company/engine/checkpoint/resume.py (1)
  • cleanup_after_resume (129-172)
src/ai_company/engine/context.py (2)
  • AgentContext (87-307)
  • to_snapshot (280-302)
src/ai_company/engine/recovery.py (6)
  • FailAndReassignStrategy (150-220)
  • RecoveryResult (31-112)
  • recover (124-143)
  • recover (162-216)
  • get_strategy_type (145-147)
  • get_strategy_type (218-220)
src/ai_company/persistence/errors.py (1)
  • PersistenceError (8-9)
src/ai_company/persistence/repositories.py (8)
  • CheckpointRepository (484-533)
  • HeartbeatRepository (537-595)
  • get_latest (498-519)
  • get (59-71)
  • get (218-230)
  • get (346-358)
  • get (426-438)
  • get (551-563)
src/ai_company/engine/parallel_models.py (1)
  • task_id (87-89)
src/ai_company/memory/errors.py (1)
  • MemoryError (13-14)
tests/unit/persistence/test_protocol.py (2)
src/ai_company/persistence/repositories.py (24)
  • CheckpointRepository (484-533)
  • HeartbeatRepository (537-595)
  • save (48-57)
  • save (114-123)
  • save (170-180)
  • save (207-216)
  • save (284-294)
  • save (335-344)
  • save (415-424)
  • save (487-496)
  • save (540-549)
  • get_latest (498-519)
  • delete_by_execution (521-533)
  • get (59-71)
  • get (218-230)
  • get (346-358)
  • get (426-438)
  • get (551-563)
  • get_stale (565-581)
  • delete (95-107)
  • delete (260-272)
  • delete (396-408)
  • delete (468-480)
  • delete (583-595)
src/ai_company/engine/checkpoint/models.py (2)
  • Checkpoint (25-65)
  • Heartbeat (68-85)
🪛 LanguageTool
CLAUDE.md

[style] ~154-~154: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...

(EG_NO_COMMA)

🔇 Additional comments (64)
src/ai_company/engine/checkpoint/callback.py (1)

13-14: Type alias is clear and correctly scoped.

CheckpointCallback at Line 13 matches the intended async callback contract and keeps checkpoint integration points strongly typed.

src/ai_company/persistence/sqlite/migrations.py (1)

26-26: V6 migration wiring is consistent and complete.

Line 26, Lines 224-256, and Lines 320-335 correctly align schema version, DDL, and migration registration for checkpoints/heartbeats.

Also applies to: 224-256, 320-335

tests/unit/persistence/test_migrations_v2.py (1)

58-58: Good replacement of hardcoded schema version.

Line 58 now tracks the canonical constant, which prevents brittle test failures on future migration bumps.

src/ai_company/observability/events/persistence.py (1)

152-176: Event constant additions look consistent.

The new checkpoint and heartbeat constants are well-named and complete for save/query/delete/deserialization paths.

README.md (1)

34-35: Docs update matches the implemented architecture.

The capability and diagram updates correctly communicate checkpoint-based crash recovery and persistence↔engine interaction.

Also applies to: 114-114

src/ai_company/persistence/sqlite/__init__.py (1)

7-12: Re-export surface is correctly extended.

The new repository symbols are imported and exposed in __all__ consistently.

Also applies to: 26-29

CLAUDE.md (1)

107-107: Guideline updates are aligned with the checkpoint recovery work.

Both changed lines accurately reflect the new engine recovery surface and observability-event usage conventions.

Also applies to: 154-154

tests/unit/persistence/sqlite/test_migrations_v6.py (1)

1-85: The rest of the V6 migration coverage is solid.

Table existence/columns, heartbeat index presence, and idempotency checks are all well-structured.

Also applies to: 107-135

tests/unit/observability/test_events.py (1)

11-15: LGTM!

The new checkpoint event imports and test coverage follow established patterns in this file. Event constants are correctly imported from the domain-specific module (ai_company.observability.events.checkpoint), and the test verifies the expected string values while updating the domain module discovery set.

Also applies to: 227-227, 254-258

src/ai_company/persistence/protocol.py (1)

145-153: LGTM!

The new checkpoints and heartbeats property accessors follow the established pattern for other repository accessors in the protocol. Docstrings are consistent with existing properties, and the protocol update aligns with the concrete implementations in SQLitePersistenceBackend and test fakes.

tests/unit/engine/checkpoint/test_callback_factory.py (1)

1-384: LGTM!

Comprehensive test coverage for the checkpoint callback factory. Tests properly cover:

  • Boundary turn logic with parametrized cases
  • Heartbeat update behavior (saved on checkpoint, skipped on skip)
  • Error handling: swallowing of general exceptions, propagation of MemoryError/RecursionError
  • Checkpoint failure skipping heartbeat to avoid inconsistent state

Test structure follows conventions with proper markers and timeout configuration.

src/ai_company/persistence/sqlite/backend.py (1)

32-37: LGTM!

The new checkpoints and heartbeats repository integration follows the established patterns exactly:

  • Private attributes initialized to None
  • State clearing in _clear_state
  • Repository instantiation in _create_repositories
  • Public accessors using _require_connected for connectivity guards

Consistent with all other repository wiring in this backend.

Also applies to: 88-89, 104-105, 170-171, 372-388

tests/unit/persistence/sqlite/test_heartbeat_repo.py (1)

1-194: LGTM!

Thorough test coverage for SQLiteHeartbeatRepository covering all protocol methods:

  • Save/get roundtrip with field verification
  • Missing execution returns None
  • Upsert behavior updates existing records
  • get_stale filtering and ordering by timestamp
  • Delete semantics with isolation verification

Test structure follows conventions with proper markers and fixture usage.

src/ai_company/engine/react_loop.py (1)

209-226: LGTM!

The checkpoint callback integration is well-designed:

  • Clear documentation explains the intentional timing (after LLM response, before tool execution) for crash recovery
  • PEP 758 except syntax used correctly for Python 3.14+
  • Error handling appropriately re-raises MemoryError/RecursionError while swallowing and logging other exceptions
  • Structured logging with EXECUTION_CHECKPOINT_CALLBACK_FAILED event constant
tests/unit/engine/test_recovery_checkpoint_fields.py (1)

1-184: LGTM!

Solid test coverage for RecoveryResult checkpoint-related fields:

  • can_resume computed field logic verification
  • Consistency validation between checkpoint_context_json and resume_attempt
  • Default value verification for resume_attempt
  • Backward compatibility ensuring FailAndReassignStrategy produces expected non-resumable results

Tests properly use fixtures and follow conventions.

src/ai_company/observability/events/execution.py (1)

67-75: LGTM!

New execution event constants for checkpoint and resume flows follow the established patterns:

  • Final[str] type annotations
  • Domain.subject.qualifier naming convention
  • Logical grouping with descriptive comment header

Constants are properly consumed by ReactLoop and PlanExecuteLoop for observability.

tests/unit/engine/checkpoint/test_models.py (1)

1-272: LGTM!

Comprehensive test coverage for the checkpoint models including creation, immutability, validation, and constraint enforcement. Test structure is clean with well-designed helpers and proper use of @pytest.mark.unit markers on all test classes.

tests/unit/persistence/sqlite/test_checkpoint_repo.py (1)

1-238: LGTM!

Thorough test coverage for the SQLite checkpoint repository including roundtrip persistence, ordering by turn_number, filtering by execution_id/task_id, upsert behavior, and deletion. Good use of explicit checkpoint IDs in tests to ensure deterministic assertions.

src/ai_company/engine/__init__.py (1)

31-38: LGTM!

Clean re-export of the checkpoint subsystem's public API. The __all__ entries are properly sorted alphabetically and the imports correctly reference the checkpoint subpackage.

Also applies to: 219-222, 258-258, 348-348

src/ai_company/persistence/sqlite/heartbeat_repo.py (1)

1-168: LGTM!

Well-structured repository implementation that mirrors the checkpoint repository pattern. Proper UTC normalization for timestamp comparisons, comprehensive error handling with structured logging, and clean separation of concerns with the _row_to_model helper.

src/ai_company/engine/checkpoint/callback_factory.py (1)

1-133: LGTM!

Clean factory implementation with proper closure capture. The best-effort persistence pattern (log but don't propagate errors) is appropriate for checkpoint callbacks that shouldn't crash the execution loop. Good design decision to only save heartbeat after successful checkpoint save.

src/ai_company/engine/checkpoint/__init__.py (1)

1-29: LGTM!

Clean public API surface for the checkpoint subsystem. All necessary components are properly re-exported with an alphabetically sorted __all__.

tests/unit/engine/checkpoint/test_strategy.py (1)

1-439: LGTM!

Comprehensive test coverage for CheckpointRecoveryStrategy including protocol compliance, resume paths with checkpoint, fallback scenarios (no checkpoint, max attempts exhausted, zero attempts, repository errors), custom fallback injection, and per-execution counter independence. Well-structured with clear test organization.

src/ai_company/engine/recovery.py (2)

62-92: LGTM!

Well-designed consistency validation between checkpoint_context_json and resume_attempt. The validator correctly enforces that both fields must be set together (for resume) or both at defaults (for non-resume). JSON validation ensures the checkpoint context is a valid object when provided.


106-112: LGTM!

Clean computed_field implementation for can_resume that derives from the presence of checkpoint context, complementing the existing can_reassign pattern.

docs/design/engine.md (2)

511-514: LGTM! New RecoveryResult fields are well-documented.

The new checkpoint recovery fields (checkpoint_context_json, resume_attempt, can_resume) are clearly documented with appropriate types and descriptions. The computed property semantics for can_resume align with the implementation.


565-579: Documentation correctly reflects the checkpoint recovery implementation.

The updates accurately describe:

  • Storage backend delegation to the injected CheckpointRepository
  • Reconciliation via system message informing the agent of the resume point and triggering error
  • Deferred richer reconciliation (workspace change detection) for future iterations

This aligns well with the PR objectives and implementation.

src/ai_company/persistence/sqlite/checkpoint_repo.py (4)

37-66: LGTM! Save method correctly implements upsert with proper error handling.

The save method properly:

  • Uses INSERT OR REPLACE for upsert semantics
  • Serializes checkpoint via model_dump(mode="json")
  • Logs error before raising QueryError
  • Logs success at DEBUG level after commit

68-132: LGTM! Well-structured get_latest with proper validation and parameterization.

The method correctly:

  • Validates that at least one filter is provided
  • Builds WHERE clause from hardcoded column names (safe from injection)
  • Uses parameterized queries for values
  • Orders by turn_number DESC LIMIT 1 to get the latest checkpoint
  • Handles not-found and error cases appropriately

134-157: LGTM! Delete method with proper cleanup and logging.

The delete_by_execution method correctly returns the count of deleted rows and logs appropriately. The conditional logging for count > 0 avoids noise for no-op deletions.


159-174: LGTM! Row-to-model conversion with proper error handling.

The _row_to_model helper correctly catches ValidationError and wraps it in QueryError with appropriate logging. This provides clear error semantics for callers.

src/ai_company/engine/plan_execute_loop.py (4)

86-97: LGTM! Constructor properly accepts optional checkpoint callback.

The constructor cleanly integrates the checkpoint callback as an optional parameter and exposes the config via a property. This maintains backward compatibility while enabling checkpoint persistence.


574-575: LGTM! Checkpoint callback invoked after planner turn completion.

Checkpointing after the planner response ensures the plan is persisted before step execution begins. This is the correct placement for crash recovery.


711-712: LGTM! Checkpoint callback invoked after step turn completion.

Checkpointing after each step turn ensures incremental progress is preserved. This aligns with the "persist after each completed turn" design documented in the PR objectives.


768-793: LGTM! Robust checkpoint callback invocation with proper error isolation.

The implementation correctly:

  • Short-circuits if no callback is configured
  • Re-raises MemoryError and RecursionError (non-recoverable)
  • Logs and swallows all other exceptions to prevent checkpoint failures from interrupting execution
  • Uses PEP 758 except syntax (Python 3.14+) as required by coding guidelines
src/ai_company/engine/checkpoint/resume.py (2)

89-126: LGTM! Loop instantiation correctly preserves immutability.

The function creates new loop instances with the callback injected rather than mutating existing loops. This follows the frozen model pattern used throughout the codebase. The warning for unsupported loop types provides appropriate observability.


129-172: LGTM! Best-effort cleanup with robust error handling.

The cleanup_after_resume function correctly:

  • Handles None repositories gracefully
  • Logs successes at DEBUG level
  • Catches all exceptions and logs at WARNING without propagating
  • Uses exc_info=True to include stack traces for debugging

This ensures cleanup failures never interrupt the main execution flow.

tests/unit/api/conftest.py (3)

338-372: LGTM! FakeCheckpointRepository correctly implements the protocol.

The implementation properly:

  • Validates that at least one filter is provided in get_latest
  • Returns the checkpoint with highest turn_number using max()
  • Returns delete count from delete_by_execution

This mirrors the CheckpointRepository protocol defined in src/ai_company/persistence/repositories.py.


374-395: LGTM! FakeHeartbeatRepository correctly implements the protocol.

The implementation properly:

  • Uses execution_id as the unique key for storage
  • Returns stale heartbeats sorted by last_heartbeat_at
  • Returns boolean from delete indicating whether removal occurred

This aligns with the HeartbeatRepository protocol.


411-412: LGTM! FakePersistenceBackend correctly exposes new repositories.

The new _checkpoints and _heartbeats_repo attributes are initialized in the constructor and exposed via properties matching the protocol surface (checkpoints, heartbeats).

Also applies to: 476-482

src/ai_company/engine/checkpoint/models.py (3)

25-65: LGTM! Checkpoint model is well-designed with proper validation.

The model correctly:

  • Uses frozen=True for immutability
  • Validates context_json is a valid JSON object (not primitive/array)
  • Uses NotBlankStr for all identifier fields
  • Has timezone-aware datetime fields with AwareDatetime
  • Uses lambda factories for default values to avoid mutable default issues

68-85: LGTM! Heartbeat model is minimal and well-structured.

The model correctly captures the essential liveness signal fields with proper types. The execution_id serves as the logical primary key for upsert operations.


88-120: LGTM! CheckpointConfig has sensible defaults and validation.

The configuration model correctly:

  • Defaults persist_every_n_turns to 1 (checkpoint every turn)
  • Documents that heartbeat_interval_seconds is reserved for future use
  • Uses ge=0 for max_resume_attempts to allow disabling resume
src/ai_company/observability/events/checkpoint.py (1)

1-29: LGTM! Comprehensive and well-organized event constants.

The module provides complete coverage of checkpoint/heartbeat lifecycle events with consistent naming conventions (checkpoint.*, heartbeat.*). The categorization (lifecycle, heartbeat, loop integration, recovery flow) aids discoverability and aligns with coding guidelines for domain-specific event modules.

src/ai_company/persistence/repositories.py (3)

25-26: LGTM! TYPE_CHECKING guard properly avoids circular imports.

Using TYPE_CHECKING for the Checkpoint and Heartbeat imports prevents runtime circular dependencies while enabling static type checking.


483-533: LGTM! CheckpointRepository protocol is well-defined.

The protocol correctly specifies:

  • save: Upsert semantics (insert or replace by ID)
  • get_latest: Requires at least one filter, returns highest turn_number
  • delete_by_execution: Returns count of deleted checkpoints

Docstrings include proper Raises documentation for PersistenceError and ValueError.


536-595: LGTM! HeartbeatRepository protocol is well-defined.

The protocol correctly specifies:

  • save: Upsert by execution_id
  • get: Single heartbeat lookup
  • get_stale: Threshold-based query for detecting unresponsive agents
  • delete: Boolean return for deletion status

This enables the heartbeat-based failure detection described in the PR objectives.

tests/unit/persistence/test_protocol.py (5)

214-227: LGTM!

The _FakeCheckpointRepository correctly implements the CheckpointRepository protocol methods with appropriate signatures and return types. The use of str instead of NotBlankStr is acceptable for test fakes due to structural subtyping.


230-244: LGTM!

The _FakeHeartbeatRepository properly implements all required protocol methods: save, get, get_stale, and delete. Return types align with the protocol specification.


76-82: LGTM!

The aggregate method signature correctly adds the optional task_id parameter to match the updated CostRecordRepository protocol.


308-314: LGTM!

The _FakeBackend correctly exposes the new checkpoints and heartbeats repository properties, maintaining consistency with the PersistenceBackend protocol.


368-372: LGTM!

Good addition of protocol compliance tests for the new fake repositories, consistent with the existing test pattern in this file.

src/ai_company/engine/checkpoint/strategy.py (5)

1-39: LGTM!

The module docstring, imports, and logger setup follow the project conventions. Event constants are properly imported from the domain-specific observability module.


45-82: LGTM!

The class docstring and __init__ are well-structured. The in-memory resume counter with the documented limitation (resets on process restart) is appropriately noted. Using asyncio.Lock for async-safe access is correct.


83-143: LGTM!

The recover method is well-structured with clear flow: load checkpoint → check resume attempts → either delegate to fallback or build resume result. Proper logging at each decision point.


166-200: LGTM!

The _load_latest_checkpoint helper correctly handles PersistenceError by logging and returning None, while re-raising MemoryError and RecursionError as non-recoverable. The PEP 758 except syntax is correctly applied.


263-284: LGTM!

The _delegate_to_fallback method correctly cleans up orphaned checkpoint/heartbeat rows before delegating to the fallback strategy. The cleanup is best-effort (errors logged but not propagated) as documented in cleanup_after_resume.

src/ai_company/engine/agent_engine.py (7)

18-24: LGTM!

Imports for checkpoint models and resume helpers are correctly organized. The imports support the new checkpoint recovery integration.


165-181: LGTM!

Good validation ensuring checkpoint_repo and heartbeat_repo are provided together or both omitted. This prevents misconfiguration. Default CheckpointConfig() when none provided is a sensible fallback.


494-507: LGTM!

The _make_loop_with_callback helper cleanly delegates to make_loop_with_callback from the resume module, maintaining separation of concerns.


617-672: LGTM!

The _apply_recovery method correctly threads completion_config and effective_autonomy through to the checkpoint resume path. The can_resume check properly gates the new resume flow.


741-776: LGTM!

The _execute_resumed_loop correctly reconstructs the budget checker from the checkpoint context and creates a fresh tool invoker. The approach mirrors the main _execute path appropriately.


778-804: LGTM!

The _finalize_resume method correctly:

  1. Only cleans up on non-error termination (preserving data for potential re-resume on failure)
  2. Type-checks for CheckpointRecoveryStrategy before calling clear_resume_count
  3. Performs best-effort cleanup via cleanup_after_resume

287-294: LGTM!

Correct use of PEP 758 except syntax for re-raising non-recoverable errors.

Comment on lines +222 to +228
self._resume_counts[execution_id] = resume_count + 1

# Evict oldest entries when the dict grows too large
if len(self._resume_counts) > _MAX_TRACKED_EXECUTIONS:
oldest = next(iter(self._resume_counts))
self._resume_counts.pop(oldest, None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

FIFO eviction may prematurely remove active executions.

The eviction uses insertion order (next(iter(...))), which may remove an execution that is still actively recovering while keeping idle ones. Consider LRU eviction or document this limitation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/checkpoint/strategy.py` around lines 222 - 228, The
current FIFO eviction on self._resume_counts can drop an actively-recovering
execution; change to LRU eviction by turning _resume_counts into an
order-preserving structure and updating access order whenever an execution_id is
touched. Concretely, replace the plain dict use of _resume_counts with an
OrderedDict (or maintain last_access timestamps) and in the code paths that
read/increment _resume_counts (the block updating
self._resume_counts[execution_id] and the eviction check guarded by
_MAX_TRACKED_EXECUTIONS) call move_to_end(execution_id) (or update the
timestamp) so the eviction removes the least-recently-used key instead of the
oldest-inserted key. Ensure the eviction still pops from the front when len(...)
> _MAX_TRACKED_EXECUTIONS.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the “Checkpoint Recovery” crash-recovery strategy by persisting AgentContext checkpoints + execution heartbeats, adding repository protocols and SQLite storage, and integrating per-turn checkpoint callbacks plus a resume path into the engine.

Changes:

  • Added checkpoint/heartbeat domain models, callback factory, resume helpers, and CheckpointRecoveryStrategy with max-attempt fallback.
  • Extended persistence layer with checkpoint/heartbeat repository protocols, SQLite implementations, and schema v6 migration + tests.
  • Integrated checkpoint callback injection into execution loops and added engine resume flow + observability events/docs.

Reviewed changes

Copilot reviewed 35 out of 36 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/persistence/test_protocol.py Extends protocol conformance fakes to include checkpoint/heartbeat repos.
tests/unit/persistence/test_migrations_v2.py Makes schema version assertions resilient to future bumps.
tests/unit/persistence/sqlite/test_migrations_v6.py Adds tests for v6 checkpoints/heartbeats tables + indexes + idempotency.
tests/unit/persistence/sqlite/test_heartbeat_repo.py Adds unit tests for SQLite heartbeat repository behavior.
tests/unit/persistence/sqlite/test_checkpoint_repo.py Adds unit tests for SQLite checkpoint repository behavior.
tests/unit/observability/test_events.py Ensures checkpoint event module is discovered and key constants exist.
tests/unit/engine/test_recovery_checkpoint_fields.py Tests RecoveryResult checkpoint fields (can_resume, resume_attempt) and validation rules.
tests/unit/engine/checkpoint/test_strategy.py Tests checkpoint recovery strategy behavior, fallback, and counter semantics.
tests/unit/engine/checkpoint/test_models.py Tests Pydantic checkpoint/heartbeat/config models and validation/frozen behavior.
tests/unit/engine/checkpoint/test_callback_factory.py Tests checkpoint callback persistence boundary logic and error handling.
tests/unit/engine/checkpoint/init.py Adds checkpoint test package marker.
tests/unit/api/conftest.py Adds in-memory fake checkpoint/heartbeat repos to API test persistence backend.
src/ai_company/persistence/sqlite/migrations.py Bumps schema to v6 and adds checkpoints/heartbeats DDL + indexes.
src/ai_company/persistence/sqlite/heartbeat_repo.py Implements SQLite heartbeat persistence (upsert/get/get_stale/delete).
src/ai_company/persistence/sqlite/checkpoint_repo.py Implements SQLite checkpoint persistence (upsert/get_latest/delete_by_execution).
src/ai_company/persistence/sqlite/backend.py Wires checkpoint/heartbeat repos into SQLite backend lifecycle and properties.
src/ai_company/persistence/sqlite/init.py Exports new SQLite checkpoint/heartbeat repositories.
src/ai_company/persistence/repositories.py Adds CheckpointRepository and HeartbeatRepository protocols.
src/ai_company/persistence/protocol.py Extends PersistenceBackend protocol with checkpoint/heartbeat repositories.
src/ai_company/observability/events/persistence.py Adds persistence event constants for checkpoint/heartbeat operations.
src/ai_company/observability/events/execution.py Adds execution-level events for checkpoint callback and resume lifecycle.
src/ai_company/observability/events/checkpoint.py Introduces checkpoint-domain event constants.
src/ai_company/engine/recovery.py Extends RecoveryResult with checkpoint resume fields + consistency validation.
src/ai_company/engine/react_loop.py Injects optional per-turn checkpoint callback into ReAct loop.
src/ai_company/engine/plan_execute_loop.py Injects optional per-turn checkpoint callback into Plan-and-Execute loop.
src/ai_company/engine/checkpoint/strategy.py Adds CheckpointRecoveryStrategy with in-memory async-locked resume attempt counter.
src/ai_company/engine/checkpoint/resume.py Adds resume helpers: deserialize+reconcile, loop callback injection, cleanup helpers.
src/ai_company/engine/checkpoint/models.py Adds frozen Pydantic models: Checkpoint, Heartbeat, CheckpointConfig.
src/ai_company/engine/checkpoint/callback_factory.py Adds factory to produce best-effort per-turn checkpoint+heartbeat persistence callback.
src/ai_company/engine/checkpoint/callback.py Adds CheckpointCallback type alias.
src/ai_company/engine/checkpoint/init.py Exposes checkpoint package public API.
src/ai_company/engine/agent_engine.py Adds checkpoint plumbing + resume orchestration into the engine recovery flow.
src/ai_company/engine/init.py Re-exports checkpoint-related engine APIs.
docs/design/engine.md Updates engine design docs to reflect implemented checkpoint recovery.
README.md Mentions crash recovery (checkpoint resume) and updates architecture diagram.
CLAUDE.md Updates repo layout + logging event guidance to include checkpoint recovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 414 to +418
execution_result = await self._apply_recovery(
execution_result,
agent_id,
task_id,
completion_config=completion_config,
Comment on lines +688 to +689
approach as ``_execute``); timeout is not applied because
the original wall-clock deadline is no longer available.
) -> RecoveryResult:
"""Build a resumable ``RecoveryResult``."""
execution_id = context.execution_id
resume_attempt = self._resume_counts.get(execution_id, 1)
Comment on lines +275 to +279
await cleanup_after_resume(
self._checkpoint_repo,
self._heartbeat_repo,
context.execution_id,
)
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 14, 2026

Greptile Summary

This PR implements §6.6 Strategy 2 (checkpoint recovery) — per-turn AgentContext persistence, heartbeat liveness tracking, and resume-from-checkpoint on crash with a max_resume_attempts fallback to fail-and-reassign. The persistence layer (SQLite repos, V6 migration), model definitions, callback factory, and loop integrations are all well-implemented. Four issues were found in the orchestration layer.

  • Double finalize() / cleanup on success path — when a resumed execution completes without error, both _finalize_resume and _post_execution_pipeline invoke _recovery_strategy.finalize() and cleanup_checkpoint_artifacts() for the same execution_id. Both built-in operations are idempotent, but calling a protocol hook twice violates the RecoveryStrategy.finalize() contract and could silently corrupt state in future or third-party strategy implementations.
  • max_resume_attempts > 1 is unreachable within a single run() call_apply_recovery is called at most once per _post_execution_pipeline invocation; if the resumed execution also fails, _finalize_resume returns the error result directly without re-entering recovery. The _resume_counts counter therefore cannot exceed 1 within a single call, making the guard against infinite resume loops effectively inactive for the documented multi-attempt use case.
  • make_loop_with_callback reconstructs loops by position, not by clone — the factory instantiates a fresh ReactLoop(checkpoint_callback=callback) rather than cloning the original, silently dropping any future constructor fields that are not explicitly forwarded.
  • Original crash result bypasses error classificationclassify_execution_errors runs on the final execution_result after _apply_recovery returns; if recovery leads to a successful resume, the original failure's error taxonomy is never recorded.

Confidence Score: 2/5

  • The PR introduces correctness issues in the recovery orchestration that affect the reliability of the feature and the correctness of the RecoveryStrategy protocol contract.
  • Persistence, models, migrations, and loop integrations are solid, but the orchestration layer in agent_engine.py and strategy.py has two logic bugs (double finalize, unreachable max_resume_attempts) and two behavioural gaps (loop-state loss on clone, missing classification of the original error). None of these cause data loss today, but the double-finalize silently violates the protocol contract and the max_resume_attempts guard is a no-op, which is the primary safety mechanism this feature is supposed to provide.
  • Pay close attention to src/ai_company/engine/agent_engine.py (_finalize_resume vs _post_execution_pipeline cleanup overlap) and src/ai_company/engine/checkpoint/strategy.py (max_resume_attempts counter reachability).

Important Files Changed

Filename Overview
src/ai_company/engine/agent_engine.py Core integration point for checkpoint recovery. Two logic issues found: (1) double invocation of _recovery_strategy.finalize() and cleanup_checkpoint_artifacts() on successful resume (both _finalize_resume and _post_execution_pipeline run cleanup); (2) the original crash result's error classification is silently skipped. Resume path threading of completion_config and effective_autonomy is correctly propagated.
src/ai_company/engine/checkpoint/strategy.py CheckpointRecoveryStrategy with async-safe resume counter. Logic issue: max_resume_attempts > 1 is unreachable within a single run() call — recover() can only be invoked once per pipeline invocation, so the counter never exceeds 1. The FIFO eviction, _MAX_TRACKED_EXECUTIONS guard, and fallback delegation are otherwise correct.
src/ai_company/engine/checkpoint/resume.py Standalone resume helpers (deserialize, loop construction, cleanup). Logic issue: make_loop_with_callback reconstructs loop instances rather than cloning them, silently dropping any future constructor fields. Cleanup helpers are well-structured and best-effort with correct error handling.
src/ai_company/engine/checkpoint/models.py Frozen Pydantic models for Checkpoint, Heartbeat, and CheckpointConfig. context_json validation, UTC-aware timestamps, and sensible defaults are all correct. No issues found.
src/ai_company/engine/checkpoint/callback_factory.py Checkpoint callback closure factory. Turn-0 skip, N-turn cadence, sequential heartbeat-after-checkpoint, and best-effort error suppression are all correctly implemented. PEP 758 except MemoryError, RecursionError: syntax used correctly throughout.
src/ai_company/engine/recovery.py Extended RecoveryResult with checkpoint_context_json, resume_attempt, and can_resume computed field. New finalize() protocol method and FailAndReassignStrategy no-op implementation are clean. Validator consistency check between checkpoint_context_json and resume_attempt is correct.
src/ai_company/persistence/sqlite/checkpoint_repo.py SQLite checkpoint repository. Dynamic WHERE clause built from hardcoded column names only (noted with # ruff: noqa: S608), parameterised values, upsert via INSERT OR REPLACE, and correct commit + error wrapping. No issues found.
src/ai_company/persistence/sqlite/heartbeat_repo.py SQLite heartbeat repository. UTC normalisation in save() and get_stale() ensures correct lexicographic ISO-8601 comparisons. Upsert, delete, and stale-query paths are all correct. No issues found.
src/ai_company/persistence/sqlite/migrations.py V6 migration adds checkpoints and heartbeats tables with appropriate indexes. The compound (execution_id, turn_number) index correctly supports ORDER BY turn_number DESC LIMIT 1. Ascending-index portability note for SQLite < 3.47 is accurate. No issues found.
src/ai_company/engine/react_loop.py Checkpoint callback injected after LLM response recording but before tool execution, with clear rationale in comments. PEP 758 exception syntax and best-effort error handling are correct. No issues found.
src/ai_company/engine/plan_execute_loop.py Callback injected at two turn-completion sites (planning and step execution). config property added for callback factory consumption. Consistent error handling pattern with ReactLoop. No issues found.
src/ai_company/persistence/repositories.py New CheckpointRepository and HeartbeatRepository protocols with runtime_checkable. Checkpoint/Heartbeat imports correctly placed under TYPE_CHECKING. get_latest filter requirement and get_stale threshold semantics are well-documented. No issues found.

Comments Outside Diff (4)

  1. src/ai_company/engine/agent_engine.py, line 854-870 (link)

    Double finalize() and cleanup on successful resume

    When the resumed execution completes without an error, _finalize_resume already calls both _recovery_strategy.finalize(execution_id) and cleanup_checkpoint_artifacts(...). Control then returns to _post_execution_pipeline, which runs the same two calls a second time (lines 463–472):

    if execution_result.termination_reason != TerminationReason.ERROR:
        exec_id = execution_result.context.execution_id
        if self._recovery_strategy is not None:
            await self._recovery_strategy.finalize(exec_id)   # already called
        await cleanup_checkpoint_artifacts(...)               # already called

    For the built-in CheckpointRecoveryStrategy, both operations are idempotent, so there's no observable failure today. However, the RecoveryStrategy.finalize() protocol is documented as a "post-resume cleanup hook" — calling it twice violates that contract and could silently corrupt state in any third-party or future strategy that tracks finalization (e.g. one that marks an execution as fully closed in an external system).

    The simplest fix is to remove the duplicate logic from _finalize_resume and let _post_execution_pipeline's existing block handle cleanup unconditionally, or guard one of the two call sites with a flag. For example, have _finalize_resume return a boolean indicating whether cleanup already ran:

    # _finalize_resume: skip inline cleanup, return result + did_cleanup flag
    # _post_execution_pipeline: skip the block if did_cleanup is True

    Alternatively, centralise all cleanup in _finalize_resume and remove the _post_execution_pipeline block entirely — the comment there already says "Normal completions … bypass recovery entirely, so cleanup runs here", but that comment is no longer accurate once the resume path is active.

  2. src/ai_company/engine/checkpoint/strategy.py, line 87-110 (link)

    max_resume_attempts > 1 is unreachable within a single run() call

    The _resume_counts counter is incremented by _reserve_resume_attempt each time recover() is called for a given execution_id. However, _apply_recovery is invoked at most once per _post_execution_pipeline call (which itself is called once per _executerun() invocation). If the resumed execution also fails, _finalize_resume returns the error result directly back through _apply_recovery without re-entering _post_execution_pipeline — so recover() is never called a second time for the same execution_id within the same run() call.

    For the counter to ever reach 2, the outer system would need to retry with the same execution_id, which doesn't happen in normal usage (each run() call creates a fresh AgentContext with a new UUID).

    The practical effect is that max_resume_attempts=2 (the default) and max_resume_attempts=1 give identical behaviour. The safeguard against infinite resume loops is therefore not active.

    The inline note acknowledges that persisting the counter is a planned improvement, but the current in-memory design also can't provide the stated "max N attempts per lifetime" guarantee for multi-attempt scenarios. Consider either:

    • Documenting that the current default of 2 is effectively 1 per run() call until counter persistence is implemented, or
    • Wiring a second recovery pass inside _finalize_resume (call _apply_recovery on the error result before returning it) so the counter can actually increment toward the limit within a single invocation.
  3. src/ai_company/engine/checkpoint/resume.py, line 97-120 (link)

    Resumed loop loses all state from the original ReactLoop/PlanExecuteLoop instance

    make_loop_with_callback creates a brand-new loop instance rather than cloning or wrapping the original:

    if isinstance(loop, ReactLoop):
        return ReactLoop(checkpoint_callback=callback)

    For ReactLoop this is currently safe — the only stored state is _checkpoint_callback. But for PlanExecuteLoop, config is preserved:

    return PlanExecuteLoop(config=loop.config, checkpoint_callback=callback)

    If either loop class ever gains additional constructor parameters (e.g. a tool_filter, a step_limit, a custom replanning_policy), this factory will silently drop those fields on the resumed execution, and the resume will run with different behaviour than the original run. The pattern of "copy-constructor minus one field" is fragile.

    A more robust approach would be to add a with_checkpoint_callback(cb) method to each loop that returns a copy of itself with only the callback replaced — making the forwarding explicit and compiler-checkable when new fields are added.

  4. src/ai_company/engine/agent_engine.py, line 793-820 (link)

    Error classification skipped for the resumed execution

    _finalize_resume records costs and applies post-execution transitions for the resumed result, but it does not call classify_execution_errors. The original _post_execution_pipeline runs classification on the final execution_result only after _apply_recovery returns, which means the result that gets classified is the resumed result — not the original crash result. That part looks correct.

    However, if the resumed execution itself terminates with ERROR, _finalize_resume returns the error result directly. At that point _post_execution_pipeline skips classification because the resumed-error result bypasses the if self._error_taxonomy_config is not None block (it was already executed earlier in the pipeline on the original error result... wait, actually, looking at the code order in _post_execution_pipeline:

    execution_result = await self._apply_recovery(...)  # may return resumed-error
    # Sync post-recovery status ...
    # Non-ERROR cleanup ...
    # Classification runs on execution_result here ← resumed-error IS classified

    Classification does run on the final execution_result in _post_execution_pipeline. So if the resumed execution produces a non-ERROR result, it is classified correctly. But if it produces an ERROR result, the error result from _finalize_resume eventually becomes execution_result in _post_execution_pipeline, and classification does run on it — so this appears fine.

    The actual concern is narrower: the original crash result (pre-resume) is never classified. If the original failure had a classifiable error taxonomy entry, that classification is silently lost. Depending on whether classification results are used for observability/dashboards, this could mean the original failure reason goes untracked.

    Consider calling classify_execution_errors on the original execution_result before entering the recovery block, or at minimum logging a note that the original error taxonomy was not recorded.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/ai_company/engine/agent_engine.py
Line: 854-870

Comment:
**Double `finalize()` and cleanup on successful resume**

When the resumed execution completes without an error, `_finalize_resume` already calls both `_recovery_strategy.finalize(execution_id)` and `cleanup_checkpoint_artifacts(...)`. Control then returns to `_post_execution_pipeline`, which runs the same two calls a second time (lines 463–472):

```python
if execution_result.termination_reason != TerminationReason.ERROR:
    exec_id = execution_result.context.execution_id
    if self._recovery_strategy is not None:
        await self._recovery_strategy.finalize(exec_id)   # already called
    await cleanup_checkpoint_artifacts(...)               # already called
```

For the built-in `CheckpointRecoveryStrategy`, both operations are idempotent, so there's no observable failure today. However, the `RecoveryStrategy.finalize()` protocol is documented as a "post-resume cleanup hook" — calling it twice violates that contract and could silently corrupt state in any third-party or future strategy that tracks finalization (e.g. one that marks an execution as fully closed in an external system).

The simplest fix is to remove the duplicate logic from `_finalize_resume` and let `_post_execution_pipeline`'s existing block handle cleanup unconditionally, or guard one of the two call sites with a flag. For example, have `_finalize_resume` return a boolean indicating whether cleanup already ran:

```python
# _finalize_resume: skip inline cleanup, return result + did_cleanup flag
# _post_execution_pipeline: skip the block if did_cleanup is True
```

Alternatively, centralise all cleanup in `_finalize_resume` and remove the `_post_execution_pipeline` block entirely — the comment there already says "Normal completions … bypass recovery entirely, so cleanup runs here", but that comment is no longer accurate once the resume path is active.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/ai_company/engine/checkpoint/strategy.py
Line: 87-110

Comment:
**`max_resume_attempts` > 1 is unreachable within a single `run()` call**

The `_resume_counts` counter is incremented by `_reserve_resume_attempt` each time `recover()` is called for a given `execution_id`. However, `_apply_recovery` is invoked at most once per `_post_execution_pipeline` call (which itself is called once per `_execute``run()` invocation). If the resumed execution also fails, `_finalize_resume` returns the error result directly back through `_apply_recovery` without re-entering `_post_execution_pipeline` — so `recover()` is never called a second time for the same `execution_id` within the same `run()` call.

For the counter to ever reach 2, the outer system would need to retry with the same `execution_id`, which doesn't happen in normal usage (each `run()` call creates a fresh `AgentContext` with a new UUID).

The practical effect is that `max_resume_attempts=2` (the default) and `max_resume_attempts=1` give identical behaviour. The safeguard against infinite resume loops is therefore not active.

The inline note acknowledges that persisting the counter is a planned improvement, but the current in-memory design also can't provide the stated "max N attempts per lifetime" guarantee for multi-attempt scenarios. Consider either:
- Documenting that the current default of `2` is effectively `1` per `run()` call until counter persistence is implemented, or
- Wiring a second recovery pass inside `_finalize_resume` (call `_apply_recovery` on the error result before returning it) so the counter can actually increment toward the limit within a single invocation.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/ai_company/engine/checkpoint/resume.py
Line: 97-120

Comment:
**Resumed loop loses all state from the original `ReactLoop`/`PlanExecuteLoop` instance**

`make_loop_with_callback` creates a brand-new loop instance rather than cloning or wrapping the original:

```python
if isinstance(loop, ReactLoop):
    return ReactLoop(checkpoint_callback=callback)
```

For `ReactLoop` this is currently safe — the only stored state is `_checkpoint_callback`. But for `PlanExecuteLoop`, config is preserved:

```python
return PlanExecuteLoop(config=loop.config, checkpoint_callback=callback)
```

If either loop class ever gains additional constructor parameters (e.g. a `tool_filter`, a `step_limit`, a custom `replanning_policy`), this factory will silently drop those fields on the resumed execution, and the resume will run with different behaviour than the original run. The pattern of "copy-constructor minus one field" is fragile.

A more robust approach would be to add a `with_checkpoint_callback(cb)` method to each loop that returns a copy of itself with only the callback replaced — making the forwarding explicit and compiler-checkable when new fields are added.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/ai_company/engine/agent_engine.py
Line: 793-820

Comment:
**Error classification skipped for the resumed execution**

`_finalize_resume` records costs and applies post-execution transitions for the resumed result, but it does not call `classify_execution_errors`. The original `_post_execution_pipeline` runs classification on the final `execution_result` only after `_apply_recovery` returns, which means the result that gets classified is the resumed result — not the original crash result. That part looks correct.

However, if the resumed execution itself terminates with `ERROR`, `_finalize_resume` returns the error result directly. At that point `_post_execution_pipeline` skips classification because the resumed-error result bypasses the `if self._error_taxonomy_config is not None` block (it was already executed earlier in the pipeline on the original error result... wait, actually, looking at the code order in `_post_execution_pipeline`:

```python
execution_result = await self._apply_recovery(...)  # may return resumed-error
# Sync post-recovery status ...
# Non-ERROR cleanup ...
# Classification runs on execution_result here ← resumed-error IS classified
```

Classification does run on the final `execution_result` in `_post_execution_pipeline`. So if the resumed execution produces a non-ERROR result, it is classified correctly. But if it produces an ERROR result, the error result from `_finalize_resume` eventually becomes `execution_result` in `_post_execution_pipeline`, and classification does run on it — so this appears fine.

The actual concern is narrower: the **original** crash result (pre-resume) is never classified. If the original failure had a classifiable error taxonomy entry, that classification is silently lost. Depending on whether classification results are used for observability/dashboards, this could mean the original failure reason goes untracked.

Consider calling `classify_execution_errors` on the original `execution_result` before entering the recovery block, or at minimum logging a note that the original error taxonomy was not recorded.

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 1352a8f

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust checkpoint recovery strategy. The implementation is comprehensive, covering data models, persistence layers with migrations, integration into the agent engine and execution loops, and extensive testing. The changes are well-structured into a new checkpoint module, promoting modularity. The use of a CheckpointRecoveryStrategy that falls back to the existing FailAndReassignStrategy is a solid design choice.

My review includes one medium-severity suggestion regarding dependency inversion in the AgentEngine to further improve the design by decoupling the engine from concrete strategy implementations.

Comment on lines +792 to +804
if result.termination_reason != TerminationReason.ERROR:
if isinstance(
self._recovery_strategy,
CheckpointRecoveryStrategy,
):
await self._recovery_strategy.clear_resume_count(
execution_id,
)
await cleanup_after_resume(
self._checkpoint_repo,
self._heartbeat_repo,
execution_id,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The AgentEngine is checking for a specific implementation of RecoveryStrategy (CheckpointRecoveryStrategy) and calling checkpoint-specific cleanup functions (clear_resume_count, cleanup_after_resume). This creates a tight coupling between the engine and a concrete strategy, violating the dependency inversion principle. The engine should ideally interact with recovery mechanisms only through the abstract RecoveryStrategy protocol.

A better approach would be to make the strategy responsible for its own cleanup. You could extend the RecoveryStrategy protocol with a finalize(execution_id: str) method.

  • CheckpointRecoveryStrategy would implement finalize to call both clear_resume_count and cleanup_after_resume.
  • Other strategies like FailAndReassignStrategy would have a no-op implementation.

This would simplify _finalize_resume to just if self._recovery_strategy: await self._recovery_strategy.finalize(execution_id), removing implementation-specific logic from the engine and improving maintainability.

Fix all 10 findings from CodeRabbit, Copilot, Gemini, and Greptile:
- Thread completion_config/effective_autonomy through fatal error path
- Extract _reconstruct_and_run_resume to keep functions under 50 lines
- Add CHECKPOINT_RECOVERY_DESERIALIZE_FAILED event for distinct logging
- Fix race condition: _reserve_resume_attempt returns tuple (flag, count)
- Add finalize() to RecoveryStrategy protocol (removes isinstance check)
- Rename cleanup_after_resume to cleanup_checkpoint_artifacts
- Add idx_cp_task_turn to migration test assertion
- Document cost recording and timeout policies

Add 31 new tests (94→125) covering resume.py (was 0%), strategy edge
cases (finalize, fallback cleanup, MemoryError/RecursionError propagation),
and RecoveryResult validator edge cases.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
src/ai_company/engine/agent_engine.py (1)

680-745: 🧹 Nitpick | 🔵 Trivial

Function still exceeds 50-line limit.

The _resume_from_checkpoint method spans 66 lines. While _reconstruct_and_run_resume was extracted as suggested in past reviews, the function still exceeds the guideline. Consider extracting the initial validation/guard clause (lines 702-710) into a separate helper.

As per coding guidelines: "Functions must be less than 50 lines."

♻️ Suggested extraction of validation guard
+    def _validate_checkpoint_json(
+        self,
+        recovery_result: RecoveryResult,
+        agent_id: str,
+        task_id: str,
+    ) -> str:
+        """Validate and return checkpoint_context_json or raise."""
+        if recovery_result.checkpoint_context_json is None:
+            logger.error(
+                EXECUTION_RESUME_FAILED,
+                agent_id=agent_id,
+                task_id=task_id,
+                error="checkpoint_context_json is None but can_resume was True",
+            )
+            msg = "checkpoint_context_json is None but can_resume was True"
+            raise RuntimeError(msg)
+        return recovery_result.checkpoint_context_json
+
     async def _resume_from_checkpoint(
         ...
     ) -> ExecutionResult:
-        if recovery_result.checkpoint_context_json is None:
-            logger.error(
-                EXECUTION_RESUME_FAILED,
-                agent_id=agent_id,
-                task_id=task_id,
-                error="checkpoint_context_json is None but can_resume was True",
-            )
-            msg = "checkpoint_context_json is None but can_resume was True"
-            raise RuntimeError(msg)
+        checkpoint_json = self._validate_checkpoint_json(
+            recovery_result, agent_id, task_id
+        )

         logger.info(...)

         try:
             result, execution_id = await self._reconstruct_and_run_resume(
-                recovery_result.checkpoint_context_json,
+                checkpoint_json,
                 ...
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/agent_engine.py` around lines 680 - 745, Extract the
initial validation/guard logic in _resume_from_checkpoint (the
checkpoint_context_json is None branch that logs EXECUTION_RESUME_FAILED,
constructs msg and raises RuntimeError) into a small helper function (e.g.,
_validate_resume_preconditions or _ensure_checkpoint_context) and call it at the
top of _resume_from_checkpoint; ensure the helper takes recovery_result,
agent_id and task_id, performs the same logger.error call and raises the same
RuntimeError when checkpoint_context_json is None, and update any references in
_resume_from_checkpoint to use that helper so the method body falls under the
50-line guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai_company/engine/checkpoint/__init__.py`:
- Around line 14-28: The module exports cleanup_checkpoint_artifacts and
deserialize_and_reconcile from ai_company.engine.checkpoint.resume but omits
make_loop_with_callback; import make_loop_with_callback from
ai_company.engine.checkpoint.resume in
src/ai_company/engine/checkpoint/__init__.py and add "make_loop_with_callback"
to the __all__ list so it is part of the public API (refer to the resume.py
symbol make_loop_with_callback and the existing imports
cleanup_checkpoint_artifacts, deserialize_and_reconcile and the __all__ list).

In `@src/ai_company/engine/checkpoint/strategy.py`:
- Around line 177-189: The current try/except around
self._checkpoint_repo.get_latest in the method swallows only PersistenceError
(re-raises MemoryError/RecursionError) so other errors (e.g., ValueError) will
propagate; update the handler to either (A) catch broad repository errors by
changing the except block to except Exception as e: (but continue to re-raise
MemoryError and RecursionError) and log via
logger.exception(CHECKPOINT_LOAD_FAILED, execution_id=execution_id,
task_id=task_id, error=e) then return None, or (B) if propagation is intended,
add a concise inline comment above the try explaining that only PersistenceError
is handled and all other exceptions should propagate; reference
_checkpoint_repo.get_latest, CHECKPOINT_LOAD_FAILED, and logger.exception when
making the change.

In `@src/ai_company/engine/recovery.py`:
- Around line 229-231: The finalize method in recovery.py currently contains
only a docstring which is a valid no-op but unclear; update the async def
finalize(self, execution_id: str) -> None: implementation to include an explicit
no-op statement (e.g., add a single pass or return None) after the docstring to
make the intent explicit and improve readability for future maintainers.

In `@tests/unit/engine/checkpoint/test_resume.py`:
- Around line 127-152: The tests currently rely on fragile substring matches of
Pydantic's error text when calling deserialize_and_reconcile; change them to
assert the exception type more robustly by either removing the match argument
and just using pytest.raises(ValueError) or (preferred) import
pydantic.ValidationError and replace pytest.raises(ValueError, match="validation
error") with pytest.raises(ValidationError) in the three tests
(test_invalid_json_raises_value_error, test_empty_string_raises,
test_valid_json_but_wrong_schema_raises) so they assert the concrete validation
exception instead of matching error text.

---

Duplicate comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 680-745: Extract the initial validation/guard logic in
_resume_from_checkpoint (the checkpoint_context_json is None branch that logs
EXECUTION_RESUME_FAILED, constructs msg and raises RuntimeError) into a small
helper function (e.g., _validate_resume_preconditions or
_ensure_checkpoint_context) and call it at the top of _resume_from_checkpoint;
ensure the helper takes recovery_result, agent_id and task_id, performs the same
logger.error call and raises the same RuntimeError when checkpoint_context_json
is None, and update any references in _resume_from_checkpoint to use that helper
so the method body falls under the 50-line guideline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ef5542ed-39b0-4a1d-a8d9-afa4f3faada4

📥 Commits

Reviewing files that changed from the base of the PR and between 838670e and 0a63453.

📒 Files selected for processing (10)
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/observability/events/checkpoint.py
  • tests/unit/engine/checkpoint/test_resume.py
  • tests/unit/engine/checkpoint/test_strategy.py
  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • tests/unit/persistence/sqlite/test_migrations_v6.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Build Backend
  • GitHub Check: Greptile Review
  • GitHub Check: Test (Python 3.14)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax: except A, B: (no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001. Tests must use test-provider, test-small-001, etc.

Files:

  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/checkpoint/__init__.py
  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • src/ai_company/engine/recovery.py
  • tests/unit/engine/checkpoint/test_resume.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/observability/events/checkpoint.py
  • tests/unit/engine/checkpoint/test_strategy.py
  • tests/unit/persistence/sqlite/test_migrations_v6.py
  • src/ai_company/engine/agent_engine.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: All public functions must have type hints. Enforce via mypy strict mode.
Google-style docstrings required on public classes and functions, enforced by ruff D rules.
Create new objects instead of mutating existing ones. For non-Pydantic internal collections, use copy.deepcopy() at construction plus MappingProxyType wrapping for read-only enforcement.
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and use copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 with BaseModel, model_validator, computed_field, ConfigDict. Use @computed_field for derived values instead of storing redundant fields (e.g., TokenUsage.total_tokens).
Use NotBlankStr from core.types for all identifier/name fields in Pydantic models, including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants, instead of manual whitespace validators.
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task.
Functions must be less than 50 lines; files must be less than 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries: user input, external APIs, config files.
NEVER use import logging, logging.getLogger(), or print() in application code.
Always use logger as the variable name for loggers (not _logger, not log).
Event names must always use constants from domain-specific modules under ai_company.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget). Import dire...

Files:

  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/engine/agent_engine.py
src/ai_company/**/[!_]*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Every module with business logic MUST have: from ai_company.observability import get_logger then logger = get_logger(__name__)

Files:

  • src/ai_company/engine/checkpoint/resume.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/observability/events/checkpoint.py
  • src/ai_company/engine/agent_engine.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Tests must use markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: use asyncio_mode = "auto" — no manual @pytest.mark.asyncio needed.
Test timeout: 30 seconds per test.
Prefer @pytest.mark.parametrize for testing similar cases.

Files:

  • tests/unit/engine/test_recovery_checkpoint_fields.py
  • tests/unit/engine/checkpoint/test_resume.py
  • tests/unit/engine/checkpoint/test_strategy.py
  • tests/unit/persistence/sqlite/test_migrations_v6.py
🧠 Learnings (1)
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Event names must always use constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

  • src/ai_company/observability/events/checkpoint.py
🧬 Code graph analysis (6)
src/ai_company/engine/checkpoint/resume.py (6)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • make_checkpoint_callback (34-133)
src/ai_company/engine/checkpoint/models.py (1)
  • CheckpointConfig (88-120)
src/ai_company/engine/context.py (2)
  • AgentContext (87-307)
  • with_message (173-182)
src/ai_company/providers/enums.py (1)
  • MessageRole (6-12)
src/ai_company/engine/loop_protocol.py (1)
  • ExecutionLoop (151-189)
src/ai_company/persistence/repositories.py (2)
  • CheckpointRepository (484-533)
  • HeartbeatRepository (537-595)
src/ai_company/engine/checkpoint/__init__.py (4)
src/ai_company/engine/checkpoint/callback_factory.py (1)
  • make_checkpoint_callback (34-133)
src/ai_company/engine/checkpoint/models.py (3)
  • Checkpoint (25-65)
  • CheckpointConfig (88-120)
  • Heartbeat (68-85)
src/ai_company/engine/checkpoint/resume.py (2)
  • cleanup_checkpoint_artifacts (130-176)
  • deserialize_and_reconcile (39-87)
src/ai_company/engine/checkpoint/strategy.py (1)
  • CheckpointRecoveryStrategy (45-297)
src/ai_company/engine/recovery.py (2)
src/ai_company/engine/task_execution.py (1)
  • TaskExecution (60-231)
src/ai_company/engine/checkpoint/strategy.py (1)
  • finalize (146-148)
tests/unit/engine/checkpoint/test_resume.py (3)
src/ai_company/engine/checkpoint/resume.py (3)
  • cleanup_checkpoint_artifacts (130-176)
  • deserialize_and_reconcile (39-87)
  • make_loop_with_callback (90-127)
src/ai_company/providers/enums.py (1)
  • MessageRole (6-12)
src/ai_company/engine/context.py (2)
  • AgentContext (87-307)
  • from_identity (140-171)
src/ai_company/engine/checkpoint/strategy.py (9)
src/ai_company/engine/checkpoint/models.py (1)
  • Checkpoint (25-65)
src/ai_company/engine/checkpoint/resume.py (1)
  • cleanup_checkpoint_artifacts (130-176)
src/ai_company/engine/context.py (2)
  • AgentContext (87-307)
  • to_snapshot (280-302)
src/ai_company/engine/recovery.py (4)
  • FailAndReassignStrategy (161-234)
  • RecoveryResult (31-112)
  • recover (124-143)
  • recover (173-227)
src/ai_company/engine/task_execution.py (1)
  • TaskExecution (60-231)
src/ai_company/persistence/errors.py (1)
  • PersistenceError (8-9)
src/ai_company/persistence/repositories.py (8)
  • CheckpointRepository (484-533)
  • HeartbeatRepository (537-595)
  • get_latest (498-519)
  • get (59-71)
  • get (218-230)
  • get (346-358)
  • get (426-438)
  • get (551-563)
src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
  • get_latest (68-132)
src/ai_company/memory/errors.py (1)
  • MemoryError (13-14)
tests/unit/engine/checkpoint/test_strategy.py (3)
src/ai_company/engine/checkpoint/strategy.py (5)
  • CheckpointRecoveryStrategy (45-297)
  • get_strategy_type (150-152)
  • recover (83-144)
  • clear_resume_count (154-167)
  • finalize (146-148)
src/ai_company/engine/recovery.py (9)
  • FailAndReassignStrategy (161-234)
  • RecoveryResult (31-112)
  • RecoveryStrategy (116-158)
  • get_strategy_type (156-158)
  • get_strategy_type (232-234)
  • recover (124-143)
  • recover (173-227)
  • finalize (145-154)
  • finalize (229-230)
src/ai_company/persistence/errors.py (1)
  • QueryError (33-34)
🔇 Additional comments (29)
src/ai_company/engine/checkpoint/resume.py (4)

1-36: LGTM!

The module is well-structured with proper logger setup, type hints, and docstrings. The event constants are correctly imported from the domain-specific module.


39-87: LGTM — past review comment addressed.

The function now correctly uses distinct event constants: CHECKPOINT_RECOVERY_DESERIALIZE_FAILED for the error path (line 65) and CHECKPOINT_RECOVERY_RECONCILIATION for the success path (line 82). This enables proper log analysis to distinguish failures from successful reconciliations.


130-176: LGTM!

The cleanup function follows best-effort semantics correctly: errors are logged at WARNING level with context but never propagated. The function handles None repositories gracefully and uses appropriate event constants for each cleanup outcome.


115-116: No action needed—ReactLoop constructor has only checkpoint_callback parameter.

ReactLoop's __init__ method accepts only checkpoint_callback: CheckpointCallback | None = None. Unlike PlanExecuteLoop, ReactLoop has no config parameter, so there are no other constructor parameters to preserve. The code correctly instantiates ReactLoop with its sole parameter.

			> Likely an incorrect or invalid review comment.
src/ai_company/observability/events/checkpoint.py (1)

1-32: LGTM!

The event constants are well-organized into logical groups with consistent naming patterns following the domain.action convention. The CHECKPOINT_RECOVERY_DESERIALIZE_FAILED constant (lines 30-32) properly supports distinguishing deserialization failures from successful reconciliations in log analysis.

tests/unit/persistence/sqlite/test_migrations_v6.py (2)

91-106: LGTM — past review comment addressed.

The expected index set now correctly includes idx_cp_task_turn (line 104), ensuring comprehensive regression protection for all checkpoint indexes created by the V6 migration.


1-136: LGTM!

Comprehensive test coverage for the V6 migration including schema version validation, table creation, column definitions, index verification, and idempotency. Test markers and timeout are properly configured.

tests/unit/engine/checkpoint/test_resume.py (1)

1-319: LGTM!

Comprehensive test suite covering happy paths, error handling, and edge cases for all three resume helper functions. The tests properly verify:

  • Context deserialization and reconciliation message injection
  • Loop callback injection for supported and unsupported loop types
  • Best-effort cleanup behavior with error swallowing
tests/unit/engine/test_recovery_checkpoint_fields.py (1)

1-282: LGTM!

Thorough test coverage for the new RecoveryResult checkpoint-related fields:

  • can_resume computation based on checkpoint_context_json presence
  • Consistency validation between checkpoint_context_json and resume_attempt
  • JSON structure validation (must be object, not array or primitive)
  • Backward compatibility with FailAndReassignStrategy
  • finalize() idempotency
tests/unit/engine/checkpoint/test_strategy.py (1)

1-601: LGTM!

Excellent test coverage for CheckpointRecoveryStrategy including:

  • Protocol compliance and strategy type verification
  • Resume-from-checkpoint flows with valid checkpoints
  • Fallback behavior (no checkpoint, max attempts exhausted, repo errors)
  • Resume counter semantics (per-execution tracking, clearing, independence)
  • finalize() behavior and idempotency
  • Cleanup invocation on fallback paths
  • Exception propagation for MemoryError and RecursionError

The tests are well-organized into focused classes with clear docstrings.

src/ai_company/engine/recovery.py (3)

62-92: LGTM!

The new checkpoint-related fields and validation logic are well-implemented:

  • checkpoint_context_json and resume_attempt fields with appropriate defaults
  • _validate_checkpoint_consistency properly enforces that both fields must be set together or both at defaults
  • JSON validation ensures the checkpoint context is a valid JSON object

106-112: LGTM!

The can_resume computed field correctly derives its value from the presence of checkpoint_context_json, providing a clean API for callers to check resume eligibility.


145-154: LGTM!

The finalize() method is a well-designed addition to the protocol, enabling post-resume cleanup hooks. The docstring clearly explains the semantics (called after non-ERROR termination).

src/ai_company/engine/checkpoint/strategy.py (7)

1-42: LGTM! Module setup follows guidelines.

Logger is correctly initialized with get_logger(__name__), imports are well-organized, and the safety bound constant is appropriately documented.


45-81: LGTM! Class initialization is well-structured.

Good use of asyncio.Lock for thread-safety of the resume counter, and the NOTE comment clearly documents the known limitation about counter persistence across process restarts.


83-144: LGTM! The recover method addresses the previously identified race condition.

The method now correctly captures resume_attempt from _reserve_resume_attempt and passes it to _build_resume_result, ensuring the metadata uses the locked value rather than reading _resume_counts outside the lock.


146-167: LGTM! Finalization methods are correctly implemented.

Proper use of the async lock when clearing resume counts, and the methods are well-documented.


207-243: LGTM! Race condition fix and documented eviction strategy.

The method now returns (should_fallback, resume_count) tuple, resolving the previously identified race condition. The FIFO eviction limitation is appropriately documented in the inline comment (lines 235-238), explaining why it's acceptable given the expected cardinality of crashed executions.


245-274: LGTM! Clean implementation of resume result builder.

Correctly accepts resume_attempt as a parameter rather than reading from _resume_counts, ensuring consistency with the locked value from _reserve_resume_attempt.


276-297: LGTM! Proper cleanup before fallback delegation.

The method correctly cleans up checkpoint artifacts before delegating to the fallback strategy, preventing orphaned database rows when the resume path won't be taken.

src/ai_company/engine/agent_engine.py (9)

18-23: LGTM! Checkpoint-related imports are well-organized.

The new imports for checkpoint models and resume helpers are properly grouped and follow the existing import structure.


172-180: LGTM! Good defensive validation for repository configuration.

The XOR check ensures checkpoint_repo and heartbeat_repo are either both provided or both omitted, preventing inconsistent configuration states.


395-399: Helpful comment explaining cost recording timing.

The comment clearly documents the intentional design decision to record costs before recovery, preventing confusion about why resumed executions have separate cost tracking.


500-513: LGTM! Clean delegation to the resume module.

Simple and focused helper that wraps the loop with checkpoint callback configuration.


652-661: LGTM! Checkpoint resume path integration.

The checkpoint resume logic is cleanly integrated into the recovery flow, properly threading completion_config and effective_autonomy through to _resume_from_checkpoint.


747-776: LGTM! Good extraction of core resume logic.

This helper cleanly encapsulates the deserialize-and-execute flow, returning the execution_id needed for finalization.


778-813: LGTM! Resumed loop execution with proper budget handling.

Correctly handles the optional task_execution case and constructs the appropriate budget checker.


815-836: LGTM! Proper conditional cleanup after resume.

The conditional cleanup (only on non-ERROR termination) correctly preserves checkpoint data for potential subsequent resume attempts when the resumed execution itself fails.


286-293: LGTM! Consistent PEP 758 except syntax.

The except MemoryError, RecursionError: syntax (without parentheses) is correctly used throughout for non-recoverable errors.

- Add content None check in test_resume.py reconciliation assertion
- Add type: ignore[comparison-overlap] for CustomLoop identity check
- Add finalize() to CustomRecovery in test_agent_engine_errors.py
Copilot AI review requested due to automatic review settings March 14, 2026 09:01
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview March 14, 2026 09:02 — with GitHub Actions Inactive
Set parent_task_id=None in TaskFactory to prevent polyfactory from
randomly generating matching id and parent_task_id values, which
triggers the "cannot be its own parent" validator.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/unit/engine/checkpoint/test_resume.py (1)

128-153: ⚠️ Potential issue | 🟡 Minor

Make invalid-payload tests less brittle and parameterized.

Line 129, Line 138, and Line 147 depend on a fragile validation-message substring and duplicate the same structure across three tests.

♻️ Proposed refactor
-    def test_invalid_json_raises_value_error(self) -> None:
-        with pytest.raises(ValueError, match="validation error"):
-            deserialize_and_reconcile(
-                checkpoint_json="{not valid json}",
-                error_message="crash",
-                agent_id="agent-1",
-                task_id="task-1",
-            )
-
-    def test_empty_string_raises(self) -> None:
-        with pytest.raises(ValueError, match="validation error"):
-            deserialize_and_reconcile(
-                checkpoint_json="",
-                error_message="crash",
-                agent_id="agent-1",
-                task_id="task-1",
-            )
-
-    def test_valid_json_but_wrong_schema_raises(self) -> None:
-        with pytest.raises(ValueError, match="validation error"):
-            deserialize_and_reconcile(
-                checkpoint_json='{"not": "an AgentContext"}',
-                error_message="crash",
-                agent_id="agent-1",
-                task_id="task-1",
-            )
+    `@pytest.mark.parametrize`(
+        "checkpoint_json",
+        ["{not valid json}", "", '{"not": "an AgentContext"}'],
+        ids=["invalid_json", "empty_string", "wrong_schema"],
+    )
+    def test_invalid_payload_raises_value_error(self, checkpoint_json: str) -> None:
+        with pytest.raises(ValueError):
+            deserialize_and_reconcile(
+                checkpoint_json=checkpoint_json,
+                error_message="crash",
+                agent_id="agent-1",
+                task_id="task-1",
+            )
#!/bin/bash
# Verify fragile match usage and inspect deserialize_and_reconcile error behavior.
rg -n 'pytest\.raises\(ValueError,\s*match="validation error"\)' tests/unit/engine/checkpoint/test_resume.py
rg -n -C4 'def deserialize_and_reconcile|model_validate_json|raise ValueError|except ValueError' src/ai_company/engine/checkpoint/resume.py

As per coding guidelines, Prefer @pytest.mark.parametrize for testing similar cases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/engine/checkpoint/test_resume.py` around lines 128 - 153, Tests
are brittle and duplicated because they assert a specific substring "validation
error" in the ValueError; replace the three tests with a single parametrized
test using pytest.mark.parametrize over the invalid checkpoint_json values and
assert only that deserialize_and_reconcile raises ValueError (or use a more
robust regex like r"validation" if a message check is required). Locate the
tests around test_invalid_json_raises_value_error / test_empty_string_raises /
test_valid_json_but_wrong_schema_raises and change them to a single parametrized
test that calls deserialize_and_reconcile(checkpoint_json=...,
error_message="crash", agent_id="agent-1", task_id="task-1") inside a with
pytest.raises(ValueError) block (or with pytest.raises(ValueError,
match=r"validation") for a looser match).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/unit/engine/checkpoint/test_resume.py`:
- Around line 128-153: Tests are brittle and duplicated because they assert a
specific substring "validation error" in the ValueError; replace the three tests
with a single parametrized test using pytest.mark.parametrize over the invalid
checkpoint_json values and assert only that deserialize_and_reconcile raises
ValueError (or use a more robust regex like r"validation" if a message check is
required). Locate the tests around test_invalid_json_raises_value_error /
test_empty_string_raises / test_valid_json_but_wrong_schema_raises and change
them to a single parametrized test that calls
deserialize_and_reconcile(checkpoint_json=..., error_message="crash",
agent_id="agent-1", task_id="task-1") inside a with pytest.raises(ValueError)
block (or with pytest.raises(ValueError, match=r"validation") for a looser
match).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: ee499e95-a5d7-43c1-8ed0-5795d5dad175

📥 Commits

Reviewing files that changed from the base of the PR and between 0a63453 and 9c5b5ae.

📒 Files selected for processing (2)
  • tests/unit/engine/checkpoint/test_resume.py
  • tests/unit/engine/test_agent_engine_errors.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Agent
  • GitHub Check: Build Backend
  • GitHub Check: Greptile Review
  • GitHub Check: Test (Python 3.14)
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax: except A, B: (no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001. Tests must use test-provider, test-small-001, etc.

Files:

  • tests/unit/engine/checkpoint/test_resume.py
  • tests/unit/engine/test_agent_engine_errors.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Tests must use markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: use asyncio_mode = "auto" — no manual @pytest.mark.asyncio needed.
Test timeout: 30 seconds per test.
Prefer @pytest.mark.parametrize for testing similar cases.

Files:

  • tests/unit/engine/checkpoint/test_resume.py
  • tests/unit/engine/test_agent_engine_errors.py
🧬 Code graph analysis (2)
tests/unit/engine/checkpoint/test_resume.py (7)
src/ai_company/engine/checkpoint/models.py (1)
  • CheckpointConfig (88-120)
src/ai_company/security/timeout/park_service.py (1)
  • resume (108-144)
src/ai_company/engine/checkpoint/resume.py (3)
  • cleanup_checkpoint_artifacts (130-176)
  • deserialize_and_reconcile (39-87)
  • make_loop_with_callback (90-127)
src/ai_company/providers/enums.py (1)
  • MessageRole (6-12)
src/ai_company/core/agent.py (1)
  • AgentIdentity (266-342)
src/ai_company/engine/context.py (2)
  • AgentContext (87-307)
  • from_identity (140-171)
src/ai_company/engine/parallel_models.py (2)
  • agent_id (79-81)
  • task_id (87-89)
tests/unit/engine/test_agent_engine_errors.py (1)
src/ai_company/engine/recovery.py (2)
  • finalize (145-154)
  • finalize (229-230)
🔇 Additional comments (4)
tests/unit/engine/test_agent_engine_errors.py (1)

556-558: Good recovery-protocol alignment in the test double.

Adding finalize() keeps CustomRecovery compatible with the expanded recovery lifecycle interface.

tests/unit/engine/checkpoint/test_resume.py (3)

58-122: Strong happy-path coverage for deserialization and reconciliation behavior.

These tests validate type restoration, reconciliation injection, and turn-count preservation clearly.


161-254: Loop callback injection tests are well-scoped.

Good coverage for no-op behavior, supported loop injection, config preservation, and unsupported loop passthrough.


261-320: Cleanup behavior tests are solid and practical.

Happy-path plus error-swallowing scenarios are covered and align with best-effort cleanup semantics.

@Aureliolo Aureliolo temporarily deployed to cloudflare-preview March 14, 2026 09:05 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements checkpoint-based crash recovery (“Strategy 2”) across the engine and persistence layers, enabling per-turn AgentContext checkpointing, heartbeat tracking, and resuming from the latest checkpoint with a reconciliation message.

Changes:

  • Added checkpoint/heartbeat domain models, callback factory, resume helpers, and CheckpointRecoveryStrategy with max-resume attempts + fallback.
  • Extended persistence protocols and SQLite backend/migrations to store checkpoints + heartbeats (schema v6) and added repository implementations.
  • Integrated optional checkpoint callback injection into execution loops and added new observability event constants + docs/tests.

Reviewed changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/persistence/test_protocol.py Adds fake protocol implementations for checkpoint/heartbeat repos.
tests/unit/persistence/test_migrations_v2.py Updates schema version assertion to use SCHEMA_VERSION.
tests/unit/persistence/sqlite/test_migrations_v6.py New tests validating v6 tables/indexes + idempotency.
tests/unit/persistence/sqlite/test_heartbeat_repo.py New unit tests for SQLiteHeartbeatRepository.
tests/unit/persistence/sqlite/test_checkpoint_repo.py New unit tests for SQLiteCheckpointRepository.
tests/unit/observability/test_events.py Registers/checks new checkpoint events module and constants.
tests/unit/engine/test_recovery_checkpoint_fields.py Tests new RecoveryResult resume fields/validation.
tests/unit/engine/test_agent_engine_errors.py Updates recovery-strategy test double to implement finalize().
tests/unit/engine/checkpoint/test_strategy.py New tests for CheckpointRecoveryStrategy behavior/counters/cleanup.
tests/unit/engine/checkpoint/test_resume.py New tests for resume helpers (deserialize/reconcile/loop injection/cleanup).
tests/unit/engine/checkpoint/test_models.py New tests for checkpoint/heartbeat/config Pydantic models.
tests/unit/engine/checkpoint/test_callback_factory.py New tests for make_checkpoint_callback() persistence/skip/error handling.
tests/unit/engine/checkpoint/init.py Establishes checkpoint test package.
tests/unit/api/conftest.py Adds in-memory fake checkpoint/heartbeat repos to fake persistence backend.
src/ai_company/persistence/sqlite/migrations.py Bumps schema to v6; adds checkpoints/heartbeats tables + indexes.
src/ai_company/persistence/sqlite/heartbeat_repo.py Implements SQLite heartbeat persistence (upsert/query stale/delete).
src/ai_company/persistence/sqlite/checkpoint_repo.py Implements SQLite checkpoint persistence (save/get_latest/delete_by_execution).
src/ai_company/persistence/sqlite/backend.py Wires checkpoint/heartbeat repositories into SQLite backend.
src/ai_company/persistence/sqlite/init.py Exports new SQLite checkpoint/heartbeat repositories.
src/ai_company/persistence/repositories.py Adds CheckpointRepository and HeartbeatRepository protocols.
src/ai_company/persistence/protocol.py Extends PersistenceBackend protocol with checkpoints/heartbeats properties.
src/ai_company/observability/events/persistence.py Adds persistence event constants for checkpoint/heartbeat repos.
src/ai_company/observability/events/execution.py Adds execution-level checkpoint callback + resume event constants.
src/ai_company/observability/events/checkpoint.py New checkpoint domain event constants module.
src/ai_company/engine/recovery.py Extends RecoveryResult for resume; adds finalize() to strategy protocol.
src/ai_company/engine/react_loop.py Injects optional checkpoint callback after turn completion (pre-tools).
src/ai_company/engine/plan_execute_loop.py Adds optional checkpoint callback support + invocation helper.
src/ai_company/engine/checkpoint/strategy.py New CheckpointRecoveryStrategy with resume counters + fallback behavior.
src/ai_company/engine/checkpoint/resume.py New helpers: deserialize+reconcile, loop injection, best-effort cleanup.
src/ai_company/engine/checkpoint/models.py New frozen Pydantic models: Checkpoint, Heartbeat, CheckpointConfig.
src/ai_company/engine/checkpoint/callback_factory.py New factory producing checkpoint+heartbeat persistence callback closure.
src/ai_company/engine/checkpoint/callback.py Adds CheckpointCallback type alias.
src/ai_company/engine/checkpoint/init.py Exposes checkpoint subpackage public API.
src/ai_company/engine/agent_engine.py Integrates resume flow + loop callback injection; threads configs through recovery.
src/ai_company/engine/init.py Exports checkpoint-related symbols from engine package.
docs/design/engine.md Updates engine crash recovery docs to reflect implemented checkpoint strategy.
README.md Mentions crash recovery + adds Persistence→Engine relationship in architecture diagram.
CLAUDE.md Updates repo/module overview and event-constant guidance with checkpoint domain.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +176 to +189
"""Load the latest checkpoint, returning ``None`` on failure."""
try:
checkpoint = await self._checkpoint_repo.get_latest(
execution_id=execution_id,
)
except MemoryError, RecursionError:
raise
except PersistenceError:
logger.exception(
CHECKPOINT_LOAD_FAILED,
execution_id=execution_id,
task_id=task_id,
)
return None
Comment on lines +397 to +399
# led to the error) should be tracked. The resumed execution
# will have its own cost recording when it completes through
# the normal pipeline.
Comment on lines +775 to +786
"""Invoke the checkpoint callback if configured.

Errors are logged but never propagated — checkpointing must
not interrupt execution.
"""
if self._checkpoint_callback is None:
return
try:
await self._checkpoint_callback(ctx)
except MemoryError, RecursionError:
raise
except Exception as exc:
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 14, 2026

Codecov Report

❌ Patch coverage is 89.03846% with 57 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.78%. Comparing base (494013f) to head (1352a8f).
⚠️ Report is 2 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/ai_company/engine/agent_engine.py 36.66% 36 Missing and 2 partials ⚠️
src/ai_company/engine/plan_execute_loop.py 56.25% 6 Missing and 1 partial ⚠️
src/ai_company/engine/react_loop.py 22.22% 6 Missing and 1 partial ⚠️
src/ai_company/engine/checkpoint/strategy.py 96.05% 2 Missing and 1 partial ⚠️
src/ai_company/persistence/sqlite/backend.py 85.71% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #367      +/-   ##
==========================================
- Coverage   93.90%   93.78%   -0.13%     
==========================================
  Files         447      456       +9     
  Lines       20819    21330     +511     
  Branches     2011     2046      +35     
==========================================
+ Hits        19551    20005     +454     
- Misses        981     1033      +52     
- Partials      287      292       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…erage

Critical: _finalize_resume now calls record_execution_costs and
apply_post_execution_transitions on the resumed ExecutionResult,
fixing the bug where resumed executions silently skipped cost
tracking and task state transitions. Identity threaded through
_apply_recovery → _resume_from_checkpoint → _finalize_resume.

- Remove dead EXECUTION_CHECKPOINT_CALLBACK constant (never imported)
- Add 9 repo error path tests (QueryError wrapping for save/get/delete)
- Consolidate 3 brittle deserialize tests into 1 parametrized test
- Export make_loop_with_callback from checkpoint __init__.py
- Extract _validate_checkpoint_json to keep function under 50 lines
- Add clarifying docstring for PersistenceError-only handling
- Fix FailAndReassignStrategy.finalize no-op clarity
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/ai_company/engine/agent_engine.py (3)

412-426: ⚠️ Potential issue | 🔴 Critical

Do not route wall-clock timeouts into checkpoint resume.

Line 567-571 turns an expired timeout_seconds into TerminationReason.ERROR, and this block then feeds it into recovery. _resume_from_checkpoint() explicitly drops the deadline, so a run that already timed out can continue well past the caller's wall-clock limit. Please distinguish deadline expiry from ordinary execution errors and short-circuit checkpoint recovery for that case.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/agent_engine.py` around lines 412 - 426, The code
currently treats any TerminationReason.ERROR the same and sends the run into
_apply_recovery (which may call _resume_from_checkpoint and drop the wall-clock
deadline); change the logic so that if the failure is a wall-clock timeout (the
expiration of timeout_seconds / a deadline expiry recorded on execution_result)
you short-circuit and do NOT call _apply_recovery or resume from
checkpoint—return or propagate the timeout execution_result instead; otherwise
continue to call _apply_recovery as before. Inspect execution_result for the
timeout/deadline marker (e.g., a termination_reason value indicating timeout or
a deadline_expired/timed_out flag on execution_result/context) and use that to
decide.

655-680: ⚠️ Potential issue | 🟠 Major

Resume-path exceptions should not fall back to the stale pre-recovery result.

If deserialize_and_reconcile() or the resumed loop raises, this handler logs the failure and returns the original execution_result. That result still carries the old task_execution, so a bad checkpoint or resume failure can leave the task in its stale pre-recovery state instead of failing/reassigning it. The malformed-checkpoint cases already covered in tests/unit/engine/checkpoint/test_resume.py make this reachable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/agent_engine.py` around lines 655 - 680, The current
except Exception handler in agent_engine.py logs recovery/resume failures and
then returns the original stale execution_result (which still contains the
pre-recovery task_execution), allowing a bad checkpoint/resume to leave the task
in a stale state; instead, after logging the error you should not return the old
execution_result—re-raise the exception (or raise a new contextual exception) so
the caller can mark the task failed/reassign it. Concretely: in the except
Exception as exc block that references EXECUTION_RECOVERY_FAILED, agent_id,
task_id, change the final line from "return execution_result" to "raise" (or
"raise RuntimeError(...) from exc" including agent_id/task_id), keeping the
existing logger.exception call and the separate re-raises for
MemoryError/RecursionError; this ensures failures from
deserialize_and_reconcile, _resume_from_checkpoint, or any recovery logic do not
silently fall back to a stale task_execution.

427-448: ⚠️ Potential issue | 🟠 Major

Skip the second TaskEngine sync after a successful resume.

_finalize_resume() already calls apply_post_execution_transitions() on Line 847-852, and src/ai_company/engine/task_sync.py shows that helper performs the TaskEngine sync itself. When recovery returns a non-ERROR result, this block issues another sync_to_task_engine() for the same state change.

💡 Minimal guard
-            if (
-                ctx.task_execution is not None
-                and pre_recovery_status is not None
-                and ctx.task_execution.status != pre_recovery_status
-            ):
+            if (
+                execution_result.termination_reason == TerminationReason.ERROR
+                and ctx.task_execution is not None
+                and pre_recovery_status is not None
+                and ctx.task_execution.status != pre_recovery_status
+            ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/agent_engine.py` around lines 427 - 448, This block is
causing a duplicate TaskEngine sync after a successful resume because
_finalize_resume() already calls apply_post_execution_transitions() which
performs the sync; guard the second sync so it only runs when recovery actually
failed (e.g., check execution_result.recovery_result == RecoveryResult.ERROR or
equivalent) before calling sync_to_task_engine with ctx.task_execution;
reference the variables execution_result, ctx.task_execution,
pre_recovery_status, sync_to_task_engine, _finalize_resume(), and
apply_post_execution_transitions() to find and update the conditional that wraps
the sync_to_task_engine call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 412-426: The code currently treats any TerminationReason.ERROR the
same and sends the run into _apply_recovery (which may call
_resume_from_checkpoint and drop the wall-clock deadline); change the logic so
that if the failure is a wall-clock timeout (the expiration of timeout_seconds /
a deadline expiry recorded on execution_result) you short-circuit and do NOT
call _apply_recovery or resume from checkpoint—return or propagate the timeout
execution_result instead; otherwise continue to call _apply_recovery as before.
Inspect execution_result for the timeout/deadline marker (e.g., a
termination_reason value indicating timeout or a deadline_expired/timed_out flag
on execution_result/context) and use that to decide.
- Around line 655-680: The current except Exception handler in agent_engine.py
logs recovery/resume failures and then returns the original stale
execution_result (which still contains the pre-recovery task_execution),
allowing a bad checkpoint/resume to leave the task in a stale state; instead,
after logging the error you should not return the old execution_result—re-raise
the exception (or raise a new contextual exception) so the caller can mark the
task failed/reassign it. Concretely: in the except Exception as exc block that
references EXECUTION_RECOVERY_FAILED, agent_id, task_id, change the final line
from "return execution_result" to "raise" (or "raise RuntimeError(...) from exc"
including agent_id/task_id), keeping the existing logger.exception call and the
separate re-raises for MemoryError/RecursionError; this ensures failures from
deserialize_and_reconcile, _resume_from_checkpoint, or any recovery logic do not
silently fall back to a stale task_execution.
- Around line 427-448: This block is causing a duplicate TaskEngine sync after a
successful resume because _finalize_resume() already calls
apply_post_execution_transitions() which performs the sync; guard the second
sync so it only runs when recovery actually failed (e.g., check
execution_result.recovery_result == RecoveryResult.ERROR or equivalent) before
calling sync_to_task_engine with ctx.task_execution; reference the variables
execution_result, ctx.task_execution, pre_recovery_status, sync_to_task_engine,
_finalize_resume(), and apply_post_execution_transitions() to find and update
the conditional that wraps the sync_to_task_engine call.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f926dbc2-15da-49bf-b7fe-d06923a5dd25

📥 Commits

Reviewing files that changed from the base of the PR and between 69c57a8 and ae1abdb.

📒 Files selected for processing (8)
  • src/ai_company/engine/agent_engine.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/observability/events/execution.py
  • tests/unit/engine/checkpoint/test_resume.py
  • tests/unit/persistence/sqlite/test_checkpoint_repo.py
  • tests/unit/persistence/sqlite/test_heartbeat_repo.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build Backend
  • GitHub Check: Greptile Review
  • GitHub Check: Test (Python 3.14)
  • GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax: except A, B: (no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001. Tests must use test-provider, test-small-001, etc.

Files:

  • tests/unit/engine/checkpoint/test_resume.py
  • tests/unit/persistence/sqlite/test_checkpoint_repo.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/engine/recovery.py
  • tests/unit/persistence/sqlite/test_heartbeat_repo.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/engine/agent_engine.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Tests must use markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: use asyncio_mode = "auto" — no manual @pytest.mark.asyncio needed.
Test timeout: 30 seconds per test.
Prefer @pytest.mark.parametrize for testing similar cases.

Files:

  • tests/unit/engine/checkpoint/test_resume.py
  • tests/unit/persistence/sqlite/test_checkpoint_repo.py
  • tests/unit/persistence/sqlite/test_heartbeat_repo.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/**/*.py: All public functions must have type hints. Enforce via mypy strict mode.
Google-style docstrings required on public classes and functions, enforced by ruff D rules.
Create new objects instead of mutating existing ones. For non-Pydantic internal collections, use copy.deepcopy() at construction plus MappingProxyType wrapping for read-only enforcement.
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and use copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 with BaseModel, model_validator, computed_field, ConfigDict. Use @computed_field for derived values instead of storing redundant fields (e.g., TokenUsage.total_tokens).
Use NotBlankStr from core.types for all identifier/name fields in Pydantic models, including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants, instead of manual whitespace validators.
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task.
Functions must be less than 50 lines; files must be less than 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries: user input, external APIs, config files.
NEVER use import logging, logging.getLogger(), or print() in application code.
Always use logger as the variable name for loggers (not _logger, not log).
Event names must always use constants from domain-specific modules under ai_company.observability.events (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget). Import dire...

Files:

  • src/ai_company/observability/events/execution.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/checkpoint/__init__.py
  • src/ai_company/engine/agent_engine.py
src/ai_company/**/[!_]*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Every module with business logic MUST have: from ai_company.observability import get_logger then logger = get_logger(__name__)

Files:

  • src/ai_company/observability/events/execution.py
  • src/ai_company/engine/recovery.py
  • src/ai_company/engine/checkpoint/strategy.py
  • src/ai_company/engine/agent_engine.py
🧠 Learnings (2)
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Event names must always use constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`

Applied to files:

  • src/ai_company/observability/events/execution.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Handle errors explicitly, never silently swallow exceptions.

Applied to files:

  • src/ai_company/engine/checkpoint/strategy.py
🧬 Code graph analysis (2)
tests/unit/engine/checkpoint/test_resume.py (1)
src/ai_company/engine/checkpoint/resume.py (3)
  • cleanup_checkpoint_artifacts (130-176)
  • deserialize_and_reconcile (39-87)
  • make_loop_with_callback (90-127)
src/ai_company/engine/agent_engine.py (5)
src/ai_company/engine/checkpoint/resume.py (3)
  • cleanup_checkpoint_artifacts (130-176)
  • deserialize_and_reconcile (39-87)
  • make_loop_with_callback (90-127)
src/ai_company/persistence/repositories.py (2)
  • CheckpointRepository (484-533)
  • HeartbeatRepository (537-595)
src/ai_company/engine/loop_protocol.py (4)
  • ExecutionLoop (151-189)
  • ExecutionResult (79-140)
  • execute (159-185)
  • TerminationReason (28-36)
src/ai_company/engine/context.py (1)
  • AgentContext (87-307)
src/ai_company/engine/task_sync.py (1)
  • apply_post_execution_transitions (171-227)
🔇 Additional comments (1)
src/ai_company/engine/recovery.py (1)

72-112: Nice invariant on resume metadata.

The post-validation plus can_resume computed field makes the checkpoint-resume path much harder to misuse downstream.

Normal completions (COMPLETED, MAX_TURNS, BUDGET_EXHAUSTED, SHUTDOWN,
PARKED) bypassed the recovery block entirely, leaving checkpoint and
heartbeat rows orphaned in SQLite indefinitely. Now _post_execution_pipeline
calls cleanup_checkpoint_artifacts and recovery_strategy.finalize on
all non-ERROR terminations.
Copilot AI review requested due to automatic review settings March 14, 2026 09:32
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview March 14, 2026 09:33 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the “Checkpoint Recovery” crash-recovery strategy across the engine and persistence layers, enabling per-turn AgentContext checkpointing with heartbeat-based failure detection and resume-from-checkpoint execution flow.

Changes:

  • Added checkpoint/heartbeat Pydantic models, callback factory + loop injection, and a CheckpointRecoveryStrategy with resume attempt tracking and fallback.
  • Extended persistence protocols + SQLite backend with V6 migrations and new checkpoint/heartbeat repositories.
  • Wired resume flow into AgentEngine, updated execution loops to invoke checkpoint callbacks, and expanded observability event constants + unit tests.

Reviewed changes

Copilot reviewed 38 out of 39 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/persistence/test_protocol.py Extends protocol conformance tests for new checkpoint/heartbeat repositories.
tests/unit/persistence/test_migrations_v2.py Updates migration assertions to use SCHEMA_VERSION.
tests/unit/persistence/sqlite/test_migrations_v6.py Adds coverage for V6 schema migration (tables + indexes + idempotency).
tests/unit/persistence/sqlite/test_heartbeat_repo.py Adds SQLiteHeartbeatRepository unit tests.
tests/unit/persistence/sqlite/test_checkpoint_repo.py Adds SQLiteCheckpointRepository unit tests.
tests/unit/observability/test_events.py Ensures checkpoint events module is discovered and constants exist.
tests/unit/engine/test_recovery_checkpoint_fields.py Validates new RecoveryResult checkpoint-related fields and invariants.
tests/unit/engine/test_agent_engine_errors.py Updates test strategy stub to satisfy new finalize() protocol method.
tests/unit/engine/checkpoint/test_strategy.py Adds unit tests for CheckpointRecoveryStrategy behavior and counter logic.
tests/unit/engine/checkpoint/test_resume.py Tests resume helpers (deserialize/reconcile, callback injection, cleanup).
tests/unit/engine/checkpoint/test_models.py Adds tests for checkpoint/heartbeat/config models and validation.
tests/unit/engine/checkpoint/test_callback_factory.py Adds tests for callback persistence frequency, heartbeat updates, error handling.
tests/unit/engine/checkpoint/init.py Introduces checkpoint test package marker.
tests/unit/core/conftest.py Updates Task factory to include parent_task_id.
tests/unit/api/conftest.py Adds in-memory checkpoint/heartbeat repos to fake persistence backend for API tests.
src/ai_company/persistence/sqlite/migrations.py Bumps schema to V6; adds checkpoints + heartbeats tables and indexes.
src/ai_company/persistence/sqlite/heartbeat_repo.py Implements SQLiteHeartbeatRepository with UTC normalization and structured logging.
src/ai_company/persistence/sqlite/checkpoint_repo.py Implements SQLiteCheckpointRepository with filtered “latest” queries and logging.
src/ai_company/persistence/sqlite/backend.py Wires new repositories into SQLitePersistenceBackend lifecycle/properties.
src/ai_company/persistence/sqlite/init.py Exports new SQLite repositories.
src/ai_company/persistence/repositories.py Adds CheckpointRepository and HeartbeatRepository protocols.
src/ai_company/persistence/protocol.py Extends PersistenceBackend protocol with checkpoint/heartbeat repositories.
src/ai_company/observability/events/persistence.py Adds persistence event constants for checkpoints/heartbeats.
src/ai_company/observability/events/execution.py Adds execution-level events for checkpoint callback and resume flow.
src/ai_company/observability/events/checkpoint.py Introduces checkpoint-domain event constants module.
src/ai_company/engine/recovery.py Extends RecoveryResult with checkpoint resume fields + validation; adds finalize() to protocol.
src/ai_company/engine/react_loop.py Adds optional checkpoint callback invocation per turn with guarded error logging.
src/ai_company/engine/plan_execute_loop.py Adds optional checkpoint callback invocation for planner/step turns; exposes config property.
src/ai_company/engine/checkpoint/strategy.py Implements CheckpointRecoveryStrategy with resume attempt tracking + fallback cleanup.
src/ai_company/engine/checkpoint/resume.py Adds helpers for deserialization/reconciliation, loop injection, and artifact cleanup.
src/ai_company/engine/checkpoint/models.py Adds Checkpoint, Heartbeat, and CheckpointConfig frozen models + validation.
src/ai_company/engine/checkpoint/callback_factory.py Adds make_checkpoint_callback() to persist checkpoints + heartbeats best-effort.
src/ai_company/engine/checkpoint/callback.py Defines the CheckpointCallback type alias.
src/ai_company/engine/checkpoint/init.py Exposes checkpoint module public API.
src/ai_company/engine/agent_engine.py Integrates checkpoint callback injection, resume-from-checkpoint path, and cleanup/finalization threading.
src/ai_company/engine/init.py Re-exports checkpoint-related engine symbols.
docs/design/engine.md Updates design docs to reflect implemented checkpoint recovery strategy and fields.
README.md Updates overview/diagram to reflect persistence→engine integration + crash recovery.
CLAUDE.md Updates repo structure and observability event constant guidance to include checkpoint.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +61 to +66
try:
checkpoint_ctx = AgentContext.model_validate_json(checkpoint_json)
except ValueError:
logger.exception(
CHECKPOINT_RECOVERY_DESERIALIZE_FAILED,
agent_id=agent_id,
Comment on lines +454 to +458
if execution_result.termination_reason != TerminationReason.ERROR:
exec_id = execution_result.context.execution_id
if self._recovery_strategy is not None:
await self._recovery_strategy.finalize(exec_id)
await cleanup_checkpoint_artifacts(
@Aureliolo Aureliolo merged commit f886838 into main Mar 14, 2026
32 of 34 checks passed
@Aureliolo Aureliolo deleted the feat/checkpoint-recovery branch March 14, 2026 09:38
@Aureliolo Aureliolo temporarily deployed to cloudflare-preview March 14, 2026 09:38 — with GitHub Actions Inactive
Aureliolo added a commit that referenced this pull request Mar 14, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.1.4](v0.1.3...v0.1.4)
(2026-03-14)


### Features

* add approval workflow gates to TaskEngine
([#387](#387))
([2db968a](2db968a))
* implement checkpoint recovery strategy
([#367](#367))
([f886838](f886838))


### CI/CD

* add npm and pre-commit ecosystems to Dependabot
([#369](#369))
([54e5fe7](54e5fe7))
* bump actions/setup-node from 4.4.0 to 6.3.0
([#360](#360))
([2db5105](2db5105))
* bump github/codeql-action from 3.32.6 to 4.32.6
([#361](#361))
([ce766e8](ce766e8))
* group major dependabot bumps per ecosystem
([#388](#388))
([3c43aef](3c43aef))


### Maintenance

* bump @vitejs/plugin-vue from 5.2.4 to 6.0.5 in /web
([#382](#382))
([d7054ee](d7054ee))
* bump @vue/tsconfig from 0.7.0 to 0.9.0 in /web in the minor-and-patch
group across 1 directory
([#371](#371))
([64fa08b](64fa08b))
* bump astro from 5.18.1 to 6.0.4 in /site
([#376](#376))
([d349317](d349317))
* bump https://github.com/astral-sh/ruff-pre-commit from v0.15.5 to
0.15.6 ([#372](#372))
([dcacb2e](dcacb2e))
* bump https://github.com/gitleaks/gitleaks from v8.24.3 to 8.30.1
([#375](#375))
([a18e6ed](a18e6ed))
* bump https://github.com/hadolint/hadolint from v2.12.0 to 2.14.0
([#373](#373))
([47b906b](47b906b))
* bump https://github.com/pre-commit/pre-commit-hooks from v5.0.0 to
6.0.0 ([#374](#374))
([1926555](1926555))
* bump litellm from 1.82.1 to 1.82.2 in the minor-and-patch group
([#385](#385))
([fa4f7b7](fa4f7b7))
* bump node from 22-alpine to 25-alpine in /docker/web
([#359](#359))
([8d56cd3](8d56cd3))
* bump node from 22-slim to 25-slim in /docker/sandbox
([#358](#358))
([3de8748](3de8748))
* bump pinia from 2.3.1 to 3.0.4 in /web
([#381](#381))
([c78dcc2](c78dcc2))
* bump the major group across 1 directory with 9 updates
([#389](#389))
([9fa621b](9fa621b))
* bump the minor-and-patch group with 2 updates
([#362](#362))
([6ede2ce](6ede2ce))
* bump vue-router from 4.6.4 to 5.0.3 in /web
([#378](#378))
([6c60f6c](6c60f6c))
* expand review skills to 18 smart conditional agents
([#364](#364))
([494013f](494013f))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: implement checkpoint recovery strategy

2 participants