feat: implement checkpoint recovery strategy#367
Conversation
Adds the Checkpoint Recovery strategy (Engine design §6.6, Strategy 2) which persists AgentContext snapshots after each completed turn and resumes from the last checkpoint on crash, preserving progress. After max_resume_attempts failures, falls back to fail-and-reassign. New modules: - engine/checkpoint/ — models, callback factory, strategy - persistence/sqlite/ — checkpoint and heartbeat repositories - observability/events/checkpoint.py — lifecycle event constants Key changes: - RecoveryResult extended with checkpoint_context_json, resume_attempt, can_resume fields (backward-compatible defaults) - ReactLoop and PlanExecuteLoop accept optional checkpoint_callback - AgentEngine wires checkpoint repos, builds callback-enabled loops, and resumes from checkpoint on recoverable errors - SQLite schema V6 adds checkpoints and heartbeats tables - PersistenceBackend protocol gains checkpoints/heartbeats properties - 100 new unit tests, full suite green at 94% coverage
Pre-reviewed by 15 agents, 42 findings addressed: - Fix missing completion_config and effective_autonomy on resume path - Add MemoryError/RecursionError re-raise in strategy checkpoint load - Add asyncio.Lock to protect _resume_counts from TOCTOU race - Clean up resume counts in fallback path to prevent memory leak - Validate checkpoint_repo and heartbeat_repo are provided together - Log warning for unsupported loop types in checkpoint injection - Skip heartbeat update when checkpoint save fails (prevent limbo) - Guard against turn-zero checkpoint via explicit check - Sanitize error message in LLM reconciliation ChatMessage - Replace assert with explicit RuntimeError for checkpoint_context_json - Add cross-field validator for checkpoint_context_json/resume_attempt - Clean up checkpoints after successful resume completion - Thread completion_config and effective_autonomy through recovery chain - Extract duplicate checkpoint callback blocks into helper method - Add public config property to PlanExecuteLoop (remove SLF001 noqa) - Use EXECUTION_CHECKPOINT_CALLBACK_FAILED event for error paths - Add structured error info to callback exception logging - Use NotBlankStr for repo params and callback_factory identifiers - Move cursor.rowcount read before commit in repos - Fix HeartbeatRepository.get_stale to accept AwareDatetime - Drop DESC from composite index (portability for SQLite < 3.47) - Document intentional FK absence in migrations - Document heartbeat_interval_seconds as reserved for future use - Update CLAUDE.md package structure and logging event docs - Update PersistenceBackend and RecoveryResult docstrings - Fix various docstring accuracy issues across modules - Add FakePersistenceBackend checkpoint/heartbeat repos in test conftest - Add protocol compliance assertions for CheckpointRepository/HeartbeatRepository - Add test_checkpoint_events_exist assertions - Fix _FakeCostRecordRepository.aggregate missing task_id (pre-existing) - Verify checkpoint/heartbeat model content in callback factory tests - Fix test_strategy.py return type annotation
Critical fixes:
- execution_id mismatch: use checkpoint_ctx.execution_id for cleanup
instead of fabricated f"{agent_id}:{task_id}" (Copilot, Greptile,
code-reviewer, python-reviewer, silent-failure, security)
- PEP 758 except syntax in models.py
- Extract _resume_from_checkpoint (118→35 lines) into helpers
- Extract recover() (108→30 lines) into smaller methods
- Move _make_loop_with_callback to checkpoint/resume.py
- Create engine/checkpoint/resume.py for resume orchestration helpers
Major fixes:
- Include actual error_message in reconciliation (not hardcoded text)
- Move _delegate_to_fallback outside lock scope in strategy
- Clean up heartbeat after successful resume (Greptile)
- Conditional checkpoint cleanup — skip on ERROR (CodeRabbit)
- Fix EXECUTION_LOOP_STARTED → EXECUTION_LOOP_START in CLAUDE.md
- Fix PARALLEL_EXECUTION_STARTED → PARALLEL_GROUP_START in CLAUDE.md
- Add 3 missing checkpoint fields to RecoveryResult table in design spec
- Remove non-existent storage field from YAML config example
- Update reconciliation description to match implementation
- Make clear_resume_count async with lock for thread safety
- Normalize heartbeat timestamps to UTC for correct comparisons
- Narrow except Exception to PersistenceError in strategy
- Split make_checkpoint_callback (92→3 functions under 50 lines)
Medium fixes:
- Use CHECKPOINT_DELETE_FAILED event constant for cleanup warnings
- Add exc_info=True on cleanup warning logs
- Log before raising RuntimeError on None checkpoint_context_json
- Type clear_resume_count parameter as NotBlankStr
- Add JSON validation to RecoveryResult.checkpoint_context_json
- Bound _resume_counts dict with _MAX_TRACKED_EXECUTIONS eviction
- Update RecoveryStrategy protocol docstring for checkpoint recovery
- Add checkpoint recovery to README capabilities
- Update ReactLoop checkpoint_callback docstring
- FakeCheckpointRepository.get_latest validates no-filter ValueError
- Add test for RecoveryResult._validate_checkpoint_consistency
- Add CHECKPOINT_UNSUPPORTED_LOOP event constant
Minor fixes:
- Fix checkpoint __init__.py docstring (configurable intervals)
- Fix callback.py type alias docstring (may skip)
- Fix callback_factory docstring (turn-0 skip mention)
- FakeHeartbeatRepository.get_stale sorts by timestamp
- Add turn=0 parametrize cases to test_callback_factory
- Tighten context_json validator to reject arrays/primitives
- Add composite (task_id, turn_number) index
- Add CHECK (turn_number >= 0) constraint to DDL
- Remove duplicate test_schema_version from test_migrations_v2
- Use SCHEMA_VERSION constant in test_migrations_v2
CodeRabbit (5 new actionable comments on commit 6a8cfa7): - Resume path now uses BudgetEnforcer when configured (was bypassing enforcer-specific features like per-agent allocation, auto-downgrade) - Log deserialization errors before re-raising in resume.py - Use HEARTBEAT_DELETE_FAILED event constant for heartbeat cleanup failures (was incorrectly using CHECKPOINT_DELETE_FAILED) - Fix reconciliation bullet in design spec (was inconsistent with implementation description below it) - Add comment explaining checkpoint placement before tool execution (intentional: preserves LLM response on mid-tool crash) Greptile (3 findings, confidence 3/5): - Clean up orphaned checkpoint/heartbeat rows on max-resume fallback path (delegate_to_fallback now calls cleanup_after_resume) - Add heartbeat_repo to CheckpointRecoveryStrategy for fallback cleanup - Document in-memory resume counter limitation with TODO comment (persisting counter across restarts is a planned improvement) - Wrong heartbeat event constant — already fixed above
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a checkpoint-based crash-recovery subsystem: per-turn AgentContext persistence and heartbeats, loop hooks and callbacks for checkpointing, a CheckpointRecoveryStrategy with resume attempts and fallback, SQLite repos + migration v6, engine wiring for resume flows, observability events, tests, and docs updates. Changes
Sequence Diagram(s)mermaid Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
📝 Coding Plan
Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's resilience by introducing a robust checkpoint recovery strategy. Agents can now automatically recover from crashes by resuming from their last saved state, preserving valuable progress on long-running tasks. This feature incorporates periodic state persistence, liveness detection, and a configurable retry mechanism, ensuring greater operational stability and a graceful fallback to task re-assignment when recovery limits are met. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/ai_company/engine/agent_engine.py (1)
1089-1093:⚠️ Potential issue | 🟡 MinorMissing
completion_configandeffective_autonomyin fatal error recovery path.
_build_error_executioncalls_apply_recoverywithoutcompletion_configandeffective_autonomy. If checkpoint recovery is triggered during fatal error handling, the resumed execution won't have proper configuration threading.This is low-impact since fatal errors typically occur before checkpoints exist.
🛡️ Proposed fix
Update
_handle_fatal_errorand_build_error_executionsignatures to accept and forward these parameters:async def _build_error_execution( # noqa: PLR0913 self, identity: AgentIdentity, task: Task, agent_id: str, task_id: str, error_msg: str, ctx: AgentContext | None, + *, + completion_config: CompletionConfig | None = None, + effective_autonomy: EffectiveAutonomy | None = None, ) -> ExecutionResult: """Create an error ``ExecutionResult`` and apply recovery.""" error_ctx = ctx or AgentContext.from_identity(identity, task=task) error_execution = ExecutionResult( context=error_ctx, termination_reason=TerminationReason.ERROR, error_message=error_msg, ) return await self._apply_recovery( error_execution, agent_id, task_id, + completion_config=completion_config, + effective_autonomy=effective_autonomy, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/agent_engine.py` around lines 1089 - 1093, The fatal-error recovery path is missing completion_config and effective_autonomy and thus _build_error_execution currently calls _apply_recovery without them; update the signatures of _handle_fatal_error and _build_error_execution to accept completion_config and effective_autonomy, and forward those parameters into the call to _apply_recovery (and any intermediate calls) so that when checkpoint recovery is applied the resumed execution receives the correct completion_config and effective_autonomy values; ensure parameter names match existing usage and are passed unchanged through to _apply_recovery.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 674-739: The _resume_from_checkpoint function is over the 50-line
limit; extract the core resume logic into a smaller helper (e.g.,
_reconstruct_and_run_resume) that performs deserialize_and_reconcile(...) and
awaits _execute_resumed_loop(...), preserves the existing try/except behavior
(re-raising MemoryError/RecursionError, logging other exceptions with
EXECUTION_RESUME_FAILED and agent_id/task_id), and returns the ExecutionResult
along with checkpoint_ctx.execution_id so _resume_from_checkpoint can call await
self._finalize_resume(...) and return the result; ensure logger.info and the
initial checkpoint null check remain in _resume_from_checkpoint and that
signature/exception semantics of _execute_resumed_loop and _finalize_resume are
unchanged.
In `@src/ai_company/engine/checkpoint/resume.py`:
- Around line 38-86: In deserialize_and_reconcile, change the exception logging
to use a distinct error event constant instead of
CHECKPOINT_RECOVERY_RECONCILIATION: replace the logger.exception call in the
except ValueError block to log CHECKPOINT_RECOVERY_DESERIALIZE_FAILED (including
agent_id, task_id and the error message) and keep the success path logging
(logger.debug) using CHECKPOINT_RECOVERY_RECONCILIATION; ensure
CHECKPOINT_RECOVERY_DESERIALIZE_FAILED is defined/ imported where constants live
so log tooling can distinguish deserialize failures from successful
reconciliation events.
In `@src/ai_company/engine/checkpoint/strategy.py`:
- Around line 222-228: The current FIFO eviction on self._resume_counts can drop
an actively-recovering execution; change to LRU eviction by turning
_resume_counts into an order-preserving structure and updating access order
whenever an execution_id is touched. Concretely, replace the plain dict use of
_resume_counts with an OrderedDict (or maintain last_access timestamps) and in
the code paths that read/increment _resume_counts (the block updating
self._resume_counts[execution_id] and the eviction check guarded by
_MAX_TRACKED_EXECUTIONS) call move_to_end(execution_id) (or update the
timestamp) so the eviction removes the least-recently-used key instead of the
oldest-inserted key. Ensure the eviction still pops from the front when len(...)
> _MAX_TRACKED_EXECUTIONS.
- Around line 231-241: The code reads _resume_counts in _build_resume_result
which races with updates made inside the lock by _reserve_resume_attempt; modify
_reserve_resume_attempt to return the reserved resume count and change recover
to capture that return value and pass it as an explicit resume_attempt argument
into _build_resume_result (update _build_resume_result signature to accept
resume_attempt instead of reading _resume_counts itself) so the metadata uses
the locked, returned value rather than reading _resume_counts outside the lock.
In `@tests/unit/persistence/sqlite/test_migrations_v6.py`:
- Around line 100-105: The test's expected index set in test_migrations_v6.py is
missing the checkpoint index "idx_cp_task_turn"; update the expected set (used
with the variable indexes) to include "idx_cp_task_turn" so the assertion
(assert expected.issubset(indexes)) verifies all checkpoint indexes created by
the v6 migration.
---
Outside diff comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 1089-1093: The fatal-error recovery path is missing
completion_config and effective_autonomy and thus _build_error_execution
currently calls _apply_recovery without them; update the signatures of
_handle_fatal_error and _build_error_execution to accept completion_config and
effective_autonomy, and forward those parameters into the call to
_apply_recovery (and any intermediate calls) so that when checkpoint recovery is
applied the resumed execution receives the correct completion_config and
effective_autonomy values; ensure parameter names match existing usage and are
passed unchanged through to _apply_recovery.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6555c608-4472-4e35-a2e7-74f236039a47
📒 Files selected for processing (36)
CLAUDE.mdREADME.mddocs/design/engine.mdsrc/ai_company/engine/__init__.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/engine/checkpoint/callback.pysrc/ai_company/engine/checkpoint/callback_factory.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/engine/checkpoint/resume.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/recovery.pysrc/ai_company/observability/events/checkpoint.pysrc/ai_company/observability/events/execution.pysrc/ai_company/observability/events/persistence.pysrc/ai_company/persistence/protocol.pysrc/ai_company/persistence/repositories.pysrc/ai_company/persistence/sqlite/__init__.pysrc/ai_company/persistence/sqlite/backend.pysrc/ai_company/persistence/sqlite/checkpoint_repo.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/persistence/sqlite/migrations.pytests/unit/api/conftest.pytests/unit/engine/checkpoint/__init__.pytests/unit/engine/checkpoint/test_callback_factory.pytests/unit/engine/checkpoint/test_models.pytests/unit/engine/checkpoint/test_strategy.pytests/unit/engine/test_recovery_checkpoint_fields.pytests/unit/observability/test_events.pytests/unit/persistence/sqlite/test_checkpoint_repo.pytests/unit/persistence/sqlite/test_heartbeat_repo.pytests/unit/persistence/sqlite/test_migrations_v6.pytests/unit/persistence/test_migrations_v2.pytests/unit/persistence/test_protocol.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Agent
- GitHub Check: Build Backend
- GitHub Check: Greptile Review
- GitHub Check: Test (Python 3.14)
- GitHub Check: Build Backend
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (6)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT usefrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax:except A, B:(no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names:example-provider,example-large-001,example-medium-001,example-small-001. Tests must usetest-provider,test-small-001, etc.
Files:
tests/unit/engine/checkpoint/test_models.pysrc/ai_company/persistence/sqlite/backend.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/observability/events/execution.pysrc/ai_company/engine/checkpoint/callback.pysrc/ai_company/engine/checkpoint/callback_factory.pytests/unit/observability/test_events.pysrc/ai_company/persistence/sqlite/__init__.pytests/unit/persistence/test_migrations_v2.pysrc/ai_company/observability/events/checkpoint.pytests/unit/engine/checkpoint/test_callback_factory.pytests/unit/persistence/sqlite/test_migrations_v6.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/persistence/protocol.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/checkpoint/resume.pytests/unit/persistence/sqlite/test_checkpoint_repo.pysrc/ai_company/engine/__init__.pysrc/ai_company/engine/recovery.pysrc/ai_company/persistence/sqlite/migrations.pytests/unit/api/conftest.pysrc/ai_company/engine/checkpoint/__init__.pytests/unit/engine/test_recovery_checkpoint_fields.pysrc/ai_company/persistence/repositories.pytests/unit/persistence/sqlite/test_heartbeat_repo.pytests/unit/engine/checkpoint/test_strategy.pysrc/ai_company/observability/events/persistence.pysrc/ai_company/persistence/sqlite/checkpoint_repo.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/engine/checkpoint/strategy.pytests/unit/persistence/test_protocol.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Tests must use markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e,@pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: useasyncio_mode = "auto"— no manual@pytest.mark.asyncioneeded.
Test timeout: 30 seconds per test.
Prefer@pytest.mark.parametrizefor testing similar cases.
Files:
tests/unit/engine/checkpoint/test_models.pytests/unit/observability/test_events.pytests/unit/persistence/test_migrations_v2.pytests/unit/engine/checkpoint/test_callback_factory.pytests/unit/persistence/sqlite/test_migrations_v6.pytests/unit/persistence/sqlite/test_checkpoint_repo.pytests/unit/api/conftest.pytests/unit/engine/test_recovery_checkpoint_fields.pytests/unit/persistence/sqlite/test_heartbeat_repo.pytests/unit/engine/checkpoint/test_strategy.pytests/unit/persistence/test_protocol.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: All public functions must have type hints. Enforce via mypy strict mode.
Google-style docstrings required on public classes and functions, enforced by ruff D rules.
Create new objects instead of mutating existing ones. For non-Pydantic internal collections, usecopy.deepcopy()at construction plusMappingProxyTypewrapping for read-only enforcement.
Fordict/listfields in frozen Pydantic models, rely onfrozen=Truefor field reassignment prevention and usecopy.deepcopy()at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 withBaseModel,model_validator,computed_field,ConfigDict. Use@computed_fieldfor derived values instead of storing redundant fields (e.g.,TokenUsage.total_tokens).
UseNotBlankStrfromcore.typesfor all identifier/name fields in Pydantic models, including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants, instead of manual whitespace validators.
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over barecreate_task.
Functions must be less than 50 lines; files must be less than 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries: user input, external APIs, config files.
NEVER useimport logging,logging.getLogger(), orprint()in application code.
Always useloggeras the variable name for loggers (not_logger, notlog).
Event names must always use constants from domain-specific modules underai_company.observability.events(e.g.,PROVIDER_CALL_STARTfromevents.provider,BUDGET_RECORD_ADDEDfromevents.budget). Import dire...
Files:
src/ai_company/persistence/sqlite/backend.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/observability/events/execution.pysrc/ai_company/engine/checkpoint/callback.pysrc/ai_company/engine/checkpoint/callback_factory.pysrc/ai_company/persistence/sqlite/__init__.pysrc/ai_company/observability/events/checkpoint.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/persistence/protocol.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/checkpoint/resume.pysrc/ai_company/engine/__init__.pysrc/ai_company/engine/recovery.pysrc/ai_company/persistence/sqlite/migrations.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/persistence/repositories.pysrc/ai_company/observability/events/persistence.pysrc/ai_company/persistence/sqlite/checkpoint_repo.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/engine/checkpoint/strategy.py
src/ai_company/**/[!_]*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Every module with business logic MUST have:
from ai_company.observability import get_loggerthenlogger = get_logger(__name__)
Files:
src/ai_company/persistence/sqlite/backend.pysrc/ai_company/persistence/sqlite/heartbeat_repo.pysrc/ai_company/observability/events/execution.pysrc/ai_company/engine/checkpoint/callback.pysrc/ai_company/engine/checkpoint/callback_factory.pysrc/ai_company/observability/events/checkpoint.pysrc/ai_company/engine/checkpoint/models.pysrc/ai_company/persistence/protocol.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/checkpoint/resume.pysrc/ai_company/engine/recovery.pysrc/ai_company/persistence/sqlite/migrations.pysrc/ai_company/persistence/repositories.pysrc/ai_company/observability/events/persistence.pysrc/ai_company/persistence/sqlite/checkpoint_repo.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/engine/checkpoint/strategy.py
docs/design/**/*.md
📄 CodeRabbit inference engine (CLAUDE.md)
When approved deviations occur, update the relevant
docs/design/page to reflect the new reality.
Files:
docs/design/engine.md
docs/**/*.md
📄 CodeRabbit inference engine (CLAUDE.md)
Documentation source in
docs/built with Zensical. Design spec indocs/design/(7 pages). Architecture indocs/architecture/. Roadmap indocs/roadmap/. Security indocs/security.md.
Files:
docs/design/engine.md
🧠 Learnings (9)
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Event names must always use constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
src/ai_company/observability/events/execution.pytests/unit/observability/test_events.pyCLAUDE.mdsrc/ai_company/observability/events/checkpoint.pysrc/ai_company/engine/react_loop.pysrc/ai_company/observability/events/persistence.pysrc/ai_company/engine/plan_execute_loop.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/ai_company/**/[!_]*.py : Every module with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`
Applied to files:
CLAUDE.mdsrc/ai_company/engine/plan_execute_loop.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Structured logging: always use `logger.info(EVENT, key=value)` — never `logger.info("msg %s", val)`
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All state transitions must log at INFO level.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : NEVER use `import logging`, `logging.getLogger()`, or `print()` in application code.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : All error paths must log at WARNING or ERROR with context before raising.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Always use `logger` as the variable name for loggers (not `_logger`, not `log`).
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : DEBUG logging for object creation, internal flow, and entry/exit of key functions.
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Pure data models, enums, and re-exports do NOT need logging.
Applied to files:
CLAUDE.md
🧬 Code graph analysis (22)
tests/unit/engine/checkpoint/test_models.py (1)
src/ai_company/engine/checkpoint/models.py (3)
Checkpoint(25-65)CheckpointConfig(88-120)Heartbeat(68-85)
src/ai_company/persistence/sqlite/heartbeat_repo.py (4)
src/ai_company/engine/checkpoint/models.py (1)
Heartbeat(68-85)src/ai_company/observability/_logger.py (1)
get_logger(8-28)src/ai_company/persistence/errors.py (1)
QueryError(33-34)src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
_row_to_model(159-174)
src/ai_company/engine/checkpoint/callback.py (1)
src/ai_company/engine/context.py (1)
AgentContext(87-307)
src/ai_company/persistence/sqlite/__init__.py (2)
src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
SQLiteCheckpointRepository(27-174)src/ai_company/persistence/sqlite/heartbeat_repo.py (1)
SQLiteHeartbeatRepository(27-168)
tests/unit/persistence/test_migrations_v2.py (2)
src/ai_company/persistence/sqlite/migrations.py (1)
get_user_version(261-265)tests/unit/persistence/sqlite/conftest.py (1)
memory_db(15-22)
tests/unit/engine/checkpoint/test_callback_factory.py (5)
src/ai_company/engine/checkpoint/callback_factory.py (1)
make_checkpoint_callback(34-133)src/ai_company/engine/checkpoint/models.py (1)
CheckpointConfig(88-120)src/ai_company/engine/context.py (2)
AgentContext(87-307)from_identity(140-171)src/ai_company/persistence/sqlite/heartbeat_repo.py (1)
save(37-67)src/ai_company/memory/errors.py (1)
MemoryError(13-14)
tests/unit/persistence/sqlite/test_migrations_v6.py (2)
src/ai_company/persistence/sqlite/migrations.py (1)
run_migrations(338-406)tests/unit/persistence/test_migrations_v2.py (1)
memory_db(21-26)
src/ai_company/engine/checkpoint/models.py (2)
src/ai_company/tools/base.py (1)
description(138-140)src/ai_company/engine/parallel_models.py (2)
agent_id(79-81)task_id(87-89)
src/ai_company/persistence/protocol.py (4)
src/ai_company/persistence/repositories.py (3)
CheckpointRepository(484-533)CostRecordRepository(111-163)HeartbeatRepository(537-595)tests/unit/api/conftest.py (2)
checkpoints(477-478)heartbeats(481-482)tests/unit/persistence/test_protocol.py (2)
checkpoints(309-310)heartbeats(313-314)src/ai_company/persistence/sqlite/backend.py (2)
checkpoints(373-379)heartbeats(382-388)
src/ai_company/engine/react_loop.py (2)
src/ai_company/engine/checkpoint/callback_factory.py (1)
_checkpoint_callback(63-76)src/ai_company/memory/errors.py (1)
MemoryError(13-14)
tests/unit/persistence/sqlite/test_checkpoint_repo.py (3)
src/ai_company/engine/checkpoint/models.py (1)
Checkpoint(25-65)src/ai_company/persistence/sqlite/checkpoint_repo.py (4)
SQLiteCheckpointRepository(27-174)save(37-66)get_latest(68-132)delete_by_execution(134-157)tests/unit/persistence/sqlite/conftest.py (1)
migrated_db(26-34)
src/ai_company/engine/__init__.py (3)
src/ai_company/engine/checkpoint/models.py (3)
Checkpoint(25-65)CheckpointConfig(88-120)Heartbeat(68-85)src/ai_company/engine/checkpoint/strategy.py (1)
CheckpointRecoveryStrategy(45-284)src/ai_company/engine/checkpoint/callback_factory.py (1)
make_checkpoint_callback(34-133)
src/ai_company/engine/recovery.py (1)
src/ai_company/engine/task_execution.py (1)
TaskExecution(60-231)
src/ai_company/persistence/sqlite/migrations.py (2)
tests/unit/persistence/sqlite/test_user_repo.py (1)
db(22-28)tests/unit/hr/test_persistence.py (1)
db(29-35)
tests/unit/api/conftest.py (2)
src/ai_company/engine/checkpoint/models.py (2)
Checkpoint(25-65)Heartbeat(68-85)src/ai_company/persistence/protocol.py (2)
checkpoints(146-148)heartbeats(151-153)
src/ai_company/engine/checkpoint/__init__.py (4)
src/ai_company/engine/checkpoint/callback_factory.py (1)
make_checkpoint_callback(34-133)src/ai_company/engine/checkpoint/models.py (3)
Checkpoint(25-65)CheckpointConfig(88-120)Heartbeat(68-85)src/ai_company/engine/checkpoint/resume.py (2)
cleanup_after_resume(129-172)deserialize_and_reconcile(38-86)src/ai_company/engine/checkpoint/strategy.py (1)
CheckpointRecoveryStrategy(45-284)
tests/unit/engine/test_recovery_checkpoint_fields.py (2)
src/ai_company/engine/recovery.py (5)
FailAndReassignStrategy(150-220)RecoveryResult(31-112)can_resume(110-112)recover(124-143)recover(162-216)src/ai_company/engine/checkpoint/strategy.py (1)
recover(83-143)
src/ai_company/persistence/repositories.py (4)
src/ai_company/engine/checkpoint/models.py (2)
Checkpoint(25-65)Heartbeat(68-85)src/ai_company/persistence/sqlite/checkpoint_repo.py (2)
get_latest(68-132)delete_by_execution(134-157)src/ai_company/persistence/sqlite/heartbeat_repo.py (3)
get(69-94)get_stale(96-127)delete(129-151)src/ai_company/persistence/sqlite/parked_context_repo.py (1)
get(65-91)
src/ai_company/engine/agent_engine.py (7)
src/ai_company/engine/checkpoint/models.py (1)
CheckpointConfig(88-120)src/ai_company/engine/checkpoint/resume.py (3)
cleanup_after_resume(129-172)deserialize_and_reconcile(38-86)make_loop_with_callback(89-126)src/ai_company/engine/checkpoint/strategy.py (2)
CheckpointRecoveryStrategy(45-284)clear_resume_count(149-162)src/ai_company/engine/recovery.py (3)
RecoveryResult(31-112)RecoveryStrategy(116-147)can_resume(110-112)src/ai_company/persistence/repositories.py (2)
CheckpointRepository(484-533)HeartbeatRepository(537-595)src/ai_company/engine/loop_protocol.py (3)
ExecutionLoop(151-189)ExecutionResult(79-140)execute(159-185)src/ai_company/engine/context.py (1)
AgentContext(87-307)
src/ai_company/engine/plan_execute_loop.py (3)
src/ai_company/engine/plan_models.py (1)
PlanExecuteConfig(91-118)src/ai_company/engine/checkpoint/callback_factory.py (1)
_checkpoint_callback(63-76)src/ai_company/engine/context.py (1)
AgentContext(87-307)
src/ai_company/engine/checkpoint/strategy.py (8)
src/ai_company/engine/checkpoint/models.py (1)
Checkpoint(25-65)src/ai_company/engine/checkpoint/resume.py (1)
cleanup_after_resume(129-172)src/ai_company/engine/context.py (2)
AgentContext(87-307)to_snapshot(280-302)src/ai_company/engine/recovery.py (6)
FailAndReassignStrategy(150-220)RecoveryResult(31-112)recover(124-143)recover(162-216)get_strategy_type(145-147)get_strategy_type(218-220)src/ai_company/persistence/errors.py (1)
PersistenceError(8-9)src/ai_company/persistence/repositories.py (8)
CheckpointRepository(484-533)HeartbeatRepository(537-595)get_latest(498-519)get(59-71)get(218-230)get(346-358)get(426-438)get(551-563)src/ai_company/engine/parallel_models.py (1)
task_id(87-89)src/ai_company/memory/errors.py (1)
MemoryError(13-14)
tests/unit/persistence/test_protocol.py (2)
src/ai_company/persistence/repositories.py (24)
CheckpointRepository(484-533)HeartbeatRepository(537-595)save(48-57)save(114-123)save(170-180)save(207-216)save(284-294)save(335-344)save(415-424)save(487-496)save(540-549)get_latest(498-519)delete_by_execution(521-533)get(59-71)get(218-230)get(346-358)get(426-438)get(551-563)get_stale(565-581)delete(95-107)delete(260-272)delete(396-408)delete(468-480)delete(583-595)src/ai_company/engine/checkpoint/models.py (2)
Checkpoint(25-65)Heartbeat(68-85)
🪛 LanguageTool
CLAUDE.md
[style] ~154-~154: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...
(EG_NO_COMMA)
🔇 Additional comments (64)
src/ai_company/engine/checkpoint/callback.py (1)
13-14: Type alias is clear and correctly scoped.
CheckpointCallbackat Line 13 matches the intended async callback contract and keeps checkpoint integration points strongly typed.src/ai_company/persistence/sqlite/migrations.py (1)
26-26: V6 migration wiring is consistent and complete.Line 26, Lines 224-256, and Lines 320-335 correctly align schema version, DDL, and migration registration for checkpoints/heartbeats.
Also applies to: 224-256, 320-335
tests/unit/persistence/test_migrations_v2.py (1)
58-58: Good replacement of hardcoded schema version.Line 58 now tracks the canonical constant, which prevents brittle test failures on future migration bumps.
src/ai_company/observability/events/persistence.py (1)
152-176: Event constant additions look consistent.The new checkpoint and heartbeat constants are well-named and complete for save/query/delete/deserialization paths.
README.md (1)
34-35: Docs update matches the implemented architecture.The capability and diagram updates correctly communicate checkpoint-based crash recovery and persistence↔engine interaction.
Also applies to: 114-114
src/ai_company/persistence/sqlite/__init__.py (1)
7-12: Re-export surface is correctly extended.The new repository symbols are imported and exposed in
__all__consistently.Also applies to: 26-29
CLAUDE.md (1)
107-107: Guideline updates are aligned with the checkpoint recovery work.Both changed lines accurately reflect the new engine recovery surface and observability-event usage conventions.
Also applies to: 154-154
tests/unit/persistence/sqlite/test_migrations_v6.py (1)
1-85: The rest of the V6 migration coverage is solid.Table existence/columns, heartbeat index presence, and idempotency checks are all well-structured.
Also applies to: 107-135
tests/unit/observability/test_events.py (1)
11-15: LGTM!The new checkpoint event imports and test coverage follow established patterns in this file. Event constants are correctly imported from the domain-specific module (
ai_company.observability.events.checkpoint), and the test verifies the expected string values while updating the domain module discovery set.Also applies to: 227-227, 254-258
src/ai_company/persistence/protocol.py (1)
145-153: LGTM!The new
checkpointsandheartbeatsproperty accessors follow the established pattern for other repository accessors in the protocol. Docstrings are consistent with existing properties, and the protocol update aligns with the concrete implementations inSQLitePersistenceBackendand test fakes.tests/unit/engine/checkpoint/test_callback_factory.py (1)
1-384: LGTM!Comprehensive test coverage for the checkpoint callback factory. Tests properly cover:
- Boundary turn logic with parametrized cases
- Heartbeat update behavior (saved on checkpoint, skipped on skip)
- Error handling: swallowing of general exceptions, propagation of
MemoryError/RecursionError- Checkpoint failure skipping heartbeat to avoid inconsistent state
Test structure follows conventions with proper markers and timeout configuration.
src/ai_company/persistence/sqlite/backend.py (1)
32-37: LGTM!The new
checkpointsandheartbeatsrepository integration follows the established patterns exactly:
- Private attributes initialized to
None- State clearing in
_clear_state- Repository instantiation in
_create_repositories- Public accessors using
_require_connectedfor connectivity guardsConsistent with all other repository wiring in this backend.
Also applies to: 88-89, 104-105, 170-171, 372-388
tests/unit/persistence/sqlite/test_heartbeat_repo.py (1)
1-194: LGTM!Thorough test coverage for
SQLiteHeartbeatRepositorycovering all protocol methods:
- Save/get roundtrip with field verification
- Missing execution returns
None- Upsert behavior updates existing records
get_stalefiltering and ordering by timestamp- Delete semantics with isolation verification
Test structure follows conventions with proper markers and fixture usage.
src/ai_company/engine/react_loop.py (1)
209-226: LGTM!The checkpoint callback integration is well-designed:
- Clear documentation explains the intentional timing (after LLM response, before tool execution) for crash recovery
- PEP 758 except syntax used correctly for Python 3.14+
- Error handling appropriately re-raises
MemoryError/RecursionErrorwhile swallowing and logging other exceptions- Structured logging with
EXECUTION_CHECKPOINT_CALLBACK_FAILEDevent constanttests/unit/engine/test_recovery_checkpoint_fields.py (1)
1-184: LGTM!Solid test coverage for
RecoveryResultcheckpoint-related fields:
can_resumecomputed field logic verification- Consistency validation between
checkpoint_context_jsonandresume_attempt- Default value verification for
resume_attempt- Backward compatibility ensuring
FailAndReassignStrategyproduces expected non-resumable resultsTests properly use fixtures and follow conventions.
src/ai_company/observability/events/execution.py (1)
67-75: LGTM!New execution event constants for checkpoint and resume flows follow the established patterns:
Final[str]type annotations- Domain.subject.qualifier naming convention
- Logical grouping with descriptive comment header
Constants are properly consumed by
ReactLoopandPlanExecuteLoopfor observability.tests/unit/engine/checkpoint/test_models.py (1)
1-272: LGTM!Comprehensive test coverage for the checkpoint models including creation, immutability, validation, and constraint enforcement. Test structure is clean with well-designed helpers and proper use of
@pytest.mark.unitmarkers on all test classes.tests/unit/persistence/sqlite/test_checkpoint_repo.py (1)
1-238: LGTM!Thorough test coverage for the SQLite checkpoint repository including roundtrip persistence, ordering by turn_number, filtering by execution_id/task_id, upsert behavior, and deletion. Good use of explicit checkpoint IDs in tests to ensure deterministic assertions.
src/ai_company/engine/__init__.py (1)
31-38: LGTM!Clean re-export of the checkpoint subsystem's public API. The
__all__entries are properly sorted alphabetically and the imports correctly reference the checkpoint subpackage.Also applies to: 219-222, 258-258, 348-348
src/ai_company/persistence/sqlite/heartbeat_repo.py (1)
1-168: LGTM!Well-structured repository implementation that mirrors the checkpoint repository pattern. Proper UTC normalization for timestamp comparisons, comprehensive error handling with structured logging, and clean separation of concerns with the
_row_to_modelhelper.src/ai_company/engine/checkpoint/callback_factory.py (1)
1-133: LGTM!Clean factory implementation with proper closure capture. The best-effort persistence pattern (log but don't propagate errors) is appropriate for checkpoint callbacks that shouldn't crash the execution loop. Good design decision to only save heartbeat after successful checkpoint save.
src/ai_company/engine/checkpoint/__init__.py (1)
1-29: LGTM!Clean public API surface for the checkpoint subsystem. All necessary components are properly re-exported with an alphabetically sorted
__all__.tests/unit/engine/checkpoint/test_strategy.py (1)
1-439: LGTM!Comprehensive test coverage for
CheckpointRecoveryStrategyincluding protocol compliance, resume paths with checkpoint, fallback scenarios (no checkpoint, max attempts exhausted, zero attempts, repository errors), custom fallback injection, and per-execution counter independence. Well-structured with clear test organization.src/ai_company/engine/recovery.py (2)
62-92: LGTM!Well-designed consistency validation between
checkpoint_context_jsonandresume_attempt. The validator correctly enforces that both fields must be set together (for resume) or both at defaults (for non-resume). JSON validation ensures the checkpoint context is a valid object when provided.
106-112: LGTM!Clean
computed_fieldimplementation forcan_resumethat derives from the presence of checkpoint context, complementing the existingcan_reassignpattern.docs/design/engine.md (2)
511-514: LGTM! New RecoveryResult fields are well-documented.The new checkpoint recovery fields (
checkpoint_context_json,resume_attempt,can_resume) are clearly documented with appropriate types and descriptions. The computed property semantics forcan_resumealign with the implementation.
565-579: Documentation correctly reflects the checkpoint recovery implementation.The updates accurately describe:
- Storage backend delegation to the injected
CheckpointRepository- Reconciliation via system message informing the agent of the resume point and triggering error
- Deferred richer reconciliation (workspace change detection) for future iterations
This aligns well with the PR objectives and implementation.
src/ai_company/persistence/sqlite/checkpoint_repo.py (4)
37-66: LGTM! Save method correctly implements upsert with proper error handling.The
savemethod properly:
- Uses
INSERT OR REPLACEfor upsert semantics- Serializes checkpoint via
model_dump(mode="json")- Logs error before raising
QueryError- Logs success at DEBUG level after commit
68-132: LGTM! Well-structured get_latest with proper validation and parameterization.The method correctly:
- Validates that at least one filter is provided
- Builds WHERE clause from hardcoded column names (safe from injection)
- Uses parameterized queries for values
- Orders by
turn_number DESC LIMIT 1to get the latest checkpoint- Handles not-found and error cases appropriately
134-157: LGTM! Delete method with proper cleanup and logging.The
delete_by_executionmethod correctly returns the count of deleted rows and logs appropriately. The conditional logging forcount > 0avoids noise for no-op deletions.
159-174: LGTM! Row-to-model conversion with proper error handling.The
_row_to_modelhelper correctly catchesValidationErrorand wraps it inQueryErrorwith appropriate logging. This provides clear error semantics for callers.src/ai_company/engine/plan_execute_loop.py (4)
86-97: LGTM! Constructor properly accepts optional checkpoint callback.The constructor cleanly integrates the checkpoint callback as an optional parameter and exposes the config via a property. This maintains backward compatibility while enabling checkpoint persistence.
574-575: LGTM! Checkpoint callback invoked after planner turn completion.Checkpointing after the planner response ensures the plan is persisted before step execution begins. This is the correct placement for crash recovery.
711-712: LGTM! Checkpoint callback invoked after step turn completion.Checkpointing after each step turn ensures incremental progress is preserved. This aligns with the "persist after each completed turn" design documented in the PR objectives.
768-793: LGTM! Robust checkpoint callback invocation with proper error isolation.The implementation correctly:
- Short-circuits if no callback is configured
- Re-raises
MemoryErrorandRecursionError(non-recoverable)- Logs and swallows all other exceptions to prevent checkpoint failures from interrupting execution
- Uses PEP 758
exceptsyntax (Python 3.14+) as required by coding guidelinessrc/ai_company/engine/checkpoint/resume.py (2)
89-126: LGTM! Loop instantiation correctly preserves immutability.The function creates new loop instances with the callback injected rather than mutating existing loops. This follows the frozen model pattern used throughout the codebase. The warning for unsupported loop types provides appropriate observability.
129-172: LGTM! Best-effort cleanup with robust error handling.The
cleanup_after_resumefunction correctly:
- Handles
Nonerepositories gracefully- Logs successes at DEBUG level
- Catches all exceptions and logs at WARNING without propagating
- Uses
exc_info=Trueto include stack traces for debuggingThis ensures cleanup failures never interrupt the main execution flow.
tests/unit/api/conftest.py (3)
338-372: LGTM! FakeCheckpointRepository correctly implements the protocol.The implementation properly:
- Validates that at least one filter is provided in
get_latest- Returns the checkpoint with highest
turn_numberusingmax()- Returns delete count from
delete_by_executionThis mirrors the
CheckpointRepositoryprotocol defined insrc/ai_company/persistence/repositories.py.
374-395: LGTM! FakeHeartbeatRepository correctly implements the protocol.The implementation properly:
- Uses
execution_idas the unique key for storage- Returns stale heartbeats sorted by
last_heartbeat_at- Returns boolean from
deleteindicating whether removal occurredThis aligns with the
HeartbeatRepositoryprotocol.
411-412: LGTM! FakePersistenceBackend correctly exposes new repositories.The new
_checkpointsand_heartbeats_repoattributes are initialized in the constructor and exposed via properties matching the protocol surface (checkpoints,heartbeats).Also applies to: 476-482
src/ai_company/engine/checkpoint/models.py (3)
25-65: LGTM! Checkpoint model is well-designed with proper validation.The model correctly:
- Uses
frozen=Truefor immutability- Validates
context_jsonis a valid JSON object (not primitive/array)- Uses
NotBlankStrfor all identifier fields- Has timezone-aware datetime fields with
AwareDatetime- Uses lambda factories for default values to avoid mutable default issues
68-85: LGTM! Heartbeat model is minimal and well-structured.The model correctly captures the essential liveness signal fields with proper types. The
execution_idserves as the logical primary key for upsert operations.
88-120: LGTM! CheckpointConfig has sensible defaults and validation.The configuration model correctly:
- Defaults
persist_every_n_turnsto 1 (checkpoint every turn)- Documents that
heartbeat_interval_secondsis reserved for future use- Uses
ge=0formax_resume_attemptsto allow disabling resumesrc/ai_company/observability/events/checkpoint.py (1)
1-29: LGTM! Comprehensive and well-organized event constants.The module provides complete coverage of checkpoint/heartbeat lifecycle events with consistent naming conventions (
checkpoint.*,heartbeat.*). The categorization (lifecycle, heartbeat, loop integration, recovery flow) aids discoverability and aligns with coding guidelines for domain-specific event modules.src/ai_company/persistence/repositories.py (3)
25-26: LGTM! TYPE_CHECKING guard properly avoids circular imports.Using
TYPE_CHECKINGfor theCheckpointandHeartbeatimports prevents runtime circular dependencies while enabling static type checking.
483-533: LGTM! CheckpointRepository protocol is well-defined.The protocol correctly specifies:
save: Upsert semantics (insert or replace by ID)get_latest: Requires at least one filter, returns highest turn_numberdelete_by_execution: Returns count of deleted checkpointsDocstrings include proper
Raisesdocumentation forPersistenceErrorandValueError.
536-595: LGTM! HeartbeatRepository protocol is well-defined.The protocol correctly specifies:
save: Upsert by execution_idget: Single heartbeat lookupget_stale: Threshold-based query for detecting unresponsive agentsdelete: Boolean return for deletion statusThis enables the heartbeat-based failure detection described in the PR objectives.
tests/unit/persistence/test_protocol.py (5)
214-227: LGTM!The
_FakeCheckpointRepositorycorrectly implements theCheckpointRepositoryprotocol methods with appropriate signatures and return types. The use ofstrinstead ofNotBlankStris acceptable for test fakes due to structural subtyping.
230-244: LGTM!The
_FakeHeartbeatRepositoryproperly implements all required protocol methods:save,get,get_stale, anddelete. Return types align with the protocol specification.
76-82: LGTM!The
aggregatemethod signature correctly adds the optionaltask_idparameter to match the updatedCostRecordRepositoryprotocol.
308-314: LGTM!The
_FakeBackendcorrectly exposes the newcheckpointsandheartbeatsrepository properties, maintaining consistency with thePersistenceBackendprotocol.
368-372: LGTM!Good addition of protocol compliance tests for the new fake repositories, consistent with the existing test pattern in this file.
src/ai_company/engine/checkpoint/strategy.py (5)
1-39: LGTM!The module docstring, imports, and logger setup follow the project conventions. Event constants are properly imported from the domain-specific observability module.
45-82: LGTM!The class docstring and
__init__are well-structured. The in-memory resume counter with the documented limitation (resets on process restart) is appropriately noted. Usingasyncio.Lockfor async-safe access is correct.
83-143: LGTM!The
recovermethod is well-structured with clear flow: load checkpoint → check resume attempts → either delegate to fallback or build resume result. Proper logging at each decision point.
166-200: LGTM!The
_load_latest_checkpointhelper correctly handlesPersistenceErrorby logging and returningNone, while re-raisingMemoryErrorandRecursionErroras non-recoverable. The PEP 758 except syntax is correctly applied.
263-284: LGTM!The
_delegate_to_fallbackmethod correctly cleans up orphaned checkpoint/heartbeat rows before delegating to the fallback strategy. The cleanup is best-effort (errors logged but not propagated) as documented incleanup_after_resume.src/ai_company/engine/agent_engine.py (7)
18-24: LGTM!Imports for checkpoint models and resume helpers are correctly organized. The imports support the new checkpoint recovery integration.
165-181: LGTM!Good validation ensuring
checkpoint_repoandheartbeat_repoare provided together or both omitted. This prevents misconfiguration. DefaultCheckpointConfig()when none provided is a sensible fallback.
494-507: LGTM!The
_make_loop_with_callbackhelper cleanly delegates tomake_loop_with_callbackfrom the resume module, maintaining separation of concerns.
617-672: LGTM!The
_apply_recoverymethod correctly threadscompletion_configandeffective_autonomythrough to the checkpoint resume path. Thecan_resumecheck properly gates the new resume flow.
741-776: LGTM!The
_execute_resumed_loopcorrectly reconstructs the budget checker from the checkpoint context and creates a fresh tool invoker. The approach mirrors the main_executepath appropriately.
778-804: LGTM!The
_finalize_resumemethod correctly:
- Only cleans up on non-error termination (preserving data for potential re-resume on failure)
- Type-checks for
CheckpointRecoveryStrategybefore callingclear_resume_count- Performs best-effort cleanup via
cleanup_after_resume
287-294: LGTM!Correct use of PEP 758 except syntax for re-raising non-recoverable errors.
| self._resume_counts[execution_id] = resume_count + 1 | ||
|
|
||
| # Evict oldest entries when the dict grows too large | ||
| if len(self._resume_counts) > _MAX_TRACKED_EXECUTIONS: | ||
| oldest = next(iter(self._resume_counts)) | ||
| self._resume_counts.pop(oldest, None) | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
FIFO eviction may prematurely remove active executions.
The eviction uses insertion order (next(iter(...))), which may remove an execution that is still actively recovering while keeping idle ones. Consider LRU eviction or document this limitation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/engine/checkpoint/strategy.py` around lines 222 - 228, The
current FIFO eviction on self._resume_counts can drop an actively-recovering
execution; change to LRU eviction by turning _resume_counts into an
order-preserving structure and updating access order whenever an execution_id is
touched. Concretely, replace the plain dict use of _resume_counts with an
OrderedDict (or maintain last_access timestamps) and in the code paths that
read/increment _resume_counts (the block updating
self._resume_counts[execution_id] and the eviction check guarded by
_MAX_TRACKED_EXECUTIONS) call move_to_end(execution_id) (or update the
timestamp) so the eviction removes the least-recently-used key instead of the
oldest-inserted key. Ensure the eviction still pops from the front when len(...)
> _MAX_TRACKED_EXECUTIONS.
There was a problem hiding this comment.
Pull request overview
Implements the “Checkpoint Recovery” crash-recovery strategy by persisting AgentContext checkpoints + execution heartbeats, adding repository protocols and SQLite storage, and integrating per-turn checkpoint callbacks plus a resume path into the engine.
Changes:
- Added checkpoint/heartbeat domain models, callback factory, resume helpers, and
CheckpointRecoveryStrategywith max-attempt fallback. - Extended persistence layer with checkpoint/heartbeat repository protocols, SQLite implementations, and schema v6 migration + tests.
- Integrated checkpoint callback injection into execution loops and added engine resume flow + observability events/docs.
Reviewed changes
Copilot reviewed 35 out of 36 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/persistence/test_protocol.py | Extends protocol conformance fakes to include checkpoint/heartbeat repos. |
| tests/unit/persistence/test_migrations_v2.py | Makes schema version assertions resilient to future bumps. |
| tests/unit/persistence/sqlite/test_migrations_v6.py | Adds tests for v6 checkpoints/heartbeats tables + indexes + idempotency. |
| tests/unit/persistence/sqlite/test_heartbeat_repo.py | Adds unit tests for SQLite heartbeat repository behavior. |
| tests/unit/persistence/sqlite/test_checkpoint_repo.py | Adds unit tests for SQLite checkpoint repository behavior. |
| tests/unit/observability/test_events.py | Ensures checkpoint event module is discovered and key constants exist. |
| tests/unit/engine/test_recovery_checkpoint_fields.py | Tests RecoveryResult checkpoint fields (can_resume, resume_attempt) and validation rules. |
| tests/unit/engine/checkpoint/test_strategy.py | Tests checkpoint recovery strategy behavior, fallback, and counter semantics. |
| tests/unit/engine/checkpoint/test_models.py | Tests Pydantic checkpoint/heartbeat/config models and validation/frozen behavior. |
| tests/unit/engine/checkpoint/test_callback_factory.py | Tests checkpoint callback persistence boundary logic and error handling. |
| tests/unit/engine/checkpoint/init.py | Adds checkpoint test package marker. |
| tests/unit/api/conftest.py | Adds in-memory fake checkpoint/heartbeat repos to API test persistence backend. |
| src/ai_company/persistence/sqlite/migrations.py | Bumps schema to v6 and adds checkpoints/heartbeats DDL + indexes. |
| src/ai_company/persistence/sqlite/heartbeat_repo.py | Implements SQLite heartbeat persistence (upsert/get/get_stale/delete). |
| src/ai_company/persistence/sqlite/checkpoint_repo.py | Implements SQLite checkpoint persistence (upsert/get_latest/delete_by_execution). |
| src/ai_company/persistence/sqlite/backend.py | Wires checkpoint/heartbeat repos into SQLite backend lifecycle and properties. |
| src/ai_company/persistence/sqlite/init.py | Exports new SQLite checkpoint/heartbeat repositories. |
| src/ai_company/persistence/repositories.py | Adds CheckpointRepository and HeartbeatRepository protocols. |
| src/ai_company/persistence/protocol.py | Extends PersistenceBackend protocol with checkpoint/heartbeat repositories. |
| src/ai_company/observability/events/persistence.py | Adds persistence event constants for checkpoint/heartbeat operations. |
| src/ai_company/observability/events/execution.py | Adds execution-level events for checkpoint callback and resume lifecycle. |
| src/ai_company/observability/events/checkpoint.py | Introduces checkpoint-domain event constants. |
| src/ai_company/engine/recovery.py | Extends RecoveryResult with checkpoint resume fields + consistency validation. |
| src/ai_company/engine/react_loop.py | Injects optional per-turn checkpoint callback into ReAct loop. |
| src/ai_company/engine/plan_execute_loop.py | Injects optional per-turn checkpoint callback into Plan-and-Execute loop. |
| src/ai_company/engine/checkpoint/strategy.py | Adds CheckpointRecoveryStrategy with in-memory async-locked resume attempt counter. |
| src/ai_company/engine/checkpoint/resume.py | Adds resume helpers: deserialize+reconcile, loop callback injection, cleanup helpers. |
| src/ai_company/engine/checkpoint/models.py | Adds frozen Pydantic models: Checkpoint, Heartbeat, CheckpointConfig. |
| src/ai_company/engine/checkpoint/callback_factory.py | Adds factory to produce best-effort per-turn checkpoint+heartbeat persistence callback. |
| src/ai_company/engine/checkpoint/callback.py | Adds CheckpointCallback type alias. |
| src/ai_company/engine/checkpoint/init.py | Exposes checkpoint package public API. |
| src/ai_company/engine/agent_engine.py | Adds checkpoint plumbing + resume orchestration into the engine recovery flow. |
| src/ai_company/engine/init.py | Re-exports checkpoint-related engine APIs. |
| docs/design/engine.md | Updates engine design docs to reflect implemented checkpoint recovery. |
| README.md | Mentions crash recovery (checkpoint resume) and updates architecture diagram. |
| CLAUDE.md | Updates repo layout + logging event guidance to include checkpoint recovery. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| execution_result = await self._apply_recovery( | ||
| execution_result, | ||
| agent_id, | ||
| task_id, | ||
| completion_config=completion_config, |
| approach as ``_execute``); timeout is not applied because | ||
| the original wall-clock deadline is no longer available. |
| ) -> RecoveryResult: | ||
| """Build a resumable ``RecoveryResult``.""" | ||
| execution_id = context.execution_id | ||
| resume_attempt = self._resume_counts.get(execution_id, 1) |
| await cleanup_after_resume( | ||
| self._checkpoint_repo, | ||
| self._heartbeat_repo, | ||
| context.execution_id, | ||
| ) |
Greptile SummaryThis PR implements §6.6 Strategy 2 (checkpoint recovery) — per-turn
Confidence Score: 2/5
Important Files Changed
|
There was a problem hiding this comment.
Code Review
This pull request introduces a robust checkpoint recovery strategy. The implementation is comprehensive, covering data models, persistence layers with migrations, integration into the agent engine and execution loops, and extensive testing. The changes are well-structured into a new checkpoint module, promoting modularity. The use of a CheckpointRecoveryStrategy that falls back to the existing FailAndReassignStrategy is a solid design choice.
My review includes one medium-severity suggestion regarding dependency inversion in the AgentEngine to further improve the design by decoupling the engine from concrete strategy implementations.
| if result.termination_reason != TerminationReason.ERROR: | ||
| if isinstance( | ||
| self._recovery_strategy, | ||
| CheckpointRecoveryStrategy, | ||
| ): | ||
| await self._recovery_strategy.clear_resume_count( | ||
| execution_id, | ||
| ) | ||
| await cleanup_after_resume( | ||
| self._checkpoint_repo, | ||
| self._heartbeat_repo, | ||
| execution_id, | ||
| ) |
There was a problem hiding this comment.
The AgentEngine is checking for a specific implementation of RecoveryStrategy (CheckpointRecoveryStrategy) and calling checkpoint-specific cleanup functions (clear_resume_count, cleanup_after_resume). This creates a tight coupling between the engine and a concrete strategy, violating the dependency inversion principle. The engine should ideally interact with recovery mechanisms only through the abstract RecoveryStrategy protocol.
A better approach would be to make the strategy responsible for its own cleanup. You could extend the RecoveryStrategy protocol with a finalize(execution_id: str) method.
CheckpointRecoveryStrategywould implementfinalizeto call bothclear_resume_countandcleanup_after_resume.- Other strategies like
FailAndReassignStrategywould have a no-op implementation.
This would simplify _finalize_resume to just if self._recovery_strategy: await self._recovery_strategy.finalize(execution_id), removing implementation-specific logic from the engine and improving maintainability.
Fix all 10 findings from CodeRabbit, Copilot, Gemini, and Greptile: - Thread completion_config/effective_autonomy through fatal error path - Extract _reconstruct_and_run_resume to keep functions under 50 lines - Add CHECKPOINT_RECOVERY_DESERIALIZE_FAILED event for distinct logging - Fix race condition: _reserve_resume_attempt returns tuple (flag, count) - Add finalize() to RecoveryStrategy protocol (removes isinstance check) - Rename cleanup_after_resume to cleanup_checkpoint_artifacts - Add idx_cp_task_turn to migration test assertion - Document cost recording and timeout policies Add 31 new tests (94→125) covering resume.py (was 0%), strategy edge cases (finalize, fallback cleanup, MemoryError/RecursionError propagation), and RecoveryResult validator edge cases.
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
src/ai_company/engine/agent_engine.py (1)
680-745: 🧹 Nitpick | 🔵 TrivialFunction still exceeds 50-line limit.
The
_resume_from_checkpointmethod spans 66 lines. While_reconstruct_and_run_resumewas extracted as suggested in past reviews, the function still exceeds the guideline. Consider extracting the initial validation/guard clause (lines 702-710) into a separate helper.As per coding guidelines: "Functions must be less than 50 lines."
♻️ Suggested extraction of validation guard
+ def _validate_checkpoint_json( + self, + recovery_result: RecoveryResult, + agent_id: str, + task_id: str, + ) -> str: + """Validate and return checkpoint_context_json or raise.""" + if recovery_result.checkpoint_context_json is None: + logger.error( + EXECUTION_RESUME_FAILED, + agent_id=agent_id, + task_id=task_id, + error="checkpoint_context_json is None but can_resume was True", + ) + msg = "checkpoint_context_json is None but can_resume was True" + raise RuntimeError(msg) + return recovery_result.checkpoint_context_json + async def _resume_from_checkpoint( ... ) -> ExecutionResult: - if recovery_result.checkpoint_context_json is None: - logger.error( - EXECUTION_RESUME_FAILED, - agent_id=agent_id, - task_id=task_id, - error="checkpoint_context_json is None but can_resume was True", - ) - msg = "checkpoint_context_json is None but can_resume was True" - raise RuntimeError(msg) + checkpoint_json = self._validate_checkpoint_json( + recovery_result, agent_id, task_id + ) logger.info(...) try: result, execution_id = await self._reconstruct_and_run_resume( - recovery_result.checkpoint_context_json, + checkpoint_json, ... )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/agent_engine.py` around lines 680 - 745, Extract the initial validation/guard logic in _resume_from_checkpoint (the checkpoint_context_json is None branch that logs EXECUTION_RESUME_FAILED, constructs msg and raises RuntimeError) into a small helper function (e.g., _validate_resume_preconditions or _ensure_checkpoint_context) and call it at the top of _resume_from_checkpoint; ensure the helper takes recovery_result, agent_id and task_id, performs the same logger.error call and raises the same RuntimeError when checkpoint_context_json is None, and update any references in _resume_from_checkpoint to use that helper so the method body falls under the 50-line guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai_company/engine/checkpoint/__init__.py`:
- Around line 14-28: The module exports cleanup_checkpoint_artifacts and
deserialize_and_reconcile from ai_company.engine.checkpoint.resume but omits
make_loop_with_callback; import make_loop_with_callback from
ai_company.engine.checkpoint.resume in
src/ai_company/engine/checkpoint/__init__.py and add "make_loop_with_callback"
to the __all__ list so it is part of the public API (refer to the resume.py
symbol make_loop_with_callback and the existing imports
cleanup_checkpoint_artifacts, deserialize_and_reconcile and the __all__ list).
In `@src/ai_company/engine/checkpoint/strategy.py`:
- Around line 177-189: The current try/except around
self._checkpoint_repo.get_latest in the method swallows only PersistenceError
(re-raises MemoryError/RecursionError) so other errors (e.g., ValueError) will
propagate; update the handler to either (A) catch broad repository errors by
changing the except block to except Exception as e: (but continue to re-raise
MemoryError and RecursionError) and log via
logger.exception(CHECKPOINT_LOAD_FAILED, execution_id=execution_id,
task_id=task_id, error=e) then return None, or (B) if propagation is intended,
add a concise inline comment above the try explaining that only PersistenceError
is handled and all other exceptions should propagate; reference
_checkpoint_repo.get_latest, CHECKPOINT_LOAD_FAILED, and logger.exception when
making the change.
In `@src/ai_company/engine/recovery.py`:
- Around line 229-231: The finalize method in recovery.py currently contains
only a docstring which is a valid no-op but unclear; update the async def
finalize(self, execution_id: str) -> None: implementation to include an explicit
no-op statement (e.g., add a single pass or return None) after the docstring to
make the intent explicit and improve readability for future maintainers.
In `@tests/unit/engine/checkpoint/test_resume.py`:
- Around line 127-152: The tests currently rely on fragile substring matches of
Pydantic's error text when calling deserialize_and_reconcile; change them to
assert the exception type more robustly by either removing the match argument
and just using pytest.raises(ValueError) or (preferred) import
pydantic.ValidationError and replace pytest.raises(ValueError, match="validation
error") with pytest.raises(ValidationError) in the three tests
(test_invalid_json_raises_value_error, test_empty_string_raises,
test_valid_json_but_wrong_schema_raises) so they assert the concrete validation
exception instead of matching error text.
---
Duplicate comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 680-745: Extract the initial validation/guard logic in
_resume_from_checkpoint (the checkpoint_context_json is None branch that logs
EXECUTION_RESUME_FAILED, constructs msg and raises RuntimeError) into a small
helper function (e.g., _validate_resume_preconditions or
_ensure_checkpoint_context) and call it at the top of _resume_from_checkpoint;
ensure the helper takes recovery_result, agent_id and task_id, performs the same
logger.error call and raises the same RuntimeError when checkpoint_context_json
is None, and update any references in _resume_from_checkpoint to use that helper
so the method body falls under the 50-line guideline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ef5542ed-39b0-4a1d-a8d9-afa4f3faada4
📒 Files selected for processing (10)
src/ai_company/engine/agent_engine.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/engine/checkpoint/resume.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/engine/recovery.pysrc/ai_company/observability/events/checkpoint.pytests/unit/engine/checkpoint/test_resume.pytests/unit/engine/checkpoint/test_strategy.pytests/unit/engine/test_recovery_checkpoint_fields.pytests/unit/persistence/sqlite/test_migrations_v6.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Build Backend
- GitHub Check: Greptile Review
- GitHub Check: Test (Python 3.14)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT usefrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax:except A, B:(no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names:example-provider,example-large-001,example-medium-001,example-small-001. Tests must usetest-provider,test-small-001, etc.
Files:
src/ai_company/engine/checkpoint/resume.pysrc/ai_company/engine/checkpoint/__init__.pytests/unit/engine/test_recovery_checkpoint_fields.pysrc/ai_company/engine/recovery.pytests/unit/engine/checkpoint/test_resume.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/observability/events/checkpoint.pytests/unit/engine/checkpoint/test_strategy.pytests/unit/persistence/sqlite/test_migrations_v6.pysrc/ai_company/engine/agent_engine.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: All public functions must have type hints. Enforce via mypy strict mode.
Google-style docstrings required on public classes and functions, enforced by ruff D rules.
Create new objects instead of mutating existing ones. For non-Pydantic internal collections, usecopy.deepcopy()at construction plusMappingProxyTypewrapping for read-only enforcement.
Fordict/listfields in frozen Pydantic models, rely onfrozen=Truefor field reassignment prevention and usecopy.deepcopy()at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 withBaseModel,model_validator,computed_field,ConfigDict. Use@computed_fieldfor derived values instead of storing redundant fields (e.g.,TokenUsage.total_tokens).
UseNotBlankStrfromcore.typesfor all identifier/name fields in Pydantic models, including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants, instead of manual whitespace validators.
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over barecreate_task.
Functions must be less than 50 lines; files must be less than 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries: user input, external APIs, config files.
NEVER useimport logging,logging.getLogger(), orprint()in application code.
Always useloggeras the variable name for loggers (not_logger, notlog).
Event names must always use constants from domain-specific modules underai_company.observability.events(e.g.,PROVIDER_CALL_STARTfromevents.provider,BUDGET_RECORD_ADDEDfromevents.budget). Import dire...
Files:
src/ai_company/engine/checkpoint/resume.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/engine/recovery.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/observability/events/checkpoint.pysrc/ai_company/engine/agent_engine.py
src/ai_company/**/[!_]*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Every module with business logic MUST have:
from ai_company.observability import get_loggerthenlogger = get_logger(__name__)
Files:
src/ai_company/engine/checkpoint/resume.pysrc/ai_company/engine/recovery.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/observability/events/checkpoint.pysrc/ai_company/engine/agent_engine.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Tests must use markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e,@pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: useasyncio_mode = "auto"— no manual@pytest.mark.asyncioneeded.
Test timeout: 30 seconds per test.
Prefer@pytest.mark.parametrizefor testing similar cases.
Files:
tests/unit/engine/test_recovery_checkpoint_fields.pytests/unit/engine/checkpoint/test_resume.pytests/unit/engine/checkpoint/test_strategy.pytests/unit/persistence/sqlite/test_migrations_v6.py
🧠 Learnings (1)
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Event names must always use constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
src/ai_company/observability/events/checkpoint.py
🧬 Code graph analysis (6)
src/ai_company/engine/checkpoint/resume.py (6)
src/ai_company/engine/checkpoint/callback_factory.py (1)
make_checkpoint_callback(34-133)src/ai_company/engine/checkpoint/models.py (1)
CheckpointConfig(88-120)src/ai_company/engine/context.py (2)
AgentContext(87-307)with_message(173-182)src/ai_company/providers/enums.py (1)
MessageRole(6-12)src/ai_company/engine/loop_protocol.py (1)
ExecutionLoop(151-189)src/ai_company/persistence/repositories.py (2)
CheckpointRepository(484-533)HeartbeatRepository(537-595)
src/ai_company/engine/checkpoint/__init__.py (4)
src/ai_company/engine/checkpoint/callback_factory.py (1)
make_checkpoint_callback(34-133)src/ai_company/engine/checkpoint/models.py (3)
Checkpoint(25-65)CheckpointConfig(88-120)Heartbeat(68-85)src/ai_company/engine/checkpoint/resume.py (2)
cleanup_checkpoint_artifacts(130-176)deserialize_and_reconcile(39-87)src/ai_company/engine/checkpoint/strategy.py (1)
CheckpointRecoveryStrategy(45-297)
src/ai_company/engine/recovery.py (2)
src/ai_company/engine/task_execution.py (1)
TaskExecution(60-231)src/ai_company/engine/checkpoint/strategy.py (1)
finalize(146-148)
tests/unit/engine/checkpoint/test_resume.py (3)
src/ai_company/engine/checkpoint/resume.py (3)
cleanup_checkpoint_artifacts(130-176)deserialize_and_reconcile(39-87)make_loop_with_callback(90-127)src/ai_company/providers/enums.py (1)
MessageRole(6-12)src/ai_company/engine/context.py (2)
AgentContext(87-307)from_identity(140-171)
src/ai_company/engine/checkpoint/strategy.py (9)
src/ai_company/engine/checkpoint/models.py (1)
Checkpoint(25-65)src/ai_company/engine/checkpoint/resume.py (1)
cleanup_checkpoint_artifacts(130-176)src/ai_company/engine/context.py (2)
AgentContext(87-307)to_snapshot(280-302)src/ai_company/engine/recovery.py (4)
FailAndReassignStrategy(161-234)RecoveryResult(31-112)recover(124-143)recover(173-227)src/ai_company/engine/task_execution.py (1)
TaskExecution(60-231)src/ai_company/persistence/errors.py (1)
PersistenceError(8-9)src/ai_company/persistence/repositories.py (8)
CheckpointRepository(484-533)HeartbeatRepository(537-595)get_latest(498-519)get(59-71)get(218-230)get(346-358)get(426-438)get(551-563)src/ai_company/persistence/sqlite/checkpoint_repo.py (1)
get_latest(68-132)src/ai_company/memory/errors.py (1)
MemoryError(13-14)
tests/unit/engine/checkpoint/test_strategy.py (3)
src/ai_company/engine/checkpoint/strategy.py (5)
CheckpointRecoveryStrategy(45-297)get_strategy_type(150-152)recover(83-144)clear_resume_count(154-167)finalize(146-148)src/ai_company/engine/recovery.py (9)
FailAndReassignStrategy(161-234)RecoveryResult(31-112)RecoveryStrategy(116-158)get_strategy_type(156-158)get_strategy_type(232-234)recover(124-143)recover(173-227)finalize(145-154)finalize(229-230)src/ai_company/persistence/errors.py (1)
QueryError(33-34)
🔇 Additional comments (29)
src/ai_company/engine/checkpoint/resume.py (4)
1-36: LGTM!The module is well-structured with proper logger setup, type hints, and docstrings. The event constants are correctly imported from the domain-specific module.
39-87: LGTM — past review comment addressed.The function now correctly uses distinct event constants:
CHECKPOINT_RECOVERY_DESERIALIZE_FAILEDfor the error path (line 65) andCHECKPOINT_RECOVERY_RECONCILIATIONfor the success path (line 82). This enables proper log analysis to distinguish failures from successful reconciliations.
130-176: LGTM!The cleanup function follows best-effort semantics correctly: errors are logged at WARNING level with context but never propagated. The function handles
Nonerepositories gracefully and uses appropriate event constants for each cleanup outcome.
115-116: No action needed—ReactLoop constructor has onlycheckpoint_callbackparameter.ReactLoop's
__init__method accepts onlycheckpoint_callback: CheckpointCallback | None = None. Unlike PlanExecuteLoop, ReactLoop has noconfigparameter, so there are no other constructor parameters to preserve. The code correctly instantiates ReactLoop with its sole parameter.> Likely an incorrect or invalid review comment.src/ai_company/observability/events/checkpoint.py (1)
1-32: LGTM!The event constants are well-organized into logical groups with consistent naming patterns following the
domain.actionconvention. TheCHECKPOINT_RECOVERY_DESERIALIZE_FAILEDconstant (lines 30-32) properly supports distinguishing deserialization failures from successful reconciliations in log analysis.tests/unit/persistence/sqlite/test_migrations_v6.py (2)
91-106: LGTM — past review comment addressed.The expected index set now correctly includes
idx_cp_task_turn(line 104), ensuring comprehensive regression protection for all checkpoint indexes created by the V6 migration.
1-136: LGTM!Comprehensive test coverage for the V6 migration including schema version validation, table creation, column definitions, index verification, and idempotency. Test markers and timeout are properly configured.
tests/unit/engine/checkpoint/test_resume.py (1)
1-319: LGTM!Comprehensive test suite covering happy paths, error handling, and edge cases for all three resume helper functions. The tests properly verify:
- Context deserialization and reconciliation message injection
- Loop callback injection for supported and unsupported loop types
- Best-effort cleanup behavior with error swallowing
tests/unit/engine/test_recovery_checkpoint_fields.py (1)
1-282: LGTM!Thorough test coverage for the new
RecoveryResultcheckpoint-related fields:
can_resumecomputation based oncheckpoint_context_jsonpresence- Consistency validation between
checkpoint_context_jsonandresume_attempt- JSON structure validation (must be object, not array or primitive)
- Backward compatibility with
FailAndReassignStrategyfinalize()idempotencytests/unit/engine/checkpoint/test_strategy.py (1)
1-601: LGTM!Excellent test coverage for
CheckpointRecoveryStrategyincluding:
- Protocol compliance and strategy type verification
- Resume-from-checkpoint flows with valid checkpoints
- Fallback behavior (no checkpoint, max attempts exhausted, repo errors)
- Resume counter semantics (per-execution tracking, clearing, independence)
finalize()behavior and idempotency- Cleanup invocation on fallback paths
- Exception propagation for
MemoryErrorandRecursionErrorThe tests are well-organized into focused classes with clear docstrings.
src/ai_company/engine/recovery.py (3)
62-92: LGTM!The new checkpoint-related fields and validation logic are well-implemented:
checkpoint_context_jsonandresume_attemptfields with appropriate defaults_validate_checkpoint_consistencyproperly enforces that both fields must be set together or both at defaults- JSON validation ensures the checkpoint context is a valid JSON object
106-112: LGTM!The
can_resumecomputed field correctly derives its value from the presence ofcheckpoint_context_json, providing a clean API for callers to check resume eligibility.
145-154: LGTM!The
finalize()method is a well-designed addition to the protocol, enabling post-resume cleanup hooks. The docstring clearly explains the semantics (called after non-ERROR termination).src/ai_company/engine/checkpoint/strategy.py (7)
1-42: LGTM! Module setup follows guidelines.Logger is correctly initialized with
get_logger(__name__), imports are well-organized, and the safety bound constant is appropriately documented.
45-81: LGTM! Class initialization is well-structured.Good use of
asyncio.Lockfor thread-safety of the resume counter, and the NOTE comment clearly documents the known limitation about counter persistence across process restarts.
83-144: LGTM! The recover method addresses the previously identified race condition.The method now correctly captures
resume_attemptfrom_reserve_resume_attemptand passes it to_build_resume_result, ensuring the metadata uses the locked value rather than reading_resume_countsoutside the lock.
146-167: LGTM! Finalization methods are correctly implemented.Proper use of the async lock when clearing resume counts, and the methods are well-documented.
207-243: LGTM! Race condition fix and documented eviction strategy.The method now returns
(should_fallback, resume_count)tuple, resolving the previously identified race condition. The FIFO eviction limitation is appropriately documented in the inline comment (lines 235-238), explaining why it's acceptable given the expected cardinality of crashed executions.
245-274: LGTM! Clean implementation of resume result builder.Correctly accepts
resume_attemptas a parameter rather than reading from_resume_counts, ensuring consistency with the locked value from_reserve_resume_attempt.
276-297: LGTM! Proper cleanup before fallback delegation.The method correctly cleans up checkpoint artifacts before delegating to the fallback strategy, preventing orphaned database rows when the resume path won't be taken.
src/ai_company/engine/agent_engine.py (9)
18-23: LGTM! Checkpoint-related imports are well-organized.The new imports for checkpoint models and resume helpers are properly grouped and follow the existing import structure.
172-180: LGTM! Good defensive validation for repository configuration.The XOR check ensures checkpoint_repo and heartbeat_repo are either both provided or both omitted, preventing inconsistent configuration states.
395-399: Helpful comment explaining cost recording timing.The comment clearly documents the intentional design decision to record costs before recovery, preventing confusion about why resumed executions have separate cost tracking.
500-513: LGTM! Clean delegation to the resume module.Simple and focused helper that wraps the loop with checkpoint callback configuration.
652-661: LGTM! Checkpoint resume path integration.The checkpoint resume logic is cleanly integrated into the recovery flow, properly threading
completion_configandeffective_autonomythrough to_resume_from_checkpoint.
747-776: LGTM! Good extraction of core resume logic.This helper cleanly encapsulates the deserialize-and-execute flow, returning the execution_id needed for finalization.
778-813: LGTM! Resumed loop execution with proper budget handling.Correctly handles the optional task_execution case and constructs the appropriate budget checker.
815-836: LGTM! Proper conditional cleanup after resume.The conditional cleanup (only on non-ERROR termination) correctly preserves checkpoint data for potential subsequent resume attempts when the resumed execution itself fails.
286-293: LGTM! Consistent PEP 758 except syntax.The
except MemoryError, RecursionError:syntax (without parentheses) is correctly used throughout for non-recoverable errors.
- Add content None check in test_resume.py reconciliation assertion - Add type: ignore[comparison-overlap] for CustomLoop identity check - Add finalize() to CustomRecovery in test_agent_engine_errors.py
Set parent_task_id=None in TaskFactory to prevent polyfactory from randomly generating matching id and parent_task_id values, which triggers the "cannot be its own parent" validator.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tests/unit/engine/checkpoint/test_resume.py (1)
128-153:⚠️ Potential issue | 🟡 MinorMake invalid-payload tests less brittle and parameterized.
Line 129, Line 138, and Line 147 depend on a fragile validation-message substring and duplicate the same structure across three tests.
♻️ Proposed refactor
- def test_invalid_json_raises_value_error(self) -> None: - with pytest.raises(ValueError, match="validation error"): - deserialize_and_reconcile( - checkpoint_json="{not valid json}", - error_message="crash", - agent_id="agent-1", - task_id="task-1", - ) - - def test_empty_string_raises(self) -> None: - with pytest.raises(ValueError, match="validation error"): - deserialize_and_reconcile( - checkpoint_json="", - error_message="crash", - agent_id="agent-1", - task_id="task-1", - ) - - def test_valid_json_but_wrong_schema_raises(self) -> None: - with pytest.raises(ValueError, match="validation error"): - deserialize_and_reconcile( - checkpoint_json='{"not": "an AgentContext"}', - error_message="crash", - agent_id="agent-1", - task_id="task-1", - ) + `@pytest.mark.parametrize`( + "checkpoint_json", + ["{not valid json}", "", '{"not": "an AgentContext"}'], + ids=["invalid_json", "empty_string", "wrong_schema"], + ) + def test_invalid_payload_raises_value_error(self, checkpoint_json: str) -> None: + with pytest.raises(ValueError): + deserialize_and_reconcile( + checkpoint_json=checkpoint_json, + error_message="crash", + agent_id="agent-1", + task_id="task-1", + )#!/bin/bash # Verify fragile match usage and inspect deserialize_and_reconcile error behavior. rg -n 'pytest\.raises\(ValueError,\s*match="validation error"\)' tests/unit/engine/checkpoint/test_resume.py rg -n -C4 'def deserialize_and_reconcile|model_validate_json|raise ValueError|except ValueError' src/ai_company/engine/checkpoint/resume.pyAs per coding guidelines,
Prefer@pytest.mark.parametrizefor testing similar cases.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/engine/checkpoint/test_resume.py` around lines 128 - 153, Tests are brittle and duplicated because they assert a specific substring "validation error" in the ValueError; replace the three tests with a single parametrized test using pytest.mark.parametrize over the invalid checkpoint_json values and assert only that deserialize_and_reconcile raises ValueError (or use a more robust regex like r"validation" if a message check is required). Locate the tests around test_invalid_json_raises_value_error / test_empty_string_raises / test_valid_json_but_wrong_schema_raises and change them to a single parametrized test that calls deserialize_and_reconcile(checkpoint_json=..., error_message="crash", agent_id="agent-1", task_id="task-1") inside a with pytest.raises(ValueError) block (or with pytest.raises(ValueError, match=r"validation") for a looser match).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@tests/unit/engine/checkpoint/test_resume.py`:
- Around line 128-153: Tests are brittle and duplicated because they assert a
specific substring "validation error" in the ValueError; replace the three tests
with a single parametrized test using pytest.mark.parametrize over the invalid
checkpoint_json values and assert only that deserialize_and_reconcile raises
ValueError (or use a more robust regex like r"validation" if a message check is
required). Locate the tests around test_invalid_json_raises_value_error /
test_empty_string_raises / test_valid_json_but_wrong_schema_raises and change
them to a single parametrized test that calls
deserialize_and_reconcile(checkpoint_json=..., error_message="crash",
agent_id="agent-1", task_id="task-1") inside a with pytest.raises(ValueError)
block (or with pytest.raises(ValueError, match=r"validation") for a looser
match).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: ee499e95-a5d7-43c1-8ed0-5795d5dad175
📒 Files selected for processing (2)
tests/unit/engine/checkpoint/test_resume.pytests/unit/engine/test_agent_engine_errors.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Agent
- GitHub Check: Build Backend
- GitHub Check: Greptile Review
- GitHub Check: Test (Python 3.14)
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT usefrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax:except A, B:(no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names:example-provider,example-large-001,example-medium-001,example-small-001. Tests must usetest-provider,test-small-001, etc.
Files:
tests/unit/engine/checkpoint/test_resume.pytests/unit/engine/test_agent_engine_errors.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Tests must use markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e,@pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: useasyncio_mode = "auto"— no manual@pytest.mark.asyncioneeded.
Test timeout: 30 seconds per test.
Prefer@pytest.mark.parametrizefor testing similar cases.
Files:
tests/unit/engine/checkpoint/test_resume.pytests/unit/engine/test_agent_engine_errors.py
🧬 Code graph analysis (2)
tests/unit/engine/checkpoint/test_resume.py (7)
src/ai_company/engine/checkpoint/models.py (1)
CheckpointConfig(88-120)src/ai_company/security/timeout/park_service.py (1)
resume(108-144)src/ai_company/engine/checkpoint/resume.py (3)
cleanup_checkpoint_artifacts(130-176)deserialize_and_reconcile(39-87)make_loop_with_callback(90-127)src/ai_company/providers/enums.py (1)
MessageRole(6-12)src/ai_company/core/agent.py (1)
AgentIdentity(266-342)src/ai_company/engine/context.py (2)
AgentContext(87-307)from_identity(140-171)src/ai_company/engine/parallel_models.py (2)
agent_id(79-81)task_id(87-89)
tests/unit/engine/test_agent_engine_errors.py (1)
src/ai_company/engine/recovery.py (2)
finalize(145-154)finalize(229-230)
🔇 Additional comments (4)
tests/unit/engine/test_agent_engine_errors.py (1)
556-558: Good recovery-protocol alignment in the test double.Adding
finalize()keepsCustomRecoverycompatible with the expanded recovery lifecycle interface.tests/unit/engine/checkpoint/test_resume.py (3)
58-122: Strong happy-path coverage for deserialization and reconciliation behavior.These tests validate type restoration, reconciliation injection, and turn-count preservation clearly.
161-254: Loop callback injection tests are well-scoped.Good coverage for no-op behavior, supported loop injection, config preservation, and unsupported loop passthrough.
261-320: Cleanup behavior tests are solid and practical.Happy-path plus error-swallowing scenarios are covered and align with best-effort cleanup semantics.
There was a problem hiding this comment.
Pull request overview
Implements checkpoint-based crash recovery (“Strategy 2”) across the engine and persistence layers, enabling per-turn AgentContext checkpointing, heartbeat tracking, and resuming from the latest checkpoint with a reconciliation message.
Changes:
- Added checkpoint/heartbeat domain models, callback factory, resume helpers, and
CheckpointRecoveryStrategywith max-resume attempts + fallback. - Extended persistence protocols and SQLite backend/migrations to store checkpoints + heartbeats (schema v6) and added repository implementations.
- Integrated optional checkpoint callback injection into execution loops and added new observability event constants + docs/tests.
Reviewed changes
Copilot reviewed 38 out of 39 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/persistence/test_protocol.py | Adds fake protocol implementations for checkpoint/heartbeat repos. |
| tests/unit/persistence/test_migrations_v2.py | Updates schema version assertion to use SCHEMA_VERSION. |
| tests/unit/persistence/sqlite/test_migrations_v6.py | New tests validating v6 tables/indexes + idempotency. |
| tests/unit/persistence/sqlite/test_heartbeat_repo.py | New unit tests for SQLiteHeartbeatRepository. |
| tests/unit/persistence/sqlite/test_checkpoint_repo.py | New unit tests for SQLiteCheckpointRepository. |
| tests/unit/observability/test_events.py | Registers/checks new checkpoint events module and constants. |
| tests/unit/engine/test_recovery_checkpoint_fields.py | Tests new RecoveryResult resume fields/validation. |
| tests/unit/engine/test_agent_engine_errors.py | Updates recovery-strategy test double to implement finalize(). |
| tests/unit/engine/checkpoint/test_strategy.py | New tests for CheckpointRecoveryStrategy behavior/counters/cleanup. |
| tests/unit/engine/checkpoint/test_resume.py | New tests for resume helpers (deserialize/reconcile/loop injection/cleanup). |
| tests/unit/engine/checkpoint/test_models.py | New tests for checkpoint/heartbeat/config Pydantic models. |
| tests/unit/engine/checkpoint/test_callback_factory.py | New tests for make_checkpoint_callback() persistence/skip/error handling. |
| tests/unit/engine/checkpoint/init.py | Establishes checkpoint test package. |
| tests/unit/api/conftest.py | Adds in-memory fake checkpoint/heartbeat repos to fake persistence backend. |
| src/ai_company/persistence/sqlite/migrations.py | Bumps schema to v6; adds checkpoints/heartbeats tables + indexes. |
| src/ai_company/persistence/sqlite/heartbeat_repo.py | Implements SQLite heartbeat persistence (upsert/query stale/delete). |
| src/ai_company/persistence/sqlite/checkpoint_repo.py | Implements SQLite checkpoint persistence (save/get_latest/delete_by_execution). |
| src/ai_company/persistence/sqlite/backend.py | Wires checkpoint/heartbeat repositories into SQLite backend. |
| src/ai_company/persistence/sqlite/init.py | Exports new SQLite checkpoint/heartbeat repositories. |
| src/ai_company/persistence/repositories.py | Adds CheckpointRepository and HeartbeatRepository protocols. |
| src/ai_company/persistence/protocol.py | Extends PersistenceBackend protocol with checkpoints/heartbeats properties. |
| src/ai_company/observability/events/persistence.py | Adds persistence event constants for checkpoint/heartbeat repos. |
| src/ai_company/observability/events/execution.py | Adds execution-level checkpoint callback + resume event constants. |
| src/ai_company/observability/events/checkpoint.py | New checkpoint domain event constants module. |
| src/ai_company/engine/recovery.py | Extends RecoveryResult for resume; adds finalize() to strategy protocol. |
| src/ai_company/engine/react_loop.py | Injects optional checkpoint callback after turn completion (pre-tools). |
| src/ai_company/engine/plan_execute_loop.py | Adds optional checkpoint callback support + invocation helper. |
| src/ai_company/engine/checkpoint/strategy.py | New CheckpointRecoveryStrategy with resume counters + fallback behavior. |
| src/ai_company/engine/checkpoint/resume.py | New helpers: deserialize+reconcile, loop injection, best-effort cleanup. |
| src/ai_company/engine/checkpoint/models.py | New frozen Pydantic models: Checkpoint, Heartbeat, CheckpointConfig. |
| src/ai_company/engine/checkpoint/callback_factory.py | New factory producing checkpoint+heartbeat persistence callback closure. |
| src/ai_company/engine/checkpoint/callback.py | Adds CheckpointCallback type alias. |
| src/ai_company/engine/checkpoint/init.py | Exposes checkpoint subpackage public API. |
| src/ai_company/engine/agent_engine.py | Integrates resume flow + loop callback injection; threads configs through recovery. |
| src/ai_company/engine/init.py | Exports checkpoint-related symbols from engine package. |
| docs/design/engine.md | Updates engine crash recovery docs to reflect implemented checkpoint strategy. |
| README.md | Mentions crash recovery + adds Persistence→Engine relationship in architecture diagram. |
| CLAUDE.md | Updates repo/module overview and event-constant guidance with checkpoint domain. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Load the latest checkpoint, returning ``None`` on failure.""" | ||
| try: | ||
| checkpoint = await self._checkpoint_repo.get_latest( | ||
| execution_id=execution_id, | ||
| ) | ||
| except MemoryError, RecursionError: | ||
| raise | ||
| except PersistenceError: | ||
| logger.exception( | ||
| CHECKPOINT_LOAD_FAILED, | ||
| execution_id=execution_id, | ||
| task_id=task_id, | ||
| ) | ||
| return None |
| # led to the error) should be tracked. The resumed execution | ||
| # will have its own cost recording when it completes through | ||
| # the normal pipeline. |
| """Invoke the checkpoint callback if configured. | ||
|
|
||
| Errors are logged but never propagated — checkpointing must | ||
| not interrupt execution. | ||
| """ | ||
| if self._checkpoint_callback is None: | ||
| return | ||
| try: | ||
| await self._checkpoint_callback(ctx) | ||
| except MemoryError, RecursionError: | ||
| raise | ||
| except Exception as exc: |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #367 +/- ##
==========================================
- Coverage 93.90% 93.78% -0.13%
==========================================
Files 447 456 +9
Lines 20819 21330 +511
Branches 2011 2046 +35
==========================================
+ Hits 19551 20005 +454
- Misses 981 1033 +52
- Partials 287 292 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…erage Critical: _finalize_resume now calls record_execution_costs and apply_post_execution_transitions on the resumed ExecutionResult, fixing the bug where resumed executions silently skipped cost tracking and task state transitions. Identity threaded through _apply_recovery → _resume_from_checkpoint → _finalize_resume. - Remove dead EXECUTION_CHECKPOINT_CALLBACK constant (never imported) - Add 9 repo error path tests (QueryError wrapping for save/get/delete) - Consolidate 3 brittle deserialize tests into 1 parametrized test - Export make_loop_with_callback from checkpoint __init__.py - Extract _validate_checkpoint_json to keep function under 50 lines - Add clarifying docstring for PersistenceError-only handling - Fix FailAndReassignStrategy.finalize no-op clarity
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/ai_company/engine/agent_engine.py (3)
412-426:⚠️ Potential issue | 🔴 CriticalDo not route wall-clock timeouts into checkpoint resume.
Line 567-571 turns an expired
timeout_secondsintoTerminationReason.ERROR, and this block then feeds it into recovery._resume_from_checkpoint()explicitly drops the deadline, so a run that already timed out can continue well past the caller's wall-clock limit. Please distinguish deadline expiry from ordinary execution errors and short-circuit checkpoint recovery for that case.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/agent_engine.py` around lines 412 - 426, The code currently treats any TerminationReason.ERROR the same and sends the run into _apply_recovery (which may call _resume_from_checkpoint and drop the wall-clock deadline); change the logic so that if the failure is a wall-clock timeout (the expiration of timeout_seconds / a deadline expiry recorded on execution_result) you short-circuit and do NOT call _apply_recovery or resume from checkpoint—return or propagate the timeout execution_result instead; otherwise continue to call _apply_recovery as before. Inspect execution_result for the timeout/deadline marker (e.g., a termination_reason value indicating timeout or a deadline_expired/timed_out flag on execution_result/context) and use that to decide.
655-680:⚠️ Potential issue | 🟠 MajorResume-path exceptions should not fall back to the stale pre-recovery result.
If
deserialize_and_reconcile()or the resumed loop raises, this handler logs the failure and returns the originalexecution_result. That result still carries the oldtask_execution, so a bad checkpoint or resume failure can leave the task in its stale pre-recovery state instead of failing/reassigning it. The malformed-checkpoint cases already covered intests/unit/engine/checkpoint/test_resume.pymake this reachable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/agent_engine.py` around lines 655 - 680, The current except Exception handler in agent_engine.py logs recovery/resume failures and then returns the original stale execution_result (which still contains the pre-recovery task_execution), allowing a bad checkpoint/resume to leave the task in a stale state; instead, after logging the error you should not return the old execution_result—re-raise the exception (or raise a new contextual exception) so the caller can mark the task failed/reassign it. Concretely: in the except Exception as exc block that references EXECUTION_RECOVERY_FAILED, agent_id, task_id, change the final line from "return execution_result" to "raise" (or "raise RuntimeError(...) from exc" including agent_id/task_id), keeping the existing logger.exception call and the separate re-raises for MemoryError/RecursionError; this ensures failures from deserialize_and_reconcile, _resume_from_checkpoint, or any recovery logic do not silently fall back to a stale task_execution.
427-448:⚠️ Potential issue | 🟠 MajorSkip the second TaskEngine sync after a successful resume.
_finalize_resume()already callsapply_post_execution_transitions()on Line 847-852, andsrc/ai_company/engine/task_sync.pyshows that helper performs the TaskEngine sync itself. When recovery returns a non-ERRORresult, this block issues anothersync_to_task_engine()for the same state change.💡 Minimal guard
- if ( - ctx.task_execution is not None - and pre_recovery_status is not None - and ctx.task_execution.status != pre_recovery_status - ): + if ( + execution_result.termination_reason == TerminationReason.ERROR + and ctx.task_execution is not None + and pre_recovery_status is not None + and ctx.task_execution.status != pre_recovery_status + ):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/agent_engine.py` around lines 427 - 448, This block is causing a duplicate TaskEngine sync after a successful resume because _finalize_resume() already calls apply_post_execution_transitions() which performs the sync; guard the second sync so it only runs when recovery actually failed (e.g., check execution_result.recovery_result == RecoveryResult.ERROR or equivalent) before calling sync_to_task_engine with ctx.task_execution; reference the variables execution_result, ctx.task_execution, pre_recovery_status, sync_to_task_engine, _finalize_resume(), and apply_post_execution_transitions() to find and update the conditional that wraps the sync_to_task_engine call.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 412-426: The code currently treats any TerminationReason.ERROR the
same and sends the run into _apply_recovery (which may call
_resume_from_checkpoint and drop the wall-clock deadline); change the logic so
that if the failure is a wall-clock timeout (the expiration of timeout_seconds /
a deadline expiry recorded on execution_result) you short-circuit and do NOT
call _apply_recovery or resume from checkpoint—return or propagate the timeout
execution_result instead; otherwise continue to call _apply_recovery as before.
Inspect execution_result for the timeout/deadline marker (e.g., a
termination_reason value indicating timeout or a deadline_expired/timed_out flag
on execution_result/context) and use that to decide.
- Around line 655-680: The current except Exception handler in agent_engine.py
logs recovery/resume failures and then returns the original stale
execution_result (which still contains the pre-recovery task_execution),
allowing a bad checkpoint/resume to leave the task in a stale state; instead,
after logging the error you should not return the old execution_result—re-raise
the exception (or raise a new contextual exception) so the caller can mark the
task failed/reassign it. Concretely: in the except Exception as exc block that
references EXECUTION_RECOVERY_FAILED, agent_id, task_id, change the final line
from "return execution_result" to "raise" (or "raise RuntimeError(...) from exc"
including agent_id/task_id), keeping the existing logger.exception call and the
separate re-raises for MemoryError/RecursionError; this ensures failures from
deserialize_and_reconcile, _resume_from_checkpoint, or any recovery logic do not
silently fall back to a stale task_execution.
- Around line 427-448: This block is causing a duplicate TaskEngine sync after a
successful resume because _finalize_resume() already calls
apply_post_execution_transitions() which performs the sync; guard the second
sync so it only runs when recovery actually failed (e.g., check
execution_result.recovery_result == RecoveryResult.ERROR or equivalent) before
calling sync_to_task_engine with ctx.task_execution; reference the variables
execution_result, ctx.task_execution, pre_recovery_status, sync_to_task_engine,
_finalize_resume(), and apply_post_execution_transitions() to find and update
the conditional that wraps the sync_to_task_engine call.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: f926dbc2-15da-49bf-b7fe-d06923a5dd25
📒 Files selected for processing (8)
src/ai_company/engine/agent_engine.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/engine/recovery.pysrc/ai_company/observability/events/execution.pytests/unit/engine/checkpoint/test_resume.pytests/unit/persistence/sqlite/test_checkpoint_repo.pytests/unit/persistence/sqlite/test_heartbeat_repo.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build Backend
- GitHub Check: Greptile Review
- GitHub Check: Test (Python 3.14)
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Python version 3.14+ with PEP 649 native lazy annotations required.
Do NOT usefrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations.
Use PEP 758 except syntax:except A, B:(no parentheses) — ruff enforces this on Python 3.14.
Line length must be 88 characters, enforced by ruff.
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names:example-provider,example-large-001,example-medium-001,example-small-001. Tests must usetest-provider,test-small-001, etc.
Files:
tests/unit/engine/checkpoint/test_resume.pytests/unit/persistence/sqlite/test_checkpoint_repo.pysrc/ai_company/observability/events/execution.pysrc/ai_company/engine/recovery.pytests/unit/persistence/sqlite/test_heartbeat_repo.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/engine/agent_engine.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Tests must use markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e,@pytest.mark.slow
Test coverage minimum: 80% (enforced in CI).
Async tests: useasyncio_mode = "auto"— no manual@pytest.mark.asyncioneeded.
Test timeout: 30 seconds per test.
Prefer@pytest.mark.parametrizefor testing similar cases.
Files:
tests/unit/engine/checkpoint/test_resume.pytests/unit/persistence/sqlite/test_checkpoint_repo.pytests/unit/persistence/sqlite/test_heartbeat_repo.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: All public functions must have type hints. Enforce via mypy strict mode.
Google-style docstrings required on public classes and functions, enforced by ruff D rules.
Create new objects instead of mutating existing ones. For non-Pydantic internal collections, usecopy.deepcopy()at construction plusMappingProxyTypewrapping for read-only enforcement.
Fordict/listfields in frozen Pydantic models, rely onfrozen=Truefor field reassignment prevention and usecopy.deepcopy()at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization).
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 withBaseModel,model_validator,computed_field,ConfigDict. Use@computed_fieldfor derived values instead of storing redundant fields (e.g.,TokenUsage.total_tokens).
UseNotBlankStrfromcore.typesfor all identifier/name fields in Pydantic models, including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants, instead of manual whitespace validators.
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over barecreate_task.
Functions must be less than 50 lines; files must be less than 800 lines.
Handle errors explicitly, never silently swallow exceptions.
Validate at system boundaries: user input, external APIs, config files.
NEVER useimport logging,logging.getLogger(), orprint()in application code.
Always useloggeras the variable name for loggers (not_logger, notlog).
Event names must always use constants from domain-specific modules underai_company.observability.events(e.g.,PROVIDER_CALL_STARTfromevents.provider,BUDGET_RECORD_ADDEDfromevents.budget). Import dire...
Files:
src/ai_company/observability/events/execution.pysrc/ai_company/engine/recovery.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/engine/checkpoint/__init__.pysrc/ai_company/engine/agent_engine.py
src/ai_company/**/[!_]*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Every module with business logic MUST have:
from ai_company.observability import get_loggerthenlogger = get_logger(__name__)
Files:
src/ai_company/observability/events/execution.pysrc/ai_company/engine/recovery.pysrc/ai_company/engine/checkpoint/strategy.pysrc/ai_company/engine/agent_engine.py
🧠 Learnings (2)
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Event names must always use constants from domain-specific modules under `ai_company.observability.events` (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
src/ai_company/observability/events/execution.py
📚 Learning: 2026-03-13T21:03:58.906Z
Learnt from: CR
Repo: Aureliolo/synthorg PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-13T21:03:58.906Z
Learning: Applies to src/**/*.py : Handle errors explicitly, never silently swallow exceptions.
Applied to files:
src/ai_company/engine/checkpoint/strategy.py
🧬 Code graph analysis (2)
tests/unit/engine/checkpoint/test_resume.py (1)
src/ai_company/engine/checkpoint/resume.py (3)
cleanup_checkpoint_artifacts(130-176)deserialize_and_reconcile(39-87)make_loop_with_callback(90-127)
src/ai_company/engine/agent_engine.py (5)
src/ai_company/engine/checkpoint/resume.py (3)
cleanup_checkpoint_artifacts(130-176)deserialize_and_reconcile(39-87)make_loop_with_callback(90-127)src/ai_company/persistence/repositories.py (2)
CheckpointRepository(484-533)HeartbeatRepository(537-595)src/ai_company/engine/loop_protocol.py (4)
ExecutionLoop(151-189)ExecutionResult(79-140)execute(159-185)TerminationReason(28-36)src/ai_company/engine/context.py (1)
AgentContext(87-307)src/ai_company/engine/task_sync.py (1)
apply_post_execution_transitions(171-227)
🔇 Additional comments (1)
src/ai_company/engine/recovery.py (1)
72-112: Nice invariant on resume metadata.The post-validation plus
can_resumecomputed field makes the checkpoint-resume path much harder to misuse downstream.
Normal completions (COMPLETED, MAX_TURNS, BUDGET_EXHAUSTED, SHUTDOWN, PARKED) bypassed the recovery block entirely, leaving checkpoint and heartbeat rows orphaned in SQLite indefinitely. Now _post_execution_pipeline calls cleanup_checkpoint_artifacts and recovery_strategy.finalize on all non-ERROR terminations.
There was a problem hiding this comment.
Pull request overview
Implements the “Checkpoint Recovery” crash-recovery strategy across the engine and persistence layers, enabling per-turn AgentContext checkpointing with heartbeat-based failure detection and resume-from-checkpoint execution flow.
Changes:
- Added checkpoint/heartbeat Pydantic models, callback factory + loop injection, and a
CheckpointRecoveryStrategywith resume attempt tracking and fallback. - Extended persistence protocols + SQLite backend with V6 migrations and new checkpoint/heartbeat repositories.
- Wired resume flow into
AgentEngine, updated execution loops to invoke checkpoint callbacks, and expanded observability event constants + unit tests.
Reviewed changes
Copilot reviewed 38 out of 39 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/persistence/test_protocol.py | Extends protocol conformance tests for new checkpoint/heartbeat repositories. |
| tests/unit/persistence/test_migrations_v2.py | Updates migration assertions to use SCHEMA_VERSION. |
| tests/unit/persistence/sqlite/test_migrations_v6.py | Adds coverage for V6 schema migration (tables + indexes + idempotency). |
| tests/unit/persistence/sqlite/test_heartbeat_repo.py | Adds SQLiteHeartbeatRepository unit tests. |
| tests/unit/persistence/sqlite/test_checkpoint_repo.py | Adds SQLiteCheckpointRepository unit tests. |
| tests/unit/observability/test_events.py | Ensures checkpoint events module is discovered and constants exist. |
| tests/unit/engine/test_recovery_checkpoint_fields.py | Validates new RecoveryResult checkpoint-related fields and invariants. |
| tests/unit/engine/test_agent_engine_errors.py | Updates test strategy stub to satisfy new finalize() protocol method. |
| tests/unit/engine/checkpoint/test_strategy.py | Adds unit tests for CheckpointRecoveryStrategy behavior and counter logic. |
| tests/unit/engine/checkpoint/test_resume.py | Tests resume helpers (deserialize/reconcile, callback injection, cleanup). |
| tests/unit/engine/checkpoint/test_models.py | Adds tests for checkpoint/heartbeat/config models and validation. |
| tests/unit/engine/checkpoint/test_callback_factory.py | Adds tests for callback persistence frequency, heartbeat updates, error handling. |
| tests/unit/engine/checkpoint/init.py | Introduces checkpoint test package marker. |
| tests/unit/core/conftest.py | Updates Task factory to include parent_task_id. |
| tests/unit/api/conftest.py | Adds in-memory checkpoint/heartbeat repos to fake persistence backend for API tests. |
| src/ai_company/persistence/sqlite/migrations.py | Bumps schema to V6; adds checkpoints + heartbeats tables and indexes. |
| src/ai_company/persistence/sqlite/heartbeat_repo.py | Implements SQLiteHeartbeatRepository with UTC normalization and structured logging. |
| src/ai_company/persistence/sqlite/checkpoint_repo.py | Implements SQLiteCheckpointRepository with filtered “latest” queries and logging. |
| src/ai_company/persistence/sqlite/backend.py | Wires new repositories into SQLitePersistenceBackend lifecycle/properties. |
| src/ai_company/persistence/sqlite/init.py | Exports new SQLite repositories. |
| src/ai_company/persistence/repositories.py | Adds CheckpointRepository and HeartbeatRepository protocols. |
| src/ai_company/persistence/protocol.py | Extends PersistenceBackend protocol with checkpoint/heartbeat repositories. |
| src/ai_company/observability/events/persistence.py | Adds persistence event constants for checkpoints/heartbeats. |
| src/ai_company/observability/events/execution.py | Adds execution-level events for checkpoint callback and resume flow. |
| src/ai_company/observability/events/checkpoint.py | Introduces checkpoint-domain event constants module. |
| src/ai_company/engine/recovery.py | Extends RecoveryResult with checkpoint resume fields + validation; adds finalize() to protocol. |
| src/ai_company/engine/react_loop.py | Adds optional checkpoint callback invocation per turn with guarded error logging. |
| src/ai_company/engine/plan_execute_loop.py | Adds optional checkpoint callback invocation for planner/step turns; exposes config property. |
| src/ai_company/engine/checkpoint/strategy.py | Implements CheckpointRecoveryStrategy with resume attempt tracking + fallback cleanup. |
| src/ai_company/engine/checkpoint/resume.py | Adds helpers for deserialization/reconciliation, loop injection, and artifact cleanup. |
| src/ai_company/engine/checkpoint/models.py | Adds Checkpoint, Heartbeat, and CheckpointConfig frozen models + validation. |
| src/ai_company/engine/checkpoint/callback_factory.py | Adds make_checkpoint_callback() to persist checkpoints + heartbeats best-effort. |
| src/ai_company/engine/checkpoint/callback.py | Defines the CheckpointCallback type alias. |
| src/ai_company/engine/checkpoint/init.py | Exposes checkpoint module public API. |
| src/ai_company/engine/agent_engine.py | Integrates checkpoint callback injection, resume-from-checkpoint path, and cleanup/finalization threading. |
| src/ai_company/engine/init.py | Re-exports checkpoint-related engine symbols. |
| docs/design/engine.md | Updates design docs to reflect implemented checkpoint recovery strategy and fields. |
| README.md | Updates overview/diagram to reflect persistence→engine integration + crash recovery. |
| CLAUDE.md | Updates repo structure and observability event constant guidance to include checkpoint. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| checkpoint_ctx = AgentContext.model_validate_json(checkpoint_json) | ||
| except ValueError: | ||
| logger.exception( | ||
| CHECKPOINT_RECOVERY_DESERIALIZE_FAILED, | ||
| agent_id=agent_id, |
| if execution_result.termination_reason != TerminationReason.ERROR: | ||
| exec_id = execution_result.context.execution_id | ||
| if self._recovery_strategy is not None: | ||
| await self._recovery_strategy.finalize(exec_id) | ||
| await cleanup_checkpoint_artifacts( |
🤖 I have created a release *beep* *boop* --- ## [0.1.4](v0.1.3...v0.1.4) (2026-03-14) ### Features * add approval workflow gates to TaskEngine ([#387](#387)) ([2db968a](2db968a)) * implement checkpoint recovery strategy ([#367](#367)) ([f886838](f886838)) ### CI/CD * add npm and pre-commit ecosystems to Dependabot ([#369](#369)) ([54e5fe7](54e5fe7)) * bump actions/setup-node from 4.4.0 to 6.3.0 ([#360](#360)) ([2db5105](2db5105)) * bump github/codeql-action from 3.32.6 to 4.32.6 ([#361](#361)) ([ce766e8](ce766e8)) * group major dependabot bumps per ecosystem ([#388](#388)) ([3c43aef](3c43aef)) ### Maintenance * bump @vitejs/plugin-vue from 5.2.4 to 6.0.5 in /web ([#382](#382)) ([d7054ee](d7054ee)) * bump @vue/tsconfig from 0.7.0 to 0.9.0 in /web in the minor-and-patch group across 1 directory ([#371](#371)) ([64fa08b](64fa08b)) * bump astro from 5.18.1 to 6.0.4 in /site ([#376](#376)) ([d349317](d349317)) * bump https://github.com/astral-sh/ruff-pre-commit from v0.15.5 to 0.15.6 ([#372](#372)) ([dcacb2e](dcacb2e)) * bump https://github.com/gitleaks/gitleaks from v8.24.3 to 8.30.1 ([#375](#375)) ([a18e6ed](a18e6ed)) * bump https://github.com/hadolint/hadolint from v2.12.0 to 2.14.0 ([#373](#373)) ([47b906b](47b906b)) * bump https://github.com/pre-commit/pre-commit-hooks from v5.0.0 to 6.0.0 ([#374](#374)) ([1926555](1926555)) * bump litellm from 1.82.1 to 1.82.2 in the minor-and-patch group ([#385](#385)) ([fa4f7b7](fa4f7b7)) * bump node from 22-alpine to 25-alpine in /docker/web ([#359](#359)) ([8d56cd3](8d56cd3)) * bump node from 22-slim to 25-slim in /docker/sandbox ([#358](#358)) ([3de8748](3de8748)) * bump pinia from 2.3.1 to 3.0.4 in /web ([#381](#381)) ([c78dcc2](c78dcc2)) * bump the major group across 1 directory with 9 updates ([#389](#389)) ([9fa621b](9fa621b)) * bump the minor-and-patch group with 2 updates ([#362](#362)) ([6ede2ce](6ede2ce)) * bump vue-router from 4.6.4 to 5.0.3 in /web ([#378](#378)) ([6c60f6c](6c60f6c)) * expand review skills to 18 smart conditional agents ([#364](#364)) ([494013f](494013f)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
Summary
AgentContextpersistence, heartbeat-based failure detection, resume from last checkpoint with reconciliation, max resume attempts with fallback to fail-and-reassignCheckpoint,Heartbeat,CheckpointConfigfrozen Pydantic models,CheckpointRecoveryStrategywith async-safe resume counter,CheckpointCallbacktype alias, andmake_checkpoint_callbackfactoryCheckpointRepositoryandHeartbeatRepositoryprotocols with SQLite implementations, V6 migration (checkpoints + heartbeats tables with indexes)ReactLoopandPlanExecuteLoop, resume path inAgentEnginewithcompletion_configandeffective_autonomythreadingevents/checkpoint,events/execution, andevents/persistenceCloses #201
Post-PR Review
Reviewed by 15 local agents + 5 external reviewers (Copilot, Gemini, Greptile, CodeRabbit, Socket). Two review rounds:
Round 1: 46 findings addressed from 15 agents + 5 external reviewers
Round 2: 8 additional findings addressed from CodeRabbit + Greptile re-review
Key fixes across both rounds:
execution_idmismatch causing checkpoint cleanup to never work (Copilot, Greptile, 4 agents)_resume_from_checkpoint(118→35 lines),recover()(108→30 lines), newcheckpoint/resume.pymoduleTest plan
🤖 Generated with Claude Code