feat: implement single-task execution lifecycle (#21)#144
Conversation
- Add post-execution task transitions: COMPLETED runs auto-complete via two-hop IN_PROGRESS → IN_REVIEW → COMPLETED; non-completion reasons (MAX_TURNS, BUDGET_EXHAUSTED, ERROR) leave task IN_PROGRESS - Add TaskCompletionMetrics model (engine/metrics.py) for proxy overhead metrics per DESIGN_SPEC §10.5: turns_per_task, tokens_per_task, cost_per_task, duration_seconds - Add completion_summary computed field on AgentRunResult (last assistant message content) - Add wall-clock timeout via asyncio.wait_for() with timeout_seconds parameter on AgentEngine.run() - Add EXECUTION_ENGINE_TASK_METRICS and EXECUTION_ENGINE_TIMEOUT event constants - Log TaskCompletionMetrics at INFO on every run completion - Update existing tests for COMPLETED final status - Add comprehensive tests: post-execution transitions (7 tests), timeout (3 tests), metrics (9 tests), completion_summary (5 tests), full lifecycle integration test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Guard `except TimeoutError` to re-raise non-wall-clock timeouts - Wrap post-execution transitions in try/except to protect successful results from bookkeeping failures - Move duration snapshot after cost recording + transitions for accurate wall-clock measurement - Remove redundant `execution_result` param from `_log_completion` - Change `raise exc from None` to `raise exc from build_exc` to preserve both exceptions in traceback chain - Update `timeout_seconds` and `_apply_post_execution_transitions` docstrings for accuracy - Add TODO(M4) marker for auto-complete scaffolding - Split test_agent_engine.py (1042 lines) into two files under 800 - Update DESIGN_SPEC.md: §6.5 run() signature, pipeline steps, constants count, computed fields; §10.5 metric sources and TaskCompletionMetrics model; §15.3 add metrics.py Pre-reviewed by 9 agents, 18 findings addressed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (7)
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds task-level timeout and post-execution state transitions, exposes per-task completion metrics and completion_summary on run results, introduces TaskCompletionMetrics and timeout/metrics observability events, and scaffolds a ReAct loop and recovery strategy while updating tests to assert the new lifecycle and metrics. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant AgentEngine
participant ReActLoop as ReAct Loop
participant Observability
participant TaskStore as Task State Store
Client->>AgentEngine: run(identity, task, timeout_seconds?)
AgentEngine->>TaskStore: validate assignment & set IN_PROGRESS
AgentEngine->>ReActLoop: start execution loop (async)
alt timeout provided
ReActLoop-->>AgentEngine: running...
AgentEngine->>ReActLoop: await with timeout (asyncio.wait_for)
Note right of AgentEngine: on timeout -> cancel loop
AgentEngine->>Observability: emit EXECUTION_ENGINE_TIMEOUT
AgentEngine->>AgentEngine: build ERROR ExecutionResult
else completes or errors
ReActLoop-->>AgentEngine: ExecutionResult (COMPLETED/ERROR/OTHER)
end
AgentEngine->>AgentEngine: _apply_post_execution_transitions(result)
AgentEngine->>TaskStore: transition IN_PROGRESS→IN_REVIEW→COMPLETED (if COMPLETED)
AgentEngine->>Observability: emit EXECUTION_ENGINE_TASK_METRICS (TaskCompletionMetrics.from_run_result)
AgentEngine-->>Client: return AgentRunResult (includes completion_summary, metrics)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Greptile SummaryThis PR implements the full single-task execution lifecycle for Key findings:
Confidence Score: 3/5
Sequence DiagramsequenceDiagram
participant Caller
participant AgentEngine
participant _run_loop_with_timeout
participant ExecutionLoop
participant _record_costs
participant _apply_post_execution_transitions
participant _log_completion
Caller->>AgentEngine: run(identity, task, timeout_seconds?)
AgentEngine->>AgentEngine: _validate_run_inputs()
AgentEngine->>AgentEngine: _validate_agent()
AgentEngine->>AgentEngine: _validate_task()
AgentEngine->>AgentEngine: _prepare_context() → ASSIGNED→IN_PROGRESS
AgentEngine->>_run_loop_with_timeout: await (ctx, timeout_seconds)
alt timeout_seconds is None
_run_loop_with_timeout->>ExecutionLoop: await execute()
ExecutionLoop-->>_run_loop_with_timeout: ExecutionResult (with turns)
else timeout_seconds set
_run_loop_with_timeout->>ExecutionLoop: asyncio.create_task(execute())
_run_loop_with_timeout->>_run_loop_with_timeout: asyncio.wait({loop_task}, timeout)
alt loop finishes in time
ExecutionLoop-->>_run_loop_with_timeout: ExecutionResult (with turns)
else wall-clock timeout fires
_run_loop_with_timeout->>ExecutionLoop: loop_task.cancel()
Note over _run_loop_with_timeout: Returns ExecutionResult(ctx=pre-exec ctx)<br/>⚠️ partial turns/cost are lost
end
end
_run_loop_with_timeout-->>AgentEngine: ExecutionResult
AgentEngine->>_record_costs: await (execution_result)
_record_costs-->>AgentEngine: (costs recorded per turn, or nothing if empty)
AgentEngine->>_apply_post_execution_transitions: (execution_result)
alt TerminationReason.COMPLETED
_apply_post_execution_transitions->>_apply_post_execution_transitions: IN_PROGRESS→IN_REVIEW→COMPLETED
else other reason
_apply_post_execution_transitions->>_apply_post_execution_transitions: no-op
end
_apply_post_execution_transitions-->>AgentEngine: ExecutionResult (updated ctx)
AgentEngine->>AgentEngine: build AgentRunResult(duration_seconds)
AgentEngine->>_log_completion: (result, duration)
_log_completion->>_log_completion: TaskCompletionMetrics.from_run_result()
_log_completion-->>AgentEngine: (metrics logged)
AgentEngine-->>Caller: AgentRunResult
Last reviewed commit: 73b7769 |
| start=start, | ||
| timeout_seconds=timeout_seconds, | ||
| ) | ||
| except MemoryError, RecursionError: |
There was a problem hiding this comment.
Python 2 except comma syntax — not valid Python 3
except MemoryError, RecursionError: is Python 2 syntax and is a SyntaxError in Python 3. Python 2 parsed this as "catch MemoryError, bind it to the name RecursionError" — it does NOT catch both exception types. The Python 3 form to catch multiple exceptions requires parentheses. This same pattern appears on lines 584 and 669 as well.
| except MemoryError, RecursionError: | |
| except (MemoryError, RecursionError): |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/engine/agent_engine.py
Line: 192
Comment:
**Python 2 `except` comma syntax — not valid Python 3**
`except MemoryError, RecursionError:` is Python 2 syntax and is a `SyntaxError` in Python 3. Python 2 parsed this as "catch `MemoryError`, bind it to the name `RecursionError`" — it does NOT catch both exception types. The Python 3 form to catch multiple exceptions requires parentheses. This same pattern appears on lines **584** and **669** as well.
```suggestion
except (MemoryError, RecursionError):
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Pull request overview
Implements the full single-task execution lifecycle in AgentEngine.run() (validation → prompt/context setup → execution loop with optional wall-clock timeout → cost recording → post-execution transitions), and adds per-run result summarization and proxy overhead metrics to support issue #21 and DESIGN_SPEC alignment.
Changes:
- Add
timeout_secondssupport (viaasyncio.wait_for) and post-execution task transitions (IN_PROGRESS → IN_REVIEW → COMPLETED on success) inAgentEngine. - Introduce
TaskCompletionMetricsmodel withfrom_run_result()factory; log new execution events for timeout + task metrics. - Extend
AgentRunResultwithcompletion_summaryand update/add unit + integration tests reflecting the new lifecycle behavior.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/engine/test_run_result.py | Adds unit coverage for AgentRunResult.completion_summary behavior (assistant-only, skips tool-call-only/empty content). |
| tests/unit/engine/test_metrics.py | Adds unit tests for TaskCompletionMetrics construction, validation, and extraction from AgentRunResult. |
| tests/unit/engine/test_agent_engine_lifecycle.py | Adds unit tests for post-execution transitions, timeout behavior, and metrics computability. |
| tests/unit/engine/test_agent_engine.py | Updates existing unit tests to reflect auto-completion transitions and context expectations. |
| tests/integration/engine/test_agent_engine_integration.py | Updates existing integration assertion and adds a full lifecycle integration test (ASSIGNED → COMPLETED) plus metrics/summary checks. |
| src/ai_company/observability/events/execution.py | Adds new engine event constants for task metrics and timeout. |
| src/ai_company/engine/run_result.py | Adds computed completion_summary derived from the last assistant message with non-empty content. |
| src/ai_company/engine/metrics.py | Introduces TaskCompletionMetrics frozen model and from_run_result() factory. |
| src/ai_company/engine/agent_engine.py | Implements timeout support, post-execution transitions, and logs task metrics; updates completion logging. |
| src/ai_company/engine/init.py | Re-exports TaskCompletionMetrics from the engine package. |
| DESIGN_SPEC.md | Updates spec sections to reflect new run() signature, pipeline steps, computed fields, metrics model, and repo structure. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
| else: | ||
| execution_result = await coro | ||
| except TimeoutError: |
There was a problem hiding this comment.
asyncio.wait_for() raises asyncio.TimeoutError, but this handler catches the broad built-in TimeoutError. If the underlying loop/provider/tooling raises its own TimeoutError while timeout_seconds is set, it will be misclassified as a wall-clock timeout and the original error context will be lost. Prefer catching asyncio.TimeoutError (or catching around only the wait_for call) so non-wall-clock timeouts propagate into the normal fatal-error path/logging.
| except TimeoutError: | |
| except asyncio.TimeoutError: |
| metrics = TaskCompletionMetrics.from_run_result(result) | ||
| logger.info( | ||
| EXECUTION_ENGINE_TASK_METRICS, | ||
| agent_id=agent_id, | ||
| task_id=task_id, |
There was a problem hiding this comment.
EXECUTION_ENGINE_TASK_METRICS is logged unconditionally, even when the run ends with TerminationReason.ERROR / MAX_TURNS / BUDGET_EXHAUSTED. Given the naming (TaskCompletionMetrics) and spec wording (“logged at task completion”), this will emit misleading “completion” metrics for incomplete tasks. Consider either (a) only logging this event when result.is_success/termination_reason == COMPLETED, or (b) include termination_reason in the metrics event (and update naming/docs) to make it clear these are per-run metrics.
There was a problem hiding this comment.
Code Review
This pull request implements the single-task execution lifecycle within the AgentEngine, introducing features like wall-clock timeouts, post-execution task status transitions, and completion metrics. However, a high-severity security issue was identified: the run method lacks an ownership check to ensure that the agent executing the task is the one assigned to it, which could allow unauthorized agents to execute and transition tasks. It is recommended to add a check to verify that task.assigned_to matches the agent's ID before proceeding with execution. Additionally, a critical syntax issue prevents the code from running in Python 3, and there's a potential portability issue with exception handling.
| start=start, | ||
| timeout_seconds=timeout_seconds, | ||
| ) | ||
| except MemoryError, RecursionError: |
| async def run( # noqa: PLR0913 | ||
| self, | ||
| *, | ||
| identity: AgentIdentity, | ||
| task: Task, | ||
| completion_config: CompletionConfig | None = None, | ||
| max_turns: int = DEFAULT_MAX_TURNS, | ||
| memory_messages: tuple[ChatMessage, ...] = (), | ||
| timeout_seconds: float | None = None, | ||
| ) -> AgentRunResult: |
There was a problem hiding this comment.
The run method orchestrates the agent execution lifecycle but fails to verify that the provided agent is actually assigned to the task. While it validates the agent's status and the task's status, it does not check the assigned_to field of the task against the agent's ID. This allows any active agent to execute any task that is in an ASSIGNED or IN_PROGRESS state, potentially leading to unauthorized access to task details and unauthorized state transitions. Although this check might be performed at a higher level, as the 'Top-level orchestrator', the AgentEngine should enforce this authorization boundary defensively.
| ) | ||
| else: | ||
| execution_result = await coro | ||
| except TimeoutError: |
There was a problem hiding this comment.
asyncio.wait_for raises asyncio.TimeoutError. While this is an alias for the built-in TimeoutError in Python 3.11+, using except TimeoutError: will not catch the exception on older Python 3 versions. For better portability, it's recommended to explicitly catch asyncio.TimeoutError.
except asyncio.TimeoutError:There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/ai_company/engine/agent_engine.py (1)
97-211: 🛠️ Refactor suggestion | 🟠 MajorSplit
run()and_execute()back under the 50-line cap.The timeout/telemetry additions leave both lifecycle methods responsible for validation, context prep, timeout orchestration, cost recording, transitions, and error translation in one flow. Please extract the timeout wrapper and post-processing helpers before this gets any harder to reason about safely.
As per coding guidelines "Functions should be less than 50 lines, files less than 800 lines".
Also applies to: 213-289
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/agent_engine.py` around lines 97 - 211, The run() method and _execute() are too large—extract the timeout orchestration and post-execution processing into separate helper functions so both run() and _execute() are each under 50 lines; keep validation (calls to _validate_agent and _validate_task) and context preparation (call to _prepare_context) in run(), have run() call a new timeout_wrapper that invokes the core execution loop in _execute_core (move the existing loop logic from _execute to _execute_core), and factor out cost/transition/telemetry/post-processing into a helper (e.g., _finalize_run) used by _execute_core and by the fatal-error path; update signatures of _execute/_execute_core/_finalize_run to accept ctx, system_prompt, start, timeout_seconds and ensure _handle_fatal_error is retained for exceptions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 237-265: The try/except around asyncio.wait_for is conflating
internal TimeoutError raised by self._loop.execute with wall-clock timeouts;
change to create a Task from the coroutine returned by self._loop.execute (e.g.
task = asyncio.create_task(coro)) and use asyncio.wait({task},
timeout=timeout_seconds) to get (done, pending); if pending is non-empty treat
it as the engine boundary timeout (log EXECUTION_ENGINE_TIMEOUT with
agent_id/task_id, compute duration from start, cancel the task and handle
cleanup), otherwise call task.result() to retrieve the execution_result and let
any inner exceptions propagate normally; keep references to timeout_seconds,
start, EXECUTION_ENGINE_TIMEOUT, self._loop.execute, and execution_result in the
updated flow.
In `@tests/unit/engine/test_agent_engine_lifecycle.py`:
- Around line 7-25: This test module is missing the repo-required 30s timeout
mark; add a module-level pytest mark by defining pytestmark =
pytest.mark.timeout(30) at top-level in
tests/unit/engine/test_agent_engine_lifecycle.py (near the existing import of
pytest) so all async lifecycle tests (e.g., those using AgentEngine,
AgentContext, ExecutionResult, TerminationReason) inherit the 30-second timeout.
---
Outside diff comments:
In `@src/ai_company/engine/agent_engine.py`:
- Around line 97-211: The run() method and _execute() are too large—extract the
timeout orchestration and post-execution processing into separate helper
functions so both run() and _execute() are each under 50 lines; keep validation
(calls to _validate_agent and _validate_task) and context preparation (call to
_prepare_context) in run(), have run() call a new timeout_wrapper that invokes
the core execution loop in _execute_core (move the existing loop logic from
_execute to _execute_core), and factor out
cost/transition/telemetry/post-processing into a helper (e.g., _finalize_run)
used by _execute_core and by the fatal-error path; update signatures of
_execute/_execute_core/_finalize_run to accept ctx, system_prompt, start,
timeout_seconds and ensure _handle_fatal_error is retained for exceptions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: f9b49081-783e-4ff3-bf9b-a6ef344d7605
📒 Files selected for processing (11)
DESIGN_SPEC.mdsrc/ai_company/engine/__init__.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/engine/metrics.pysrc/ai_company/engine/run_result.pysrc/ai_company/observability/events/execution.pytests/integration/engine/test_agent_engine_integration.pytests/unit/engine/test_agent_engine.pytests/unit/engine/test_agent_engine_lifecycle.pytests/unit/engine/test_metrics.pytests/unit/engine/test_run_result.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Agent
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Python 3.14+ with PEP 649 native lazy annotations
Do NOT usefrom __future__ import annotations—Python 3.14 has PEP 649
Useexcept A, B:syntax (no parentheses) for exception handling on Python 3.14—ruff enforces this
Add type hints to all public functions in Python; mypy strict mode is enforced
Use Google-style docstrings on all public classes and functions—ruff D rules enforce this
Create new objects instead of mutating existing ones; usecopy.deepcopy()at construction for non-Pydantic internal collections andMappingProxyTypewrapping for read-only enforcement
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (usingmodel_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model.
Use Pydantic v2 withBaseModel,model_validator,computed_field, andConfigDict
Use@computed_fieldfor derived values instead of storing + validating redundant fields (e.g.TokenUsage.total_tokens)
UseNotBlankStr(fromcore.types) for all identifier/name fields—including optional (NotBlankStr | None) and tuple (tuple[NotBlankStr, ...]) variants—instead of manual whitespace validators
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code (e.g. multiple tool invocations, parallel agent calls); prefer structured concurrency over barecreate_task
Enforce line length of 88 characters (ruff enforces this)
Functions should be less than 50 lines, files less than 800 lines
Handle errors explicitly; never silently swallow errors in Python code
Validate at system boundaries (user input, external APIs, config files)
Files:
src/ai_company/observability/events/execution.pytests/unit/engine/test_run_result.pysrc/ai_company/engine/run_result.pytests/unit/engine/test_agent_engine.pysrc/ai_company/engine/__init__.pytests/unit/engine/test_agent_engine_lifecycle.pytests/unit/engine/test_metrics.pysrc/ai_company/engine/metrics.pysrc/ai_company/engine/agent_engine.pytests/integration/engine/test_agent_engine_integration.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/**/*.py: Every module with business logic MUST importfrom ai_company.observability import get_loggerthenlogger = get_logger(__name__)
Never useimport logging,logging.getLogger(), orprint()in application code
Always useloggeras the variable name for loggers (not_logger, notlog)
Use event name constants from domain-specific modules underai_company.observability.events(e.g.PROVIDER_CALL_STARTfromevents.provider,BUDGET_RECORD_ADDEDfromevents.budget). Import directly:from ai_company.observability.events.<domain> import EVENT_CONSTANT
Always use structured logging withlogger.info(EVENT, key=value)format—neverlogger.info('msg %s', val)
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
DEBUG level logging should be used for object creation, internal flow, entry/exit of key functions
Pure data models, enums, and re-exports do NOT need logging
Files:
src/ai_company/observability/events/execution.pysrc/ai_company/engine/run_result.pysrc/ai_company/engine/__init__.pysrc/ai_company/engine/metrics.pysrc/ai_company/engine/agent_engine.py
{src/**/*.py,tests/**/*.py,src/**/*.yaml,src/**/*.yml,tests/**/*.yaml,tests/**/*.yml,examples/**/*.yaml,examples/**/*.yml}
📄 CodeRabbit inference engine (CLAUDE.md)
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names:
example-provider,example-large-001,example-medium-001,example-small-001,large/medium/smallas aliases. Vendor names may only appear in: (1) DESIGN_SPEC.md provider list, (2).claude/skill/agent files, (3) third-party import paths/module names (e.g.litellm.types.llms.openai). Tests must usetest-provider,test-small-001, etc.
Files:
src/ai_company/observability/events/execution.pytests/unit/engine/test_run_result.pysrc/ai_company/engine/run_result.pytests/unit/engine/test_agent_engine.pysrc/ai_company/engine/__init__.pytests/unit/engine/test_agent_engine_lifecycle.pytests/unit/engine/test_metrics.pysrc/ai_company/engine/metrics.pysrc/ai_company/engine/agent_engine.pytests/integration/engine/test_agent_engine_integration.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Mark unit tests with@pytest.mark.unit, integration tests with@pytest.mark.integration, e2e tests with@pytest.mark.e2e, and slow tests with@pytest.mark.slow
Useasyncio_mode = 'auto'for pytest async tests—no manual@pytest.mark.asyncioneeded
Set a 30-second timeout per test
Files:
tests/unit/engine/test_run_result.pytests/unit/engine/test_agent_engine.pytests/unit/engine/test_agent_engine_lifecycle.pytests/unit/engine/test_metrics.pytests/integration/engine/test_agent_engine_integration.py
src/ai_company/{providers,engine}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
RetryExhaustedErrorsignals that all retries failed—the engine layer catches this to trigger fallback chains
Files:
src/ai_company/engine/run_result.pysrc/ai_company/engine/__init__.pysrc/ai_company/engine/metrics.pysrc/ai_company/engine/agent_engine.py
🧠 Learnings (3)
📚 Learning: 2026-03-06T21:51:55.175Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-06T21:51:55.175Z
Learning: Applies to src/**/*.py : Use event name constants from domain-specific modules under `ai_company.observability.events` (e.g. `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`). Import directly: `from ai_company.observability.events.<domain> import EVENT_CONSTANT`
Applied to files:
src/ai_company/observability/events/execution.py
📚 Learning: 2026-01-24T09:54:45.426Z
Learnt from: CR
Repo: Aureliolo/story-factory PR: 0
File: .github/instructions/agents.instructions.md:0-0
Timestamp: 2026-01-24T09:54:45.426Z
Learning: Applies to agents/test*.py : Agent tests should cover: successful generation with valid output, handling malformed LLM responses, error conditions (network errors, timeouts), output format validation, and integration with story state
Applied to files:
tests/unit/engine/test_run_result.pytests/unit/engine/test_agent_engine.pytests/unit/engine/test_agent_engine_lifecycle.pytests/unit/engine/test_metrics.pytests/integration/engine/test_agent_engine_integration.py
📚 Learning: 2026-02-26T17:43:50.902Z
Learnt from: CR
Repo: Aureliolo/story-factory PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-02-26T17:43:50.902Z
Learning: Applies to src/agents/**/*.py : Agents must extend `BaseAgent`, use retry logic, and implement configurable timeout via settings.
Applied to files:
tests/unit/engine/test_agent_engine_lifecycle.pyDESIGN_SPEC.mdsrc/ai_company/engine/agent_engine.py
🔇 Additional comments (13)
src/ai_company/observability/events/execution.py (1)
35-36: LGTM!The new event constants follow the established naming convention and integrate well with the existing
EXECUTION_ENGINE_*event family.src/ai_company/engine/__init__.py (1)
29-29: LGTM!The
TaskCompletionMetricsimport and export are correctly integrated into the public API, maintaining alphabetical ordering in__all__.Also applies to: 62-62
src/ai_company/engine/run_result.py (1)
87-104: LGTM!The
completion_summarycomputed field correctly handles all edge cases:
- Skips
Nonecontent (tool-call-only messages)- Skips empty string content
- Returns the last qualifying assistant message or
NoneThe reverse iteration is efficient for finding the most recent message.
tests/unit/engine/test_agent_engine.py (2)
118-121: LGTM!Test assertions and comments correctly updated to reflect the new auto-completion behavior where successful runs transition through
ASSIGNED → IN_PROGRESS → IN_REVIEW → COMPLETED.Also applies to: 145-148
538-541: LGTM!Mock contexts now correctly simulate the
IN_PROGRESSstate that_prepare_contextestablishes before handing control to the execution loop. This ensures the mock behavior aligns with real engine execution.Also applies to: 655-660
tests/unit/engine/test_run_result.py (2)
428-452: LGTM!The
_make_result_with_messageshelper provides a clean way to construct test fixtures with specific conversation content, enabling focused testing of thecompletion_summarycomputed field.
455-495: LGTM!Comprehensive test coverage for
completion_summary:
- Returns last assistant content when present
- Returns
Nonefor no assistant messages- Returns
Nonefor empty conversation- Skips tool-call-only messages (
content=None)- Skips empty string content
tests/integration/engine/test_agent_engine_integration.py (2)
205-208: LGTM!Existing test updated to verify the auto-completion path where successful runs transition to
COMPLETED.
211-296: LGTM!Comprehensive integration test validating the full task lifecycle:
- Verifies all three transitions (
ASSIGNED → IN_PROGRESS → IN_REVIEW → COMPLETED)- Confirms
completed_attimestamp is set- Validates
completion_summaryis non-empty- Ensures
TaskCompletionMetricscan be computed with positive valuesThis provides excellent end-to-end coverage for the single-task execution lifecycle feature.
src/ai_company/engine/metrics.py (1)
1-75: LGTM!Well-designed frozen Pydantic model following established patterns:
- Uses
NotBlankStrfor identifier fields as per coding guidelines- Proper
ge=0constraints on numeric fields- Clean factory method
from_run_resultfor extraction fromAgentRunResult- Good use of
TYPE_CHECKINGto avoid circular importtests/unit/engine/test_metrics.py (2)
19-100: LGTM!Comprehensive unit tests for
TaskCompletionMetricsconstruction and validation:
- Valid construction with all fields
task_id=Nonehandling- Frozen immutability enforcement
- Zero value acceptance
- Negative value rejection for
turns_per_taskandtokens_per_task- Blank
agent_idrejection viaNotBlankStr
102-194: LGTM!The
from_run_resultfactory method tests thoroughly validate extraction fromAgentRunResult:
- Correctly extracts
task_id,agent_id,turns_per_task,tokens_per_task,cost_per_task, andduration_seconds- Handles zero-turns edge case
- Uses a well-designed helper method for fixture construction
src/ai_company/engine/agent_engine.py (1)
266-273: Timeout results currently discard partial progress.The fallback at lines 266-273 reconstructs the result from the pre-loop
ctx, leavingturnsat theExecutionResultdefault of(). This means_record_costs()has no turn data to persist in the timeout path. Sinceasyncio.wait_for()cancels the underlying task on timeout, timed-out work will be undercounted unless the loop itself catchesCancelledErrorand returns a checkpointedExecutionResultwith accumulated context before this branch executes.Verify that explicit cancellation/checkpoint logic exists in
react_loop.py(or the applicable timeout handler) to preserve the latestAgentContextand turn data when the outer timeout fires. If absent, this path will always record zero cost/tokens for timed-out runs.
| import pytest | ||
|
|
||
| from ai_company.core.agent import AgentIdentity # noqa: TC001 | ||
| from ai_company.core.enums import TaskStatus | ||
| from ai_company.core.task import Task # noqa: TC001 | ||
| from ai_company.engine.agent_engine import AgentEngine | ||
| from ai_company.engine.context import AgentContext | ||
| from ai_company.engine.loop_protocol import ( | ||
| ExecutionResult, | ||
| TerminationReason, | ||
| ) | ||
|
|
||
| if TYPE_CHECKING: | ||
| from .conftest import MockCompletionProvider | ||
|
|
||
| from .conftest import make_completion_response as _make_completion_response | ||
|
|
||
|
|
||
| @pytest.mark.unit |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add the mandatory 30-second timeout mark to this module.
These async lifecycle tests are new, but none of them carries the repo-standard timeout. A module-level timeout mark is probably the least repetitive way to keep a hung provider/loop mock from stalling the suite.
♻️ Suggested change
import pytest
+pytestmark = pytest.mark.timeout(30)
+
from ai_company.core.agent import AgentIdentity # noqa: TC001As per coding guidelines "Set a 30-second timeout per test".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/engine/test_agent_engine_lifecycle.py` around lines 7 - 25, This
test module is missing the repo-required 30s timeout mark; add a module-level
pytest mark by defining pytestmark = pytest.mark.timeout(30) at top-level in
tests/unit/engine/test_agent_engine_lifecycle.py (near the existing import of
pytest) so all async lifecycle tests (e.g., those using AgentEngine,
AgentContext, ExecutionResult, TerminationReason) inherit the 30-second timeout.
…t, Gemini, and greptile - Add produced_artifacts field to AgentRunResult (#1) - Wrap _log_completion in try/except to preserve valid results (#2) - Add test for inner TimeoutError propagation without engine timeout (#3) - Extract _run_loop_with_timeout from _execute (50-line limit) (#4) - Extract _validate_run_inputs from run() (50-line limit) (#5) - Rename metrics docstrings from "completed task" to "agent run" + add termination_reason to metrics event (#6) - Fix raise exc from build_exc chain direction (#7) - Replace asyncio.wait_for with asyncio.wait for timeout disambiguation (#8) - Add test for _apply_post_execution_transitions failure resilience (#9) - Add test for timeout cost recording behavior (#10) - Fix hardcoded from_status in transition logs (#11) - Add agent-task ownership check in _validate_task (#12) - Split test_invalid_timeout_raises into two test methods (#13) - Add negative validation tests for cost_per_task/duration_seconds (#14) - Add test_blank_task_id_rejected (#15) - Update _execute docstring to mention timeout, transitions, metrics (#16) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| 5. **Transition task** — `ASSIGNED` → `IN_PROGRESS` (pass-through if already `IN_PROGRESS`). | ||
| 6. **Prepare tools and budget** — creates `ToolInvoker` from registry and `BudgetChecker` from task budget limit. | ||
| 7. **Delegate to loop** — calls `ExecutionLoop.execute()` with context, provider, tool invoker, budget checker, and completion config. | ||
| 7. **Delegate to loop** — calls `ExecutionLoop.execute()` with context, provider, tool invoker, budget checker, and completion config. If `timeout_seconds` is set, wraps the call in `asyncio.wait_for`; on expiry the run returns with `TerminationReason.ERROR` but cost recording and post-execution processing still occur. |
There was a problem hiding this comment.
DESIGN_SPEC references asyncio.wait_for but implementation uses asyncio.wait
The spec description (line 928) states:
If
timeout_secondsis set, wraps the call inasyncio.wait_for…
However, the actual implementation in _run_loop_with_timeout (lines 278–300) deliberately uses asyncio.wait instead, with clear rationale: this prevents conflating an internal TimeoutError from the loop with the engine's wall-clock deadline (see the method's docstring at lines 280–282).
The implementation choice is correct and well-reasoned; the spec just needs to match:
| 7. **Delegate to loop** — calls `ExecutionLoop.execute()` with context, provider, tool invoker, budget checker, and completion config. If `timeout_seconds` is set, wraps the call in `asyncio.wait_for`; on expiry the run returns with `TerminationReason.ERROR` but cost recording and post-execution processing still occur. | |
| 7. **Delegate to loop** — calls `ExecutionLoop.execute()` with context, provider, tool invoker, budget checker, and completion config. If `timeout_seconds` is set, wraps the call in `asyncio.wait` (not `asyncio.wait_for`, to avoid conflating internal `TimeoutError` with the engine's wall-clock deadline); on expiry the run returns with `TerminationReason.ERROR` but cost recording and post-execution processing still occur. |
Prompt To Fix With AI
This is a comment left during a code review.
Path: DESIGN_SPEC.md
Line: 928
Comment:
**DESIGN_SPEC references `asyncio.wait_for` but implementation uses `asyncio.wait`**
The spec description (line 928) states:
> If `timeout_seconds` is set, wraps the call in `asyncio.wait_for`…
However, the actual implementation in `_run_loop_with_timeout` (lines 278–300) deliberately uses `asyncio.wait` instead, with clear rationale: this prevents conflating an internal `TimeoutError` from the loop with the engine's wall-clock deadline (see the method's docstring at lines 280–282).
The implementation choice is correct and well-reasoned; the spec just needs to match:
```suggestion
7. **Delegate to loop** — calls `ExecutionLoop.execute()` with context, provider, tool invoker, budget checker, and completion config. If `timeout_seconds` is set, wraps the call in `asyncio.wait` (not `asyncio.wait_for`, to avoid conflating internal `TimeoutError` with the engine's wall-clock deadline); on expiry the run returns with `TerminationReason.ERROR` but cost recording and post-execution processing still occur.
```
How can I resolve this? If you propose a fix, please make it concise.| produced_artifacts: tuple[Artifact, ...] = Field( | ||
| default=(), | ||
| description="Artifacts produced during execution", | ||
| ) |
There was a problem hiding this comment.
produced_artifacts field is declared but never populated
The produced_artifacts field defaults to an empty tuple and is not passed a non-empty value anywhere in AgentRunResult construction — neither in the normal execution path (_execute, line 248–254) nor in the error handler (_handle_fatal_error, line 736–741). The ExecutionResult returned from ExecutionLoop.execute() has no artifacts field to forward either.
Callers inspecting result.produced_artifacts will always receive an empty tuple, which could cause silent confusion about whether artifacts were actually extracted. Either mark this as an explicit TODO (e.g., # TODO(M?): populate from loop artifacts extraction logic) to signal it's intentional scaffolding, or defer adding the field until the extraction logic exists.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/engine/run_result.py
Line: 51-54
Comment:
**`produced_artifacts` field is declared but never populated**
The `produced_artifacts` field defaults to an empty tuple and is not passed a non-empty value anywhere in `AgentRunResult` construction — neither in the normal execution path (`_execute`, line 248–254) nor in the error handler (`_handle_fatal_error`, line 736–741). The `ExecutionResult` returned from `ExecutionLoop.execute()` has no artifacts field to forward either.
Callers inspecting `result.produced_artifacts` will always receive an empty tuple, which could cause silent confusion about whether artifacts were actually extracted. Either mark this as an explicit TODO (e.g., `# TODO(M?): populate from loop artifacts extraction logic`) to signal it's intentional scaffolding, or defer adding the field until the extraction logic exists.
How can I resolve this? If you propose a fix, please make it concise.| return ExecutionResult( | ||
| context=ctx, | ||
| termination_reason=TerminationReason.ERROR, | ||
| error_message=error_msg, | ||
| ) |
There was a problem hiding this comment.
Timeout path creates ExecutionResult with zero turns, silently losing partial execution state
When the wall-clock deadline expires, _run_loop_with_timeout constructs a new ExecutionResult with context=ctx (the pre-execution context passed in at line 270), termination_reason=TerminationReason.ERROR, and error_message. The turns field defaults to an empty tuple.
However, the cancelled loop_task may have completed partial execution — accumulating turns, tokens, and cost data in the context before the timeout fired. By creating a fresh ExecutionResult from the pre-execution context, all that partial state is dropped irretrievably.
The subsequent _record_costs call (line 240) then iterates over an empty turns tuple and records nothing. Any real spend incurred during the partial run is silently lost. Consequently, TaskCompletionMetrics reports zero turns, zero tokens, and zero cost even when the agent ran for multiple turns before the timeout.
If partial state loss is intentional (because the loop was forcibly cancelled and state may be unreliable), this behavior should be documented explicitly so operators understand that pre-timeout costs will not appear in billing records. If partial results should be preserved, the ExecutionLoop may need to expose a snapshot of the partial context upon cancellation so it can be included in the result.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/engine/agent_engine.py
Line: 316-320
Comment:
**Timeout path creates ExecutionResult with zero turns, silently losing partial execution state**
When the wall-clock deadline expires, `_run_loop_with_timeout` constructs a new `ExecutionResult` with `context=ctx` (the *pre-execution* context passed in at line 270), `termination_reason=TerminationReason.ERROR`, and `error_message`. The `turns` field defaults to an empty tuple.
However, the cancelled `loop_task` may have completed partial execution — accumulating turns, tokens, and cost data in the context before the timeout fired. By creating a fresh `ExecutionResult` from the pre-execution context, all that partial state is dropped irretrievably.
The subsequent `_record_costs` call (line 240) then iterates over an empty `turns` tuple and records nothing. Any real spend incurred during the partial run is silently lost. Consequently, `TaskCompletionMetrics` reports zero turns, zero tokens, and zero cost even when the agent ran for multiple turns before the timeout.
If partial state loss is intentional (because the loop was forcibly cancelled and state may be unreliable), this behavior should be documented explicitly so operators understand that pre-timeout costs will not appear in billing records. If partial results *should* be preserved, the `ExecutionLoop` may need to expose a snapshot of the partial context upon cancellation so it can be included in the result.
How can I resolve this? If you propose a fix, please make it concise.🤖 I have created a release *beep* *boop* --- ## [0.1.1](ai-company-v0.1.0...ai-company-v0.1.1) (2026-03-10) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.0...v0.1.0) (2026-03-11) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add mandatory JWT + API key authentication ([#256](#256)) ([c279cfe](c279cfe)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable output scan response policies ([#263](#263)) ([b9907e8](b9907e8)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement AuditRepository for security audit log persistence ([#279](#279)) ([94bc29f](94bc29f)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * resolve circular imports, bump litellm, fix release tag format ([#286](#286)) ([a6659b5](a6659b5)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * bump anchore/scan-action from 6.5.1 to 7.3.2 ([#271](#271)) ([80a1c15](80a1c15)) * bump docker/build-push-action from 6.19.2 to 7.0.0 ([#273](#273)) ([dd0219e](dd0219e)) * bump docker/login-action from 3.7.0 to 4.0.0 ([#272](#272)) ([33d6238](33d6238)) * bump docker/metadata-action from 5.10.0 to 6.0.0 ([#270](#270)) ([baee04e](baee04e)) * bump docker/setup-buildx-action from 3.12.0 to 4.0.0 ([#274](#274)) ([5fc06f7](5fc06f7)) * bump sigstore/cosign-installer from 3.9.1 to 4.1.0 ([#275](#275)) ([29dd16c](29dd16c)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * **main:** release ai-company 0.1.1 ([#282](#282)) ([2f4703d](2f4703d)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>
Summary
run()orchestrates identity validation, system prompt construction, context preparation, execution loop delegation (with optional wall-clock timeout viaasyncio.wait_for), cost recording, and post-execution task transitions (ASSIGNED → IN_PROGRESS → IN_REVIEW → COMPLETED)ExecutionResultwith engine metadata and computed fields (termination_reason,total_turns,total_cost_usd,is_success,completion_summary)turns_per_task,tokens_per_task,cost_per_task,duration_seconds) withfrom_run_result()factory, logged at task completionEXECUTION_ENGINE_TASK_METRICSandEXECUTION_ENGINE_TIMEOUTevent constants (12 total engine events)run()signature, pipeline steps, constants count, computed fields; §10.5 metric sources andTaskCompletionMetricsmodel; §15.3 project structurePre-PR Review Fixes
Pre-reviewed by 9 agents (code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer, logging-audit, resilience-audit, docs-consistency). 18 findings addressed:
except TimeoutErrorto re-raise non-wall-clock timeouts (silent-failure-hunter)raise exc from Nonetoraise exc from build_excto preserve both exceptions in traceback chain (python-reviewer)execution_resultparam from_log_completion(python-reviewer)timeout_secondsand_apply_post_execution_transitions(comment-analyzer)TODO(M4)marker for auto-complete scaffolding (type-design-analyzer)test_agent_engine.py(1042→733 lines) into two files under 800 (code-reviewer)Test Plan
Closes #21