Skip to content

test: add e2e single agent integration tests (#24)#156

Merged
Aureliolo merged 3 commits intomainfrom
test/e2e-single-agent
Mar 7, 2026
Merged

test: add e2e single agent integration tests (#24)#156
Aureliolo merged 3 commits intomainfrom
test/e2e-single-agent

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

  • Add end-to-end tests validating the core single-agent execution pipeline: engine → execution loop → real tools → cost tracking → task lifecycle
  • 4 test scenarios: file tool agent (write to disk), text-only completion, permission denial recovery, max-turns exhaustion
  • Test infrastructure: ScriptedProvider (mock provider with sequential response playback), factory helpers (make_e2e_identity, make_e2e_task, make_tool_call_response, make_text_response)
  • Pre-PR review fixes (7 agents, 10 findings addressed):
    • Add bounds check with descriptive error in ScriptedProvider.complete()
    • Fix docstrings for accuracy (execution loop, file tools, real LLM placeholder)
    • Add is_error is False assertion on success-path tool result
    • Clarify MAX_TURNS inline comment with full transition rule
    • Add SHUTDOWN to DESIGN_SPEC TerminationReason enum listing
    • Add ShutdownChecker to DESIGN_SPEC ExecutionLoop.execute() docs
    • Add e2e test command to CLAUDE.md Quick Commands

Closes #24

Test plan

  • uv run ruff check src/ tests/ — lint clean
  • uv run ruff format src/ tests/ — format clean
  • uv run mypy src/ tests/ — type-check clean (281 files)
  • uv run pytest tests/ -n auto --cov=ai_company --cov-fail-under=80 — 2476 passed, 96.36% coverage
  • Pre-commit hooks pass (trailing whitespace, ruff, gitleaks, commitizen)
  • Pre-reviewed by 7 agents: code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer, docs-consistency

Aureliolo and others added 2 commits March 7, 2026 15:25
Validate the core MVP hypothesis: a single agent can complete a real
task end-to-end through the full execution pipeline (engine, ReAct loop,
real tools, cost tracking, task lifecycle).

Four scenarios: file tool agent (real filesystem I/O), text-only agent,
permission denied recovery (CUSTOM access level), and max turns
exhaustion. Plus a gated real LLM smoke test placeholder.

Closes #24

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-reviewed by 7 agents, 10 findings addressed:
- Add bounds check with descriptive error in ScriptedProvider
- Fix docstrings for accuracy (execution loop, file tools, real LLM)
- Add is_error assertion on success-path tool result
- Clarify MAX_TURNS comment with full transition rule
- Add SHUTDOWN to DESIGN_SPEC TerminationReason enum listing
- Add ShutdownChecker to DESIGN_SPEC ExecutionLoop docs
- Add e2e test command to CLAUDE.md Quick Commands

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 7, 2026 14:37
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 7, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 7, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 48bd900f-89bb-42dc-9a6e-90bded525e8a

📥 Commits

Reviewing files that changed from the base of the PR and between 8d1cbf4 and ccdecda.

📒 Files selected for processing (3)
  • DESIGN_SPEC.md
  • tests/e2e/conftest.py
  • tests/e2e/test_single_agent_e2e.py

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added cooperative shutdown mechanism for execution loop control with explicit shutdown termination handling.
  • Tests

    • Added a comprehensive end-to-end test suite for single-agent workflows, including scripted response mocking, workspace fixtures, tool-call and text-response helpers, and cost tracking validations.
  • Chores

    • Updated Quick Commands snippet to include a command for running only e2e tests.

Walkthrough

Adds end-to-end test infrastructure and a comprehensive single-agent e2e test suite; introduces test fixtures and a ScriptedProvider mock. Updates design spec to add a cooperative ShutdownChecker parameter and a SHUTDOWN termination reason to the execution loop protocol. Also adds a Quick Commands e2e snippet.

Changes

Cohort / File(s) Summary
Documentation & Quick Commands
CLAUDE.md
Adds a Quick Commands snippet to run only e2e tests (uv run pytest tests/ -m e2e).
Design / Execution Loop API
DESIGN_SPEC.md, engine/loop_protocol.py (signature & enums)
Extends ExecutionLoop.execute(...) to accept an optional ShutdownChecker, adds SHUTDOWN to TerminationReason, and documents cooperative shutdown semantics and post-execution transition rules.
E2E Test Fixtures & Helpers
tests/e2e/conftest.py
Adds ScriptedProvider mock, workspace fixture, identity/task builders (make_e2e_identity, make_e2e_task), response builders (make_tool_call_response, make_text_response), and test constants for deterministic e2e tests.
E2E Test Cases
tests/e2e/test_single_agent_e2e.py
Introduces comprehensive single-agent e2e tests covering file-tool workflow, text-only responses, permission-denied recovery, max-iterations termination, cost tracking assertions, and a gated real-LLM integration scaffold.

Sequence Diagram(s)

sequenceDiagram
    participant TestRunner as Test Runner
    participant AgentEngine as AgentEngine
    participant Provider as ScriptedProvider
    participant ToolRegistry as ToolRegistry
    participant FileTool as FileTools
    participant CostTracker as CostTracker

    TestRunner->>AgentEngine: start execution (Task, Identity)
    AgentEngine->>Provider: request completion (turn N)
    Provider-->>AgentEngine: CompletionResponse (tool_call or text)
    alt tool_call
        AgentEngine->>ToolRegistry: resolve tool call
        ToolRegistry->>FileTool: run tool (e.g., WriteFile)
        FileTool-->>AgentEngine: tool result
    end
    AgentEngine->>Provider: request completion (turn N+1)
    Provider-->>AgentEngine: CompletionResponse (STOP / final)
    AgentEngine->>CostTracker: record usage/costs
    AgentEngine->>AgentEngine: update task state & termination reason
    AgentEngine-->>TestRunner: ExecutionResult (status, metrics, conversation)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding end-to-end integration tests for single-agent workflows.
Description check ✅ Passed The description comprehensively details the changes, test infrastructure, pre-PR fixes, test plan, and verification results, all related to the e2e test additions.
Linked Issues check ✅ Passed All four test scenarios from issue #24 acceptance criteria are implemented: file tools (test_single_agent_writes_to_file), text-only (test_single_agent_text_only_completion), permission denial (test_single_agent_permission_denied_recovery), max iterations (test_single_agent_max_turns_exhausted). Cost tracking and status transitions are validated. Mocked ScriptedProvider avoids real API calls. Optional real LLM test placeholder included.
Out of Scope Changes check ✅ Passed Design spec documentation updates (DESIGN_SPEC.md) align with PR description's stated fixes and support the test infrastructure. CLAUDE.md e2e command addition supports test execution. All changes directly support the e2e testing objectives.
Docstring Coverage ✅ Passed Docstring coverage is 93.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch test/e2e-single-agent
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch test/e2e-single-agent

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's testing foundation by introducing a dedicated suite of end-to-end tests for the single-agent execution flow. These tests ensure the core agent functionality, from task processing to tool interaction and lifecycle management, behaves as expected under various conditions, thereby increasing confidence in the system's reliability and correctness.

Highlights

  • End-to-End Test Suite: Introduced a comprehensive suite of end-to-end tests to validate the single-agent execution pipeline, covering engine, execution loop, real tools, cost tracking, and task lifecycle.
  • Diverse Test Scenarios: Implemented four distinct test scenarios: an agent using file tools, a text-only completion agent, an agent recovering from permission denial, and an agent hitting the maximum turn limit.
  • Robust Test Infrastructure: Developed ScriptedProvider for sequential response playback (including a bounds check for calls) and utility functions (make_e2e_identity, make_e2e_task, make_tool_call_response, make_text_response) to facilitate E2E test setup.
  • Design Specification Updates: Updated DESIGN_SPEC.md to include ShutdownChecker in the ExecutionLoop.execute() signature and added SHUTDOWN to the TerminationReason enum, along with documenting the ShutdownChecker.
  • Documentation and Code Improvements: Applied several pre-PR review fixes, such as clarifying docstrings for accuracy, adding an is_error is False assertion on success-path tool results, and clarifying the MAX_TURNS inline comment.
  • E2E Test Command: Added a new uv run pytest tests/ -m e2e command to CLAUDE.md for easily running end-to-end tests.
Changelog
  • CLAUDE.md
    • Added a command to run e2e tests.
  • DESIGN_SPEC.md
    • Updated the execute method signature to include an optional ShutdownChecker.
    • Expanded the TerminationReason enum with a SHUTDOWN state.
    • Documented the ShutdownChecker callback type.
  • tests/e2e/conftest.py
    • Introduced ScriptedProvider for mocking LLM responses in tests.
    • Added e2e_workspace fixture for isolated file system operations.
    • Provided helper functions for creating AgentIdentity, Task, and CompletionResponse objects for E2E tests.
  • tests/e2e/test_single_agent_e2e.py
    • Added a new file containing end-to-end tests for single-agent scenarios.
    • Implemented tests for file tool usage, text-only completion, permission denial recovery, and max turns exhaustion.
Activity
  • Verified linting, formatting, and type-checking are clean.
  • Confirmed all pytest suites passed, achieving 96.36% code coverage.
  • Ensured pre-commit hooks ran successfully.
  • The changes were pre-reviewed by 7 different AI agents.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable suite of end-to-end tests for the single-agent execution pipeline. The tests are well-designed, covering various scenarios including successful tool use, text-only completion, permission denial recovery, and turn exhaustion. The testing infrastructure, including the ScriptedProvider and factory helpers, is robust and will make future e2e testing easier. The documentation updates are also accurate. I have one minor suggestion to improve the clarity of one of the new tests.

"""Agent writes a file to disk, then completes with a summary."""
write_tool = WriteFileTool(workspace_root=e2e_workspace)
read_tool = ReadFileTool(workspace_root=e2e_workspace)
registry = ToolRegistry([write_tool, read_tool])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The read_tool is registered here but is not used in this test scenario. The scripted agent behavior only involves the write_file tool. To improve clarity and remove unnecessary setup, you can remove read_tool from the registry. You can also remove its initialization on the preceding line.

Suggested change
registry = ToolRegistry([write_tool, read_tool])
registry = ToolRegistry([write_tool])

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 7, 2026

Greptile Summary

This PR adds end-to-end integration tests that validate the full single-agent execution pipeline — from AgentEngine through the ReActLoop, real file system tools, cost tracking, and task lifecycle transitions — using a ScriptedProvider mock that plays back pre-defined LLM responses sequentially. It also updates DESIGN_SPEC.md to document SHUTDOWN as a TerminationReason and ShutdownChecker as an execute() parameter, and adds the e2e pytest marker command to CLAUDE.md.

Four test scenarios: file tool write (real disk I/O), text-only single-turn completion, permission denial recovery, and MAX_TURNS exhaustion — each asserting result fields, task lifecycle transitions, conversation structure, and cost tracking consistency.

ScriptedProvider: Clean sequential mock with correctly-ordered bounds check (index check before append), and call_count / received_messages tracking for post-test inspection.

Minor gap: TestPermissionDeniedRecovery omits the result.termination_reason assertion present in all other three test classes, leaving a small coverage hole.

Confidence Score: 5/5

  • This PR is safe to merge — it adds tests and documentation with no changes to production source code.
  • All changes are confined to test infrastructure (tests/e2e/) and documentation (CLAUDE.md, DESIGN_SPEC.md). No production code is modified. The identified finding is a straightforward test-completeness improvement (missing assertion in one test method to match the pattern in others) with no functional impact. The PR description confirms full CI passage: lint, type-check (strict, 281 files), 2476 tests at 96.36% coverage, and pre-commit hooks.
  • No files require special attention — the finding is in test_single_agent_e2e.py and is a non-blocking style improvement.

Sequence Diagram

sequenceDiagram
    participant Test
    participant Engine as AgentEngine
    participant ReactLoop as ReActLoop
    participant SP as ScriptedProvider
    participant TI as ToolInvoker
    participant CT as CostTracker
    Test->>Engine: run(identity, task, max_turns)
    Engine->>ReactLoop: execute(context, provider, tool_invoker)
    ReactLoop->>SP: complete(messages, model)
    SP-->>ReactLoop: CompletionResponse TOOL_USE
    ReactLoop->>TI: invoke(tool_call)
    TI-->>ReactLoop: ToolResult
    ReactLoop->>SP: complete(messages, model)
    SP-->>ReactLoop: CompletionResponse STOP
    ReactLoop-->>Engine: ExecutionResult COMPLETED
    Engine->>CT: record(TokenUsage)
    Engine->>Engine: task transition to COMPLETED
    Engine-->>Test: AgentRunResult
    Test->>Test: assert result, filesystem, lifecycle, costs
Loading

Last reviewed commit: ccdecda

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an end-to-end (e2e) test suite that exercises the single-agent execution pipeline (engine → loop → real tools → cost tracking → task lifecycle), plus small doc updates to reflect the new testing workflow and termination reasons.

Changes:

  • Introduce 4 e2e scenarios (file write, text-only completion, permission denial recovery, max-turns exhaustion) using real file tools.
  • Add e2e test infrastructure (ScriptedProvider + factory helpers) under tests/e2e/.
  • Update DESIGN_SPEC.md and CLAUDE.md to reflect shutdown termination/docs and add an e2e pytest command.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/e2e/test_single_agent_e2e.py New end-to-end scenarios validating the full single-agent execution pipeline.
tests/e2e/conftest.py New e2e fixtures + scripted completion provider + response factory helpers.
DESIGN_SPEC.md Documentation update: include SHUTDOWN termination and ShutdownChecker in loop API docs.
CLAUDE.md Add quick command for running e2e tests via pytest marker.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +9 to +15
from typing import TYPE_CHECKING

import pytest

if TYPE_CHECKING:
from pathlib import Path

Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path is only imported under TYPE_CHECKING, but this repo’s tests commonly import Path at runtime because pytest evaluates annotations (see e.g. tests/unit/tools/git/conftest.py:5). With the current pattern, anything that resolves annotations (pytest/plugins/typing.get_type_hints) can raise NameError: Path is not defined. Import Path at runtime (optionally with # noqa: TC003) and drop the TYPE_CHECKING block here.

Suggested change
from typing import TYPE_CHECKING
import pytest
if TYPE_CHECKING:
from pathlib import Path
from pathlib import Path
import pytest

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +37
if TYPE_CHECKING:
from collections.abc import AsyncIterator
from pathlib import Path

Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsyncIterator/Path are imported only under TYPE_CHECKING, but this repo’s test suite imports annotation types at runtime because pytest evaluates them (see tests/unit/tools/git/conftest.py:5). Keeping these imports type-checking-only risks NameError if annotations are resolved. Import Path/AsyncIterator at runtime (optionally with # noqa: TC003) and remove the TYPE_CHECKING block.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CLAUDE.md`:
- Line 37: Update the e2e quick command string that currently reads "uv run
pytest tests/ -m e2e" to include pytest-xdist parallelism by adding "-n auto" so
it matches the full-suite and docs; locate and modify the command in CLAUDE.md
(the e2e shortcut line) to read the same invocation with "-n auto" appended.

In `@DESIGN_SPEC.md`:
- Around line 823-831: The spec currently lists TerminationReason.SHUTDOWN but
the orchestrator pipeline only treats COMPLETED as changing task state; update
the orchestrator and AgentEngine/execute documentation so that when execute(...)
returns an ExecutionResult with TerminationReason.SHUTDOWN the orchestrator
transitions the task state to INTERRUPTED (same as §6.7 requires) instead of
leaving it IN_PROGRESS; specifically, thread TerminationReason.SHUTDOWN through
the orchestrator's task state transition logic (the code/docs describing how
ExecutionResult is handled), and update any place that lists only COMPLETED as
state-changing to include SHUTDOWN -> INTERRUPTED so ShutdownChecker,
ExecutionResult, and Task state transition behavior are consistent.

In `@tests/e2e/conftest.py`:
- Around line 154-172: The fixture make_tool_call_response currently builds a
CompletionResponse with content="" which misrepresents pure tool-use turns;
update make_tool_call_response to pass content=None to CompletionResponse (leave
finish_reason=FinishReason.TOOL_USE, usage, model=_TEST_MODEL, and tool_calls
as-is) so the test accurately simulates tool-only assistant responses and
surfaces code paths that treat None differently from an empty string.

In `@tests/e2e/test_single_agent_e2e.py`:
- Around line 372-389: The test
TestRealLLMIntegration.test_real_provider_text_completion unconditionally calls
pytest.skip(), so the REAL_LLM_TEST path is never exercised; replace the
unconditional skip with an env-gated minimal smoke path: read REAL_LLM_PROVIDER
(or similar env vars) and if missing call pytest.skip(), otherwise construct a
minimal provider/client using those env vars inside
test_real_provider_text_completion, perform a simple text completion/request via
the existing LLM client or agent helper (e.g., create the client, call its
complete/generate method), and assert on a non-empty/valid response; keep the
test slow/timeout markers but ensure the new logic only runs when
REAL_LLM_TEST=1 and required provider env vars are present.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 566ae37d-54ca-48e7-bb40-84393643f656

📥 Commits

Reviewing files that changed from the base of the PR and between d1fe1fb and 8d1cbf4.

📒 Files selected for processing (4)
  • CLAUDE.md
  • DESIGN_SPEC.md
  • tests/e2e/conftest.py
  • tests/e2e/test_single_agent_e2e.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Agent
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Do not use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations
Use except A, B: syntax without parentheses for exception handling — ruff enforces PEP 758 on Python 3.14
All public functions and classes must have type hints with strict mypy compliance
Use Google-style docstrings on all public classes and functions — enforced by ruff D rules
Every module with business logic must include: from ai_company.observability import get_logger then logger = get_logger(__name__)
Never use import logging, logging.getLogger(), or print() in application code — use the project logger instead
Always use logger as the variable name for loggers — not _logger or log
Use event name constants from ai_company.observability.events.<domain> instead of string literals for log events
Use structured logging format: logger.info(EVENT, key=value) — never use string formatting like logger.info('msg %s', val)
All error paths must log at WARNING or ERROR with context before raising exceptions
All state transitions must be logged at INFO level
Use DEBUG level logging for object creation, internal flow, and entry/exit of key functions
Create new objects instead of mutating existing ones — never mutate objects
For non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction and wrap with MappingProxyType for read-only enforcement
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves
Never mix static config fields with mutable runtime fields in a single Pydantic model
Use NotBlankStr from core.types for all identifier and name fields in Pydantic models, including optional and tuple variants, instead of manual whitespace validators
Use @computed_field in Pydantic models for derived values instead of storing and validating redundant fields
Use model_copy(update=...) for evolving runtime state in Pydantic models
Use `copy.deepcop...

Files:

  • tests/e2e/test_single_agent_e2e.py
  • tests/e2e/conftest.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Mark all tests with appropriate markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow
Tests must not use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) — use generic names: example-provider, example-large-001, example-medium-001, example-small-001, or test-provider, test-small-001
Prefer @pytest.mark.parametrize for testing similar cases
Default async pytest mode is asyncio_mode = 'auto' — no manual @pytest.mark.asyncio needed
Test timeout is 30 seconds per test
Use pytest-xdist parallelism via -n auto for test execution

Files:

  • tests/e2e/test_single_agent_e2e.py
  • tests/e2e/conftest.py
🧠 Learnings (3)
📚 Learning: 2026-03-07T13:46:59.933Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T13:46:59.933Z
Learning: Dependency groups: `test` (pytest + plugins), `dev` (includes test + ruff, mypy, pre-commit, commitizen). Install with `uv sync`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-07T13:46:59.933Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T13:46:59.933Z
Learning: Applies to tests/**/*.py : Mark all tests with appropriate markers: `pytest.mark.unit`, `pytest.mark.integration`, `pytest.mark.e2e`, or `pytest.mark.slow`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-07T13:46:59.933Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T13:46:59.933Z
Learning: Applies to tests/**/*.py : Use `pytest-xdist` parallelism via `-n auto` for test execution

Applied to files:

  • CLAUDE.md
🧬 Code graph analysis (1)
tests/e2e/conftest.py (2)
src/ai_company/providers/enums.py (1)
  • FinishReason (15-22)
src/ai_company/providers/models.py (5)
  • CompletionResponse (257-306)
  • StreamChunk (309-387)
  • TokenUsage (12-35)
  • ToolCall (96-119)
  • ToolDefinition (64-93)

uv run mypy src/ tests/ # type-check (strict)
uv run pytest tests/ -m unit # unit tests only
uv run pytest tests/ -m integration # integration tests only
uv run pytest tests/ -m e2e # e2e tests only
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if CLAUDE.md exists and view the context around line 37
if [ -f CLAUDE.md ]; then
  echo "=== CLAUDE.md exists, viewing lines 30-45 ==="
  sed -n '30,45p' CLAUDE.md
fi

Repository: Aureliolo/ai-company

Length of output: 811


🏁 Script executed:

# Search for pytest configuration files
echo "=== Looking for pytest configuration ==="
fd -e 'pyproject.toml|pytest.ini|tox.ini|setup.cfg' --type f | head -20

Repository: Aureliolo/ai-company

Length of output: 104


🏁 Script executed:

# Check for pytest-xdist references in the codebase
echo "=== Searching for pytest-xdist references ==="
rg 'pytest.*-n|xdist|auto' --type-list | head -5
rg 'pytest.*-n|xdist|auto' -i

Repository: Aureliolo/ai-company

Length of output: 29935


Add -n auto to the e2e quick command for parallelism consistency.

The e2e test shortcut omits pytest-xdist parallelism, while the full suite command and repo documentation both use -n auto. All test execution should be consistent.

Proposed fix
-uv run pytest tests/ -m e2e                # e2e tests only
+uv run pytest tests/ -m e2e -n auto        # e2e tests only
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
uv run pytest tests/ -m e2e # e2e tests only
uv run pytest tests/ -m e2e -n auto # e2e tests only
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` at line 37, Update the e2e quick command string that currently
reads "uv run pytest tests/ -m e2e" to include pytest-xdist parallelism by
adding "-n auto" so it matches the full-suite and docs; locate and modify the
command in CLAUDE.md (the e2e shortcut line) to read the same invocation with
"-n auto" appended.

Comment on lines +372 to +389
@pytest.mark.slow
@pytest.mark.timeout(60)
@pytest.mark.skipif(
os.environ.get("REAL_LLM_TEST") != "1",
reason="Set REAL_LLM_TEST=1 to run real LLM integration test",
)
class TestRealLLMIntegration:
"""Optional smoke test with a real LLM provider.

Skipped unless REAL_LLM_TEST=1 is set; not expected to run in CI.
"""

async def test_real_provider_text_completion(self) -> None:
"""Minimal text-only task with a real provider.

Placeholder — replace the skip with real provider setup when ready.
"""
pytest.skip("Real LLM test placeholder — configure a real provider")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The manual real-LLM path is still unreachable.

Even when REAL_LLM_TEST=1 is set, this class never runs a real smoke path because the only test unconditionally calls pytest.skip(). That misses the linked objective of having an optional manual real-provider run. Either wire a minimal env-driven provider here or drop the claim until the smoke path actually exists.

I can help sketch a minimal env-gated smoke test that keeps CI isolated but makes the manual path real.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/test_single_agent_e2e.py` around lines 372 - 389, The test
TestRealLLMIntegration.test_real_provider_text_completion unconditionally calls
pytest.skip(), so the REAL_LLM_TEST path is never exercised; replace the
unconditional skip with an env-gated minimal smoke path: read REAL_LLM_PROVIDER
(or similar env vars) and if missing call pytest.skip(), otherwise construct a
minimal provider/client using those env vars inside
test_real_provider_text_completion, perform a simple text completion/request via
the existing LLM client or agent helper (e.g., create the client, call its
complete/generate method), and assert on a non-empty/valid response; keep the
test slow/timeout markers but ensure the new logic only runs when
REAL_LLM_TEST=1 and required provider env vars are present.

… reviewers

- Move received_messages.append() after bounds check in ScriptedProvider (conftest.py)
- Fix double-skip on real LLM test — now env-gated with actionable skip message
- Document SHUTDOWN→INTERRUPTED and ERROR→recovery transitions in DESIGN_SPEC §6.5
- Use content=None for tool-only responses in make_tool_call_response
- Rename TestMaxIterationsExhausted → TestMaxTurnsExhausted (consistent terminology)
- Remove unused read_tool from TestFileToolAgent registry
- Add min conversation length assertion in text-only test
- Add file existence assertions in max-turns test
- Add isinstance protocol assertion for ScriptedProvider
- Improve complete() and stream() docstrings in ScriptedProvider

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Aureliolo Aureliolo merged commit f566fb4 into main Mar 7, 2026
7 checks passed
@Aureliolo Aureliolo deleted the test/e2e-single-agent branch March 7, 2026 14:49
Comment on lines +244 to +246
# Agent recovered successfully
assert result.is_success is True
assert result.total_turns == 2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing termination_reason assertion.

TestPermissionDeniedRecovery checks result.is_success is True but omits result.termination_reason, even though all three other test classes (TestFileToolAgent, TestTextOnlyAgent, TestMaxTurnsExhausted) explicitly assert both. This creates a coverage gap: a scenario where the engine erroneously returns TerminationReason.MAX_TURNS or TerminationReason.BUDGET_EXHAUSTED (while setting is_success=True) would not be caught.

Suggested change
# Agent recovered successfully
assert result.is_success is True
assert result.total_turns == 2
# Agent recovered successfully
assert result.is_success is True
assert result.termination_reason == TerminationReason.COMPLETED
assert result.total_turns == 2
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/e2e/test_single_agent_e2e.py
Line: 244-246

Comment:
Missing `termination_reason` assertion.

`TestPermissionDeniedRecovery` checks `result.is_success is True` but omits `result.termination_reason`, even though all three other test classes (`TestFileToolAgent`, `TestTextOnlyAgent`, `TestMaxTurnsExhausted`) explicitly assert both. This creates a coverage gap: a scenario where the engine erroneously returns `TerminationReason.MAX_TURNS` or `TerminationReason.BUDGET_EXHAUSTED` (while setting `is_success=True`) would not be caught.

```suggestion
        # Agent recovered successfully
        assert result.is_success is True
        assert result.termination_reason == TerminationReason.COMPLETED
        assert result.total_turns == 2
```

How can I resolve this? If you propose a fix, please make it concise.

Aureliolo added a commit that referenced this pull request Mar 10, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.1.1](ai-company-v0.1.0...ai-company-v0.1.1)
(2026-03-10)


### Features

* add autonomy levels and approval timeout policies
([#42](#42),
[#126](#126))
([#197](#197))
([eecc25a](eecc25a))
* add CFO cost optimization service with anomaly detection, reports, and
approval decisions
([#186](#186))
([a7fa00b](a7fa00b))
* add code quality toolchain (ruff, mypy, pre-commit, dependabot)
([#63](#63))
([36681a8](36681a8))
* add configurable cost tiers and subscription/quota-aware tracking
([#67](#67))
([#185](#185))
([9baedfa](9baedfa))
* add container packaging, Docker Compose, and CI pipeline
([#269](#269))
([435bdfe](435bdfe)),
closes [#267](#267)
* add coordination error taxonomy classification pipeline
([#146](#146))
([#181](#181))
([70c7480](70c7480))
* add cost-optimized, hierarchical, and auction assignment strategies
([#175](#175))
([ce924fa](ce924fa)),
closes [#173](#173)
* add design specification, license, and project setup
([8669a09](8669a09))
* add env var substitution and config file auto-discovery
([#77](#77))
([7f53832](7f53832))
* add FastestStrategy routing + vendor-agnostic cleanup
([#140](#140))
([09619cb](09619cb)),
closes [#139](#139)
* add HR engine and performance tracking
([#45](#45),
[#47](#47))
([#193](#193))
([2d091ea](2d091ea))
* add issue auto-search and resolution verification to PR review skill
([#119](#119))
([deecc39](deecc39))
* add memory retrieval, ranking, and context injection pipeline
([#41](#41))
([873b0aa](873b0aa))
* add pluggable MemoryBackend protocol with models, config, and events
([#180](#180))
([46cfdd4](46cfdd4))
* add pluggable MemoryBackend protocol with models, config, and events
([#32](#32))
([46cfdd4](46cfdd4))
* add pluggable PersistenceBackend protocol with SQLite implementation
([#36](#36))
([f753779](f753779))
* add progressive trust and promotion/demotion subsystems
([#43](#43),
[#49](#49))
([3a87c08](3a87c08))
* add retry handler, rate limiter, and provider resilience
([#100](#100))
([b890545](b890545))
* add SecOps security agent with rule engine, audit log, and ToolInvoker
integration ([#40](#40))
([83b7b6c](83b7b6c))
* add shared org memory and memory consolidation/archival
([#125](#125),
[#48](#48))
([4a0832b](4a0832b))
* design unified provider interface
([#86](#86))
([3e23d64](3e23d64))
* expand template presets, rosters, and add inheritance
([#80](#80),
[#81](#81),
[#84](#84))
([15a9134](15a9134))
* implement agent runtime state vs immutable config split
([#115](#115))
([4cb1ca5](4cb1ca5))
* implement AgentEngine core orchestrator
([#11](#11))
([#143](#143))
([f2eb73a](f2eb73a))
* implement basic tool system (registry, invocation, results)
([#15](#15))
([c51068b](c51068b))
* implement built-in file system tools
([#18](#18))
([325ef98](325ef98))
* implement communication foundation — message bus, dispatcher, and
messenger ([#157](#157))
([8e71bfd](8e71bfd))
* implement company template system with 7 built-in presets
([#85](#85))
([cbf1496](cbf1496))
* implement conflict resolution protocol
([#122](#122))
([#166](#166))
([e03f9f2](e03f9f2))
* implement core entity and role system models
([#69](#69))
([acf9801](acf9801))
* implement crash recovery with fail-and-reassign strategy
([#149](#149))
([e6e91ed](e6e91ed))
* implement engine extensions — Plan-and-Execute loop and call
categorization
([#134](#134),
[#135](#135))
([#159](#159))
([9b2699f](9b2699f))
* implement enterprise logging system with structlog
([#73](#73))
([2f787e5](2f787e5))
* implement graceful shutdown with cooperative timeout strategy
([#130](#130))
([6592515](6592515))
* implement hierarchical delegation and loop prevention
([#12](#12),
[#17](#17))
([6be60b6](6be60b6))
* implement LiteLLM driver and provider registry
([#88](#88))
([ae3f18b](ae3f18b)),
closes [#4](#4)
* implement LLM decomposition strategy and workspace isolation
([#174](#174))
([aa0eefe](aa0eefe))
* implement meeting protocol system
([#123](#123))
([ee7caca](ee7caca))
* implement message and communication domain models
([#74](#74))
([560a5d2](560a5d2))
* implement model routing engine
([#99](#99))
([d3c250b](d3c250b))
* implement parallel agent execution
([#22](#22))
([#161](#161))
([65940b3](65940b3))
* implement per-call cost tracking service
([#7](#7))
([#102](#102))
([c4f1f1c](c4f1f1c))
* implement personality injection and system prompt construction
([#105](#105))
([934dd85](934dd85))
* implement single-task execution lifecycle
([#21](#21))
([#144](#144))
([c7e64e4](c7e64e4))
* implement subprocess sandbox for tool execution isolation
([#131](#131))
([#153](#153))
([3c8394e](3c8394e))
* implement task assignment subsystem with pluggable strategies
([#172](#172))
([c7f1b26](c7f1b26)),
closes [#26](#26)
[#30](#30)
* implement task decomposition and routing engine
([#14](#14))
([9c7fb52](9c7fb52))
* implement Task, Project, Artifact, Budget, and Cost domain models
([#71](#71))
([81eabf1](81eabf1))
* implement tool permission checking
([#16](#16))
([833c190](833c190))
* implement YAML config loader with Pydantic validation
([#59](#59))
([ff3a2ba](ff3a2ba))
* implement YAML config loader with Pydantic validation
([#75](#75))
([ff3a2ba](ff3a2ba))
* initialize project with uv, hatchling, and src layout
([39005f9](39005f9))
* initialize project with uv, hatchling, and src layout
([#62](#62))
([39005f9](39005f9))
* Litestar REST API, WebSocket feed, and approval queue (M6)
([#189](#189))
([29fcd08](29fcd08))
* make TokenUsage.total_tokens a computed field
([#118](#118))
([c0bab18](c0bab18)),
closes [#109](#109)
* parallel tool execution in ToolInvoker.invoke_all
([#137](#137))
([58517ee](58517ee))
* testing framework, CI pipeline, and M0 gap fixes
([#64](#64))
([f581749](f581749))
* wire all modules into observability system
([#97](#97))
([f7a0617](f7a0617))


### Bug Fixes

* address Greptile post-merge review findings from PRs
[#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175)
([#176](#176))
([c5ca929](c5ca929))
* address post-merge review feedback from PRs
[#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167)
([#170](#170))
([3bf897a](3bf897a)),
closes [#169](#169)
* enforce strict mypy on test files
([#89](#89))
([aeeff8c](aeeff8c))
* harden Docker sandbox, MCP bridge, and code runner
([#50](#50),
[#53](#53))
([d5e1b6e](d5e1b6e))
* harden git tools security + code quality improvements
([#150](#150))
([000a325](000a325))
* harden subprocess cleanup, env filtering, and shutdown resilience
([#155](#155))
([d1fe1fb](d1fe1fb))
* incorporate post-merge feedback + pre-PR review fixes
([#164](#164))
([c02832a](c02832a))
* pre-PR review fixes for post-merge findings
([#183](#183))
([26b3108](26b3108))
* strengthen immutability for BaseTool schema and ToolInvoker boundaries
([#117](#117))
([7e5e861](7e5e861))


### Performance

* harden non-inferable principle implementation
([#195](#195))
([02b5f4e](02b5f4e)),
closes [#188](#188)


### Refactoring

* adopt NotBlankStr across all models
([#108](#108))
([#120](#120))
([ef89b90](ef89b90))
* extract _SpendingTotals base class from spending summary models
([#111](#111))
([2f39c1b](2f39c1b))
* harden BudgetEnforcer with error handling, validation extraction, and
review fixes
([#182](#182))
([c107bf9](c107bf9))
* harden personality profiles, department validation, and template
rendering ([#158](#158))
([10b2299](10b2299))
* pre-PR review improvements for ExecutionLoop + ReAct loop
([#124](#124))
([8dfb3c0](8dfb3c0))
* split events.py into per-domain event modules
([#136](#136))
([e9cba89](e9cba89))


### Documentation

* add ADR-001 memory layer evaluation and selection
([#178](#178))
([db3026f](db3026f)),
closes [#39](#39)
* add agent scaling research findings to DESIGN_SPEC
([#145](#145))
([57e487b](57e487b))
* add CLAUDE.md, contributing guide, and dev documentation
([#65](#65))
([55c1025](55c1025)),
closes [#54](#54)
* add crash recovery, sandboxing, analytics, and testing decisions
([#127](#127))
([5c11595](5c11595))
* address external review feedback with MVP scope and new protocols
([#128](#128))
([3b30b9a](3b30b9a))
* expand design spec with pluggable strategy protocols
([#121](#121))
([6832db6](6832db6))
* finalize 23 design decisions (ADR-002)
([#190](#190))
([8c39742](8c39742))
* update project docs for M2.5 conventions and add docs-consistency
review agent
([#114](#114))
([99766ee](99766ee))


### Tests

* add e2e single agent integration tests
([#24](#24))
([#156](#156))
([f566fb4](f566fb4))
* add provider adapter integration tests
([#90](#90))
([40a61f4](40a61f4))


### CI/CD

* add Release Please for automated versioning and GitHub Releases
([#278](#278))
([a488758](a488758))
* bump actions/checkout from 4 to 6
([#95](#95))
([1897247](1897247))
* bump actions/upload-artifact from 4 to 7
([#94](#94))
([27b1517](27b1517))
* harden CI/CD pipeline
([#92](#92))
([ce4693c](ce4693c))
* split vulnerability scans into critical-fail and high-warn tiers
([#277](#277))
([aba48af](aba48af))


### Maintenance

* add /worktree skill for parallel worktree management
([#171](#171))
([951e337](951e337))
* add design spec context loading to research-link skill
([8ef9685](8ef9685))
* add post-merge-cleanup skill
([#70](#70))
([f913705](f913705))
* add pre-pr-review skill and update CLAUDE.md
([#103](#103))
([92e9023](92e9023))
* add research-link skill and rename skill files to SKILL.md
([#101](#101))
([651c577](651c577))
* bump aiosqlite from 0.21.0 to 0.22.1
([#191](#191))
([3274a86](3274a86))
* bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group
([#96](#96))
([0338d0c](0338d0c))
* bump ruff from 0.15.4 to 0.15.5
([a49ee46](a49ee46))
* fix M0 audit items
([#66](#66))
([c7724b5](c7724b5))
* pin setup-uv action to full SHA
([#281](#281))
([4448002](4448002))
* post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests,
hookify rules
([#148](#148))
([c57a6a9](c57a6a9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Aureliolo added a commit that referenced this pull request Mar 11, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.1.0](v0.0.0...v0.1.0)
(2026-03-11)


### Features

* add autonomy levels and approval timeout policies
([#42](#42),
[#126](#126))
([#197](#197))
([eecc25a](eecc25a))
* add CFO cost optimization service with anomaly detection, reports, and
approval decisions
([#186](#186))
([a7fa00b](a7fa00b))
* add code quality toolchain (ruff, mypy, pre-commit, dependabot)
([#63](#63))
([36681a8](36681a8))
* add configurable cost tiers and subscription/quota-aware tracking
([#67](#67))
([#185](#185))
([9baedfa](9baedfa))
* add container packaging, Docker Compose, and CI pipeline
([#269](#269))
([435bdfe](435bdfe)),
closes [#267](#267)
* add coordination error taxonomy classification pipeline
([#146](#146))
([#181](#181))
([70c7480](70c7480))
* add cost-optimized, hierarchical, and auction assignment strategies
([#175](#175))
([ce924fa](ce924fa)),
closes [#173](#173)
* add design specification, license, and project setup
([8669a09](8669a09))
* add env var substitution and config file auto-discovery
([#77](#77))
([7f53832](7f53832))
* add FastestStrategy routing + vendor-agnostic cleanup
([#140](#140))
([09619cb](09619cb)),
closes [#139](#139)
* add HR engine and performance tracking
([#45](#45),
[#47](#47))
([#193](#193))
([2d091ea](2d091ea))
* add issue auto-search and resolution verification to PR review skill
([#119](#119))
([deecc39](deecc39))
* add mandatory JWT + API key authentication
([#256](#256))
([c279cfe](c279cfe))
* add memory retrieval, ranking, and context injection pipeline
([#41](#41))
([873b0aa](873b0aa))
* add pluggable MemoryBackend protocol with models, config, and events
([#180](#180))
([46cfdd4](46cfdd4))
* add pluggable MemoryBackend protocol with models, config, and events
([#32](#32))
([46cfdd4](46cfdd4))
* add pluggable output scan response policies
([#263](#263))
([b9907e8](b9907e8))
* add pluggable PersistenceBackend protocol with SQLite implementation
([#36](#36))
([f753779](f753779))
* add progressive trust and promotion/demotion subsystems
([#43](#43),
[#49](#49))
([3a87c08](3a87c08))
* add retry handler, rate limiter, and provider resilience
([#100](#100))
([b890545](b890545))
* add SecOps security agent with rule engine, audit log, and ToolInvoker
integration ([#40](#40))
([83b7b6c](83b7b6c))
* add shared org memory and memory consolidation/archival
([#125](#125),
[#48](#48))
([4a0832b](4a0832b))
* design unified provider interface
([#86](#86))
([3e23d64](3e23d64))
* expand template presets, rosters, and add inheritance
([#80](#80),
[#81](#81),
[#84](#84))
([15a9134](15a9134))
* implement agent runtime state vs immutable config split
([#115](#115))
([4cb1ca5](4cb1ca5))
* implement AgentEngine core orchestrator
([#11](#11))
([#143](#143))
([f2eb73a](f2eb73a))
* implement AuditRepository for security audit log persistence
([#279](#279))
([94bc29f](94bc29f))
* implement basic tool system (registry, invocation, results)
([#15](#15))
([c51068b](c51068b))
* implement built-in file system tools
([#18](#18))
([325ef98](325ef98))
* implement communication foundation — message bus, dispatcher, and
messenger ([#157](#157))
([8e71bfd](8e71bfd))
* implement company template system with 7 built-in presets
([#85](#85))
([cbf1496](cbf1496))
* implement conflict resolution protocol
([#122](#122))
([#166](#166))
([e03f9f2](e03f9f2))
* implement core entity and role system models
([#69](#69))
([acf9801](acf9801))
* implement crash recovery with fail-and-reassign strategy
([#149](#149))
([e6e91ed](e6e91ed))
* implement engine extensions — Plan-and-Execute loop and call
categorization
([#134](#134),
[#135](#135))
([#159](#159))
([9b2699f](9b2699f))
* implement enterprise logging system with structlog
([#73](#73))
([2f787e5](2f787e5))
* implement graceful shutdown with cooperative timeout strategy
([#130](#130))
([6592515](6592515))
* implement hierarchical delegation and loop prevention
([#12](#12),
[#17](#17))
([6be60b6](6be60b6))
* implement LiteLLM driver and provider registry
([#88](#88))
([ae3f18b](ae3f18b)),
closes [#4](#4)
* implement LLM decomposition strategy and workspace isolation
([#174](#174))
([aa0eefe](aa0eefe))
* implement meeting protocol system
([#123](#123))
([ee7caca](ee7caca))
* implement message and communication domain models
([#74](#74))
([560a5d2](560a5d2))
* implement model routing engine
([#99](#99))
([d3c250b](d3c250b))
* implement parallel agent execution
([#22](#22))
([#161](#161))
([65940b3](65940b3))
* implement per-call cost tracking service
([#7](#7))
([#102](#102))
([c4f1f1c](c4f1f1c))
* implement personality injection and system prompt construction
([#105](#105))
([934dd85](934dd85))
* implement single-task execution lifecycle
([#21](#21))
([#144](#144))
([c7e64e4](c7e64e4))
* implement subprocess sandbox for tool execution isolation
([#131](#131))
([#153](#153))
([3c8394e](3c8394e))
* implement task assignment subsystem with pluggable strategies
([#172](#172))
([c7f1b26](c7f1b26)),
closes [#26](#26)
[#30](#30)
* implement task decomposition and routing engine
([#14](#14))
([9c7fb52](9c7fb52))
* implement Task, Project, Artifact, Budget, and Cost domain models
([#71](#71))
([81eabf1](81eabf1))
* implement tool permission checking
([#16](#16))
([833c190](833c190))
* implement YAML config loader with Pydantic validation
([#59](#59))
([ff3a2ba](ff3a2ba))
* implement YAML config loader with Pydantic validation
([#75](#75))
([ff3a2ba](ff3a2ba))
* initialize project with uv, hatchling, and src layout
([39005f9](39005f9))
* initialize project with uv, hatchling, and src layout
([#62](#62))
([39005f9](39005f9))
* Litestar REST API, WebSocket feed, and approval queue (M6)
([#189](#189))
([29fcd08](29fcd08))
* make TokenUsage.total_tokens a computed field
([#118](#118))
([c0bab18](c0bab18)),
closes [#109](#109)
* parallel tool execution in ToolInvoker.invoke_all
([#137](#137))
([58517ee](58517ee))
* testing framework, CI pipeline, and M0 gap fixes
([#64](#64))
([f581749](f581749))
* wire all modules into observability system
([#97](#97))
([f7a0617](f7a0617))


### Bug Fixes

* address Greptile post-merge review findings from PRs
[#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175)
([#176](#176))
([c5ca929](c5ca929))
* address post-merge review feedback from PRs
[#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167)
([#170](#170))
([3bf897a](3bf897a)),
closes [#169](#169)
* enforce strict mypy on test files
([#89](#89))
([aeeff8c](aeeff8c))
* harden Docker sandbox, MCP bridge, and code runner
([#50](#50),
[#53](#53))
([d5e1b6e](d5e1b6e))
* harden git tools security + code quality improvements
([#150](#150))
([000a325](000a325))
* harden subprocess cleanup, env filtering, and shutdown resilience
([#155](#155))
([d1fe1fb](d1fe1fb))
* incorporate post-merge feedback + pre-PR review fixes
([#164](#164))
([c02832a](c02832a))
* pre-PR review fixes for post-merge findings
([#183](#183))
([26b3108](26b3108))
* resolve circular imports, bump litellm, fix release tag format
([#286](#286))
([a6659b5](a6659b5))
* strengthen immutability for BaseTool schema and ToolInvoker boundaries
([#117](#117))
([7e5e861](7e5e861))


### Performance

* harden non-inferable principle implementation
([#195](#195))
([02b5f4e](02b5f4e)),
closes [#188](#188)


### Refactoring

* adopt NotBlankStr across all models
([#108](#108))
([#120](#120))
([ef89b90](ef89b90))
* extract _SpendingTotals base class from spending summary models
([#111](#111))
([2f39c1b](2f39c1b))
* harden BudgetEnforcer with error handling, validation extraction, and
review fixes
([#182](#182))
([c107bf9](c107bf9))
* harden personality profiles, department validation, and template
rendering ([#158](#158))
([10b2299](10b2299))
* pre-PR review improvements for ExecutionLoop + ReAct loop
([#124](#124))
([8dfb3c0](8dfb3c0))
* split events.py into per-domain event modules
([#136](#136))
([e9cba89](e9cba89))


### Documentation

* add ADR-001 memory layer evaluation and selection
([#178](#178))
([db3026f](db3026f)),
closes [#39](#39)
* add agent scaling research findings to DESIGN_SPEC
([#145](#145))
([57e487b](57e487b))
* add CLAUDE.md, contributing guide, and dev documentation
([#65](#65))
([55c1025](55c1025)),
closes [#54](#54)
* add crash recovery, sandboxing, analytics, and testing decisions
([#127](#127))
([5c11595](5c11595))
* address external review feedback with MVP scope and new protocols
([#128](#128))
([3b30b9a](3b30b9a))
* expand design spec with pluggable strategy protocols
([#121](#121))
([6832db6](6832db6))
* finalize 23 design decisions (ADR-002)
([#190](#190))
([8c39742](8c39742))
* update project docs for M2.5 conventions and add docs-consistency
review agent
([#114](#114))
([99766ee](99766ee))


### Tests

* add e2e single agent integration tests
([#24](#24))
([#156](#156))
([f566fb4](f566fb4))
* add provider adapter integration tests
([#90](#90))
([40a61f4](40a61f4))


### CI/CD

* add Release Please for automated versioning and GitHub Releases
([#278](#278))
([a488758](a488758))
* bump actions/checkout from 4 to 6
([#95](#95))
([1897247](1897247))
* bump actions/upload-artifact from 4 to 7
([#94](#94))
([27b1517](27b1517))
* bump anchore/scan-action from 6.5.1 to 7.3.2
([#271](#271))
([80a1c15](80a1c15))
* bump docker/build-push-action from 6.19.2 to 7.0.0
([#273](#273))
([dd0219e](dd0219e))
* bump docker/login-action from 3.7.0 to 4.0.0
([#272](#272))
([33d6238](33d6238))
* bump docker/metadata-action from 5.10.0 to 6.0.0
([#270](#270))
([baee04e](baee04e))
* bump docker/setup-buildx-action from 3.12.0 to 4.0.0
([#274](#274))
([5fc06f7](5fc06f7))
* bump sigstore/cosign-installer from 3.9.1 to 4.1.0
([#275](#275))
([29dd16c](29dd16c))
* harden CI/CD pipeline
([#92](#92))
([ce4693c](ce4693c))
* split vulnerability scans into critical-fail and high-warn tiers
([#277](#277))
([aba48af](aba48af))


### Maintenance

* add /worktree skill for parallel worktree management
([#171](#171))
([951e337](951e337))
* add design spec context loading to research-link skill
([8ef9685](8ef9685))
* add post-merge-cleanup skill
([#70](#70))
([f913705](f913705))
* add pre-pr-review skill and update CLAUDE.md
([#103](#103))
([92e9023](92e9023))
* add research-link skill and rename skill files to SKILL.md
([#101](#101))
([651c577](651c577))
* bump aiosqlite from 0.21.0 to 0.22.1
([#191](#191))
([3274a86](3274a86))
* bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group
([#96](#96))
([0338d0c](0338d0c))
* bump ruff from 0.15.4 to 0.15.5
([a49ee46](a49ee46))
* fix M0 audit items
([#66](#66))
([c7724b5](c7724b5))
* **main:** release ai-company 0.1.1
([#282](#282))
([2f4703d](2f4703d))
* pin setup-uv action to full SHA
([#281](#281))
([4448002](4448002))
* post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests,
hookify rules
([#148](#148))
([c57a6a9](c57a6a9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

End-to-end integration test: single agent receives and completes a task

2 participants