test: add e2e single agent integration tests (#24) by Aureliolo · Pull Request #156 · Aureliolo/synthorg

Aureliolo · 2026-03-07T14:37:00Z

Summary

Add end-to-end tests validating the core single-agent execution pipeline: engine → execution loop → real tools → cost tracking → task lifecycle
4 test scenarios: file tool agent (write to disk), text-only completion, permission denial recovery, max-turns exhaustion
Test infrastructure: ScriptedProvider (mock provider with sequential response playback), factory helpers (make_e2e_identity, make_e2e_task, make_tool_call_response, make_text_response)
Pre-PR review fixes (7 agents, 10 findings addressed):
- Add bounds check with descriptive error in ScriptedProvider.complete()
- Fix docstrings for accuracy (execution loop, file tools, real LLM placeholder)
- Add is_error is False assertion on success-path tool result
- Clarify MAX_TURNS inline comment with full transition rule
- Add SHUTDOWN to DESIGN_SPEC TerminationReason enum listing
- Add ShutdownChecker to DESIGN_SPEC ExecutionLoop.execute() docs
- Add e2e test command to CLAUDE.md Quick Commands

Closes #24

Test plan

uv run ruff check src/ tests/ — lint clean
uv run ruff format src/ tests/ — format clean
uv run mypy src/ tests/ — type-check clean (281 files)
uv run pytest tests/ -n auto --cov=ai_company --cov-fail-under=80 — 2476 passed, 96.36% coverage
Pre-commit hooks pass (trailing whitespace, ruff, gitleaks, commitizen)
Pre-reviewed by 7 agents: code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer, docs-consistency

Validate the core MVP hypothesis: a single agent can complete a real task end-to-end through the full execution pipeline (engine, ReAct loop, real tools, cost tracking, task lifecycle). Four scenarios: file tool agent (real filesystem I/O), text-only agent, permission denied recovery (CUSTOM access level), and max turns exhaustion. Plus a gated real LLM smoke test placeholder. Closes #24 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Pre-reviewed by 7 agents, 10 findings addressed: - Add bounds check with descriptive error in ScriptedProvider - Fix docstrings for accuracy (execution loop, file tools, real LLM) - Add is_error assertion on success-path tool result - Clarify MAX_TURNS comment with full transition rule - Add SHUTDOWN to DESIGN_SPEC TerminationReason enum listing - Add ShutdownChecker to DESIGN_SPEC ExecutionLoop docs - Add e2e test command to CLAUDE.md Quick Commands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-07T14:37:13Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

coderabbitai · 2026-03-07T14:37:18Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 48bd900f-89bb-42dc-9a6e-90bded525e8a

📥 Commits

Reviewing files that changed from the base of the PR and between 8d1cbf4 and ccdecda.

📒 Files selected for processing (3)

DESIGN_SPEC.md
tests/e2e/conftest.py
tests/e2e/test_single_agent_e2e.py

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added cooperative shutdown mechanism for execution loop control with explicit shutdown termination handling.
Tests
- Added a comprehensive end-to-end test suite for single-agent workflows, including scripted response mocking, workspace fixtures, tool-call and text-response helpers, and cost tracking validations.
Chores
- Updated Quick Commands snippet to include a command for running only e2e tests.

Walkthrough

Adds end-to-end test infrastructure and a comprehensive single-agent e2e test suite; introduces test fixtures and a ScriptedProvider mock. Updates design spec to add a cooperative ShutdownChecker parameter and a SHUTDOWN termination reason to the execution loop protocol. Also adds a Quick Commands e2e snippet.

Changes

Cohort / File(s)	Summary
Documentation & Quick Commands `CLAUDE.md`	Adds a Quick Commands snippet to run only e2e tests (`uv run pytest tests/ -m e2e`).
Design / Execution Loop API `DESIGN_SPEC.md`, `engine/loop_protocol.py` (signature & enums)	Extends ExecutionLoop.execute(...) to accept an optional `ShutdownChecker`, adds `SHUTDOWN` to `TerminationReason`, and documents cooperative shutdown semantics and post-execution transition rules.
E2E Test Fixtures & Helpers `tests/e2e/conftest.py`	Adds `ScriptedProvider` mock, workspace fixture, identity/task builders (`make_e2e_identity`, `make_e2e_task`), response builders (`make_tool_call_response`, `make_text_response`), and test constants for deterministic e2e tests.
E2E Test Cases `tests/e2e/test_single_agent_e2e.py`	Introduces comprehensive single-agent e2e tests covering file-tool workflow, text-only responses, permission-denied recovery, max-iterations termination, cost tracking assertions, and a gated real-LLM integration scaffold.

Sequence Diagram(s)

sequenceDiagram
    participant TestRunner as Test Runner
    participant AgentEngine as AgentEngine
    participant Provider as ScriptedProvider
    participant ToolRegistry as ToolRegistry
    participant FileTool as FileTools
    participant CostTracker as CostTracker

    TestRunner->>AgentEngine: start execution (Task, Identity)
    AgentEngine->>Provider: request completion (turn N)
    Provider-->>AgentEngine: CompletionResponse (tool_call or text)
    alt tool_call
        AgentEngine->>ToolRegistry: resolve tool call
        ToolRegistry->>FileTool: run tool (e.g., WriteFile)
        FileTool-->>AgentEngine: tool result
    end
    AgentEngine->>Provider: request completion (turn N+1)
    Provider-->>AgentEngine: CompletionResponse (STOP / final)
    AgentEngine->>CostTracker: record usage/costs
    AgentEngine->>AgentEngine: update task state & termination reason
    AgentEngine-->>TestRunner: ExecutionResult (status, metrics, conversation)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

test: add e2e single agent integration tests (#24) #156 — Adds ScriptedProvider and e2e tests and updates ExecutionLoop.execute signature with ShutdownChecker and SHUTDOWN, closely matching these changes.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely summarizes the main change: adding end-to-end integration tests for single-agent workflows.
Description check	✅ Passed	The description comprehensively details the changes, test infrastructure, pre-PR fixes, test plan, and verification results, all related to the e2e test additions.
Linked Issues check	✅ Passed	All four test scenarios from issue `#24` acceptance criteria are implemented: file tools (test_single_agent_writes_to_file), text-only (test_single_agent_text_only_completion), permission denial (test_single_agent_permission_denied_recovery), max iterations (test_single_agent_max_turns_exhausted). Cost tracking and status transitions are validated. Mocked ScriptedProvider avoids real API calls. Optional real LLM test placeholder included.
Out of Scope Changes check	✅ Passed	Design spec documentation updates (DESIGN_SPEC.md) align with PR description's stated fixes and support the test infrastructure. CLAUDE.md e2e command addition supports test execution. All changes directly support the e2e testing objectives.
Docstring Coverage	✅ Passed	Docstring coverage is 93.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch test/e2e-single-agent

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch test/e2e-single-agent

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-07T14:37:19Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the project's testing foundation by introducing a dedicated suite of end-to-end tests for the single-agent execution flow. These tests ensure the core agent functionality, from task processing to tool interaction and lifecycle management, behaves as expected under various conditions, thereby increasing confidence in the system's reliability and correctness.

Highlights

End-to-End Test Suite: Introduced a comprehensive suite of end-to-end tests to validate the single-agent execution pipeline, covering engine, execution loop, real tools, cost tracking, and task lifecycle.
Diverse Test Scenarios: Implemented four distinct test scenarios: an agent using file tools, a text-only completion agent, an agent recovering from permission denial, and an agent hitting the maximum turn limit.
Robust Test Infrastructure: Developed ScriptedProvider for sequential response playback (including a bounds check for calls) and utility functions (make_e2e_identity, make_e2e_task, make_tool_call_response, make_text_response) to facilitate E2E test setup.
Design Specification Updates: Updated DESIGN_SPEC.md to include ShutdownChecker in the ExecutionLoop.execute() signature and added SHUTDOWN to the TerminationReason enum, along with documenting the ShutdownChecker.
Documentation and Code Improvements: Applied several pre-PR review fixes, such as clarifying docstrings for accuracy, adding an is_error is False assertion on success-path tool results, and clarifying the MAX_TURNS inline comment.
E2E Test Command: Added a new uv run pytest tests/ -m e2e command to CLAUDE.md for easily running end-to-end tests.

Changelog

CLAUDE.md
- Added a command to run e2e tests.
DESIGN_SPEC.md
- Updated the execute method signature to include an optional ShutdownChecker.
- Expanded the TerminationReason enum with a SHUTDOWN state.
- Documented the ShutdownChecker callback type.
tests/e2e/conftest.py
- Introduced ScriptedProvider for mocking LLM responses in tests.
- Added e2e_workspace fixture for isolated file system operations.
- Provided helper functions for creating AgentIdentity, Task, and CompletionResponse objects for E2E tests.
tests/e2e/test_single_agent_e2e.py
- Added a new file containing end-to-end tests for single-agent scenarios.
- Implemented tests for file tool usage, text-only completion, permission denial recovery, and max turns exhaustion.

Activity

Verified linting, formatting, and type-checking are clean.
Confirmed all pytest suites passed, achieving 96.36% code coverage.
Ensured pre-commit hooks ran successfully.
The changes were pre-reviewed by 7 different AI agents.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable suite of end-to-end tests for the single-agent execution pipeline. The tests are well-designed, covering various scenarios including successful tool use, text-only completion, permission denial recovery, and turn exhaustion. The testing infrastructure, including the ScriptedProvider and factory helpers, is robust and will make future e2e testing easier. The documentation updates are also accurate. I have one minor suggestion to improve the clarity of one of the new tests.

gemini-code-assist · 2026-03-07T14:38:25Z

tests/e2e/test_single_agent_e2e.py

+        """Agent writes a file to disk, then completes with a summary."""
+        write_tool = WriteFileTool(workspace_root=e2e_workspace)
+        read_tool = ReadFileTool(workspace_root=e2e_workspace)
+        registry = ToolRegistry([write_tool, read_tool])


The read_tool is registered here but is not used in this test scenario. The scripted agent behavior only involves the write_file tool. To improve clarity and remove unnecessary setup, you can remove read_tool from the registry. You can also remove its initialization on the preceding line.

Suggested change

registry = ToolRegistry([write_tool, read_tool])

registry = ToolRegistry([write_tool])

greptile-apps · 2026-03-07T14:40:21Z

Greptile Summary

This PR adds end-to-end integration tests that validate the full single-agent execution pipeline — from AgentEngine through the ReActLoop, real file system tools, cost tracking, and task lifecycle transitions — using a ScriptedProvider mock that plays back pre-defined LLM responses sequentially. It also updates DESIGN_SPEC.md to document SHUTDOWN as a TerminationReason and ShutdownChecker as an execute() parameter, and adds the e2e pytest marker command to CLAUDE.md.

Four test scenarios: file tool write (real disk I/O), text-only single-turn completion, permission denial recovery, and MAX_TURNS exhaustion — each asserting result fields, task lifecycle transitions, conversation structure, and cost tracking consistency.

ScriptedProvider: Clean sequential mock with correctly-ordered bounds check (index check before append), and call_count / received_messages tracking for post-test inspection.

Minor gap: TestPermissionDeniedRecovery omits the result.termination_reason assertion present in all other three test classes, leaving a small coverage hole.

Confidence Score: 5/5

This PR is safe to merge — it adds tests and documentation with no changes to production source code.
All changes are confined to test infrastructure (tests/e2e/) and documentation (CLAUDE.md, DESIGN_SPEC.md). No production code is modified. The identified finding is a straightforward test-completeness improvement (missing assertion in one test method to match the pattern in others) with no functional impact. The PR description confirms full CI passage: lint, type-check (strict, 281 files), 2476 tests at 96.36% coverage, and pre-commit hooks.
No files require special attention — the finding is in test_single_agent_e2e.py and is a non-blocking style improvement.

Sequence Diagram

sequenceDiagram
    participant Test
    participant Engine as AgentEngine
    participant ReactLoop as ReActLoop
    participant SP as ScriptedProvider
    participant TI as ToolInvoker
    participant CT as CostTracker
    Test->>Engine: run(identity, task, max_turns)
    Engine->>ReactLoop: execute(context, provider, tool_invoker)
    ReactLoop->>SP: complete(messages, model)
    SP-->>ReactLoop: CompletionResponse TOOL_USE
    ReactLoop->>TI: invoke(tool_call)
    TI-->>ReactLoop: ToolResult
    ReactLoop->>SP: complete(messages, model)
    SP-->>ReactLoop: CompletionResponse STOP
    ReactLoop-->>Engine: ExecutionResult COMPLETED
    Engine->>CT: record(TokenUsage)
    Engine->>Engine: task transition to COMPLETED
    Engine-->>Test: AgentRunResult
    Test->>Test: assert result, filesystem, lifecycle, costs

_{Last reviewed commit: ccdecda}

tests/e2e/conftest.py

tests/e2e/test_single_agent_e2e.py

Copilot

Pull request overview

Adds an end-to-end (e2e) test suite that exercises the single-agent execution pipeline (engine → loop → real tools → cost tracking → task lifecycle), plus small doc updates to reflect the new testing workflow and termination reasons.

Changes:

Introduce 4 e2e scenarios (file write, text-only completion, permission denial recovery, max-turns exhaustion) using real file tools.
Add e2e test infrastructure (ScriptedProvider + factory helpers) under tests/e2e/.
Update DESIGN_SPEC.md and CLAUDE.md to reflect shutdown termination/docs and add an e2e pytest command.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
tests/e2e/test_single_agent_e2e.py	New end-to-end scenarios validating the full single-agent execution pipeline.
tests/e2e/conftest.py	New e2e fixtures + scripted completion provider + response factory helpers.
DESIGN_SPEC.md	Documentation update: include `SHUTDOWN` termination and `ShutdownChecker` in loop API docs.
CLAUDE.md	Add quick command for running e2e tests via pytest marker.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-07T14:40:50Z

tests/e2e/test_single_agent_e2e.py

+from typing import TYPE_CHECKING
+
+import pytest
+
+if TYPE_CHECKING:
+    from pathlib import Path
+


Path is only imported under TYPE_CHECKING, but this repo’s tests commonly import Path at runtime because pytest evaluates annotations (see e.g. tests/unit/tools/git/conftest.py:5). With the current pattern, anything that resolves annotations (pytest/plugins/typing.get_type_hints) can raise NameError: Path is not defined. Import Path at runtime (optionally with # noqa: TC003) and drop the TYPE_CHECKING block here.

Suggested change

from typing import TYPE_CHECKING

import pytest

if TYPE_CHECKING:

from pathlib import Path

from pathlib import Path

import pytest

Copilot · 2026-03-07T14:40:51Z

tests/e2e/conftest.py

+if TYPE_CHECKING:
+    from collections.abc import AsyncIterator
+    from pathlib import Path
+


AsyncIterator/Path are imported only under TYPE_CHECKING, but this repo’s test suite imports annotation types at runtime because pytest evaluates them (see tests/unit/tools/git/conftest.py:5). Keeping these imports type-checking-only risks NameError if annotations are resolved. Import Path/AsyncIterator at runtime (optionally with # noqa: TC003) and remove the TYPE_CHECKING block.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@CLAUDE.md`:
- Line 37: Update the e2e quick command string that currently reads "uv run
pytest tests/ -m e2e" to include pytest-xdist parallelism by adding "-n auto" so
it matches the full-suite and docs; locate and modify the command in CLAUDE.md
(the e2e shortcut line) to read the same invocation with "-n auto" appended.

In `@DESIGN_SPEC.md`:
- Around line 823-831: The spec currently lists TerminationReason.SHUTDOWN but
the orchestrator pipeline only treats COMPLETED as changing task state; update
the orchestrator and AgentEngine/execute documentation so that when execute(...)
returns an ExecutionResult with TerminationReason.SHUTDOWN the orchestrator
transitions the task state to INTERRUPTED (same as §6.7 requires) instead of
leaving it IN_PROGRESS; specifically, thread TerminationReason.SHUTDOWN through
the orchestrator's task state transition logic (the code/docs describing how
ExecutionResult is handled), and update any place that lists only COMPLETED as
state-changing to include SHUTDOWN -> INTERRUPTED so ShutdownChecker,
ExecutionResult, and Task state transition behavior are consistent.

In `@tests/e2e/conftest.py`:
- Around line 154-172: The fixture make_tool_call_response currently builds a
CompletionResponse with content="" which misrepresents pure tool-use turns;
update make_tool_call_response to pass content=None to CompletionResponse (leave
finish_reason=FinishReason.TOOL_USE, usage, model=_TEST_MODEL, and tool_calls
as-is) so the test accurately simulates tool-only assistant responses and
surfaces code paths that treat None differently from an empty string.

In `@tests/e2e/test_single_agent_e2e.py`:
- Around line 372-389: The test
TestRealLLMIntegration.test_real_provider_text_completion unconditionally calls
pytest.skip(), so the REAL_LLM_TEST path is never exercised; replace the
unconditional skip with an env-gated minimal smoke path: read REAL_LLM_PROVIDER
(or similar env vars) and if missing call pytest.skip(), otherwise construct a
minimal provider/client using those env vars inside
test_real_provider_text_completion, perform a simple text completion/request via
the existing LLM client or agent helper (e.g., create the client, call its
complete/generate method), and assert on a non-empty/valid response; keep the
test slow/timeout markers but ensure the new logic only runs when
REAL_LLM_TEST=1 and required provider env vars are present.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 566ae37d-54ca-48e7-bb40-84393643f656

📥 Commits

Reviewing files that changed from the base of the PR and between d1fe1fb and 8d1cbf4.

📒 Files selected for processing (4)

CLAUDE.md
DESIGN_SPEC.md
tests/e2e/conftest.py
tests/e2e/test_single_agent_e2e.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Agent
GitHub Check: Greptile Review

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Do not use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations
Use except A, B: syntax without parentheses for exception handling — ruff enforces PEP 758 on Python 3.14
All public functions and classes must have type hints with strict mypy compliance
Use Google-style docstrings on all public classes and functions — enforced by ruff D rules
Every module with business logic must include: from ai_company.observability import get_logger then logger = get_logger(__name__)
Never use import logging, logging.getLogger(), or print() in application code — use the project logger instead
Always use logger as the variable name for loggers — not _logger or log
Use event name constants from ai_company.observability.events.<domain> instead of string literals for log events
Use structured logging format: logger.info(EVENT, key=value) — never use string formatting like logger.info('msg %s', val)
All error paths must log at WARNING or ERROR with context before raising exceptions
All state transitions must be logged at INFO level
Use DEBUG level logging for object creation, internal flow, and entry/exit of key functions
Create new objects instead of mutating existing ones — never mutate objects
For non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction and wrap with MappingProxyType for read-only enforcement
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves
Never mix static config fields with mutable runtime fields in a single Pydantic model
Use NotBlankStr from core.types for all identifier and name fields in Pydantic models, including optional and tuple variants, instead of manual whitespace validators
Use @computed_field in Pydantic models for derived values instead of storing and validating redundant fields
Use model_copy(update=...) for evolving runtime state in Pydantic models
Use `copy.deepcop...

Files:

tests/e2e/test_single_agent_e2e.py
tests/e2e/conftest.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Mark all tests with appropriate markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow
Tests must not use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) — use generic names: example-provider, example-large-001, example-medium-001, example-small-001, or test-provider, test-small-001
Prefer @pytest.mark.parametrize for testing similar cases
Default async pytest mode is asyncio_mode = 'auto' — no manual @pytest.mark.asyncio needed
Test timeout is 30 seconds per test
Use pytest-xdist parallelism via -n auto for test execution

Files:

tests/e2e/test_single_agent_e2e.py
tests/e2e/conftest.py

🧠 Learnings (3)

📚 Learning: 2026-03-07T13:46:59.933Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T13:46:59.933Z
Learning: Dependency groups: `test` (pytest + plugins), `dev` (includes test + ruff, mypy, pre-commit, commitizen). Install with `uv sync`

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-07T13:46:59.933Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T13:46:59.933Z
Learning: Applies to tests/**/*.py : Mark all tests with appropriate markers: `pytest.mark.unit`, `pytest.mark.integration`, `pytest.mark.e2e`, or `pytest.mark.slow`

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-07T13:46:59.933Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T13:46:59.933Z
Learning: Applies to tests/**/*.py : Use `pytest-xdist` parallelism via `-n auto` for test execution

Applied to files:

CLAUDE.md

🧬 Code graph analysis (1)

tests/e2e/conftest.py (2)

src/ai_company/providers/enums.py (1)

FinishReason (15-22)

src/ai_company/providers/models.py (5)

CompletionResponse (257-306)

StreamChunk (309-387)

TokenUsage (12-35)

ToolCall (96-119)

ToolDefinition (64-93)

coderabbitai · 2026-03-07T14:42:24Z

CLAUDE.md

 uv run mypy src/ tests/                    # type-check (strict)
 uv run pytest tests/ -m unit               # unit tests only
 uv run pytest tests/ -m integration        # integration tests only
+uv run pytest tests/ -m e2e                # e2e tests only


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if CLAUDE.md exists and view the context around line 37 if [ -f CLAUDE.md ]; then echo "=== CLAUDE.md exists, viewing lines 30-45 ===" sed -n '30,45p' CLAUDE.md fi

Repository: Aureliolo/ai-company

Length of output: 811

🏁 Script executed:

# Search for pytest configuration files echo "=== Looking for pytest configuration ===" fd -e 'pyproject.toml|pytest.ini|tox.ini|setup.cfg' --type f | head -20

Repository: Aureliolo/ai-company

Length of output: 104

🏁 Script executed:

# Check for pytest-xdist references in the codebase echo "=== Searching for pytest-xdist references ===" rg 'pytest.*-n|xdist|auto' --type-list | head -5 rg 'pytest.*-n|xdist|auto' -i

Repository: Aureliolo/ai-company

Length of output: 29935

Add -n auto to the e2e quick command for parallelism consistency.

The e2e test shortcut omits pytest-xdist parallelism, while the full suite command and repo documentation both use -n auto. All test execution should be consistent.

Proposed fix

-uv run pytest tests/ -m e2e # e2e tests only +uv run pytest tests/ -m e2e -n auto # e2e tests only

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

uv run pytest tests/ -m e2e # e2e tests only

uv run pytest tests/ -m e2e -n auto # e2e tests only

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@CLAUDE.md` at line 37, Update the e2e quick command string that currently reads "uv run pytest tests/ -m e2e" to include pytest-xdist parallelism by adding "-n auto" so it matches the full-suite and docs; locate and modify the command in CLAUDE.md (the e2e shortcut line) to read the same invocation with "-n auto" appended.

DESIGN_SPEC.md

tests/e2e/conftest.py

coderabbitai · 2026-03-07T14:42:24Z

tests/e2e/test_single_agent_e2e.py

+@pytest.mark.slow
+@pytest.mark.timeout(60)
+@pytest.mark.skipif(
+    os.environ.get("REAL_LLM_TEST") != "1",
+    reason="Set REAL_LLM_TEST=1 to run real LLM integration test",
+)
+class TestRealLLMIntegration:
+    """Optional smoke test with a real LLM provider.
+
+    Skipped unless REAL_LLM_TEST=1 is set; not expected to run in CI.
+    """
+
+    async def test_real_provider_text_completion(self) -> None:
+        """Minimal text-only task with a real provider.
+
+        Placeholder — replace the skip with real provider setup when ready.
+        """
+        pytest.skip("Real LLM test placeholder — configure a real provider")


⚠️ Potential issue | 🟠 Major

The manual real-LLM path is still unreachable.

Even when REAL_LLM_TEST=1 is set, this class never runs a real smoke path because the only test unconditionally calls pytest.skip(). That misses the linked objective of having an optional manual real-provider run. Either wire a minimal env-driven provider here or drop the claim until the smoke path actually exists.

I can help sketch a minimal env-gated smoke test that keeps CI isolated but makes the manual path real.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/e2e/test_single_agent_e2e.py` around lines 372 - 389, The test TestRealLLMIntegration.test_real_provider_text_completion unconditionally calls pytest.skip(), so the REAL_LLM_TEST path is never exercised; replace the unconditional skip with an env-gated minimal smoke path: read REAL_LLM_PROVIDER (or similar env vars) and if missing call pytest.skip(), otherwise construct a minimal provider/client using those env vars inside test_real_provider_text_completion, perform a simple text completion/request via the existing LLM client or agent helper (e.g., create the client, call its complete/generate method), and assert on a non-empty/valid response; keep the test slow/timeout markers but ensure the new logic only runs when REAL_LLM_TEST=1 and required provider env vars are present.

… reviewers - Move received_messages.append() after bounds check in ScriptedProvider (conftest.py) - Fix double-skip on real LLM test — now env-gated with actionable skip message - Document SHUTDOWN→INTERRUPTED and ERROR→recovery transitions in DESIGN_SPEC §6.5 - Use content=None for tool-only responses in make_tool_call_response - Rename TestMaxIterationsExhausted → TestMaxTurnsExhausted (consistent terminology) - Remove unused read_tool from TestFileToolAgent registry - Add min conversation length assertion in text-only test - Add file existence assertions in max-turns test - Add isinstance protocol assertion for ScriptedProvider - Improve complete() and stream() docstrings in ScriptedProvider Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-03-07T14:55:03Z

tests/e2e/test_single_agent_e2e.py

+        # Agent recovered successfully
+        assert result.is_success is True
+        assert result.total_turns == 2


Missing termination_reason assertion.

TestPermissionDeniedRecovery checks result.is_success is True but omits result.termination_reason, even though all three other test classes (TestFileToolAgent, TestTextOnlyAgent, TestMaxTurnsExhausted) explicitly assert both. This creates a coverage gap: a scenario where the engine erroneously returns TerminationReason.MAX_TURNS or TerminationReason.BUDGET_EXHAUSTED (while setting is_success=True) would not be caught.

Suggested change

# Agent recovered successfully

assert result.is_success is True

assert result.total_turns == 2

# Agent recovered successfully

assert result.is_success is True

assert result.termination_reason == TerminationReason.COMPLETED

assert result.total_turns == 2

Prompt To Fix With AI

This is a comment left during a code review. Path: tests/e2e/test_single_agent_e2e.py Line: 244-246 Comment: Missing `termination_reason` assertion. `TestPermissionDeniedRecovery` checks `result.is_success is True` but omits `result.termination_reason`, even though all three other test classes (`TestFileToolAgent`, `TestTextOnlyAgent`, `TestMaxTurnsExhausted`) explicitly assert both. This creates a coverage gap: a scenario where the engine erroneously returns `TerminationReason.MAX_TURNS` or `TerminationReason.BUDGET_EXHAUSTED` (while setting `is_success=True`) would not be caught. ```suggestion # Agent recovered successfully assert result.is_success is True assert result.termination_reason == TerminationReason.COMPLETED assert result.total_turns == 2 ``` How can I resolve this? If you propose a fix, please make it concise.

🤖 I have created a release *beep* *boop* --- ## [0.1.1](ai-company-v0.1.0...ai-company-v0.1.1) (2026-03-10) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.0...v0.1.0) (2026-03-11) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add mandatory JWT + API key authentication ([#256](#256)) ([c279cfe](c279cfe)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable output scan response policies ([#263](#263)) ([b9907e8](b9907e8)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement AuditRepository for security audit log persistence ([#279](#279)) ([94bc29f](94bc29f)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * resolve circular imports, bump litellm, fix release tag format ([#286](#286)) ([a6659b5](a6659b5)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * bump anchore/scan-action from 6.5.1 to 7.3.2 ([#271](#271)) ([80a1c15](80a1c15)) * bump docker/build-push-action from 6.19.2 to 7.0.0 ([#273](#273)) ([dd0219e](dd0219e)) * bump docker/login-action from 3.7.0 to 4.0.0 ([#272](#272)) ([33d6238](33d6238)) * bump docker/metadata-action from 5.10.0 to 6.0.0 ([#270](#270)) ([baee04e](baee04e)) * bump docker/setup-buildx-action from 3.12.0 to 4.0.0 ([#274](#274)) ([5fc06f7](5fc06f7)) * bump sigstore/cosign-installer from 3.9.1 to 4.1.0 ([#275](#275)) ([29dd16c](29dd16c)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * **main:** release ai-company 0.1.1 ([#282](#282)) ([2f4703d](2f4703d)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>

Aureliolo and others added 2 commits March 7, 2026 15:25

Copilot AI review requested due to automatic review settings March 7, 2026 14:37

Copilot started reviewing on behalf of Aureliolo March 7, 2026 14:37 View session

gemini-code-assist bot reviewed Mar 7, 2026

View reviewed changes

greptile-apps bot reviewed Mar 7, 2026

View reviewed changes

tests/e2e/conftest.py Outdated Show resolved Hide resolved

tests/e2e/test_single_agent_e2e.py Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 7, 2026

View reviewed changes

coderabbitai bot reviewed Mar 7, 2026

View reviewed changes

Aureliolo merged commit f566fb4 into main Mar 7, 2026
7 checks passed

Aureliolo deleted the test/e2e-single-agent branch March 7, 2026 14:49

greptile-apps bot reviewed Mar 7, 2026

View reviewed changes

Aureliolo mentioned this pull request Mar 10, 2026

chore(main): release ai-company 0.1.1 #282

Merged

Aureliolo mentioned this pull request Mar 10, 2026

chore(main): release 0.1.0 #283

Merged

This was referenced Mar 15, 2026

chore(main): release 0.2.4 #431

Merged

chore(main): release 0.2.0 #442

Closed

chore(main): release 0.2.5 #447

Merged

chore(main): release 0.2.0 #460

Closed

chore(main): release 0.2.0 #471

Closed

	registry = ToolRegistry([write_tool, read_tool])
	registry = ToolRegistry([write_tool])

	uv run pytest tests/ -m e2e # e2e tests only
	uv run pytest tests/ -m e2e -n auto # e2e tests only

Conversation

Aureliolo commented Mar 7, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

coderabbitai bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

gemini-code-assist bot commented Mar 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 7, 2026 •

edited

Loading

coderabbitai bot commented Mar 7, 2026 •

edited

Loading

greptile-apps bot commented Mar 7, 2026 •

edited

Loading