test: add e2e single agent integration tests (#24)#156
Conversation
Validate the core MVP hypothesis: a single agent can complete a real task end-to-end through the full execution pipeline (engine, ReAct loop, real tools, cost tracking, task lifecycle). Four scenarios: file tool agent (real filesystem I/O), text-only agent, permission denied recovery (CUSTOM access level), and max turns exhaustion. Plus a gated real LLM smoke test placeholder. Closes #24 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-reviewed by 7 agents, 10 findings addressed: - Add bounds check with descriptive error in ScriptedProvider - Fix docstrings for accuracy (execution loop, file tools, real LLM) - Add is_error assertion on success-path tool result - Clarify MAX_TURNS comment with full transition rule - Add SHUTDOWN to DESIGN_SPEC TerminationReason enum listing - Add ShutdownChecker to DESIGN_SPEC ExecutionLoop docs - Add e2e test command to CLAUDE.md Quick Commands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds end-to-end test infrastructure and a comprehensive single-agent e2e test suite; introduces test fixtures and a ScriptedProvider mock. Updates design spec to add a cooperative ShutdownChecker parameter and a SHUTDOWN termination reason to the execution loop protocol. Also adds a Quick Commands e2e snippet. Changes
Sequence Diagram(s)sequenceDiagram
participant TestRunner as Test Runner
participant AgentEngine as AgentEngine
participant Provider as ScriptedProvider
participant ToolRegistry as ToolRegistry
participant FileTool as FileTools
participant CostTracker as CostTracker
TestRunner->>AgentEngine: start execution (Task, Identity)
AgentEngine->>Provider: request completion (turn N)
Provider-->>AgentEngine: CompletionResponse (tool_call or text)
alt tool_call
AgentEngine->>ToolRegistry: resolve tool call
ToolRegistry->>FileTool: run tool (e.g., WriteFile)
FileTool-->>AgentEngine: tool result
end
AgentEngine->>Provider: request completion (turn N+1)
Provider-->>AgentEngine: CompletionResponse (STOP / final)
AgentEngine->>CostTracker: record usage/costs
AgentEngine->>AgentEngine: update task state & termination reason
AgentEngine-->>TestRunner: ExecutionResult (status, metrics, conversation)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the project's testing foundation by introducing a dedicated suite of end-to-end tests for the single-agent execution flow. These tests ensure the core agent functionality, from task processing to tool interaction and lifecycle management, behaves as expected under various conditions, thereby increasing confidence in the system's reliability and correctness. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a valuable suite of end-to-end tests for the single-agent execution pipeline. The tests are well-designed, covering various scenarios including successful tool use, text-only completion, permission denial recovery, and turn exhaustion. The testing infrastructure, including the ScriptedProvider and factory helpers, is robust and will make future e2e testing easier. The documentation updates are also accurate. I have one minor suggestion to improve the clarity of one of the new tests.
tests/e2e/test_single_agent_e2e.py
Outdated
| """Agent writes a file to disk, then completes with a summary.""" | ||
| write_tool = WriteFileTool(workspace_root=e2e_workspace) | ||
| read_tool = ReadFileTool(workspace_root=e2e_workspace) | ||
| registry = ToolRegistry([write_tool, read_tool]) |
There was a problem hiding this comment.
The read_tool is registered here but is not used in this test scenario. The scripted agent behavior only involves the write_file tool. To improve clarity and remove unnecessary setup, you can remove read_tool from the registry. You can also remove its initialization on the preceding line.
| registry = ToolRegistry([write_tool, read_tool]) | |
| registry = ToolRegistry([write_tool]) |
Greptile SummaryThis PR adds end-to-end integration tests that validate the full single-agent execution pipeline — from Four test scenarios: file tool write (real disk I/O), text-only single-turn completion, permission denial recovery, and
Minor gap: Confidence Score: 5/5
Sequence DiagramsequenceDiagram
participant Test
participant Engine as AgentEngine
participant ReactLoop as ReActLoop
participant SP as ScriptedProvider
participant TI as ToolInvoker
participant CT as CostTracker
Test->>Engine: run(identity, task, max_turns)
Engine->>ReactLoop: execute(context, provider, tool_invoker)
ReactLoop->>SP: complete(messages, model)
SP-->>ReactLoop: CompletionResponse TOOL_USE
ReactLoop->>TI: invoke(tool_call)
TI-->>ReactLoop: ToolResult
ReactLoop->>SP: complete(messages, model)
SP-->>ReactLoop: CompletionResponse STOP
ReactLoop-->>Engine: ExecutionResult COMPLETED
Engine->>CT: record(TokenUsage)
Engine->>Engine: task transition to COMPLETED
Engine-->>Test: AgentRunResult
Test->>Test: assert result, filesystem, lifecycle, costs
Last reviewed commit: ccdecda |
There was a problem hiding this comment.
Pull request overview
Adds an end-to-end (e2e) test suite that exercises the single-agent execution pipeline (engine → loop → real tools → cost tracking → task lifecycle), plus small doc updates to reflect the new testing workflow and termination reasons.
Changes:
- Introduce 4 e2e scenarios (file write, text-only completion, permission denial recovery, max-turns exhaustion) using real file tools.
- Add e2e test infrastructure (
ScriptedProvider+ factory helpers) undertests/e2e/. - Update
DESIGN_SPEC.mdandCLAUDE.mdto reflect shutdown termination/docs and add an e2e pytest command.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/e2e/test_single_agent_e2e.py | New end-to-end scenarios validating the full single-agent execution pipeline. |
| tests/e2e/conftest.py | New e2e fixtures + scripted completion provider + response factory helpers. |
| DESIGN_SPEC.md | Documentation update: include SHUTDOWN termination and ShutdownChecker in loop API docs. |
| CLAUDE.md | Add quick command for running e2e tests via pytest marker. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from typing import TYPE_CHECKING | ||
|
|
||
| import pytest | ||
|
|
||
| if TYPE_CHECKING: | ||
| from pathlib import Path | ||
|
|
There was a problem hiding this comment.
Path is only imported under TYPE_CHECKING, but this repo’s tests commonly import Path at runtime because pytest evaluates annotations (see e.g. tests/unit/tools/git/conftest.py:5). With the current pattern, anything that resolves annotations (pytest/plugins/typing.get_type_hints) can raise NameError: Path is not defined. Import Path at runtime (optionally with # noqa: TC003) and drop the TYPE_CHECKING block here.
| from typing import TYPE_CHECKING | |
| import pytest | |
| if TYPE_CHECKING: | |
| from pathlib import Path | |
| from pathlib import Path | |
| import pytest |
| if TYPE_CHECKING: | ||
| from collections.abc import AsyncIterator | ||
| from pathlib import Path | ||
|
|
There was a problem hiding this comment.
AsyncIterator/Path are imported only under TYPE_CHECKING, but this repo’s test suite imports annotation types at runtime because pytest evaluates them (see tests/unit/tools/git/conftest.py:5). Keeping these imports type-checking-only risks NameError if annotations are resolved. Import Path/AsyncIterator at runtime (optionally with # noqa: TC003) and remove the TYPE_CHECKING block.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@CLAUDE.md`:
- Line 37: Update the e2e quick command string that currently reads "uv run
pytest tests/ -m e2e" to include pytest-xdist parallelism by adding "-n auto" so
it matches the full-suite and docs; locate and modify the command in CLAUDE.md
(the e2e shortcut line) to read the same invocation with "-n auto" appended.
In `@DESIGN_SPEC.md`:
- Around line 823-831: The spec currently lists TerminationReason.SHUTDOWN but
the orchestrator pipeline only treats COMPLETED as changing task state; update
the orchestrator and AgentEngine/execute documentation so that when execute(...)
returns an ExecutionResult with TerminationReason.SHUTDOWN the orchestrator
transitions the task state to INTERRUPTED (same as §6.7 requires) instead of
leaving it IN_PROGRESS; specifically, thread TerminationReason.SHUTDOWN through
the orchestrator's task state transition logic (the code/docs describing how
ExecutionResult is handled), and update any place that lists only COMPLETED as
state-changing to include SHUTDOWN -> INTERRUPTED so ShutdownChecker,
ExecutionResult, and Task state transition behavior are consistent.
In `@tests/e2e/conftest.py`:
- Around line 154-172: The fixture make_tool_call_response currently builds a
CompletionResponse with content="" which misrepresents pure tool-use turns;
update make_tool_call_response to pass content=None to CompletionResponse (leave
finish_reason=FinishReason.TOOL_USE, usage, model=_TEST_MODEL, and tool_calls
as-is) so the test accurately simulates tool-only assistant responses and
surfaces code paths that treat None differently from an empty string.
In `@tests/e2e/test_single_agent_e2e.py`:
- Around line 372-389: The test
TestRealLLMIntegration.test_real_provider_text_completion unconditionally calls
pytest.skip(), so the REAL_LLM_TEST path is never exercised; replace the
unconditional skip with an env-gated minimal smoke path: read REAL_LLM_PROVIDER
(or similar env vars) and if missing call pytest.skip(), otherwise construct a
minimal provider/client using those env vars inside
test_real_provider_text_completion, perform a simple text completion/request via
the existing LLM client or agent helper (e.g., create the client, call its
complete/generate method), and assert on a non-empty/valid response; keep the
test slow/timeout markers but ensure the new logic only runs when
REAL_LLM_TEST=1 and required provider env vars are present.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 566ae37d-54ca-48e7-bb40-84393643f656
📒 Files selected for processing (4)
CLAUDE.mdDESIGN_SPEC.mdtests/e2e/conftest.pytests/e2e/test_single_agent_e2e.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Agent
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Do not usefrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations
Useexcept A, B:syntax without parentheses for exception handling — ruff enforces PEP 758 on Python 3.14
All public functions and classes must have type hints with strict mypy compliance
Use Google-style docstrings on all public classes and functions — enforced by ruff D rules
Every module with business logic must include:from ai_company.observability import get_loggerthenlogger = get_logger(__name__)
Never useimport logging,logging.getLogger(), orprint()in application code — use the project logger instead
Always useloggeras the variable name for loggers — not_loggerorlog
Use event name constants fromai_company.observability.events.<domain>instead of string literals for log events
Use structured logging format:logger.info(EVENT, key=value)— never use string formatting likelogger.info('msg %s', val)
All error paths must log at WARNING or ERROR with context before raising exceptions
All state transitions must be logged at INFO level
Use DEBUG level logging for object creation, internal flow, and entry/exit of key functions
Create new objects instead of mutating existing ones — never mutate objects
For non-Pydantic internal collections (registries,BaseTool), usecopy.deepcopy()at construction and wrap withMappingProxyTypefor read-only enforcement
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves
Never mix static config fields with mutable runtime fields in a single Pydantic model
UseNotBlankStrfromcore.typesfor all identifier and name fields in Pydantic models, including optional and tuple variants, instead of manual whitespace validators
Use@computed_fieldin Pydantic models for derived values instead of storing and validating redundant fields
Usemodel_copy(update=...)for evolving runtime state in Pydantic models
Use `copy.deepcop...
Files:
tests/e2e/test_single_agent_e2e.pytests/e2e/conftest.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Mark all tests with appropriate markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e, or@pytest.mark.slow
Tests must not use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) — use generic names:example-provider,example-large-001,example-medium-001,example-small-001, ortest-provider,test-small-001
Prefer@pytest.mark.parametrizefor testing similar cases
Default async pytest mode isasyncio_mode = 'auto'— no manual@pytest.mark.asyncioneeded
Test timeout is 30 seconds per test
Usepytest-xdistparallelism via-n autofor test execution
Files:
tests/e2e/test_single_agent_e2e.pytests/e2e/conftest.py
🧠 Learnings (3)
📚 Learning: 2026-03-07T13:46:59.933Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T13:46:59.933Z
Learning: Dependency groups: `test` (pytest + plugins), `dev` (includes test + ruff, mypy, pre-commit, commitizen). Install with `uv sync`
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-07T13:46:59.933Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T13:46:59.933Z
Learning: Applies to tests/**/*.py : Mark all tests with appropriate markers: `pytest.mark.unit`, `pytest.mark.integration`, `pytest.mark.e2e`, or `pytest.mark.slow`
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-07T13:46:59.933Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T13:46:59.933Z
Learning: Applies to tests/**/*.py : Use `pytest-xdist` parallelism via `-n auto` for test execution
Applied to files:
CLAUDE.md
🧬 Code graph analysis (1)
tests/e2e/conftest.py (2)
src/ai_company/providers/enums.py (1)
FinishReason(15-22)src/ai_company/providers/models.py (5)
CompletionResponse(257-306)StreamChunk(309-387)TokenUsage(12-35)ToolCall(96-119)ToolDefinition(64-93)
| uv run mypy src/ tests/ # type-check (strict) | ||
| uv run pytest tests/ -m unit # unit tests only | ||
| uv run pytest tests/ -m integration # integration tests only | ||
| uv run pytest tests/ -m e2e # e2e tests only |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Check if CLAUDE.md exists and view the context around line 37
if [ -f CLAUDE.md ]; then
echo "=== CLAUDE.md exists, viewing lines 30-45 ==="
sed -n '30,45p' CLAUDE.md
fiRepository: Aureliolo/ai-company
Length of output: 811
🏁 Script executed:
# Search for pytest configuration files
echo "=== Looking for pytest configuration ==="
fd -e 'pyproject.toml|pytest.ini|tox.ini|setup.cfg' --type f | head -20Repository: Aureliolo/ai-company
Length of output: 104
🏁 Script executed:
# Check for pytest-xdist references in the codebase
echo "=== Searching for pytest-xdist references ==="
rg 'pytest.*-n|xdist|auto' --type-list | head -5
rg 'pytest.*-n|xdist|auto' -iRepository: Aureliolo/ai-company
Length of output: 29935
Add -n auto to the e2e quick command for parallelism consistency.
The e2e test shortcut omits pytest-xdist parallelism, while the full suite command and repo documentation both use -n auto. All test execution should be consistent.
Proposed fix
-uv run pytest tests/ -m e2e # e2e tests only
+uv run pytest tests/ -m e2e -n auto # e2e tests only📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| uv run pytest tests/ -m e2e # e2e tests only | |
| uv run pytest tests/ -m e2e -n auto # e2e tests only |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@CLAUDE.md` at line 37, Update the e2e quick command string that currently
reads "uv run pytest tests/ -m e2e" to include pytest-xdist parallelism by
adding "-n auto" so it matches the full-suite and docs; locate and modify the
command in CLAUDE.md (the e2e shortcut line) to read the same invocation with
"-n auto" appended.
tests/e2e/test_single_agent_e2e.py
Outdated
| @pytest.mark.slow | ||
| @pytest.mark.timeout(60) | ||
| @pytest.mark.skipif( | ||
| os.environ.get("REAL_LLM_TEST") != "1", | ||
| reason="Set REAL_LLM_TEST=1 to run real LLM integration test", | ||
| ) | ||
| class TestRealLLMIntegration: | ||
| """Optional smoke test with a real LLM provider. | ||
|
|
||
| Skipped unless REAL_LLM_TEST=1 is set; not expected to run in CI. | ||
| """ | ||
|
|
||
| async def test_real_provider_text_completion(self) -> None: | ||
| """Minimal text-only task with a real provider. | ||
|
|
||
| Placeholder — replace the skip with real provider setup when ready. | ||
| """ | ||
| pytest.skip("Real LLM test placeholder — configure a real provider") |
There was a problem hiding this comment.
The manual real-LLM path is still unreachable.
Even when REAL_LLM_TEST=1 is set, this class never runs a real smoke path because the only test unconditionally calls pytest.skip(). That misses the linked objective of having an optional manual real-provider run. Either wire a minimal env-driven provider here or drop the claim until the smoke path actually exists.
I can help sketch a minimal env-gated smoke test that keeps CI isolated but makes the manual path real.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/e2e/test_single_agent_e2e.py` around lines 372 - 389, The test
TestRealLLMIntegration.test_real_provider_text_completion unconditionally calls
pytest.skip(), so the REAL_LLM_TEST path is never exercised; replace the
unconditional skip with an env-gated minimal smoke path: read REAL_LLM_PROVIDER
(or similar env vars) and if missing call pytest.skip(), otherwise construct a
minimal provider/client using those env vars inside
test_real_provider_text_completion, perform a simple text completion/request via
the existing LLM client or agent helper (e.g., create the client, call its
complete/generate method), and assert on a non-empty/valid response; keep the
test slow/timeout markers but ensure the new logic only runs when
REAL_LLM_TEST=1 and required provider env vars are present.
… reviewers - Move received_messages.append() after bounds check in ScriptedProvider (conftest.py) - Fix double-skip on real LLM test — now env-gated with actionable skip message - Document SHUTDOWN→INTERRUPTED and ERROR→recovery transitions in DESIGN_SPEC §6.5 - Use content=None for tool-only responses in make_tool_call_response - Rename TestMaxIterationsExhausted → TestMaxTurnsExhausted (consistent terminology) - Remove unused read_tool from TestFileToolAgent registry - Add min conversation length assertion in text-only test - Add file existence assertions in max-turns test - Add isinstance protocol assertion for ScriptedProvider - Improve complete() and stream() docstrings in ScriptedProvider Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| # Agent recovered successfully | ||
| assert result.is_success is True | ||
| assert result.total_turns == 2 |
There was a problem hiding this comment.
Missing termination_reason assertion.
TestPermissionDeniedRecovery checks result.is_success is True but omits result.termination_reason, even though all three other test classes (TestFileToolAgent, TestTextOnlyAgent, TestMaxTurnsExhausted) explicitly assert both. This creates a coverage gap: a scenario where the engine erroneously returns TerminationReason.MAX_TURNS or TerminationReason.BUDGET_EXHAUSTED (while setting is_success=True) would not be caught.
| # Agent recovered successfully | |
| assert result.is_success is True | |
| assert result.total_turns == 2 | |
| # Agent recovered successfully | |
| assert result.is_success is True | |
| assert result.termination_reason == TerminationReason.COMPLETED | |
| assert result.total_turns == 2 |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/e2e/test_single_agent_e2e.py
Line: 244-246
Comment:
Missing `termination_reason` assertion.
`TestPermissionDeniedRecovery` checks `result.is_success is True` but omits `result.termination_reason`, even though all three other test classes (`TestFileToolAgent`, `TestTextOnlyAgent`, `TestMaxTurnsExhausted`) explicitly assert both. This creates a coverage gap: a scenario where the engine erroneously returns `TerminationReason.MAX_TURNS` or `TerminationReason.BUDGET_EXHAUSTED` (while setting `is_success=True`) would not be caught.
```suggestion
# Agent recovered successfully
assert result.is_success is True
assert result.termination_reason == TerminationReason.COMPLETED
assert result.total_turns == 2
```
How can I resolve this? If you propose a fix, please make it concise.🤖 I have created a release *beep* *boop* --- ## [0.1.1](ai-company-v0.1.0...ai-company-v0.1.1) (2026-03-10) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.0...v0.1.0) (2026-03-11) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add mandatory JWT + API key authentication ([#256](#256)) ([c279cfe](c279cfe)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable output scan response policies ([#263](#263)) ([b9907e8](b9907e8)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement AuditRepository for security audit log persistence ([#279](#279)) ([94bc29f](94bc29f)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * resolve circular imports, bump litellm, fix release tag format ([#286](#286)) ([a6659b5](a6659b5)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * bump anchore/scan-action from 6.5.1 to 7.3.2 ([#271](#271)) ([80a1c15](80a1c15)) * bump docker/build-push-action from 6.19.2 to 7.0.0 ([#273](#273)) ([dd0219e](dd0219e)) * bump docker/login-action from 3.7.0 to 4.0.0 ([#272](#272)) ([33d6238](33d6238)) * bump docker/metadata-action from 5.10.0 to 6.0.0 ([#270](#270)) ([baee04e](baee04e)) * bump docker/setup-buildx-action from 3.12.0 to 4.0.0 ([#274](#274)) ([5fc06f7](5fc06f7)) * bump sigstore/cosign-installer from 3.9.1 to 4.1.0 ([#275](#275)) ([29dd16c](29dd16c)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * **main:** release ai-company 0.1.1 ([#282](#282)) ([2f4703d](2f4703d)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>
Summary
ScriptedProvider(mock provider with sequential response playback), factory helpers (make_e2e_identity,make_e2e_task,make_tool_call_response,make_text_response)ScriptedProvider.complete()is_error is Falseassertion on success-path tool resultSHUTDOWNto DESIGN_SPECTerminationReasonenum listingShutdownCheckerto DESIGN_SPECExecutionLoop.execute()docse2etest command to CLAUDE.md Quick CommandsCloses #24
Test plan
uv run ruff check src/ tests/— lint cleanuv run ruff format src/ tests/— format cleanuv run mypy src/ tests/— type-check clean (281 files)uv run pytest tests/ -n auto --cov=ai_company --cov-fail-under=80— 2476 passed, 96.36% coverage