Skip to content

feat: implement engine extensions — Plan-and-Execute loop and call categorization#159

Merged
Aureliolo merged 5 commits intomainfrom
feat/engine-extensions
Mar 7, 2026
Merged

feat: implement engine extensions — Plan-and-Execute loop and call categorization#159
Aureliolo merged 5 commits intomainfrom
feat/engine-extensions

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

  • Plan-and-Execute loop (plan_execute_loop.py): Full implementation of the Plan-and-Execute execution strategy behind the ExecutionLoop protocol, with plan generation, step-by-step execution, replanning on failure, and 3 plan parsers (JSON, numbered list, text)
  • LLM call categorization (call_category.py, category_analytics.py): LLMCallCategory enum (PRODUCTIVE/COORDINATION/SYSTEM) with per-category cost/token breakdowns and orchestration overhead ratio with alert levels
  • Coordination metrics (coordination_metrics.py, coordination_config.py): Data models for delegation success rate, message efficiency, redundancy rate, and coordination efficiency — using @computed_field for derived values
  • Loop helpers (loop_helpers.py): Shared utilities extracted from ReactLoop — token estimation, response error checking, turn building, tool call clearing
  • Plan models (plan_models.py): ExecutionPlan and PlanStep frozen Pydantic models
  • Cost recording (cost_recording.py): Per-turn cost recording helper with category propagation
  • DESIGN_SPEC.md: Updated §6.5 (MVP callout), §10.5 (current state), §15.3 (new files), §15.5 (adopted conventions)

Key design decisions

  • ExecutionLoop is a @runtime_checkable Protocol satisfied by both ReactLoop and PlanExecuteLoop
  • metadata: dict[str, object] (not Any) for type safety in loop results
  • Format string injection prevention: f-strings instead of .format() with LLM content
  • Functions decomposed to <50 lines per CLAUDE.md (e.g. execute() from 198→45 lines)
  • @computed_field for all derivable metric values (coordination efficiency, message efficiency)

Test plan

  • 2652 tests pass (0 failures, 6 skipped — platform-specific)
  • 95.67% coverage (≥80% threshold)
  • mypy strict: 0 errors across 296 files
  • ruff lint + format: all clean
  • New test_cost_recording.py (8 tests): noop, per-turn recording, zero-cost skip, free-tier, category propagation, exception handling, BaseException propagation
  • test_plan_execute_loop.py: plan generation, multi-step, replanning, max replans, budget/turn limits, content filter handling, metadata structure
  • test_plan_models.py, test_loop_helpers.py, test_call_category.py, test_category_analytics.py, test_coordination_metrics.py, test_coordination_config.py

Review coverage

Pre-reviewed by 8 agents (code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, type-design-analyzer, logging-audit, resilience-audit, docs-consistency). 36 findings addressed, 2 polish simplifications applied.

Closes #134, closes #135

Aureliolo and others added 2 commits March 7, 2026 17:00
…tegorization (#134, #135)

Workstream A — Plan-and-Execute Loop (#134):
- Extract shared loop helpers from ReactLoop into loop_helpers.py
- Add plan data models (StepStatus, PlanStep, ExecutionPlan, PlanExecuteConfig)
- Implement PlanExecuteLoop with two-phase execution (plan → execute steps),
  mini-ReAct sub-loops per step, re-planning on failure, and JSON/text parsing
- Add plan event constants for observability

Workstream B — Call Categorization & Coordination Metrics (#135):
- Add LLMCallCategory enum (productive/coordination/system) and OrchestrationAlertLevel
- Add call_category field to CostRecord and TurnRecord with full propagation
- Implement category analytics (CategoryBreakdown, OrchestrationRatio)
- Implement coordination metrics (efficiency, overhead, error amplification,
  message density, redundancy rate) as pure computation functions
- Add CoordinationMetricsConfig and OrchestrationAlertThresholds
- Add CostTracker.get_category_breakdown() and get_orchestration_ratio() queries
- Add coordination_metrics to RootConfig and defaults

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…computed fields

Pre-reviewed by 8 agents, 36 findings addressed:
- Decompose oversized functions (plan_execute_loop.py: 198→45 lines)
- Fix format string injection vulnerability (.format() with LLM content)
- Fix step numbering bug in _parse_text_plan
- Convert coordination metrics to @computed_field pattern
- Add boundary constraints to metric input fields
- Narrow except clauses (TypeError masks bugs)
- Add structured logging to silent failure paths
- Replace dict[str, Any] with dict[str, object] in metadata
- Use NotBlankStr for identifier fields
- Add shared clear_last_turn_tool_calls helper
- Update DESIGN_SPEC.md §6.5, §10.5, §15.3, §15.5
- Add test_cost_recording.py (8 tests)
- Fix mypy errors in test files (TYPE_CHECKING, isinstance guards)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 7, 2026 16:31
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 7, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 7, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6265227f-c7f0-422a-bbc8-2251081f70e7

📥 Commits

Reviewing files that changed from the base of the PR and between dae600c and b52b1e9.

📒 Files selected for processing (9)
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/plan_parsing.py
  • src/ai_company/engine/react_loop.py
  • tests/unit/budget/test_call_category.py
  • tests/unit/budget/test_category_analytics.py
  • tests/unit/budget/test_coordination_config.py
  • tests/unit/budget/test_coordination_metrics.py
  • tests/unit/engine/test_plan_execute_loop.py
  • tests/unit/engine/test_plan_parsing.py

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Plan-and-execute loop for structured multi-step task execution.
    • LLM call categorization (productive, coordination, system) with per-category cost/token breakdowns.
    • Orchestration metrics, alerting, and coordination analytics with queryable breakdowns and ratio checks.
    • Cost tracker queries for category breakdowns and orchestration ratios; observability events for plan and budget insights.
  • Documentation

    • Design/spec updated to reflect current loop implementations and analytics roadmap.
  • Tests

    • Extensive unit tests for analytics, configs, metrics, plan models, plan execution, and loop helpers.

Walkthrough

Adds Plan‑and‑Execute loop (planning + per‑step execution with replanning), shared loop helpers, LLM call categorization, coordination metrics and orchestration analytics, new plan models/parsing, CostRecord/tracker category APIs, ReactLoop refactor to use helpers, observability events, config/schema updates, and extensive unit tests.

Changes

Cohort / File(s) Summary
Design spec
DESIGN_SPEC.md
Updated to reflect implemented ReAct + Plan‑and‑Execute loops and expanded M4 analytics/coordination details.
Budget exports
src/ai_company/budget/__init__.py
Expanded package exports to expose new enums, analytics models, and coordination metric types.
Call categorization & alerts
src/ai_company/budget/call_category.py
New enums LLMCallCategory and OrchestrationAlertLevel.
Category analytics
src/ai_company/budget/category_analytics.py
Added CategoryBreakdown, OrchestrationRatio, build_category_breakdown, compute_orchestration_ratio (aggregation, rounding, alert mapping).
Coordination config
src/ai_company/budget/coordination_config.py
Pydantic configs: metric names, error taxonomy, orchestration thresholds, and CoordinationMetricsConfig with validators.
Coordination metrics
src/ai_company/budget/coordination_metrics.py
Implemented CoordinationEfficiency, CoordinationOverhead, ErrorAmplification, MessageDensity, RedundancyRate and compute_* helpers with validations.
CostRecord & tracker
src/ai_company/budget/cost_record.py, src/ai_company/budget/tracker.py
Added optional call_category to CostRecord; CostTracker gains get_category_breakdown and get_orchestration_ratio; record filtering accepts task_id.
Config defaults & schema
src/ai_company/config/defaults.py, src/ai_company/config/schema.py
Added coordination_metrics default and RootConfig field.
Observability events
src/ai_company/observability/events/budget.py, .../execution.py
New budget category/orchestration event constants and execution plan event constants.
Loop helpers & protocol
src/ai_company/engine/loop_helpers.py, src/ai_company/engine/loop_protocol.py
New stateless helpers for shutdown/budget/provider/tool flow; TurnRecord gains call_category; ExecutionResult deep‑copies metadata.
ReactLoop refactor
src/ai_company/engine/react_loop.py
Refactored to use shared helpers; ReAct turns labeled LLMCallCategory.PRODUCTIVE; exports helper APIs.
Plan models & parsing
src/ai_company/engine/plan_models.py, src/ai_company/engine/plan_parsing.py
Immutable PlanStep/ExecutionPlan/PlanExecuteConfig models and JSON/text plan parsing utilities.
PlanExecuteLoop
src/ai_company/engine/plan_execute_loop.py
New PlanExecuteLoop implementing planning, per‑step execution, replanning, termination handling, and observability hooks.
Engine cost recording
src/ai_company/engine/cost_recording.py
Threaded call_category into per‑turn cost submission; refined zero‑cost skip logic.
Tests — budget & config
tests/unit/budget/*, tests/unit/config/conftest.py
Extensive tests for enums, category analytics, coordination configs/metrics, CostRecord categorization; test config factory updated.
Tests — engine
tests/unit/engine/*
Comprehensive tests for loop helpers, cost recording, plan models/parsing, PlanExecuteLoop scenarios, and protocol conformance; test harness records model strings.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Agent
    participant PlanExec as "PlanExecuteLoop"
    participant Provider
    participant ToolInv as "ToolInvoker"
    participant Tracker as "CostTracker"

    User->>Agent: submit task
    Agent->>PlanExec: execute(context, provider, tool_invoker)
    PlanExec->>Provider: planner call (generate ExecutionPlan)
    Provider-->>PlanExec: plan (ExecutionPlan)
    loop for each PlanStep
        PlanExec->>Provider: executor call (mini‑ReAct for step)
        Provider-->>PlanExec: CompletionResponse
        alt tool invocation requested
            PlanExec->>ToolInv: invoke tool(s)
            ToolInv-->>PlanExec: tool results
        end
        PlanExec->>Tracker: record turn cost (with call_category)
        alt step failed and replans remaining
            PlanExec->>Provider: replan request
            Provider-->>PlanExec: revised ExecutionPlan
        end
    end
    PlanExec-->>Agent: ExecutionResult (with plans/replans metadata)
    Agent->>Tracker: get_category_breakdown / get_orchestration_ratio
    Tracker-->>Agent: CategoryBreakdown / OrchestrationRatio
    Agent-->>User: final response + metrics
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main changes: implementing Plan-and-Execute loop and call categorization as the primary objectives.
Description check ✅ Passed The description comprehensively documents the PR scope, key implementations, design decisions, testing results, and linked issues with clear technical depth.
Linked Issues check ✅ Passed The PR fulfills all major acceptance criteria from #134 (Plan-and-Execute loop, replanning, max_replans config, observability, unit/integration tests) and #135 (call categorization, orchestration ratio with alerts, coordination metrics suite with all 5 metrics, analytics queries, comprehensive testing).
Out of Scope Changes check ✅ Passed All changes are within scope of the linked issues: plan-execute loop implementation, call categorization infrastructure, coordination metrics, helper utilities, cost recording, config updates, and comprehensive testing. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/engine-extensions
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch feat/engine-extensions

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the agent execution engine by introducing the Plan-and-Execute loop, a new strategy for task decomposition and execution. It also lays the groundwork for advanced LLM call analytics and multi-agent coordination metrics, providing deeper insights into agent performance and cost. The changes improve modularity, extend configurability, and update documentation to reflect these new capabilities.

Highlights

  • Plan-and-Execute Loop Implementation: Introduced a full implementation of the Plan-and-Execute execution strategy, including plan generation, step-by-step execution, replanning on failure, and support for JSON, numbered list, and plain text plan parsers.
  • LLM Call Categorization and Analytics: Added an LLMCallCategory enum (PRODUCTIVE/COORDINATION/SYSTEM) to categorize LLM API calls, along with data models and functions for per-category cost/token breakdowns and orchestration overhead ratio calculation with alert levels.
  • Coordination Metrics Data Models: Implemented data models for key coordination metrics such as delegation success rate, message efficiency, redundancy rate, and coordination efficiency, utilizing @computed_field for derived values.
  • Refactored Loop Helpers: Extracted shared utility functions from ReactLoop into a new loop_helpers.py module, covering token estimation, response error checking, turn building, and tool call clearing, to promote code reuse and modularity.
  • Dedicated Plan Models: Defined new frozen Pydantic models, ExecutionPlan and PlanStep, to formally represent the structure of execution plans.
  • Enhanced Cost Recording: Updated the cost recording mechanism to include category propagation for each turn, ensuring more granular cost tracking and analytics.
  • DESIGN_SPEC.md Updates: Revised the design specification document to reflect the current state of execution loops, LLM call analytics, new files, and adopted conventions.
Changelog
  • DESIGN_SPEC.md
    • Updated section 6.5 to reflect the implementation of both ReAct and Plan-and-Execute loops.
    • Updated section 10.5 to include call categorization and coordination metric data models as current state.
    • Added a new note in section 10.5 detailing the current state of call categorization and coordination metric data models.
    • Updated section 15.5 to mark LLM call analytics data models as adopted.
  • src/ai_company/budget/init.py
    • Added imports for new budget-related modules: call_category, category_analytics, coordination_config, and coordination_metrics.
    • Exported new classes and enums from the budget sub-package in __all__.
  • src/ai_company/budget/call_category.py
    • Added new file call_category.py defining LLMCallCategory (PRODUCTIVE, COORDINATION, SYSTEM) and OrchestrationAlertLevel enums.
  • src/ai_company/budget/category_analytics.py
    • Added new file category_analytics.py defining CategoryBreakdown and OrchestrationRatio models.
    • Implemented build_category_breakdown function to aggregate cost records by category.
    • Implemented compute_orchestration_ratio function to calculate orchestration overhead and alert levels.
  • src/ai_company/budget/coordination_config.py
    • Added new file coordination_config.py defining configuration models for coordination metrics.
    • Introduced CoordinationMetricName, ErrorCategory, ErrorTaxonomyConfig, OrchestrationAlertThresholds, and CoordinationMetricsConfig.
  • src/ai_company/budget/coordination_metrics.py
    • Added new file coordination_metrics.py defining models and computation functions for five coordination metrics: Efficiency, Overhead, Error Amplification, Message Density, and Redundancy Rate.
  • src/ai_company/budget/cost_record.py
    • Added an optional call_category field to the CostRecord model to tag LLM calls.
  • src/ai_company/budget/tracker.py
    • Added get_category_breakdown method to CostTracker to retrieve categorized cost data.
    • Added get_orchestration_ratio method to CostTracker to compute and log orchestration overhead.
    • Updated _filter_records to include task_id as a filtering option.
    • Added new observability event constants for budget category breakdown and orchestration ratio queries/alerts.
  • src/ai_company/config/defaults.py
    • Added coordination_metrics to the default configuration dictionary.
  • src/ai_company/config/schema.py
    • Imported CoordinationMetricsConfig and added it as a field to the RootConfig schema.
  • src/ai_company/engine/init.py
    • Added imports for PlanExecuteLoop, ExecutionPlan, PlanExecuteConfig, PlanStep, and StepStatus.
    • Exported new classes and enums from the engine sub-package in __all__.
  • src/ai_company/engine/cost_recording.py
    • Updated the module docstring to reflect its new role in handling per-turn cost recording.
    • Modified the condition for skipping zero-cost turns to be more precise (cost_usd == 0.0).
    • Added call_category to the CostRecord creation, propagating it from the TurnRecord.
  • src/ai_company/engine/loop_helpers.py
    • Added new file loop_helpers.py containing shared stateless helper functions for ExecutionLoop implementations.
    • Functions include check_shutdown, check_budget, call_provider, check_response_errors, execute_tool_calls, clear_last_turn_tool_calls, get_tool_definitions, response_to_message, make_turn_record, and build_result.
  • src/ai_company/engine/loop_protocol.py
    • Added LLMCallCategory import.
    • Added an optional call_category field to the TurnRecord model.
    • Changed the type hint for metadata in ExecutionResult from dict[str, Any] to dict[str, object] for improved type safety.
  • src/ai_company/engine/plan_execute_loop.py
    • Added new file plan_execute_loop.py implementing the Plan-and-Execute execution loop.
    • Includes logic for planning, step execution, replanning on failure, and various plan parsing strategies.
  • src/ai_company/engine/plan_models.py
    • Added new file plan_models.py defining StepStatus enum, PlanStep model, ExecutionPlan model, and PlanExecuteConfig model.
  • src/ai_company/engine/react_loop.py
    • Refactored ReactLoop to utilize the new shared helper functions from loop_helpers.py, removing duplicated logic.
    • Removed internal helper methods that were moved to loop_helpers.py.
  • src/ai_company/observability/events/budget.py
    • Added new event constants for budget category breakdown queries (BUDGET_CATEGORY_BREAKDOWN_QUERIED) and orchestration ratio queries/alerts (BUDGET_ORCHESTRATION_RATIO_QUERIED, BUDGET_ORCHESTRATION_RATIO_ALERT).
  • src/ai_company/observability/events/execution.py
    • Added new event constants for plan-and-execute loop events, including plan creation, step start/complete/failed, replanning, and parse errors.
  • tests/unit/budget/test_call_category.py
    • Added new unit tests for LLMCallCategory and OrchestrationAlertLevel enums.
  • tests/unit/budget/test_category_analytics.py
    • Added new unit tests for build_category_breakdown and compute_orchestration_ratio functions.
    • Added tests for CostTracker methods related to category breakdown and orchestration ratio.
  • tests/unit/budget/test_coordination_config.py
    • Added new unit tests for CoordinationMetricName, ErrorCategory, ErrorTaxonomyConfig, OrchestrationAlertThresholds, and CoordinationMetricsConfig models.
  • tests/unit/budget/test_coordination_metrics.py
    • Added new unit tests for compute_efficiency, compute_overhead, compute_error_amplification, compute_message_density, and compute_redundancy_rate functions, and the CoordinationMetrics model.
  • tests/unit/budget/test_cost_record.py
    • Added unit tests to verify the call_category field in CostRecord, including default value, assignment, and JSON roundtrip.
  • tests/unit/config/conftest.py
    • Updated RootConfigFactory to include CoordinationMetricsConfig for test data generation.
  • tests/unit/engine/test_cost_recording.py
    • Added new unit tests for the record_execution_costs function, covering various scenarios like no tracker, multiple turns, zero-cost skips, category propagation, and exception handling.
  • tests/unit/engine/test_loop_helpers.py
    • Added new unit tests for all helper functions extracted into loop_helpers.py, ensuring their correctness and error handling.
  • tests/unit/engine/test_loop_protocol.py
    • Added unit tests for the call_category field in TurnRecord.
    • Extended protocol conformance tests to include PlanExecuteLoop.
  • tests/unit/engine/test_plan_execute_loop.py
    • Added new comprehensive unit tests for the PlanExecuteLoop, covering single/multi-step execution, tool usage, replanning, budget/shutdown handling, model tiering, and plan parsing.
  • tests/unit/engine/test_plan_models.py
    • Added new unit tests for StepStatus, PlanStep, ExecutionPlan, and PlanExecuteConfig models, validating their structure, constraints, and immutability.
Activity
  • The pull request has been pre-reviewed by 8 automated agents (code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, type-design-analyzer, logging-audit, resilience-audit, docs-consistency).
  • 36 findings identified by the automated agents have been addressed.
  • 2 polish simplifications suggested by the automated agents have been applied.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Plan-and-Execute execution loop, moving it from a planned M4+ feature to M3, alongside significant enhancements to LLM call analytics. The changes include new data models and computation functions for categorizing LLM calls (productive, coordination, system) and calculating orchestration overhead ratios, with corresponding configuration options and query methods added to the CostTracker. The core execution loops (ReAct and the new Plan-and-Execute) now leverage shared helper functions for common tasks like budget checking, shutdown handling, and turn recording, and CostRecord and TurnRecord models are updated to include an optional call_category field. Review comments highlight a potential indirect prompt injection vulnerability in the PlanExecuteLoop due to unsanitized LLM-generated content in f-strings, recommend moving a LLMCallCategory import to module level to prevent NameError, point out an invalid Python 3 except syntax in loop_helpers.py and react_loop.py that needs to be updated to a tuple, and suggest categorizing planning calls as SYSTEM and step execution calls as PRODUCTIVE for accurate analytics.

Comment on lines +580 to +584
instruction = (
f"Execute step {step.step_number}: {step.description}\n"
f"Expected outcome: {step.expected_outcome}\n"
f"When done, respond with a summary of what you accomplished."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The PlanExecuteLoop implementation is vulnerable to indirect prompt injection. It takes step descriptions and expected outcomes directly from LLM-generated plans (which can be influenced by external data from tool outputs) and embeds them into subsequent prompts using f-strings without any sanitization or structural separation. An attacker could craft tool outputs that, when summarized into a plan step, inject malicious instructions that the agent will then execute in the next turn. It is recommended to use clear delimiters and explicit instructions to the LLM to treat the content as data.

        instruction = (
            f"Execute the following step {step.step_number}:\n"
            f"<step_description>\n{step.description}\n</step_description>\n"
            f"Expected outcome:\n<expected_outcome>\n{step.expected_outcome}\n</expected_outcome>\n"
            f"When done, respond with a summary of what you accomplished."
        )

)

if TYPE_CHECKING:
from ai_company.budget.call_category import LLMCallCategory
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The type hint LLMCallCategory in make_turn_record's signature may cause a NameError at runtime if from __future__ import annotations is not used, as it's only imported within a TYPE_CHECKING block. To make the code more robust, please move this import to the module level.

return None
try:
shutting_down = shutdown_checker()
except MemoryError, RecursionError:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The except syntax except MemoryError, RecursionError: is invalid Python 3 syntax and will cause a SyntaxError. More critically, this improper exception handling fails to correctly catch RecursionError as intended, leading to critical errors being swallowed. This pattern should be corrected to use a tuple for multiple exceptions, as shown in the suggestion. This issue is present in multiple places in this file (lines 63, 108, 180, 288).

    except (MemoryError, RecursionError):

if isinstance(response, ExecutionResult):
return response

turns.append(make_turn_record(turn_number, response))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The planning and re-planning calls are a form of framework overhead and should be categorized as SYSTEM calls for accurate analytics. This PR introduces call categorization, and this seems to be a missed spot.

To fix this, you'll also need to import LLMCallCategory at the top of the file:

from ai_company.budget.call_category import LLMCallCategory
        turns.append(make_turn_record(turn_number, response, call_category=LLMCallCategory.SYSTEM))

if isinstance(response, ExecutionResult):
return response

turns.append(make_turn_record(turn_number, response))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The LLM calls during step execution are direct task work and should be categorized as PRODUCTIVE for analytics. This seems to be a missed spot for the new call categorization feature.

        turns.append(make_turn_record(turn_number, response, call_category=LLMCallCategory.PRODUCTIVE))

return response

turns.append(_make_turn_record(turn_number, response))
turns.append(make_turn_record(turn_number, response))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The LLM calls within the ReAct loop are direct task work and should be categorized as PRODUCTIVE for analytics. This seems to be a missed spot for the new call categorization feature.

To fix this, you'll also need to import LLMCallCategory at the top of the file:

from ai_company.budget.call_category import LLMCallCategory
Suggested change
turns.append(make_turn_record(turn_number, response))
turns.append(make_turn_record(turn_number, response, call_category=LLMCallCategory.PRODUCTIVE))

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 7, 2026

Greptile Summary

This PR implements two major engine extensions: a full Plan-and-Execute execution loop (PlanExecuteLoop) behind the existing ExecutionLoop protocol, and an LLM call categorization system (PRODUCTIVE/COORDINATION/SYSTEM) with per-category cost/token analytics and orchestration overhead alerting. It also extracts shared loop utilities into loop_helpers.py, adds coordination metric data models, and wires call_category through TurnRecordCostRecordCostTracker.

The implementation is well-engineered overall: the phase separation (planning → step execution → replanning) is clean and easy to follow, the @runtime_checkable Protocol satisfies both ReactLoop and PlanExecuteLoop, frozen Pydantic models are used consistently, and format-string injection is correctly prevented via f-strings. The PR adheres to CLAUDE.md conventions including PEP 758 except A, B: syntax and @computed_field for all derived values.

Key findings:

  • Private constants imported cross-module (_PLANNING_PROMPT, _REPLAN_JSON_EXAMPLE in plan_execute_loop.py): These should be made public in plan_parsing.py since they are actively consumed externally, violating Python's _-prefix encapsulation convention.
  • Error paths don't log before raising in coordination_metrics.py: All five compute_* functions raise ValueError without a preceding logger.warning(...) call, violating CLAUDE.md's explicit rule that all error paths must log at WARNING or ERROR before raising. The module-level logger is consequently dead code.

Confidence Score: 4/5

  • Safe to merge after addressing the two style findings; no logic bugs or security issues found.
  • The core Plan-and-Execute loop logic is sound: budget/shutdown checks are performed at every decision boundary, replanning correctly handles step failures, and the turn/cost recording wires call_category all the way through. The two findings are both style/convention issues (private constant export and missing logger calls on error paths) that don't affect correctness or runtime behavior. The test coverage (2652 tests, 95.67% coverage, mypy strict) provides strong confidence.
  • src/ai_company/budget/coordination_metrics.py (missing logger calls in error paths); src/ai_company/engine/plan_execute_loop.py and src/ai_company/engine/plan_parsing.py (private constant visibility)

Comments Outside Diff (2)

  1. src/ai_company/engine/plan_execute_loop.py, line 62-66 (link)

    The constants _PLANNING_PROMPT and _REPLAN_JSON_EXAMPLE are imported from plan_parsing.py where their _ prefix signals module-private intent per Python convention. However, these constants are actively used externally in this module, which violates encapsulation. Since they are part of the public API surface of plan_parsing.py, they should be made public by removing the leading underscore.

    Update the imports here to:

    And rename them in plan_parsing.py (remove the _ prefix from lines 24 and 43).

  2. src/ai_company/budget/coordination_metrics.py, line 199-227 (link)

    CLAUDE.md mandates: "All error paths must log at WARNING or ERROR with context before raising." Every compute_* function in this module raises ValueError without logging. The module-level logger (line 18) is consequently dead code.

    For example, compute_efficiency should log before raising:

    Apply the same pattern to compute_overhead (line 249), compute_error_amplification (line 275), compute_message_density (line 301), and the validation loops in compute_redundancy_rate (lines 326, 330).

Last reviewed commit: b52b1e9

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements two major features: the Plan-and-Execute execution loop (Loop 2) and LLM call categorization with coordination metrics. The Plan-and-Execute loop follows the ExecutionLoop protocol, decomposing tasks into steps via LLM planning then executing each step with a mini-ReAct sub-loop, with re-planning on failure. The call categorization adds LLMCallCategory (PRODUCTIVE/COORDINATION/SYSTEM) to TurnRecord and CostRecord, with per-category analytics, orchestration ratio computation, and five coordination metrics (efficiency, overhead, error amplification, message density, redundancy rate).

Changes:

  • New Plan-and-Execute loop (plan_execute_loop.py, plan_models.py) with shared helpers extracted from ReactLoop into loop_helpers.py, and cost_recording.py updates for category propagation
  • New budget modules (call_category.py, category_analytics.py, coordination_metrics.py, coordination_config.py) with CostTracker query methods and RootConfig integration
  • Comprehensive test coverage across all new modules with observability event constants and DESIGN_SPEC documentation updates

Reviewed changes

Copilot reviewed 30 out of 30 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/ai_company/engine/plan_execute_loop.py New Plan-and-Execute loop with two-phase execution, replanning, and plan parsing
src/ai_company/engine/plan_models.py Frozen Pydantic models for plan steps, execution plans, and loop config
src/ai_company/engine/loop_helpers.py Shared stateless helpers extracted from ReactLoop for reuse
src/ai_company/engine/react_loop.py Refactored to use shared loop_helpers instead of inline methods
src/ai_company/engine/loop_protocol.py Added call_category to TurnRecord, changed metadata type to dict[str, object]
src/ai_company/engine/cost_recording.py Propagates call_category from TurnRecord to CostRecord
src/ai_company/engine/__init__.py Re-exports new plan-execute types
src/ai_company/budget/call_category.py LLMCallCategory and OrchestrationAlertLevel enums
src/ai_company/budget/category_analytics.py CategoryBreakdown, OrchestrationRatio, and pure computation functions
src/ai_company/budget/coordination_metrics.py Five coordination metric models with @computed_field
src/ai_company/budget/coordination_config.py Config models for metric collection and alert thresholds
src/ai_company/budget/cost_record.py Added call_category field
src/ai_company/budget/tracker.py New get_category_breakdown and get_orchestration_ratio query methods, task_id filter
src/ai_company/budget/__init__.py Re-exports new budget types
src/ai_company/config/schema.py Added coordination_metrics field to RootConfig
src/ai_company/config/defaults.py Added coordination_metrics default
src/ai_company/observability/events/execution.py Plan-and-execute event constants
src/ai_company/observability/events/budget.py Category analytics event constants
DESIGN_SPEC.md Updated §6.5, §10.5, §15.3, §15.5 for new features
tests/unit/engine/test_plan_execute_loop.py Comprehensive tests for Plan-and-Execute loop
tests/unit/engine/test_plan_models.py Tests for plan data models
tests/unit/engine/test_loop_helpers.py Tests for shared loop helpers
tests/unit/engine/test_loop_protocol.py Extended with call_category and PlanExecuteLoop conformance tests
tests/unit/engine/test_cost_recording.py Tests for cost recording with category propagation
tests/unit/budget/test_call_category.py Tests for call categorization enums
tests/unit/budget/test_category_analytics.py Tests for category breakdown and orchestration ratio
tests/unit/budget/test_coordination_metrics.py Tests for five coordination metrics
tests/unit/budget/test_coordination_config.py Tests for coordination config models
tests/unit/budget/test_cost_record.py Extended with call_category tests
tests/unit/config/conftest.py Updated factory for coordination_metrics field

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +435 to +439
# Should terminate (either MAX_TURNS or COMPLETED for 1st step)
assert result.termination_reason in (
TerminationReason.MAX_TURNS,
TerminationReason.COMPLETED,
)
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts result.termination_reason in (TerminationReason.MAX_TURNS, TerminationReason.COMPLETED), which makes the test non-deterministic in its expected outcome. With max_turns=2, the plan consumes 1 turn and step 1 consumes 1 turn, reaching the limit. Step 2 and 3 would not execute. The test should assert a specific outcome (likely MAX_TURNS since 3 steps can't complete in 2 turns, or COMPLETED if the first step completing is sufficient). This ambiguity weakens the test's ability to catch regressions.

Suggested change
# Should terminate (either MAX_TURNS or COMPLETED for 1st step)
assert result.termination_reason in (
TerminationReason.MAX_TURNS,
TerminationReason.COMPLETED,
)
# Should terminate because the maximum number of turns was reached
assert result.termination_reason == TerminationReason.MAX_TURNS

Copilot uses AI. Check for mistakes.
Comment on lines +226 to +241
async def _run_steps( # noqa: PLR0913
self,
ctx: AgentContext,
provider: CompletionProvider,
executor_model: str,
config: CompletionConfig,
tool_defs: list[ToolDefinition] | None,
tool_invoker: ToolInvoker | None,
plan: ExecutionPlan,
turns: list[TurnRecord],
all_plans: list[ExecutionPlan],
replans_used: int,
planner_model: str,
budget_checker: BudgetChecker | None,
shutdown_checker: ShutdownChecker | None,
) -> ExecutionResult:
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_run_steps accepts 13 positional parameters (plus self), making it one of the most heavily parameterized methods in the codebase. While # noqa: PLR0913 suppresses the linting warning, this many parameters make the method harder to maintain and test. Consider grouping related parameters into a lightweight dataclass or named tuple (e.g., a _StepExecutionContext containing provider, executor_model, config, tool_defs, tool_invoker, budget_checker, shutdown_checker) to reduce the parameter count.

Copilot uses AI. Check for mistakes.
Comment on lines +471 to +472
# Verify different models were used (check recorded call count)
assert provider.call_count == 2
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_different_models_for_phases test only verifies provider.call_count == 2 but doesn't verify that different models were actually used for the planning vs execution phase. The MockCompletionProvider doesn't record the model parameter passed to each complete() call. This means the test doesn't actually validate the model tiering behavior it claims to test. Consider extending MockCompletionProvider to record models or adding a model-recording assertion here.

Suggested change
# Verify different models were used (check recorded call count)
assert provider.call_count == 2
# Verify provider was called once for planning and once for execution
assert provider.call_count == 2
# Verify the configured planner and executor models were actually used
assert getattr(provider, "called_models", None) == [
"test-planner-001",
"test-executor-001",
]

Copilot uses AI. Check for mistakes.

import pytest

from ai_company.core.agent import AgentIdentity # noqa: TC001
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AgentIdentity import on line 8 uses # noqa: TC001 to suppress the "move to TYPE_CHECKING" linting rule, which is the correct convention for test files where pytest evaluates annotations at runtime. However, AgentIdentity is only used as a type annotation for the sample_agent_with_personality fixture parameter on line 411 (test_max_turns_during_step). In tests that use sample_agent_context, the fixture already provides AgentContext. Consider whether this import is actually needed at runtime — if it's only used in type annotations that pytest doesn't evaluate (e.g., fixture parameter types), it could remain in TYPE_CHECKING.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
src/ai_company/engine/cost_recording.py (1)

68-75: ⚠️ Potential issue | 🟠 Major

A single identity.model cannot describe every turn anymore.

The new planner/executor overrides mean one ExecutionResult can contain calls from different models, but every CostRecord here is still stamped with identity.model.provider / identity.model.model_id. That will mislabel per-call audit data as soon as a non-default planner or executor model is used. The actual model metadata needs to travel with each turn and be recorded here instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/cost_recording.py` around lines 68 - 75, CostRecord
entries are being stamped with identity.model.provider and
identity.model.model_id even though ExecutionResult turns may come from
different models; update the record creation in CostRecord construction to use
the model metadata attached to each turn (e.g., turn.model.provider and
turn.model.model_id), falling back to
identity.model.provider/identity.model.model_id only if the turn lacks its own
model info, so each CostRecord accurately reflects the per-call model used by
ExecutionResult/turn.
src/ai_company/engine/react_loop.py (1)

230-249: ⚠️ Potential issue | 🟠 Major

Return a non-success termination reason for truncated completions.

FinishReason.MAX_TOKENS means the provider stopped because it ran out of generation budget, not because the task completed. Returning TerminationReason.COMPLETED here hides truncation from callers and can suppress fallback or retry behavior.

🛠️ Suggested fix
         if response.finish_reason == FinishReason.MAX_TOKENS:
             logger.warning(
                 EXECUTION_LOOP_TERMINATED,
                 execution_id=ctx.execution_id,
-                reason=TerminationReason.COMPLETED.value,
+                reason=TerminationReason.MAX_TURNS.value,
                 turns=len(turns),
                 truncated=True,
             )
+            return build_result(
+                ctx,
+                TerminationReason.MAX_TURNS,
+                turns,
+            )
         else:
             logger.info(
                 EXECUTION_LOOP_TERMINATED,
                 execution_id=ctx.execution_id,
                 reason=TerminationReason.COMPLETED.value,
                 turns=len(turns),
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/react_loop.py` around lines 230 - 249, The code treats
FinishReason.MAX_TOKENS as a successful completion; change it to return and log
a non-success termination (e.g., TerminationReason.TRUNCATED or another
appropriate non-success value) so callers can detect truncation and trigger
fallback/retry. Specifically, in the branch checking response.finish_reason ==
FinishReason.MAX_TOKENS update the logger.warning call's reason field to the
non-success TerminationReason and pass that same TerminationReason (instead of
TerminationReason.COMPLETED) into build_result(ctx, ..., turns), keeping
truncated=True and execution/context metadata (ctx.execution_id, turns) intact.
DESIGN_SPEC.md (1)

1567-1575: ⚠️ Potential issue | 🟡 Minor

Fix the orchestration-ratio formula in §10.5.

The implementation and surrounding prose use (coordination + system) / total, but this sentence says coordination / total. That mismatch changes the meaning of the metric and the alert thresholds readers will infer from it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DESIGN_SPEC.md` around lines 1567 - 1575, The prose in §10.5 is inconsistent
with the implementation: the orchestration ratio used in code is (coordination +
system) / total but the text defines it as coordination / total. Update the
prose and any alert guidance in §10.5 to define the orchestration ratio as
(coordination + system) / total (and adjust explanatory text about
thresholds/alerts accordingly), referencing the CostRecord categories
(`coordination` and `system`) so readers understand the formula matches the
implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai_company/budget/category_analytics.py`:
- Around line 116-135: OrchestrationRatio is dropping the uncategorized_tokens
bucket from its public model so callers cannot tell if a zero ratio was due to
true zero overhead or uncategorized traffic; add an uncategorized_tokens: int =
Field(ge=0, description="Uncategorized tokens") attribute to the
OrchestrationRatio model (and mirror the same change in the similar
model/section referenced around lines 204-233) so the denominator and token
breakdown are exposed; update any constructor/factory that builds
OrchestrationRatio to populate this new field from the computed
uncategorized_tokens value.
- Around line 8-20: This module lacks the observability hook: add "from
ai_company.observability import get_logger" at the top and declare "logger =
get_logger(__name__)" so the file-level logger is available; then instrument the
existing business logic functions build_category_breakdown and
compute_orchestration_ratio to emit events using the domain event constants from
ai_company.observability.events (import the appropriate event names) rather than
raw strings, ensuring calls to logger.info/debug/error use those constants and
that logging occurs at entry/exit and on error paths; place the imports
alongside other top-level imports and add the logger declaration near them so
all functions in this file (including lines ~138-247) use the repo-wide
observability contract.

In `@src/ai_company/budget/coordination_metrics.py`:
- Around line 86-90: Update the error-rate field validators and the computation
to enforce a max of 1.0: change the Pydantic Field on error_rate_mas to include
le=1.0 (keep ge=0.0) and change error_rate_sas to use ge=0.0, le=1.0 (instead of
gt=0), and then in compute_error_amplification() defensively clamp both inputs
to min(value, 1.0) before doing any math so out-of-range inputs can't produce
nonsensical amplification ratios; reference error_rate_mas, error_rate_sas, and
compute_error_amplification when making the changes.
- Around line 1-12: Add logging to this business-logic module by importing
get_logger and initializing logger = get_logger(__name__), then update every
invalid-input branch in the coordination metric helpers (functions efficiency,
overhead, error_amplification, message_density, redundancy_rate) to call
logger.warning or logger.error with contextual details (input values and reason)
immediately before raising ValueError; ensure the import line "from
ai_company.observability import get_logger" is added and that each bad-input
path logs at WARNING/ERROR level with a clear message including the function
name and offending parameters.

In `@src/ai_company/budget/tracker.py`:
- Around line 243-244: The public query entry points that accept agent_id and
task_id currently allow blank/whitespace strings which then propagate into
_filter_records; change the parameter types from str | None to
core.types.NotBlankStr | None for all identifier/name parameters (e.g., the
methods taking agent_id: str | None, task_id: str | None) and add an upfront
normalization/validation step that rejects or normalizes purely-blank values
(trim and treat "" as None or raise) before calling _filter_records; apply the
same change to the other affected signatures/paths mentioned (the other method
overloads that accept task_id/agent_id) so blank-only inputs are validated at
the system boundary.

In `@src/ai_company/engine/cost_recording.py`:
- Around line 76-78: The code is currently persisting turns even when
turn.call_category is None; modify the recording path in cost_recording.py so
that before creating/saving the record (the block that sets
call_category=turn.call_category and timestamp=datetime.now(UTC)) you check if
turn.call_category is None and handle it instead of writing: either log an
error/warning with context (e.g., include turn id or session) and skip
recording, or raise an exception to fail fast; update the surrounding function
that constructs the record to early-return/raise when call_category is missing
so no uncategorized cost rows are persisted.

In `@src/ai_company/engine/loop_helpers.py`:
- Around line 258-274: In the branch where tool_invoker is None (the if
tool_invoker is None check), clear the recorded tool calls before logging the
error and returning so the turn history and ExecutionResult.total_tool_calls
aren’t inflated; specifically, empty response.tool_calls (or call
tool_calls_made.clear() if that list variable is used elsewhere) immediately
before the logger.error(...) / build_result(...) sequence that returns
TerminationReason.ERROR. Ensure the clearing happens before the error log and
return so no stale tool call entries remain.

In `@src/ai_company/engine/plan_execute_loop.py`:
- Line 531: The TurnRecord created by make_turn_record is missing call_category,
so planner/replanner turns are uncategorized; update the make_turn_record calls
that append to turns (the ones in the planner/replanner code path and the
per-step executor path) to pass call_category="SYSTEM" for planning/replanning
turns and call_category="PRODUCTIVE" for per-step execution turns so TurnRecord
instances are correctly categorized for orchestration breakdowns.
- Around line 274-279: The code updates an immutable ExecutionPlan via
_update_step_status but never writes the replacement back into all_plans, so
later serialization/_finalize and _handle_step_failure operate on stale objects;
update every site that calls _update_step_status (e.g., the block guarded by
step_ok and the other uses around the ranges that call _update_step_status and
_handle_step_failure) to assign the returned ExecutionPlan into the original
all_plans entry (use the same index/key you used when appending the original
plan), and ensure any local variable `plan` is also replaced with that returned
object before proceeding to _finalize() or returning an ExecutionResult so all
subsequent logic operates on the new immutable instance.
- Around line 667-676: Replace using response.finish_reason as the success check
with an explicit step result/status from the LLM response (e.g.,
response.step_result or response.success) so that "STOP" vs "MAX_TOKENS" are
treated only as generation end signals and not as step success; keep the
existing logger.warning(EXECUTION_PLAN_STEP_COMPLETE, ..., truncated=True) when
response.finish_reason == FinishReason.MAX_TOKENS but determine success by
calling/inspecting the explicit field used by self._assess_step_success (or by
adapting _assess_step_success to accept the explicit step result), and apply the
same change to the other identical block around lines 870-882.
- Around line 531-535: The code calls response_to_message(response)
unconditionally which raises when CompletionResponse finished with finish_reason
'error' or 'content_filter' (no content/tool calls); change the flow in the
block that uses make_turn_record and ctx.with_turn_completed to first detect
when response.finish_reason is 'error' or 'content_filter' or when response has
no content and no tool calls, and in that case: (1) avoid calling
response_to_message(), (2) create and return a clean ExecutionResult
representing planner failure (including response.finish_reason and any
error/details from CompletionResponse), and (3) still call
ctx.with_turn_completed(response.usage, None) or otherwise mark the turn
completed without an assistant ChatMessage; keep references to make_turn_record,
ctx.with_turn_completed, response_to_message, CompletionResponse, and
ExecutionResult so the fix is easy to locate.
- Around line 105-913: The module and several long methods violate
size/complexity rules: split parsing and replanning orchestration into separate
helper modules/classes so execute() and _run_steps() fall under 50 lines and the
file is <800 lines. Extract the plan parsing logic (_parse_plan,
_parse_json_plan, _parse_text_plan, _data_to_plan, _extract_task_summary) into a
new PlanParser class/module and replace calls to self._parse_* with a PlanParser
instance; likewise extract planner orchestration and replan creation
(_generate_plan, _call_planner, _replan, _handle_step_failure) into a
PlannerOrchestrator helper that exposes methods used by execute()/_run_steps;
update PlanExecuteLoop to delegate to these helpers so its execute() and
_run_steps() only orchestrate high-level control flow.

In `@src/ai_company/engine/react_loop.py`:
- Around line 24-35: make_turn_record is currently called without a category in
react_loop.py so ReAct turns are recorded as uncategorized; update the calls to
make_turn_record in this module (both ReAct branch and the other occurrence) to
pass an explicit call_category (e.g., call_category="productive" or "react") so
ReAct turns are counted in the per-category breakdowns.

In `@tests/unit/engine/test_cost_recording.py`:
- Around line 141-155: Add a third case for the COORDINATION category to the
call-category propagation test by converting test_call_category_propagated into
a parametrized test (use pytest.mark.parametrize) over call_category values
including LLMCallCategory.PRODUCTIVE, LLMCallCategory.SYSTEM, and
LLMCallCategory.COORDINATION; for each param, construct the turn via
_turn(call_category=...) and run record_execution_costs with the same
_FakeTracker, then assert the tracker.records entry preserves the passed
call_category (ensure you reference the existing test_call_category_propagated,
record_execution_costs, _turn, and _FakeTracker symbols).

In `@tests/unit/engine/test_plan_execute_loop.py`:
- Around line 470-472: The test currently only checks provider.call_count == 2
which doesn't guarantee different model IDs were used; instead inspect the
provider's recorded calls (e.g., provider.recorded_calls or provider.calls) and
assert that the first call (planner invocation) used model id "test-planner-001"
and the second call (executor invocation) used model id "test-executor-001",
while keeping the existing assertion that result.termination_reason ==
TerminationReason.COMPLETED; update the assertions to check the request payloads
or model_id fields of the two recorded provider calls rather than just the call
count.
- Around line 435-439: The test currently allows both
TerminationReason.MAX_TURNS and TerminationReason.COMPLETED for
result.termination_reason, which masks a false positive when running with
max_turns=2 against a 3-step plan; change the assertion to only accept
TerminationReason.MAX_TURNS (e.g., assert result.termination_reason ==
TerminationReason.MAX_TURNS) so the test correctly verifies max-turn termination
for the given setup and keep references to result.termination_reason,
TerminationReason.MAX_TURNS, TerminationReason.COMPLETED and the max_turns=2 /
3-step plan in your reasoning while editing.
- Around line 267-297: The test uses FinishReason.ERROR which is intercepted by
check_response_errors() before reaching _handle_step_failure(), so max_replans=0
never gets exercised; change the mocked provider sequence so the second response
is a non-error CompletionResponse (e.g., FinishReason.FINISHED) whose content
decodes into a step result that causes _handle_step_failure() to run (i.e., a
successful LLM completion that represents a failed step execution), then keep
PlanExecuteLoop(PlanExecuteConfig(max_replans=0)) and assert the
exhausted-replan behavior (TerminationReason.ERROR and loop_type) to exercise
the max-replans branch.

---

Outside diff comments:
In `@DESIGN_SPEC.md`:
- Around line 1567-1575: The prose in §10.5 is inconsistent with the
implementation: the orchestration ratio used in code is (coordination + system)
/ total but the text defines it as coordination / total. Update the prose and
any alert guidance in §10.5 to define the orchestration ratio as (coordination +
system) / total (and adjust explanatory text about thresholds/alerts
accordingly), referencing the CostRecord categories (`coordination` and
`system`) so readers understand the formula matches the implementation.

In `@src/ai_company/engine/cost_recording.py`:
- Around line 68-75: CostRecord entries are being stamped with
identity.model.provider and identity.model.model_id even though ExecutionResult
turns may come from different models; update the record creation in CostRecord
construction to use the model metadata attached to each turn (e.g.,
turn.model.provider and turn.model.model_id), falling back to
identity.model.provider/identity.model.model_id only if the turn lacks its own
model info, so each CostRecord accurately reflects the per-call model used by
ExecutionResult/turn.

In `@src/ai_company/engine/react_loop.py`:
- Around line 230-249: The code treats FinishReason.MAX_TOKENS as a successful
completion; change it to return and log a non-success termination (e.g.,
TerminationReason.TRUNCATED or another appropriate non-success value) so callers
can detect truncation and trigger fallback/retry. Specifically, in the branch
checking response.finish_reason == FinishReason.MAX_TOKENS update the
logger.warning call's reason field to the non-success TerminationReason and pass
that same TerminationReason (instead of TerminationReason.COMPLETED) into
build_result(ctx, ..., turns), keeping truncated=True and execution/context
metadata (ctx.execution_id, turns) intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 23586307-b157-457d-952a-2634bdb9f6ca

📥 Commits

Reviewing files that changed from the base of the PR and between f566fb4 and 1696235.

📒 Files selected for processing (30)
  • DESIGN_SPEC.md
  • src/ai_company/budget/__init__.py
  • src/ai_company/budget/call_category.py
  • src/ai_company/budget/category_analytics.py
  • src/ai_company/budget/coordination_config.py
  • src/ai_company/budget/coordination_metrics.py
  • src/ai_company/budget/cost_record.py
  • src/ai_company/budget/tracker.py
  • src/ai_company/config/defaults.py
  • src/ai_company/config/schema.py
  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/cost_recording.py
  • src/ai_company/engine/loop_helpers.py
  • src/ai_company/engine/loop_protocol.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/plan_models.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/observability/events/budget.py
  • src/ai_company/observability/events/execution.py
  • tests/unit/budget/test_call_category.py
  • tests/unit/budget/test_category_analytics.py
  • tests/unit/budget/test_coordination_config.py
  • tests/unit/budget/test_coordination_metrics.py
  • tests/unit/budget/test_cost_record.py
  • tests/unit/config/conftest.py
  • tests/unit/engine/test_cost_recording.py
  • tests/unit/engine/test_loop_helpers.py
  • tests/unit/engine/test_loop_protocol.py
  • tests/unit/engine/test_plan_execute_loop.py
  • tests/unit/engine/test_plan_models.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Agent
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations — Python 3.14 has PEP 649
Use except A, B: syntax (no parentheses) for exception handling — ruff enforces this on Python 3.14
Include type hints on all public functions; enforce with mypy strict mode
Use Google-style docstrings on all public classes and functions — enforced by ruff D rules
Use immutability: create new objects, never mutate existing ones. For non-Pydantic internal collections, use copy.deepcopy() at construction + MappingProxyType wrapping. For dict/list fields in frozen Pydantic models, rely on frozen=True and copy.deepcopy() at system boundaries
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict). Use @computed_field for derived values; use NotBlankStr from core.types for all identifier/name fields
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code; prefer structured concurrency over bare create_task
Maintain line length of 88 characters — enforced by ruff
Keep functions under 50 lines and files under 800 lines
Handle errors explicitly, never silently swallow exceptions
Validate input at system boundaries: user input, external APIs, and config files

Files:

  • src/ai_company/engine/cost_recording.py
  • src/ai_company/engine/__init__.py
  • tests/unit/budget/test_cost_record.py
  • src/ai_company/budget/__init__.py
  • src/ai_company/observability/events/budget.py
  • tests/unit/config/conftest.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/budget/category_analytics.py
  • src/ai_company/budget/call_category.py
  • tests/unit/budget/test_call_category.py
  • tests/unit/engine/test_loop_protocol.py
  • src/ai_company/engine/plan_models.py
  • src/ai_company/budget/coordination_config.py
  • tests/unit/budget/test_coordination_metrics.py
  • src/ai_company/config/defaults.py
  • src/ai_company/engine/react_loop.py
  • tests/unit/engine/test_plan_execute_loop.py
  • tests/unit/engine/test_plan_models.py
  • tests/unit/engine/test_loop_helpers.py
  • src/ai_company/engine/loop_protocol.py
  • tests/unit/budget/test_category_analytics.py
  • src/ai_company/config/schema.py
  • tests/unit/engine/test_cost_recording.py
  • tests/unit/budget/test_coordination_config.py
  • src/ai_company/engine/loop_helpers.py
  • src/ai_company/budget/cost_record.py
  • src/ai_company/budget/tracker.py
  • src/ai_company/budget/coordination_metrics.py
  • src/ai_company/engine/plan_execute_loop.py
src/ai_company/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/ai_company/**/*.py: Every module with business logic MUST have: from ai_company.observability import get_logger then logger = get_logger(__name__)
Never use import logging / logging.getLogger() / print() in application code — use get_logger() from observability module
Use variable name logger (not _logger, not log) for logging instances
Always use event name constants from ai_company.observability.events domain-specific modules; import directly (e.g., from ai_company.observability.events.<domain> import EVENT_CONSTANT)
Use structured logging format: logger.info(EVENT, key=value) — never logger.info("msg %s", val)
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
Use DEBUG level for object creation, internal flow, and entry/exit of key functions
Pure data models, enums, and re-exports do NOT need logging

Files:

  • src/ai_company/engine/cost_recording.py
  • src/ai_company/engine/__init__.py
  • src/ai_company/budget/__init__.py
  • src/ai_company/observability/events/budget.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/budget/category_analytics.py
  • src/ai_company/budget/call_category.py
  • src/ai_company/engine/plan_models.py
  • src/ai_company/budget/coordination_config.py
  • src/ai_company/config/defaults.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/loop_protocol.py
  • src/ai_company/config/schema.py
  • src/ai_company/engine/loop_helpers.py
  • src/ai_company/budget/cost_record.py
  • src/ai_company/budget/tracker.py
  • src/ai_company/budget/coordination_metrics.py
  • src/ai_company/engine/plan_execute_loop.py
src/ai_company/{providers,engine}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

RetryExhaustedError signals that all retries failed — the engine layer catches this to trigger fallback chains

Files:

  • src/ai_company/engine/cost_recording.py
  • src/ai_company/engine/__init__.py
  • src/ai_company/engine/plan_models.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/loop_protocol.py
  • src/ai_company/engine/loop_helpers.py
  • src/ai_company/engine/plan_execute_loop.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples — use generic names like example-provider, example-large-001, large/medium/small as aliases, or test-provider, test-small-001 in tests

Files:

  • src/ai_company/engine/cost_recording.py
  • src/ai_company/engine/__init__.py
  • src/ai_company/budget/__init__.py
  • src/ai_company/observability/events/budget.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/budget/category_analytics.py
  • src/ai_company/budget/call_category.py
  • src/ai_company/engine/plan_models.py
  • src/ai_company/budget/coordination_config.py
  • src/ai_company/config/defaults.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/loop_protocol.py
  • src/ai_company/config/schema.py
  • src/ai_company/engine/loop_helpers.py
  • src/ai_company/budget/cost_record.py
  • src/ai_company/budget/tracker.py
  • src/ai_company/budget/coordination_metrics.py
  • src/ai_company/engine/plan_execute_loop.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow
Maintain 80% minimum code coverage — enforced in CI
Use asyncio_mode = "auto" in pytest configuration — no manual @pytest.mark.asyncio needed
Set 30-second timeout per test
Use pytest-xdist via -n auto for parallel test execution
Prefer @pytest.mark.parametrize for testing similar cases

Files:

  • tests/unit/budget/test_cost_record.py
  • tests/unit/config/conftest.py
  • tests/unit/budget/test_call_category.py
  • tests/unit/engine/test_loop_protocol.py
  • tests/unit/budget/test_coordination_metrics.py
  • tests/unit/engine/test_plan_execute_loop.py
  • tests/unit/engine/test_plan_models.py
  • tests/unit/engine/test_loop_helpers.py
  • tests/unit/budget/test_category_analytics.py
  • tests/unit/engine/test_cost_recording.py
  • tests/unit/budget/test_coordination_config.py
🧠 Learnings (5)
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Always use event name constants from `ai_company.observability.events` domain-specific modules; import directly (e.g., `from ai_company.observability.events.<domain> import EVENT_CONSTANT`)

Applied to files:

  • src/ai_company/observability/events/budget.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/budget/tracker.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves

Applied to files:

  • tests/unit/engine/test_plan_models.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to **/*.py : Use Pydantic v2 (`BaseModel`, `model_validator`, `computed_field`, `ConfigDict`). Use `computed_field` for derived values; use `NotBlankStr` from `core.types` for all identifier/name fields

Applied to files:

  • src/ai_company/config/schema.py
  • src/ai_company/budget/cost_record.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Every module with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`

Applied to files:

  • src/ai_company/budget/tracker.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Never use `import logging` / `logging.getLogger()` / `print()` in application code — use `get_logger()` from observability module

Applied to files:

  • src/ai_company/budget/tracker.py
🧬 Code graph analysis (19)
src/ai_company/engine/__init__.py (2)
src/ai_company/engine/plan_execute_loop.py (1)
  • PlanExecuteLoop (105-913)
src/ai_company/engine/plan_models.py (4)
  • ExecutionPlan (53-88)
  • PlanExecuteConfig (91-118)
  • PlanStep (25-50)
  • StepStatus (15-22)
tests/unit/budget/test_cost_record.py (3)
src/ai_company/budget/call_category.py (1)
  • LLMCallCategory (11-25)
tests/unit/engine/test_loop_protocol.py (4)
  • test_call_category_none_default (95-103)
  • test_call_category_productive (105-114)
  • test_call_category_coordination (116-125)
  • test_call_category_system (127-136)
src/ai_company/budget/cost_record.py (1)
  • CostRecord (15-56)
src/ai_company/budget/__init__.py (4)
src/ai_company/budget/call_category.py (2)
  • LLMCallCategory (11-25)
  • OrchestrationAlertLevel (28-45)
src/ai_company/budget/category_analytics.py (2)
  • CategoryBreakdown (28-107)
  • OrchestrationRatio (110-135)
src/ai_company/budget/coordination_config.py (5)
  • CoordinationMetricName (13-20)
  • CoordinationMetricsConfig (95-129)
  • ErrorCategory (23-29)
  • ErrorTaxonomyConfig (32-49)
  • OrchestrationAlertThresholds (52-92)
src/ai_company/budget/coordination_metrics.py (6)
  • CoordinationEfficiency (17-45)
  • CoordinationMetrics (155-189)
  • CoordinationOverhead (48-70)
  • ErrorAmplification (73-98)
  • MessageDensity (101-129)
  • RedundancyRate (132-152)
tests/unit/config/conftest.py (1)
src/ai_company/budget/coordination_config.py (1)
  • CoordinationMetricsConfig (95-129)
tests/unit/budget/test_call_category.py (1)
src/ai_company/budget/call_category.py (2)
  • LLMCallCategory (11-25)
  • OrchestrationAlertLevel (28-45)
tests/unit/engine/test_loop_protocol.py (6)
src/ai_company/budget/call_category.py (1)
  • LLMCallCategory (11-25)
src/ai_company/engine/context.py (1)
  • AgentContext (87-307)
src/ai_company/engine/loop_protocol.py (3)
  • ExecutionLoop (139-177)
  • ExecutionResult (77-128)
  • get_loop_type (175-177)
src/ai_company/engine/plan_execute_loop.py (2)
  • PlanExecuteLoop (105-913)
  • get_loop_type (115-117)
src/ai_company/providers/enums.py (1)
  • FinishReason (15-22)
src/ai_company/engine/react_loop.py (1)
  • get_loop_type (62-64)
tests/unit/budget/test_coordination_metrics.py (1)
src/ai_company/budget/coordination_metrics.py (15)
  • CoordinationEfficiency (17-45)
  • CoordinationMetrics (155-189)
  • CoordinationOverhead (48-70)
  • ErrorAmplification (73-98)
  • MessageDensity (101-129)
  • RedundancyRate (132-152)
  • compute_efficiency (195-221)
  • compute_error_amplification (250-273)
  • compute_message_density (276-299)
  • compute_overhead (224-247)
  • compute_redundancy_rate (302-329)
  • value (43-45)
  • value (96-98)
  • value (127-129)
  • value_percent (68-70)
src/ai_company/engine/react_loop.py (1)
src/ai_company/engine/loop_helpers.py (10)
  • build_result (369-384)
  • call_provider (138-195)
  • check_budget (89-135)
  • check_response_errors (198-232)
  • check_shutdown (44-86)
  • clear_last_turn_tool_calls (318-329)
  • execute_tool_calls (235-315)
  • get_tool_definitions (332-339)
  • make_turn_record (351-366)
  • response_to_message (342-348)
tests/unit/engine/test_plan_execute_loop.py (3)
src/ai_company/engine/loop_protocol.py (4)
  • TerminationReason (27-34)
  • execute (147-173)
  • ExecutionLoop (139-177)
  • get_loop_type (175-177)
src/ai_company/engine/plan_execute_loop.py (3)
  • PlanExecuteLoop (105-913)
  • execute (119-197)
  • get_loop_type (115-117)
src/ai_company/engine/plan_models.py (1)
  • PlanExecuteConfig (91-118)
tests/unit/engine/test_plan_models.py (1)
src/ai_company/engine/plan_models.py (4)
  • ExecutionPlan (53-88)
  • PlanExecuteConfig (91-118)
  • PlanStep (25-50)
  • StepStatus (15-22)
tests/unit/engine/test_loop_helpers.py (1)
src/ai_company/engine/loop_helpers.py (9)
  • build_result (369-384)
  • call_provider (138-195)
  • check_budget (89-135)
  • check_response_errors (198-232)
  • check_shutdown (44-86)
  • execute_tool_calls (235-315)
  • get_tool_definitions (332-339)
  • make_turn_record (351-366)
  • response_to_message (342-348)
src/ai_company/engine/loop_protocol.py (2)
src/ai_company/budget/call_category.py (1)
  • LLMCallCategory (11-25)
src/ai_company/tools/base.py (1)
  • description (115-117)
tests/unit/budget/test_category_analytics.py (5)
src/ai_company/budget/call_category.py (2)
  • LLMCallCategory (11-25)
  • OrchestrationAlertLevel (28-45)
src/ai_company/budget/category_analytics.py (3)
  • CategoryBreakdown (28-107)
  • build_category_breakdown (138-183)
  • compute_orchestration_ratio (186-233)
src/ai_company/budget/coordination_config.py (1)
  • OrchestrationAlertThresholds (52-92)
src/ai_company/budget/cost_record.py (1)
  • CostRecord (15-56)
src/ai_company/budget/tracker.py (2)
  • CostTracker (67-412)
  • record (98-111)
src/ai_company/config/schema.py (1)
src/ai_company/budget/coordination_config.py (1)
  • CoordinationMetricsConfig (95-129)
tests/unit/budget/test_coordination_config.py (1)
src/ai_company/budget/coordination_config.py (5)
  • CoordinationMetricName (13-20)
  • CoordinationMetricsConfig (95-129)
  • ErrorCategory (23-29)
  • ErrorTaxonomyConfig (32-49)
  • OrchestrationAlertThresholds (52-92)
src/ai_company/engine/loop_helpers.py (5)
src/ai_company/providers/enums.py (2)
  • FinishReason (15-22)
  • MessageRole (6-12)
src/ai_company/providers/models.py (5)
  • ChatMessage (138-210)
  • CompletionConfig (213-254)
  • CompletionResponse (257-306)
  • ToolDefinition (64-93)
  • add_token_usage (46-61)
src/ai_company/engine/loop_protocol.py (3)
  • ExecutionResult (77-128)
  • TerminationReason (27-34)
  • TurnRecord (37-74)
src/ai_company/engine/context.py (2)
  • AgentContext (87-307)
  • with_message (173-182)
src/ai_company/providers/protocol.py (1)
  • CompletionProvider (21-80)
src/ai_company/budget/cost_record.py (1)
src/ai_company/budget/call_category.py (1)
  • LLMCallCategory (11-25)
src/ai_company/budget/tracker.py (4)
src/ai_company/budget/call_category.py (1)
  • OrchestrationAlertLevel (28-45)
src/ai_company/budget/category_analytics.py (4)
  • CategoryBreakdown (28-107)
  • OrchestrationRatio (110-135)
  • build_category_breakdown (138-183)
  • compute_orchestration_ratio (186-233)
src/ai_company/budget/coordination_config.py (1)
  • OrchestrationAlertThresholds (52-92)
src/ai_company/budget/cost_record.py (1)
  • CostRecord (15-56)
src/ai_company/engine/plan_execute_loop.py (6)
src/ai_company/providers/models.py (3)
  • ChatMessage (138-210)
  • CompletionConfig (213-254)
  • CompletionResponse (257-306)
src/ai_company/engine/loop_protocol.py (4)
  • ExecutionResult (77-128)
  • TerminationReason (27-34)
  • TurnRecord (37-74)
  • execute (147-173)
src/ai_company/engine/plan_models.py (3)
  • ExecutionPlan (53-88)
  • PlanExecuteConfig (91-118)
  • PlanStep (25-50)
src/ai_company/engine/context.py (4)
  • AgentContext (87-307)
  • has_turns_remaining (305-307)
  • with_message (173-182)
  • with_turn_completed (184-233)
src/ai_company/providers/protocol.py (1)
  • CompletionProvider (21-80)
src/ai_company/tools/invoker.py (1)
  • ToolInvoker (61-489)
🔇 Additional comments (3)
tests/unit/config/conftest.py (1)

9-9: Good fixture alignment.

Adding coordination_metrics here keeps the shared RootConfigFactory representative of the new root config surface without extra per-test setup.

Also applies to: 73-73

tests/unit/engine/test_loop_protocol.py (1)

6-6: Nice coverage of the new loop protocol surface.

These additions cover both the new TurnRecord.call_category field and PlanExecuteLoop's ExecutionLoop contract.

Also applies to: 14-15, 95-136, 294-300

tests/unit/engine/test_plan_models.py (1)

14-314: Good coverage of the immutable plan models.

Creation, validation, frozen behavior, copy-on-write updates, and JSON round-trip are all exercised here, which should keep the new plan model contracts stable.

Based on learnings, "Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves".

Comment on lines +8 to +20
import math
from typing import TYPE_CHECKING

from pydantic import BaseModel, ConfigDict, Field

from ai_company.budget.call_category import (
LLMCallCategory,
OrchestrationAlertLevel,
)
from ai_company.budget.coordination_config import (
OrchestrationAlertThresholds,
)
from ai_company.constants import BUDGET_ROUNDING_PRECISION
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add the required observability hook to this analytics module.

build_category_breakdown() and compute_orchestration_ratio() are budget business logic, but this file never declares logger = get_logger(__name__). That leaves direct callers of these helpers invisible in logs and breaks the repo-wide observability contract.

As per coding guidelines src/ai_company/**/*.py: Every module with business logic MUST have: from ai_company.observability import get_logger then logger = get_logger(__name__); always use event name constants from ai_company.observability.events domain-specific modules.

Also applies to: 138-247

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/budget/category_analytics.py` around lines 8 - 20, This module
lacks the observability hook: add "from ai_company.observability import
get_logger" at the top and declare "logger = get_logger(__name__)" so the
file-level logger is available; then instrument the existing business logic
functions build_category_breakdown and compute_orchestration_ratio to emit
events using the domain event constants from ai_company.observability.events
(import the appropriate event names) rather than raw strings, ensuring calls to
logger.info/debug/error use those constants and that logging occurs at
entry/exit and on error paths; place the imports alongside other top-level
imports and add the logger declaration near them so all functions in this file
(including lines ~138-247) use the repo-wide observability contract.

Comment on lines +116 to +135
Attributes:
ratio: Orchestration ratio (0.0-1.0).
alert_level: Alert level based on ratio thresholds.
total_tokens: Total tokens across all categories (includes
uncategorized tokens in the denominator).
productive_tokens: Productive category tokens.
coordination_tokens: Coordination category tokens.
system_tokens: System category tokens.
"""

model_config = ConfigDict(frozen=True)

ratio: float = Field(ge=0.0, le=1.0, description="Orchestration ratio")
alert_level: OrchestrationAlertLevel = Field(
description="Alert level for orchestration overhead",
)
total_tokens: int = Field(ge=0, description="Total tokens")
productive_tokens: int = Field(ge=0, description="Productive tokens")
coordination_tokens: int = Field(ge=0, description="Coordination tokens")
system_tokens: int = Field(ge=0, description="System tokens")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Carry uncategorized tokens through OrchestrationRatio.

The denominator includes uncategorized_tokens, but the returned model drops that bucket. A caller can therefore see ratio=0.0 with total_tokens>0 and no way to tell whether overhead was truly zero or whether the traffic was simply uncategorized.

Possible fix
 class OrchestrationRatio(BaseModel):
@@
     total_tokens: int = Field(ge=0, description="Total tokens")
     productive_tokens: int = Field(ge=0, description="Productive tokens")
     coordination_tokens: int = Field(ge=0, description="Coordination tokens")
     system_tokens: int = Field(ge=0, description="System tokens")
+    uncategorized_tokens: int = Field(
+        ge=0,
+        description="Uncategorized tokens",
+    )
@@
     return OrchestrationRatio(
         ratio=round(ratio, BUDGET_ROUNDING_PRECISION),
         alert_level=alert,
         total_tokens=total,
         productive_tokens=breakdown.productive_tokens,
         coordination_tokens=breakdown.coordination_tokens,
         system_tokens=breakdown.system_tokens,
+        uncategorized_tokens=breakdown.uncategorized_tokens,
     )

Also applies to: 204-233

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/budget/category_analytics.py` around lines 116 - 135,
OrchestrationRatio is dropping the uncategorized_tokens bucket from its public
model so callers cannot tell if a zero ratio was due to true zero overhead or
uncategorized traffic; add an uncategorized_tokens: int = Field(ge=0,
description="Uncategorized tokens") attribute to the OrchestrationRatio model
(and mirror the same change in the similar model/section referenced around lines
204-233) so the denominator and token breakdown are exposed; update any
constructor/factory that builds OrchestrationRatio to populate this new field
from the computed uncategorized_tokens value.

Comment on lines +1 to +12
"""Coordination metrics for multi-agent system tuning.

Pure computation functions for five coordination metrics defined in
DESIGN_SPEC (Coordination Metrics): efficiency, overhead, error
amplification, message density, and redundancy rate.
"""

import statistics
from typing import TYPE_CHECKING

from pydantic import BaseModel, ConfigDict, Field, computed_field

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add the required logger and log-before-raise on these helpers.

This new module contains business logic, but it never defines logger, and every invalid-input branch raises ValueError without emitting warning/error context first. Import get_logger, initialize logger, and log each bad-input path before raising. As per coding guidelines, "Every module with business logic MUST have: from ai_company.observability import get_logger then logger = get_logger(__name__)" and "All error paths must log at WARNING or ERROR with context before raising."

Also applies to: 214-216, 241-243, 267-269, 293-295, 318-324

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/budget/coordination_metrics.py` around lines 1 - 12, Add
logging to this business-logic module by importing get_logger and initializing
logger = get_logger(__name__), then update every invalid-input branch in the
coordination metric helpers (functions efficiency, overhead,
error_amplification, message_density, redundancy_rate) to call logger.warning or
logger.error with contextual details (input values and reason) immediately
before raising ValueError; ensure the import line "from ai_company.observability
import get_logger" is added and that each bad-input path logs at WARNING/ERROR
level with a clear message including the function name and offending parameters.

Comment on lines +86 to +90
error_rate_mas: float = Field(
ge=0.0,
description="Multi-agent error rate",
)
error_rate_sas: float = Field(gt=0, description="Single-agent error rate")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cap error-rate inputs at 1.0.

Both fields are documented as rates, but values above 1.0 currently validate and compute_error_amplification() forwards them straight through. That produces meaningless amplification ratios instead of rejecting bad inputs.

Suggested fix
     error_rate_mas: float = Field(
         ge=0.0,
+        le=1.0,
         description="Multi-agent error rate",
     )
-    error_rate_sas: float = Field(gt=0, description="Single-agent error rate")
+    error_rate_sas: float = Field(
+        gt=0,
+        le=1.0,
+        description="Single-agent error rate",
+    )

Also applies to: 250-273

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/budget/coordination_metrics.py` around lines 86 - 90, Update
the error-rate field validators and the computation to enforce a max of 1.0:
change the Pydantic Field on error_rate_mas to include le=1.0 (keep ge=0.0) and
change error_rate_sas to use ge=0.0, le=1.0 (instead of gt=0), and then in
compute_error_amplification() defensively clamp both inputs to min(value, 1.0)
before doing any math so out-of-range inputs can't produce nonsensical
amplification ratios; reference error_rate_mas, error_rate_sas, and
compute_error_amplification when making the changes.

Comment on lines +243 to +244
agent_id: str | None = None,
task_id: str | None = None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject blank task_id filters before querying.

The new task_id path flows straight into _filter_records(), so " " is treated as a real identifier and quietly returns an empty breakdown/ratio instead of surfacing bad input. Validate or normalize blank values up front on these public query methods.

As per coding guidelines **/*.py: Use NotBlankStr from core.types for all identifier/name fields, and validate input at system boundaries.

Also applies to: 283-284, 437-449

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/budget/tracker.py` around lines 243 - 244, The public query
entry points that accept agent_id and task_id currently allow blank/whitespace
strings which then propagate into _filter_records; change the parameter types
from str | None to core.types.NotBlankStr | None for all identifier/name
parameters (e.g., the methods taking agent_id: str | None, task_id: str | None)
and add an upfront normalization/validation step that rejects or normalizes
purely-blank values (trim and treat "" as None or raise) before calling
_filter_records; apply the same change to the other affected signatures/paths
mentioned (the other method overloads that accept task_id/agent_id) so
blank-only inputs are validated at the system boundary.

Comment on lines +141 to +155
async def test_call_category_propagated(self) -> None:
turns = (
_turn(call_category=LLMCallCategory.PRODUCTIVE),
_turn(turn_number=2, call_category=LLMCallCategory.SYSTEM),
)
tracker = _FakeTracker()
await record_execution_costs(
_result(turns),
_identity(),
"agent-1",
"task-1",
tracker=tracker, # type: ignore[arg-type]
)
assert tracker.records[0].call_category == LLMCallCategory.PRODUCTIVE
assert tracker.records[1].call_category == LLMCallCategory.SYSTEM
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Cover the coordination category in the propagation test.

This is the only end-to-end assertion in this suite for call_category persistence, and it skips LLMCallCategory.COORDINATION, which is the category the new orchestration metrics depend on most.

♻️ Suggested test shape
-    async def test_call_category_propagated(self) -> None:
-        turns = (
-            _turn(call_category=LLMCallCategory.PRODUCTIVE),
-            _turn(turn_number=2, call_category=LLMCallCategory.SYSTEM),
-        )
+    `@pytest.mark.parametrize`(
+        "category",
+        (
+            LLMCallCategory.PRODUCTIVE,
+            LLMCallCategory.COORDINATION,
+            LLMCallCategory.SYSTEM,
+        ),
+    )
+    async def test_call_category_propagated(
+        self,
+        category: LLMCallCategory,
+    ) -> None:
+        turns = (_turn(call_category=category),)
         tracker = _FakeTracker()
         await record_execution_costs(
             _result(turns),
             _identity(),
             "agent-1",
             "task-1",
             tracker=tracker,  # type: ignore[arg-type]
         )
-        assert tracker.records[0].call_category == LLMCallCategory.PRODUCTIVE
-        assert tracker.records[1].call_category == LLMCallCategory.SYSTEM
+        assert tracker.records[0].call_category == category

As per coding guidelines, "Prefer @pytest.mark.parametrize for testing similar cases".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def test_call_category_propagated(self) -> None:
turns = (
_turn(call_category=LLMCallCategory.PRODUCTIVE),
_turn(turn_number=2, call_category=LLMCallCategory.SYSTEM),
)
tracker = _FakeTracker()
await record_execution_costs(
_result(turns),
_identity(),
"agent-1",
"task-1",
tracker=tracker, # type: ignore[arg-type]
)
assert tracker.records[0].call_category == LLMCallCategory.PRODUCTIVE
assert tracker.records[1].call_category == LLMCallCategory.SYSTEM
`@pytest.mark.parametrize`(
"category",
(
LLMCallCategory.PRODUCTIVE,
LLMCallCategory.COORDINATION,
LLMCallCategory.SYSTEM,
),
)
async def test_call_category_propagated(
self,
category: LLMCallCategory,
) -> None:
turns = (_turn(call_category=category),)
tracker = _FakeTracker()
await record_execution_costs(
_result(turns),
_identity(),
"agent-1",
"task-1",
tracker=tracker, # type: ignore[arg-type]
)
assert tracker.records[0].call_category == category
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/engine/test_cost_recording.py` around lines 141 - 155, Add a third
case for the COORDINATION category to the call-category propagation test by
converting test_call_category_propagated into a parametrized test (use
pytest.mark.parametrize) over call_category values including
LLMCallCategory.PRODUCTIVE, LLMCallCategory.SYSTEM, and
LLMCallCategory.COORDINATION; for each param, construct the turn via
_turn(call_category=...) and run record_execution_costs with the same
_FakeTracker, then assert the tracker.records entry preserves the passed
call_category (ensure you reference the existing test_call_category_propagated,
record_execution_costs, _turn, and _FakeTracker symbols).

Aureliolo and others added 2 commits March 7, 2026 17:59
…, and Greptile

Source fixes:
- Add check_response_errors after planner LLM call (CRITICAL)
- Add shutdown/budget checks before replan LLM call (CRITICAL)
- Extract plan_parsing.py from plan_execute_loop.py (<800 lines)
- Decompose _run_steps via _attempt_replan extraction
- Tag planning turns as SYSTEM, step turns as PRODUCTIVE
- Add XML delimiters for LLM-generated content in step instructions
- Set StepStatus.IN_PROGRESS at step start
- Use EXECUTION_PLAN_STEP_TRUNCATED event for MAX_TOKENS
- Decompose _run_step_turn into _handle_step_completion/_handle_step_tool_calls
- Fix docstrings for OrchestrationAlertLevel thresholds
- Add allow_inf_nan=False to CostRecord, CategoryBreakdown, OrchestrationRatio
- Add cross-field validation to OrchestrationRatio
- Add unique-category validators to ErrorTaxonomyConfig, CoordinationMetricsConfig
- Deep-copy ExecutionResult metadata at construction boundary
- Fix ReactLoop class docstring and add call categorization
- Narrow _parse_json_plan exception to JSONDecodeError only
- Add debug logging to silent return-None paths in plan parsing
- Fix Raises docstrings for compute_efficiency/compute_overhead
- Move CompletionResponse import to TYPE_CHECKING in plan_parsing

Test fixes:
- Add test for successful replan path
- Rewrite test_max_replans_exhausted to exercise actual replan exhaustion
- Add multi-step with tools integration test
- Add ReAct vs Plan-and-Execute comparison test
- Add tests for clear_last_turn_tool_calls
- Add tests for make_budget_checker factory
- Verify model IDs in test_different_models_for_phases
- Tighten test_max_turns_during_step to expect only MAX_TURNS
- Add check_budget MemoryError/RecursionError propagation tests
- Add model recording to MockCompletionProvider

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n no-invoker

- Add logger to category_analytics.py (CLAUDE.md: business logic needs logger)
- Add logger to coordination_metrics.py (CLAUDE.md: business logic needs logger)
- Clear tool_calls_made on turn record when no invoker is available
  (consistency with shutdown handling — tools were never executed)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 7, 2026 17:02
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 32 out of 32 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +24 to +54
_PLANNING_PROMPT = """\
You are a planning agent. Analyze the task and create a step-by-step \
execution plan. Return your plan as a JSON object with this exact schema:

```json
{
"steps": [
{
"step_number": 1,
"description": "What to do in this step",
"expected_outcome": "What should result from this step"
}
]
}
```

Each step should be concrete, actionable, and independently verifiable. \
Return ONLY the JSON object, no other text."""

_REPLAN_JSON_EXAMPLE = """\
```json
{
"steps": [
{
"step_number": 1,
"description": "What to do in this step",
"expected_outcome": "What should result from this step"
}
]
}
```"""
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_PLANNING_PROMPT and _REPLAN_JSON_EXAMPLE use leading-underscore naming, which conventionally signals module-private names. However, they are imported and used by plan_execute_loop.py. Since these constants are part of the cross-module API, they should be renamed to PLANNING_PROMPT and REPLAN_JSON_EXAMPLE (without the leading underscore) to signal that they are intended for use outside plan_parsing.py.

Copilot uses AI. Check for mistakes.
import pytest

from ai_company.budget.call_category import LLMCallCategory, OrchestrationAlertLevel

Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other test files in tests/unit/budget/ include pytestmark = pytest.mark.timeout(30) (see test_config.py:18, test_cost_record.py:13, test_enums.py:7, test_hierarchy.py:18, test_spending_summary.py:23, test_tracker.py:20). This new file is missing that module-level marker, which is a deviation from the established convention in this directory.

Suggested change
pytestmark = pytest.mark.timeout(30)

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3
"""Tests for category-based analytics."""

from datetime import UTC, datetime
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other test files in tests/unit/budget/ include pytestmark = pytest.mark.timeout(30) (see test_config.py:18, test_cost_record.py:13, test_enums.py:7, test_hierarchy.py:18, test_spending_summary.py:23, test_tracker.py:20). This new file is missing that module-level marker.

Copilot uses AI. Check for mistakes.
"""Tests for coordination metrics computations."""

import pytest

Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other test files in tests/unit/budget/ include pytestmark = pytest.mark.timeout(30) (see test_config.py:18, test_cost_record.py:13, test_enums.py:7, test_hierarchy.py:18, test_spending_summary.py:23, test_tracker.py:20). This new file is missing that module-level marker.

Suggested change
pytestmark = pytest.mark.timeout(30)

Copilot uses AI. Check for mistakes.

import pytest
from pydantic import ValidationError

Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other test files in tests/unit/budget/ include pytestmark = pytest.mark.timeout(30) (see test_config.py:18, test_cost_record.py:13, test_enums.py:7, test_hierarchy.py:18, test_spending_summary.py:23, test_tracker.py:20). This new file is missing that module-level marker.

Suggested change
pytestmark = pytest.mark.timeout(30)

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +249
def parse_plan(
response: CompletionResponse,
execution_id: str,
task_summary: str,
*,
revision_number: int = 0,
) -> ExecutionPlan | None:
"""Parse an ExecutionPlan from LLM response content.

Tries JSON extraction first (with markdown code fence stripping),
then falls back to structured text parsing.

Args:
response: LLM completion response.
execution_id: Execution ID for logging.
task_summary: Brief summary of the task being planned.
revision_number: Plan revision counter (0 = original).

Returns:
Parsed ``ExecutionPlan``, or ``None`` on failure.
"""
content = response.content or ""
if not content.strip():
logger.warning(
EXECUTION_PLAN_PARSE_ERROR,
execution_id=execution_id,
reason="empty LLM response content",
)
return None

plan = _parse_json_plan(content, task_summary, revision_number)
if plan is not None:
return plan

plan = _parse_text_plan(content, task_summary, revision_number)
if plan is not None:
return plan

logger.warning(
EXECUTION_PLAN_PARSE_ERROR,
execution_id=execution_id,
content_length=len(content),
content_snippet=content[:200],
)
return None


def _parse_json_plan(
content: str,
task_summary: str,
revision_number: int,
) -> ExecutionPlan | None:
"""Try to extract a JSON plan from the content."""
json_str = content.strip()
fence_match = re.search(
r"```(?:json)?\s*\n?(.*?)```",
json_str,
re.DOTALL,
)
if fence_match:
json_str = fence_match.group(1).strip()

try:
data = json.loads(json_str)
except json.JSONDecodeError as exc:
logger.debug(
EXECUTION_PLAN_PARSE_ERROR,
parser="json",
error=str(exc),
)
return None

return _data_to_plan(data, task_summary, revision_number)


def _parse_text_plan(
content: str,
task_summary: str,
revision_number: int,
) -> ExecutionPlan | None:
"""Fallback: extract steps from numbered text lines."""
step_pattern = re.compile(
r"(?:^|\n)\s*(\d+)\.\s+(.+?)(?=\n\s*\d+\.|\Z)",
re.DOTALL,
)
matches = step_pattern.findall(content)
if not matches:
logger.debug(
EXECUTION_PLAN_PARSE_ERROR,
parser="text_fallback",
reason="no numbered steps found",
)
return None

steps: list[PlanStep] = []
for _, desc in matches:
desc_clean = desc.strip()
if not desc_clean:
continue
steps.append(
PlanStep(
step_number=len(steps) + 1,
description=desc_clean,
expected_outcome=desc_clean,
)
)

if not steps:
logger.debug(
EXECUTION_PLAN_PARSE_ERROR,
parser="text_fallback",
reason="all descriptions empty after stripping",
)
return None

try:
return ExecutionPlan(
steps=tuple(steps),
revision_number=revision_number,
original_task_summary=task_summary,
)
except ValueError as exc:
logger.debug(
EXECUTION_PLAN_PARSE_ERROR,
parser="text_fallback",
error=str(exc),
)
return None


def _data_to_plan(
data: object,
task_summary: str,
revision_number: int,
) -> ExecutionPlan | None:
"""Convert parsed JSON data to an ExecutionPlan."""
if not isinstance(data, dict):
logger.debug(
EXECUTION_PLAN_PARSE_ERROR,
parser="json_data",
reason="top-level value is not a dict",
data_type=type(data).__name__,
)
return None

raw_steps = data.get("steps")
if not isinstance(raw_steps, list) or not raw_steps:
logger.debug(
EXECUTION_PLAN_PARSE_ERROR,
parser="json_data",
reason="missing or empty 'steps' list",
)
return None

steps: list[PlanStep] = []
for i, raw_step in enumerate(raw_steps, start=1):
if not isinstance(raw_step, dict):
logger.debug(
EXECUTION_PLAN_PARSE_ERROR,
parser="json_data",
reason=f"step {i} is not a dict",
)
return None
desc = raw_step.get("description", "")
outcome = raw_step.get("expected_outcome", desc)
if not desc:
logger.debug(
EXECUTION_PLAN_PARSE_ERROR,
parser="json_data",
reason=f"step {i} has no description",
)
return None
steps.append(
PlanStep(
step_number=i,
description=str(desc),
expected_outcome=str(outcome),
)
)

try:
return ExecutionPlan(
steps=tuple(steps),
revision_number=revision_number,
original_task_summary=task_summary,
)
except ValueError as exc:
logger.debug(
EXECUTION_PLAN_PARSE_ERROR,
parser="json_data",
error=str(exc),
)
return None
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/ai_company/engine/plan_parsing.py (249 lines) has no dedicated test file, unlike every other source file in the engine package which has a corresponding test_*.py file. While some parsing paths are exercised indirectly through test_plan_execute_loop.py, the individual parsing functions (_parse_json_plan, _parse_text_plan, _data_to_plan) and their edge cases (e.g., non-dict top-level JSON, steps with missing descriptions, empty descriptions after stripping) aren't directly tested. A test_plan_parsing.py file should be added for parity with the rest of the package.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (6)
src/ai_company/engine/plan_execute_loop.py (2)

221-225: ⚠️ Potential issue | 🟠 Major

all_plans drifts stale after step-status updates.

_update_step_status() returns a replacement plan, but all_plans keeps the older instance. _finalize() serializes all_plans, so the emitted plan history will show stale statuses after steps move to IN_PROGRESS, COMPLETED, or FAILED.

🧩 Suggested fix
             plan = self._update_step_status(
                 plan,
                 step_idx,
                 StepStatus.IN_PROGRESS,
             )
+            all_plans[-1] = plan
             logger.info(
                 EXECUTION_PLAN_STEP_START,
                 execution_id=ctx.execution_id,
                 step_number=step.step_number,
                 description=step.description,
@@
                 plan = self._update_step_status(
                     plan,
                     step_idx,
                     StepStatus.COMPLETED,
                 )
+                all_plans[-1] = plan
                 logger.info(
                     EXECUTION_PLAN_STEP_COMPLETE,
                     execution_id=ctx.execution_id,
                     step_number=step.step_number,
                 )
@@
-        plan = self._update_step_status(plan, step_idx, StepStatus.FAILED)
+        plan = self._update_step_status(plan, step_idx, StepStatus.FAILED)
+        all_plans[-1] = plan
         logger.warning(
             EXECUTION_PLAN_STEP_FAILED,
             execution_id=ctx.execution_id,
             step_number=step.step_number,
         )

Also applies to: 252-256, 315-316, 777-787

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/plan_execute_loop.py` around lines 221 - 225, The plan
object returned by _update_step_status is not propagated back into the all_plans
list, so serialized history in _finalize uses stale plan instances; update the
entry in all_plans with the returned plan whenever you call _update_step_status
(e.g., inside PlanExecuteLoop where plan = self._update_step_status(...) is
used), replacing the old all_plans[plan_index] (or the element found by plan.id)
with the new plan; apply the same replacement behavior for all other call sites
mentioned (around lines 252-256, 315-316, and 777-787) so all_plans always holds
the latest plan instances before _finalize serializes them.

599-608: ⚠️ Potential issue | 🟠 Major

finish_reason is not a step-success signal.

STOP / MAX_TOKENS only mean generation ended. A reply like “I couldn't complete this step because …” still returns True here, so the step is marked completed and the replan path never runs. Have the executor emit an explicit success/failure marker, or a tiny JSON status, and key _assess_step_success() off that instead.

Also applies to: 709-724, 761-774

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/plan_execute_loop.py` around lines 599 - 608, The
current logic treats the LLM response finish_reason as a success signal; instead
modify the executor and success-checking logic so the model must emit an
explicit success marker (e.g. a tiny JSON like {"success": true/false, "note":
"..."} appended to its reply) and have _assess_step_success() parse that marker
from the assistant content to decide success or failure; update the instruction
templates built around the step execution (the strings at the sites shown around
the instruction construction) to require that explicit JSON status at the end,
and change the places that currently rely on finish_reason (including the
response handling paths referenced by _assess_step_success() and the call sites
around the ranges you noted) to call _assess_step_success() and act on its
boolean result instead of finish_reason.
src/ai_company/budget/coordination_metrics.py (2)

86-90: ⚠️ Potential issue | 🟠 Major

Reject error rates above 1.0.

Both fields are documented as rates, but values like 1.5 still validate and produce meaningless amplification factors. Add an upper bound to both inputs and keep the explicit zero-divisor guard for error_rate_sas.

📏 Suggested fix
     error_rate_mas: float = Field(
         ge=0.0,
+        le=1.0,
         description="Multi-agent error rate",
     )
-    error_rate_sas: float = Field(gt=0, description="Single-agent error rate")
+    error_rate_sas: float = Field(
+        gt=0,
+        le=1.0,
+        description="Single-agent error rate",
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/budget/coordination_metrics.py` around lines 86 - 90, Add an
upper bound of 1.0 to both rate fields to reject values >1.0: update the Field()
for error_rate_mas to include le=1.0 (keeping ge=0.0) and update the Field() for
error_rate_sas to include le=1.0 while preserving its gt=0 guard; reference the
existing symbols error_rate_mas and error_rate_sas and modify their Field(...)
constraints accordingly.

8-12: 🛠️ Refactor suggestion | 🟠 Major

Add the required logger and log-before-raise on invalid inputs.

This is budget business logic, but the module still raises ValueError in several branches without a logger or event constants. That makes bad metric inputs hard to diagnose and violates the repo observability contract.

As per coding guidelines src/ai_company/**/*.py: Every module with business logic MUST have from ai_company.observability import get_logger then logger = get_logger(__name__), and all error paths must log at WARNING or ERROR with context before raising.

Also applies to: 216-218, 245-247, 271-273, 297-299, 322-328

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/budget/coordination_metrics.py` around lines 8 - 12, Add
observability: import get_logger from ai_company.observability and instantiate
logger = get_logger(__name__) at top of
src/ai_company/budget/coordination_metrics.py, then before every ValueError
raise in the module (including the branches referenced around lines 216-218,
245-247, 271-273, 297-299, 322-328) emit a logger.warning or logger.error with
contextual fields (function name, input values, and a repo event constant) and
include a named event constant from ai_company.observability.events (e.g.,
BAD_METRIC_INPUT) so that each error path logs context before raising the
ValueError.
src/ai_company/budget/category_analytics.py (2)

8-20: 🛠️ Refactor suggestion | 🟠 Major

Add the required observability hook to this analytics module.

build_category_breakdown() and compute_orchestration_ratio() are budget business logic, but this file still has no logger or budget event constants. That breaks the repo observability contract for direct callers of these helpers.

As per coding guidelines src/ai_company/**/*.py: Every module with business logic MUST have from ai_company.observability import get_logger then logger = get_logger(__name__); use event name constants from ai_company.observability.events.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/budget/category_analytics.py` around lines 8 - 20, This module
lacks the observability hook and event constants; add "from
ai_company.observability import get_logger" and "from
ai_company.observability.events import <appropriate budget event constants>" at
the top, then create "logger = get_logger(__name__)" and emit the relevant
budget events from build_category_breakdown and compute_orchestration_ratio (use
the imported event name constants when logging/emitting) so callers follow the
repo observability contract.

116-149: ⚠️ Potential issue | 🟠 Major

Expose uncategorized_tokens on OrchestrationRatio.

total_tokens includes uncategorized traffic, but the returned model drops that bucket. Callers can therefore see ratio=0.0 with non-zero totals and no way to distinguish real zero overhead from missing categorization.

🧭 Suggested fix
 class OrchestrationRatio(BaseModel):
@@
     productive_tokens: int = Field(ge=0, description="Productive tokens")
     coordination_tokens: int = Field(ge=0, description="Coordination tokens")
     system_tokens: int = Field(ge=0, description="System tokens")
+    uncategorized_tokens: int = Field(
+        ge=0,
+        description="Uncategorized tokens",
+    )

     `@model_validator`(mode="after")
     def _validate_token_consistency(self) -> Self:
-        """Ensure total_tokens >= sum of category tokens."""
+        """Ensure total_tokens matches the exposed category buckets."""
         category_sum = (
-            self.productive_tokens + self.coordination_tokens + self.system_tokens
+            self.productive_tokens
+            + self.coordination_tokens
+            + self.system_tokens
+            + self.uncategorized_tokens
         )
-        if self.total_tokens < category_sum:
+        if self.total_tokens != category_sum:
             msg = (
-                f"total_tokens ({self.total_tokens}) must be >= "
+                f"total_tokens ({self.total_tokens}) must equal "
                 f"sum of category tokens ({category_sum})"
             )
             raise ValueError(msg)
         return self
@@
             total_tokens=0,
             productive_tokens=0,
             coordination_tokens=0,
             system_tokens=0,
+            uncategorized_tokens=0,
         )
@@
         total_tokens=total,
         productive_tokens=breakdown.productive_tokens,
         coordination_tokens=breakdown.coordination_tokens,
         system_tokens=breakdown.system_tokens,
+        uncategorized_tokens=breakdown.uncategorized_tokens,
     )

Also applies to: 225-247

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/budget/category_analytics.py` around lines 116 - 149, Add an
uncategorized_tokens Field to the OrchestrationRatio model and include it in the
token consistency check and docstring: add "uncategorized_tokens: int =
Field(ge=0, description='Uncategorized tokens')" to the model fields, update the
class docstring to document uncategorized_tokens, and modify
_validate_token_consistency to compute category_sum = self.productive_tokens +
self.coordination_tokens + self.system_tokens + self.uncategorized_tokens so
total_tokens >= category_sum; apply the same change to the second analogous
block noted (lines ~225-247).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai_company/engine/plan_execute_loop.py`:
- Around line 685-707: After updating ctx with ctx.with_turn_completed (so
accumulated_cost includes this turn), immediately call the budget check
(check_budget(ctx) or the existing budget-checking helper used elsewhere) and if
it reports budget exhausted, do not execute any tools: log the exhaustion and
return the same path as a completed step (i.e., call and return
self._handle_step_completion(ctx, response, turn_number)) instead of calling
_handle_step_tool_calls; this ensures no side-effecting tool calls run after
budget is exhausted.
- Around line 140-150: The planning-phase early returns bypass self._finalize(),
so modify execute() to call self._finalize(...) before returning when
_run_planning_phase(...) yields an ExecutionResult; specifically, after
detecting isinstance(plan_result, ExecutionResult) call
self._finalize(loop_type="planning", plan=... ) with the same plan metadata used
by step-phase exits (e.g., the plan object from planner_model or plan_result)
and any required finalization args (shutdown/budget info), then return
plan_result.

In `@src/ai_company/engine/plan_parsing.py`:
- Around line 95-100: The logger call that currently includes raw planner output
via content_snippet=content[:200] must be changed to avoid logging user/model
data; in the code around the logger.warning call in plan_parsing.py (the
EXECUTION_PLAN_PARSE_ERROR log using execution_id and content), remove the
content_snippet field and instead log a non-sensitive fingerprint (e.g., compute
a SHA-256 hex digest of content and include content_hash) while keeping
content_length and execution_id; update the logging call to include content_hash
(or similar) and/or other structured metadata but not any substring of content.

In `@src/ai_company/engine/react_loop.py`:
- Around line 201-204: The shutdown ExecutionResult is built before we clear
tool calls, so after calling clear_last_turn_tool_calls(turns) you must rebuild
the shutdown result to reflect the mutated turns; i.e., call check_shutdown(ctx,
shutdown_checker, turns) again (or otherwise reconstruct the ExecutionResult
from the updated turns) so the returned value from the shutdown branch
accurately reports no-executed tool calls (referencing check_shutdown,
clear_last_turn_tool_calls, TurnRecord, ExecutionResult, and the turns
variable).

In `@tests/unit/engine/test_plan_execute_loop.py`:
- Around line 191-199: Add assertions that validate the category of each turn in
the happy-path test: verify the first turn in result.turns (the planning turn)
has category SYSTEM and that subsequent execution turns have category
PRODUCTIVE; keep existing checks (termination_reason, turn count, metadata) and
insert checks referencing result.turns (e.g., result.turns[0].category and each
turn in result.turns[1:]) so a regression that mis-categorizes planning vs.
execution will fail.

---

Duplicate comments:
In `@src/ai_company/budget/category_analytics.py`:
- Around line 8-20: This module lacks the observability hook and event
constants; add "from ai_company.observability import get_logger" and "from
ai_company.observability.events import <appropriate budget event constants>" at
the top, then create "logger = get_logger(__name__)" and emit the relevant
budget events from build_category_breakdown and compute_orchestration_ratio (use
the imported event name constants when logging/emitting) so callers follow the
repo observability contract.
- Around line 116-149: Add an uncategorized_tokens Field to the
OrchestrationRatio model and include it in the token consistency check and
docstring: add "uncategorized_tokens: int = Field(ge=0,
description='Uncategorized tokens')" to the model fields, update the class
docstring to document uncategorized_tokens, and modify
_validate_token_consistency to compute category_sum = self.productive_tokens +
self.coordination_tokens + self.system_tokens + self.uncategorized_tokens so
total_tokens >= category_sum; apply the same change to the second analogous
block noted (lines ~225-247).

In `@src/ai_company/budget/coordination_metrics.py`:
- Around line 86-90: Add an upper bound of 1.0 to both rate fields to reject
values >1.0: update the Field() for error_rate_mas to include le=1.0 (keeping
ge=0.0) and update the Field() for error_rate_sas to include le=1.0 while
preserving its gt=0 guard; reference the existing symbols error_rate_mas and
error_rate_sas and modify their Field(...) constraints accordingly.
- Around line 8-12: Add observability: import get_logger from
ai_company.observability and instantiate logger = get_logger(__name__) at top of
src/ai_company/budget/coordination_metrics.py, then before every ValueError
raise in the module (including the branches referenced around lines 216-218,
245-247, 271-273, 297-299, 322-328) emit a logger.warning or logger.error with
contextual fields (function name, input values, and a repo event constant) and
include a named event constant from ai_company.observability.events (e.g.,
BAD_METRIC_INPUT) so that each error path logs context before raising the
ValueError.

In `@src/ai_company/engine/plan_execute_loop.py`:
- Around line 221-225: The plan object returned by _update_step_status is not
propagated back into the all_plans list, so serialized history in _finalize uses
stale plan instances; update the entry in all_plans with the returned plan
whenever you call _update_step_status (e.g., inside PlanExecuteLoop where plan =
self._update_step_status(...) is used), replacing the old all_plans[plan_index]
(or the element found by plan.id) with the new plan; apply the same replacement
behavior for all other call sites mentioned (around lines 252-256, 315-316, and
777-787) so all_plans always holds the latest plan instances before _finalize
serializes them.
- Around line 599-608: The current logic treats the LLM response finish_reason
as a success signal; instead modify the executor and success-checking logic so
the model must emit an explicit success marker (e.g. a tiny JSON like
{"success": true/false, "note": "..."} appended to its reply) and have
_assess_step_success() parse that marker from the assistant content to decide
success or failure; update the instruction templates built around the step
execution (the strings at the sites shown around the instruction construction)
to require that explicit JSON status at the end, and change the places that
currently rely on finish_reason (including the response handling paths
referenced by _assess_step_success() and the call sites around the ranges you
noted) to call _assess_step_success() and act on its boolean result instead of
finish_reason.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f944f2e3-8aab-48be-af91-f36f4b51323f

📥 Commits

Reviewing files that changed from the base of the PR and between 1696235 and c493a90.

📒 Files selected for processing (14)
  • src/ai_company/budget/call_category.py
  • src/ai_company/budget/category_analytics.py
  • src/ai_company/budget/coordination_config.py
  • src/ai_company/budget/coordination_metrics.py
  • src/ai_company/budget/cost_record.py
  • src/ai_company/engine/loop_protocol.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/engine/plan_parsing.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/observability/events/execution.py
  • tests/unit/engine/conftest.py
  • tests/unit/engine/test_loop_helpers.py
  • tests/unit/engine/test_loop_protocol.py
  • tests/unit/engine/test_plan_execute_loop.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations — Python 3.14 has PEP 649
Use except A, B: syntax (no parentheses) for exception handling — ruff enforces this on Python 3.14
Include type hints on all public functions; enforce with mypy strict mode
Use Google-style docstrings on all public classes and functions — enforced by ruff D rules
Use immutability: create new objects, never mutate existing ones. For non-Pydantic internal collections, use copy.deepcopy() at construction + MappingProxyType wrapping. For dict/list fields in frozen Pydantic models, rely on frozen=True and copy.deepcopy() at system boundaries
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict). Use @computed_field for derived values; use NotBlankStr from core.types for all identifier/name fields
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code; prefer structured concurrency over bare create_task
Maintain line length of 88 characters — enforced by ruff
Keep functions under 50 lines and files under 800 lines
Handle errors explicitly, never silently swallow exceptions
Validate input at system boundaries: user input, external APIs, and config files

Files:

  • src/ai_company/budget/coordination_metrics.py
  • tests/unit/engine/test_loop_helpers.py
  • src/ai_company/budget/call_category.py
  • src/ai_company/budget/category_analytics.py
  • src/ai_company/engine/plan_parsing.py
  • src/ai_company/engine/loop_protocol.py
  • src/ai_company/budget/coordination_config.py
  • tests/unit/engine/test_loop_protocol.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/engine/react_loop.py
  • tests/unit/engine/conftest.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/budget/cost_record.py
  • tests/unit/engine/test_plan_execute_loop.py
src/ai_company/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/ai_company/**/*.py: Every module with business logic MUST have: from ai_company.observability import get_logger then logger = get_logger(__name__)
Never use import logging / logging.getLogger() / print() in application code — use get_logger() from observability module
Use variable name logger (not _logger, not log) for logging instances
Always use event name constants from ai_company.observability.events domain-specific modules; import directly (e.g., from ai_company.observability.events.<domain> import EVENT_CONSTANT)
Use structured logging format: logger.info(EVENT, key=value) — never logger.info("msg %s", val)
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
Use DEBUG level for object creation, internal flow, and entry/exit of key functions
Pure data models, enums, and re-exports do NOT need logging

Files:

  • src/ai_company/budget/coordination_metrics.py
  • src/ai_company/budget/call_category.py
  • src/ai_company/budget/category_analytics.py
  • src/ai_company/engine/plan_parsing.py
  • src/ai_company/engine/loop_protocol.py
  • src/ai_company/budget/coordination_config.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/budget/cost_record.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples — use generic names like example-provider, example-large-001, large/medium/small as aliases, or test-provider, test-small-001 in tests

Files:

  • src/ai_company/budget/coordination_metrics.py
  • src/ai_company/budget/call_category.py
  • src/ai_company/budget/category_analytics.py
  • src/ai_company/engine/plan_parsing.py
  • src/ai_company/engine/loop_protocol.py
  • src/ai_company/budget/coordination_config.py
  • src/ai_company/observability/events/execution.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/plan_execute_loop.py
  • src/ai_company/budget/cost_record.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow
Maintain 80% minimum code coverage — enforced in CI
Use asyncio_mode = "auto" in pytest configuration — no manual @pytest.mark.asyncio needed
Set 30-second timeout per test
Use pytest-xdist via -n auto for parallel test execution
Prefer @pytest.mark.parametrize for testing similar cases

Files:

  • tests/unit/engine/test_loop_helpers.py
  • tests/unit/engine/test_loop_protocol.py
  • tests/unit/engine/conftest.py
  • tests/unit/engine/test_plan_execute_loop.py
src/ai_company/{providers,engine}/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

RetryExhaustedError signals that all retries failed — the engine layer catches this to trigger fallback chains

Files:

  • src/ai_company/engine/plan_parsing.py
  • src/ai_company/engine/loop_protocol.py
  • src/ai_company/engine/react_loop.py
  • src/ai_company/engine/plan_execute_loop.py
🧠 Learnings (9)
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Every module with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`

Applied to files:

  • src/ai_company/budget/coordination_metrics.py
  • src/ai_company/budget/category_analytics.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Never use `import logging` / `logging.getLogger()` / `print()` in application code — use `get_logger()` from observability module

Applied to files:

  • src/ai_company/budget/coordination_metrics.py
  • src/ai_company/budget/category_analytics.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : All error paths must log at WARNING or ERROR with context before raising

Applied to files:

  • src/ai_company/budget/coordination_metrics.py
  • src/ai_company/budget/category_analytics.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Use variable name `logger` (not `_logger`, not `log`) for logging instances

Applied to files:

  • src/ai_company/budget/coordination_metrics.py
  • src/ai_company/budget/category_analytics.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Use structured logging format: `logger.info(EVENT, key=value)` — never `logger.info("msg %s", val)`

Applied to files:

  • src/ai_company/budget/coordination_metrics.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Always use event name constants from `ai_company.observability.events` domain-specific modules; import directly (e.g., `from ai_company.observability.events.<domain> import EVENT_CONSTANT`)

Applied to files:

  • src/ai_company/budget/category_analytics.py
  • src/ai_company/observability/events/execution.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to **/*.py : Keep functions under 50 lines and files under 800 lines

Applied to files:

  • src/ai_company/engine/plan_execute_loop.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to **/*.py : Use Pydantic v2 (`BaseModel`, `model_validator`, `computed_field`, `ConfigDict`). Use `computed_field` for derived values; use `NotBlankStr` from `core.types` for all identifier/name fields

Applied to files:

  • src/ai_company/budget/cost_record.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/{providers,engine}/**/*.py : `RetryExhaustedError` signals that all retries failed — the engine layer catches this to trigger fallback chains

Applied to files:

  • tests/unit/engine/test_plan_execute_loop.py
🧬 Code graph analysis (4)
tests/unit/engine/test_loop_helpers.py (7)
src/ai_company/engine/context.py (2)
  • AgentContext (87-307)
  • with_message (173-182)
src/ai_company/engine/loop_helpers.py (1)
  • build_result (369-384)
src/ai_company/engine/loop_protocol.py (4)
  • TerminationReason (28-35)
  • TurnRecord (38-75)
  • execute (154-180)
  • ExecutionResult (78-135)
src/ai_company/providers/enums.py (2)
  • FinishReason (15-22)
  • MessageRole (6-12)
src/ai_company/providers/models.py (5)
  • CompletionConfig (213-254)
  • CompletionResponse (257-306)
  • TokenUsage (12-35)
  • ToolCall (96-119)
  • ChatMessage (138-210)
src/ai_company/tools/base.py (4)
  • BaseTool (56-161)
  • ToolExecutionResult (24-53)
  • description (115-117)
  • category (110-112)
src/ai_company/tools/registry.py (1)
  • ToolRegistry (30-122)
src/ai_company/budget/category_analytics.py (3)
src/ai_company/budget/call_category.py (2)
  • LLMCallCategory (11-25)
  • OrchestrationAlertLevel (28-45)
src/ai_company/budget/coordination_config.py (1)
  • OrchestrationAlertThresholds (60-100)
src/ai_company/budget/cost_record.py (2)
  • CostRecord (15-56)
  • _validate_token_consistency (51-56)
src/ai_company/engine/loop_protocol.py (1)
src/ai_company/budget/call_category.py (1)
  • LLMCallCategory (11-25)
src/ai_company/budget/cost_record.py (1)
src/ai_company/budget/call_category.py (1)
  • LLMCallCategory (11-25)
🔇 Additional comments (2)
tests/unit/engine/conftest.py (2)

191-206: Defensive exposure of recorded models looks good.

Initializing _recorded_models and returning a copied list from recorded_models matches the existing config-tracking pattern and keeps the test double encapsulated.


220-222: Model capture is wired into complete() correctly.

Appending model at call time ensures the recorded sequence stays aligned with call_count and recorded_configs, which makes multi-call assertions straightforward.

Comment on lines +685 to +707
ctx = ctx.with_turn_completed(
response.usage,
response_to_message(response),
)
logger.info(
EXECUTION_LOOP_TURN_COMPLETE,
execution_id=ctx.execution_id,
turn=turn_number,
finish_reason=response.finish_reason.value,
tool_call_count=len(response.tool_calls),
)

if not response.tool_calls:
return self._handle_step_completion(ctx, response, turn_number)

return await self._handle_step_tool_calls(
ctx,
tool_invoker,
response,
turn_number,
turns,
shutdown_checker,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Budget exhaustion can still execute tools.

After ctx.with_turn_completed(...), this turn's cost is already in ctx.accumulated_cost. If that response exhausted the budget and also requested tools, we still execute them before the next check_budget(), which can fire side-effecting tools after the hard stop.

💸 Suggested fix
         if not response.tool_calls:
             return self._handle_step_completion(ctx, response, turn_number)
 
+        budget_result = check_budget(ctx, budget_checker, turns)
+        if budget_result is not None:
+            clear_last_turn_tool_calls(turns)
+            return budget_result
+
         return await self._handle_step_tool_calls(
             ctx,
             tool_invoker,
             response,
             turn_number,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/plan_execute_loop.py` around lines 685 - 707, After
updating ctx with ctx.with_turn_completed (so accumulated_cost includes this
turn), immediately call the budget check (check_budget(ctx) or the existing
budget-checking helper used elsewhere) and if it reports budget exhausted, do
not execute any tools: log the exhaustion and return the same path as a
completed step (i.e., call and return self._handle_step_completion(ctx,
response, turn_number)) instead of calling _handle_step_tool_calls; this ensures
no side-effecting tool calls run after budget is exhausted.

Comment on lines +95 to +100
logger.warning(
EXECUTION_PLAN_PARSE_ERROR,
execution_id=execution_id,
content_length=len(content),
content_snippet=content[:200],
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't log raw planner output on parse errors.

content_snippet=content[:200] writes arbitrary model/task output into warning logs. On plan failures that can leak user data, secrets, or code into retained logs; keep structured metadata like length/hash instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/plan_parsing.py` around lines 95 - 100, The logger call
that currently includes raw planner output via content_snippet=content[:200]
must be changed to avoid logging user/model data; in the code around the
logger.warning call in plan_parsing.py (the EXECUTION_PLAN_PARSE_ERROR log using
execution_id and content), remove the content_snippet field and instead log a
non-sensitive fingerprint (e.g., compute a SHA-256 hex digest of content and
include content_hash) while keeping content_length and execution_id; update the
logging call to include content_hash (or similar) and/or other structured
metadata but not any substring of content.

Copilot feedback:
- Add pytestmark = pytest.mark.timeout(30) to 4 budget test files
- Add dedicated test_plan_parsing.py for plan parsing edge cases

CodeRabbit feedback:
- Wrap planning-phase early returns in _finalize() for metadata consistency
- Fix clear_last_turn_tool_calls race: rebuild result with cleaned turns
  (both ReactLoop and PlanExecuteLoop _handle_step_tool_calls)
- Remove raw content snippet from parse_plan warning log (security)
- Assert call categories (SYSTEM/PRODUCTIVE) in happy-path test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Aureliolo Aureliolo merged commit 9b2699f into main Mar 7, 2026
10 checks passed
@Aureliolo Aureliolo deleted the feat/engine-extensions branch March 7, 2026 17:17
Aureliolo added a commit that referenced this pull request Mar 10, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.1.1](ai-company-v0.1.0...ai-company-v0.1.1)
(2026-03-10)


### Features

* add autonomy levels and approval timeout policies
([#42](#42),
[#126](#126))
([#197](#197))
([eecc25a](eecc25a))
* add CFO cost optimization service with anomaly detection, reports, and
approval decisions
([#186](#186))
([a7fa00b](a7fa00b))
* add code quality toolchain (ruff, mypy, pre-commit, dependabot)
([#63](#63))
([36681a8](36681a8))
* add configurable cost tiers and subscription/quota-aware tracking
([#67](#67))
([#185](#185))
([9baedfa](9baedfa))
* add container packaging, Docker Compose, and CI pipeline
([#269](#269))
([435bdfe](435bdfe)),
closes [#267](#267)
* add coordination error taxonomy classification pipeline
([#146](#146))
([#181](#181))
([70c7480](70c7480))
* add cost-optimized, hierarchical, and auction assignment strategies
([#175](#175))
([ce924fa](ce924fa)),
closes [#173](#173)
* add design specification, license, and project setup
([8669a09](8669a09))
* add env var substitution and config file auto-discovery
([#77](#77))
([7f53832](7f53832))
* add FastestStrategy routing + vendor-agnostic cleanup
([#140](#140))
([09619cb](09619cb)),
closes [#139](#139)
* add HR engine and performance tracking
([#45](#45),
[#47](#47))
([#193](#193))
([2d091ea](2d091ea))
* add issue auto-search and resolution verification to PR review skill
([#119](#119))
([deecc39](deecc39))
* add memory retrieval, ranking, and context injection pipeline
([#41](#41))
([873b0aa](873b0aa))
* add pluggable MemoryBackend protocol with models, config, and events
([#180](#180))
([46cfdd4](46cfdd4))
* add pluggable MemoryBackend protocol with models, config, and events
([#32](#32))
([46cfdd4](46cfdd4))
* add pluggable PersistenceBackend protocol with SQLite implementation
([#36](#36))
([f753779](f753779))
* add progressive trust and promotion/demotion subsystems
([#43](#43),
[#49](#49))
([3a87c08](3a87c08))
* add retry handler, rate limiter, and provider resilience
([#100](#100))
([b890545](b890545))
* add SecOps security agent with rule engine, audit log, and ToolInvoker
integration ([#40](#40))
([83b7b6c](83b7b6c))
* add shared org memory and memory consolidation/archival
([#125](#125),
[#48](#48))
([4a0832b](4a0832b))
* design unified provider interface
([#86](#86))
([3e23d64](3e23d64))
* expand template presets, rosters, and add inheritance
([#80](#80),
[#81](#81),
[#84](#84))
([15a9134](15a9134))
* implement agent runtime state vs immutable config split
([#115](#115))
([4cb1ca5](4cb1ca5))
* implement AgentEngine core orchestrator
([#11](#11))
([#143](#143))
([f2eb73a](f2eb73a))
* implement basic tool system (registry, invocation, results)
([#15](#15))
([c51068b](c51068b))
* implement built-in file system tools
([#18](#18))
([325ef98](325ef98))
* implement communication foundation — message bus, dispatcher, and
messenger ([#157](#157))
([8e71bfd](8e71bfd))
* implement company template system with 7 built-in presets
([#85](#85))
([cbf1496](cbf1496))
* implement conflict resolution protocol
([#122](#122))
([#166](#166))
([e03f9f2](e03f9f2))
* implement core entity and role system models
([#69](#69))
([acf9801](acf9801))
* implement crash recovery with fail-and-reassign strategy
([#149](#149))
([e6e91ed](e6e91ed))
* implement engine extensions — Plan-and-Execute loop and call
categorization
([#134](#134),
[#135](#135))
([#159](#159))
([9b2699f](9b2699f))
* implement enterprise logging system with structlog
([#73](#73))
([2f787e5](2f787e5))
* implement graceful shutdown with cooperative timeout strategy
([#130](#130))
([6592515](6592515))
* implement hierarchical delegation and loop prevention
([#12](#12),
[#17](#17))
([6be60b6](6be60b6))
* implement LiteLLM driver and provider registry
([#88](#88))
([ae3f18b](ae3f18b)),
closes [#4](#4)
* implement LLM decomposition strategy and workspace isolation
([#174](#174))
([aa0eefe](aa0eefe))
* implement meeting protocol system
([#123](#123))
([ee7caca](ee7caca))
* implement message and communication domain models
([#74](#74))
([560a5d2](560a5d2))
* implement model routing engine
([#99](#99))
([d3c250b](d3c250b))
* implement parallel agent execution
([#22](#22))
([#161](#161))
([65940b3](65940b3))
* implement per-call cost tracking service
([#7](#7))
([#102](#102))
([c4f1f1c](c4f1f1c))
* implement personality injection and system prompt construction
([#105](#105))
([934dd85](934dd85))
* implement single-task execution lifecycle
([#21](#21))
([#144](#144))
([c7e64e4](c7e64e4))
* implement subprocess sandbox for tool execution isolation
([#131](#131))
([#153](#153))
([3c8394e](3c8394e))
* implement task assignment subsystem with pluggable strategies
([#172](#172))
([c7f1b26](c7f1b26)),
closes [#26](#26)
[#30](#30)
* implement task decomposition and routing engine
([#14](#14))
([9c7fb52](9c7fb52))
* implement Task, Project, Artifact, Budget, and Cost domain models
([#71](#71))
([81eabf1](81eabf1))
* implement tool permission checking
([#16](#16))
([833c190](833c190))
* implement YAML config loader with Pydantic validation
([#59](#59))
([ff3a2ba](ff3a2ba))
* implement YAML config loader with Pydantic validation
([#75](#75))
([ff3a2ba](ff3a2ba))
* initialize project with uv, hatchling, and src layout
([39005f9](39005f9))
* initialize project with uv, hatchling, and src layout
([#62](#62))
([39005f9](39005f9))
* Litestar REST API, WebSocket feed, and approval queue (M6)
([#189](#189))
([29fcd08](29fcd08))
* make TokenUsage.total_tokens a computed field
([#118](#118))
([c0bab18](c0bab18)),
closes [#109](#109)
* parallel tool execution in ToolInvoker.invoke_all
([#137](#137))
([58517ee](58517ee))
* testing framework, CI pipeline, and M0 gap fixes
([#64](#64))
([f581749](f581749))
* wire all modules into observability system
([#97](#97))
([f7a0617](f7a0617))


### Bug Fixes

* address Greptile post-merge review findings from PRs
[#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175)
([#176](#176))
([c5ca929](c5ca929))
* address post-merge review feedback from PRs
[#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167)
([#170](#170))
([3bf897a](3bf897a)),
closes [#169](#169)
* enforce strict mypy on test files
([#89](#89))
([aeeff8c](aeeff8c))
* harden Docker sandbox, MCP bridge, and code runner
([#50](#50),
[#53](#53))
([d5e1b6e](d5e1b6e))
* harden git tools security + code quality improvements
([#150](#150))
([000a325](000a325))
* harden subprocess cleanup, env filtering, and shutdown resilience
([#155](#155))
([d1fe1fb](d1fe1fb))
* incorporate post-merge feedback + pre-PR review fixes
([#164](#164))
([c02832a](c02832a))
* pre-PR review fixes for post-merge findings
([#183](#183))
([26b3108](26b3108))
* strengthen immutability for BaseTool schema and ToolInvoker boundaries
([#117](#117))
([7e5e861](7e5e861))


### Performance

* harden non-inferable principle implementation
([#195](#195))
([02b5f4e](02b5f4e)),
closes [#188](#188)


### Refactoring

* adopt NotBlankStr across all models
([#108](#108))
([#120](#120))
([ef89b90](ef89b90))
* extract _SpendingTotals base class from spending summary models
([#111](#111))
([2f39c1b](2f39c1b))
* harden BudgetEnforcer with error handling, validation extraction, and
review fixes
([#182](#182))
([c107bf9](c107bf9))
* harden personality profiles, department validation, and template
rendering ([#158](#158))
([10b2299](10b2299))
* pre-PR review improvements for ExecutionLoop + ReAct loop
([#124](#124))
([8dfb3c0](8dfb3c0))
* split events.py into per-domain event modules
([#136](#136))
([e9cba89](e9cba89))


### Documentation

* add ADR-001 memory layer evaluation and selection
([#178](#178))
([db3026f](db3026f)),
closes [#39](#39)
* add agent scaling research findings to DESIGN_SPEC
([#145](#145))
([57e487b](57e487b))
* add CLAUDE.md, contributing guide, and dev documentation
([#65](#65))
([55c1025](55c1025)),
closes [#54](#54)
* add crash recovery, sandboxing, analytics, and testing decisions
([#127](#127))
([5c11595](5c11595))
* address external review feedback with MVP scope and new protocols
([#128](#128))
([3b30b9a](3b30b9a))
* expand design spec with pluggable strategy protocols
([#121](#121))
([6832db6](6832db6))
* finalize 23 design decisions (ADR-002)
([#190](#190))
([8c39742](8c39742))
* update project docs for M2.5 conventions and add docs-consistency
review agent
([#114](#114))
([99766ee](99766ee))


### Tests

* add e2e single agent integration tests
([#24](#24))
([#156](#156))
([f566fb4](f566fb4))
* add provider adapter integration tests
([#90](#90))
([40a61f4](40a61f4))


### CI/CD

* add Release Please for automated versioning and GitHub Releases
([#278](#278))
([a488758](a488758))
* bump actions/checkout from 4 to 6
([#95](#95))
([1897247](1897247))
* bump actions/upload-artifact from 4 to 7
([#94](#94))
([27b1517](27b1517))
* harden CI/CD pipeline
([#92](#92))
([ce4693c](ce4693c))
* split vulnerability scans into critical-fail and high-warn tiers
([#277](#277))
([aba48af](aba48af))


### Maintenance

* add /worktree skill for parallel worktree management
([#171](#171))
([951e337](951e337))
* add design spec context loading to research-link skill
([8ef9685](8ef9685))
* add post-merge-cleanup skill
([#70](#70))
([f913705](f913705))
* add pre-pr-review skill and update CLAUDE.md
([#103](#103))
([92e9023](92e9023))
* add research-link skill and rename skill files to SKILL.md
([#101](#101))
([651c577](651c577))
* bump aiosqlite from 0.21.0 to 0.22.1
([#191](#191))
([3274a86](3274a86))
* bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group
([#96](#96))
([0338d0c](0338d0c))
* bump ruff from 0.15.4 to 0.15.5
([a49ee46](a49ee46))
* fix M0 audit items
([#66](#66))
([c7724b5](c7724b5))
* pin setup-uv action to full SHA
([#281](#281))
([4448002](4448002))
* post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests,
hookify rules
([#148](#148))
([c57a6a9](c57a6a9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Aureliolo added a commit that referenced this pull request Mar 11, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.1.0](v0.0.0...v0.1.0)
(2026-03-11)


### Features

* add autonomy levels and approval timeout policies
([#42](#42),
[#126](#126))
([#197](#197))
([eecc25a](eecc25a))
* add CFO cost optimization service with anomaly detection, reports, and
approval decisions
([#186](#186))
([a7fa00b](a7fa00b))
* add code quality toolchain (ruff, mypy, pre-commit, dependabot)
([#63](#63))
([36681a8](36681a8))
* add configurable cost tiers and subscription/quota-aware tracking
([#67](#67))
([#185](#185))
([9baedfa](9baedfa))
* add container packaging, Docker Compose, and CI pipeline
([#269](#269))
([435bdfe](435bdfe)),
closes [#267](#267)
* add coordination error taxonomy classification pipeline
([#146](#146))
([#181](#181))
([70c7480](70c7480))
* add cost-optimized, hierarchical, and auction assignment strategies
([#175](#175))
([ce924fa](ce924fa)),
closes [#173](#173)
* add design specification, license, and project setup
([8669a09](8669a09))
* add env var substitution and config file auto-discovery
([#77](#77))
([7f53832](7f53832))
* add FastestStrategy routing + vendor-agnostic cleanup
([#140](#140))
([09619cb](09619cb)),
closes [#139](#139)
* add HR engine and performance tracking
([#45](#45),
[#47](#47))
([#193](#193))
([2d091ea](2d091ea))
* add issue auto-search and resolution verification to PR review skill
([#119](#119))
([deecc39](deecc39))
* add mandatory JWT + API key authentication
([#256](#256))
([c279cfe](c279cfe))
* add memory retrieval, ranking, and context injection pipeline
([#41](#41))
([873b0aa](873b0aa))
* add pluggable MemoryBackend protocol with models, config, and events
([#180](#180))
([46cfdd4](46cfdd4))
* add pluggable MemoryBackend protocol with models, config, and events
([#32](#32))
([46cfdd4](46cfdd4))
* add pluggable output scan response policies
([#263](#263))
([b9907e8](b9907e8))
* add pluggable PersistenceBackend protocol with SQLite implementation
([#36](#36))
([f753779](f753779))
* add progressive trust and promotion/demotion subsystems
([#43](#43),
[#49](#49))
([3a87c08](3a87c08))
* add retry handler, rate limiter, and provider resilience
([#100](#100))
([b890545](b890545))
* add SecOps security agent with rule engine, audit log, and ToolInvoker
integration ([#40](#40))
([83b7b6c](83b7b6c))
* add shared org memory and memory consolidation/archival
([#125](#125),
[#48](#48))
([4a0832b](4a0832b))
* design unified provider interface
([#86](#86))
([3e23d64](3e23d64))
* expand template presets, rosters, and add inheritance
([#80](#80),
[#81](#81),
[#84](#84))
([15a9134](15a9134))
* implement agent runtime state vs immutable config split
([#115](#115))
([4cb1ca5](4cb1ca5))
* implement AgentEngine core orchestrator
([#11](#11))
([#143](#143))
([f2eb73a](f2eb73a))
* implement AuditRepository for security audit log persistence
([#279](#279))
([94bc29f](94bc29f))
* implement basic tool system (registry, invocation, results)
([#15](#15))
([c51068b](c51068b))
* implement built-in file system tools
([#18](#18))
([325ef98](325ef98))
* implement communication foundation — message bus, dispatcher, and
messenger ([#157](#157))
([8e71bfd](8e71bfd))
* implement company template system with 7 built-in presets
([#85](#85))
([cbf1496](cbf1496))
* implement conflict resolution protocol
([#122](#122))
([#166](#166))
([e03f9f2](e03f9f2))
* implement core entity and role system models
([#69](#69))
([acf9801](acf9801))
* implement crash recovery with fail-and-reassign strategy
([#149](#149))
([e6e91ed](e6e91ed))
* implement engine extensions — Plan-and-Execute loop and call
categorization
([#134](#134),
[#135](#135))
([#159](#159))
([9b2699f](9b2699f))
* implement enterprise logging system with structlog
([#73](#73))
([2f787e5](2f787e5))
* implement graceful shutdown with cooperative timeout strategy
([#130](#130))
([6592515](6592515))
* implement hierarchical delegation and loop prevention
([#12](#12),
[#17](#17))
([6be60b6](6be60b6))
* implement LiteLLM driver and provider registry
([#88](#88))
([ae3f18b](ae3f18b)),
closes [#4](#4)
* implement LLM decomposition strategy and workspace isolation
([#174](#174))
([aa0eefe](aa0eefe))
* implement meeting protocol system
([#123](#123))
([ee7caca](ee7caca))
* implement message and communication domain models
([#74](#74))
([560a5d2](560a5d2))
* implement model routing engine
([#99](#99))
([d3c250b](d3c250b))
* implement parallel agent execution
([#22](#22))
([#161](#161))
([65940b3](65940b3))
* implement per-call cost tracking service
([#7](#7))
([#102](#102))
([c4f1f1c](c4f1f1c))
* implement personality injection and system prompt construction
([#105](#105))
([934dd85](934dd85))
* implement single-task execution lifecycle
([#21](#21))
([#144](#144))
([c7e64e4](c7e64e4))
* implement subprocess sandbox for tool execution isolation
([#131](#131))
([#153](#153))
([3c8394e](3c8394e))
* implement task assignment subsystem with pluggable strategies
([#172](#172))
([c7f1b26](c7f1b26)),
closes [#26](#26)
[#30](#30)
* implement task decomposition and routing engine
([#14](#14))
([9c7fb52](9c7fb52))
* implement Task, Project, Artifact, Budget, and Cost domain models
([#71](#71))
([81eabf1](81eabf1))
* implement tool permission checking
([#16](#16))
([833c190](833c190))
* implement YAML config loader with Pydantic validation
([#59](#59))
([ff3a2ba](ff3a2ba))
* implement YAML config loader with Pydantic validation
([#75](#75))
([ff3a2ba](ff3a2ba))
* initialize project with uv, hatchling, and src layout
([39005f9](39005f9))
* initialize project with uv, hatchling, and src layout
([#62](#62))
([39005f9](39005f9))
* Litestar REST API, WebSocket feed, and approval queue (M6)
([#189](#189))
([29fcd08](29fcd08))
* make TokenUsage.total_tokens a computed field
([#118](#118))
([c0bab18](c0bab18)),
closes [#109](#109)
* parallel tool execution in ToolInvoker.invoke_all
([#137](#137))
([58517ee](58517ee))
* testing framework, CI pipeline, and M0 gap fixes
([#64](#64))
([f581749](f581749))
* wire all modules into observability system
([#97](#97))
([f7a0617](f7a0617))


### Bug Fixes

* address Greptile post-merge review findings from PRs
[#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175)
([#176](#176))
([c5ca929](c5ca929))
* address post-merge review feedback from PRs
[#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167)
([#170](#170))
([3bf897a](3bf897a)),
closes [#169](#169)
* enforce strict mypy on test files
([#89](#89))
([aeeff8c](aeeff8c))
* harden Docker sandbox, MCP bridge, and code runner
([#50](#50),
[#53](#53))
([d5e1b6e](d5e1b6e))
* harden git tools security + code quality improvements
([#150](#150))
([000a325](000a325))
* harden subprocess cleanup, env filtering, and shutdown resilience
([#155](#155))
([d1fe1fb](d1fe1fb))
* incorporate post-merge feedback + pre-PR review fixes
([#164](#164))
([c02832a](c02832a))
* pre-PR review fixes for post-merge findings
([#183](#183))
([26b3108](26b3108))
* resolve circular imports, bump litellm, fix release tag format
([#286](#286))
([a6659b5](a6659b5))
* strengthen immutability for BaseTool schema and ToolInvoker boundaries
([#117](#117))
([7e5e861](7e5e861))


### Performance

* harden non-inferable principle implementation
([#195](#195))
([02b5f4e](02b5f4e)),
closes [#188](#188)


### Refactoring

* adopt NotBlankStr across all models
([#108](#108))
([#120](#120))
([ef89b90](ef89b90))
* extract _SpendingTotals base class from spending summary models
([#111](#111))
([2f39c1b](2f39c1b))
* harden BudgetEnforcer with error handling, validation extraction, and
review fixes
([#182](#182))
([c107bf9](c107bf9))
* harden personality profiles, department validation, and template
rendering ([#158](#158))
([10b2299](10b2299))
* pre-PR review improvements for ExecutionLoop + ReAct loop
([#124](#124))
([8dfb3c0](8dfb3c0))
* split events.py into per-domain event modules
([#136](#136))
([e9cba89](e9cba89))


### Documentation

* add ADR-001 memory layer evaluation and selection
([#178](#178))
([db3026f](db3026f)),
closes [#39](#39)
* add agent scaling research findings to DESIGN_SPEC
([#145](#145))
([57e487b](57e487b))
* add CLAUDE.md, contributing guide, and dev documentation
([#65](#65))
([55c1025](55c1025)),
closes [#54](#54)
* add crash recovery, sandboxing, analytics, and testing decisions
([#127](#127))
([5c11595](5c11595))
* address external review feedback with MVP scope and new protocols
([#128](#128))
([3b30b9a](3b30b9a))
* expand design spec with pluggable strategy protocols
([#121](#121))
([6832db6](6832db6))
* finalize 23 design decisions (ADR-002)
([#190](#190))
([8c39742](8c39742))
* update project docs for M2.5 conventions and add docs-consistency
review agent
([#114](#114))
([99766ee](99766ee))


### Tests

* add e2e single agent integration tests
([#24](#24))
([#156](#156))
([f566fb4](f566fb4))
* add provider adapter integration tests
([#90](#90))
([40a61f4](40a61f4))


### CI/CD

* add Release Please for automated versioning and GitHub Releases
([#278](#278))
([a488758](a488758))
* bump actions/checkout from 4 to 6
([#95](#95))
([1897247](1897247))
* bump actions/upload-artifact from 4 to 7
([#94](#94))
([27b1517](27b1517))
* bump anchore/scan-action from 6.5.1 to 7.3.2
([#271](#271))
([80a1c15](80a1c15))
* bump docker/build-push-action from 6.19.2 to 7.0.0
([#273](#273))
([dd0219e](dd0219e))
* bump docker/login-action from 3.7.0 to 4.0.0
([#272](#272))
([33d6238](33d6238))
* bump docker/metadata-action from 5.10.0 to 6.0.0
([#270](#270))
([baee04e](baee04e))
* bump docker/setup-buildx-action from 3.12.0 to 4.0.0
([#274](#274))
([5fc06f7](5fc06f7))
* bump sigstore/cosign-installer from 3.9.1 to 4.1.0
([#275](#275))
([29dd16c](29dd16c))
* harden CI/CD pipeline
([#92](#92))
([ce4693c](ce4693c))
* split vulnerability scans into critical-fail and high-warn tiers
([#277](#277))
([aba48af](aba48af))


### Maintenance

* add /worktree skill for parallel worktree management
([#171](#171))
([951e337](951e337))
* add design spec context loading to research-link skill
([8ef9685](8ef9685))
* add post-merge-cleanup skill
([#70](#70))
([f913705](f913705))
* add pre-pr-review skill and update CLAUDE.md
([#103](#103))
([92e9023](92e9023))
* add research-link skill and rename skill files to SKILL.md
([#101](#101))
([651c577](651c577))
* bump aiosqlite from 0.21.0 to 0.22.1
([#191](#191))
([3274a86](3274a86))
* bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group
([#96](#96))
([0338d0c](0338d0c))
* bump ruff from 0.15.4 to 0.15.5
([a49ee46](a49ee46))
* fix M0 audit items
([#66](#66))
([c7724b5](c7724b5))
* **main:** release ai-company 0.1.1
([#282](#282))
([2f4703d](2f4703d))
* pin setup-uv action to full SHA
([#281](#281))
([4448002](4448002))
* post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests,
hookify rules
([#148](#148))
([c57a6a9](c57a6a9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants