feat: implement engine extensions — Plan-and-Execute loop and call categorization#159
feat: implement engine extensions — Plan-and-Execute loop and call categorization#159
Conversation
…tegorization (#134, #135) Workstream A — Plan-and-Execute Loop (#134): - Extract shared loop helpers from ReactLoop into loop_helpers.py - Add plan data models (StepStatus, PlanStep, ExecutionPlan, PlanExecuteConfig) - Implement PlanExecuteLoop with two-phase execution (plan → execute steps), mini-ReAct sub-loops per step, re-planning on failure, and JSON/text parsing - Add plan event constants for observability Workstream B — Call Categorization & Coordination Metrics (#135): - Add LLMCallCategory enum (productive/coordination/system) and OrchestrationAlertLevel - Add call_category field to CostRecord and TurnRecord with full propagation - Implement category analytics (CategoryBreakdown, OrchestrationRatio) - Implement coordination metrics (efficiency, overhead, error amplification, message density, redundancy rate) as pure computation functions - Add CoordinationMetricsConfig and OrchestrationAlertThresholds - Add CostTracker.get_category_breakdown() and get_orchestration_ratio() queries - Add coordination_metrics to RootConfig and defaults Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…computed fields Pre-reviewed by 8 agents, 36 findings addressed: - Decompose oversized functions (plan_execute_loop.py: 198→45 lines) - Fix format string injection vulnerability (.format() with LLM content) - Fix step numbering bug in _parse_text_plan - Convert coordination metrics to @computed_field pattern - Add boundary constraints to metric input fields - Narrow except clauses (TypeError masks bugs) - Add structured logging to silent failure paths - Replace dict[str, Any] with dict[str, object] in metadata - Use NotBlankStr for identifier fields - Add shared clear_last_turn_tool_calls helper - Update DESIGN_SPEC.md §6.5, §10.5, §15.3, §15.5 - Add test_cost_recording.py (8 tests) - Fix mypy errors in test files (TYPE_CHECKING, isinstance guards) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (9)
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds Plan‑and‑Execute loop (planning + per‑step execution with replanning), shared loop helpers, LLM call categorization, coordination metrics and orchestration analytics, new plan models/parsing, CostRecord/tracker category APIs, ReactLoop refactor to use helpers, observability events, config/schema updates, and extensive unit tests. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Agent
participant PlanExec as "PlanExecuteLoop"
participant Provider
participant ToolInv as "ToolInvoker"
participant Tracker as "CostTracker"
User->>Agent: submit task
Agent->>PlanExec: execute(context, provider, tool_invoker)
PlanExec->>Provider: planner call (generate ExecutionPlan)
Provider-->>PlanExec: plan (ExecutionPlan)
loop for each PlanStep
PlanExec->>Provider: executor call (mini‑ReAct for step)
Provider-->>PlanExec: CompletionResponse
alt tool invocation requested
PlanExec->>ToolInv: invoke tool(s)
ToolInv-->>PlanExec: tool results
end
PlanExec->>Tracker: record turn cost (with call_category)
alt step failed and replans remaining
PlanExec->>Provider: replan request
Provider-->>PlanExec: revised ExecutionPlan
end
end
PlanExec-->>Agent: ExecutionResult (with plans/replans metadata)
Agent->>Tracker: get_category_breakdown / get_orchestration_ratio
Tracker-->>Agent: CategoryBreakdown / OrchestrationRatio
Agent-->>User: final response + metrics
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the agent execution engine by introducing the Plan-and-Execute loop, a new strategy for task decomposition and execution. It also lays the groundwork for advanced LLM call analytics and multi-agent coordination metrics, providing deeper insights into agent performance and cost. The changes improve modularity, extend configurability, and update documentation to reflect these new capabilities. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the Plan-and-Execute execution loop, moving it from a planned M4+ feature to M3, alongside significant enhancements to LLM call analytics. The changes include new data models and computation functions for categorizing LLM calls (productive, coordination, system) and calculating orchestration overhead ratios, with corresponding configuration options and query methods added to the CostTracker. The core execution loops (ReAct and the new Plan-and-Execute) now leverage shared helper functions for common tasks like budget checking, shutdown handling, and turn recording, and CostRecord and TurnRecord models are updated to include an optional call_category field. Review comments highlight a potential indirect prompt injection vulnerability in the PlanExecuteLoop due to unsanitized LLM-generated content in f-strings, recommend moving a LLMCallCategory import to module level to prevent NameError, point out an invalid Python 3 except syntax in loop_helpers.py and react_loop.py that needs to be updated to a tuple, and suggest categorizing planning calls as SYSTEM and step execution calls as PRODUCTIVE for accurate analytics.
| instruction = ( | ||
| f"Execute step {step.step_number}: {step.description}\n" | ||
| f"Expected outcome: {step.expected_outcome}\n" | ||
| f"When done, respond with a summary of what you accomplished." | ||
| ) |
There was a problem hiding this comment.
The PlanExecuteLoop implementation is vulnerable to indirect prompt injection. It takes step descriptions and expected outcomes directly from LLM-generated plans (which can be influenced by external data from tool outputs) and embeds them into subsequent prompts using f-strings without any sanitization or structural separation. An attacker could craft tool outputs that, when summarized into a plan step, inject malicious instructions that the agent will then execute in the next turn. It is recommended to use clear delimiters and explicit instructions to the LLM to treat the content as data.
instruction = (
f"Execute the following step {step.step_number}:\n"
f"<step_description>\n{step.description}\n</step_description>\n"
f"Expected outcome:\n<expected_outcome>\n{step.expected_outcome}\n</expected_outcome>\n"
f"When done, respond with a summary of what you accomplished."
)| ) | ||
|
|
||
| if TYPE_CHECKING: | ||
| from ai_company.budget.call_category import LLMCallCategory |
There was a problem hiding this comment.
| return None | ||
| try: | ||
| shutting_down = shutdown_checker() | ||
| except MemoryError, RecursionError: |
There was a problem hiding this comment.
The except syntax except MemoryError, RecursionError: is invalid Python 3 syntax and will cause a SyntaxError. More critically, this improper exception handling fails to correctly catch RecursionError as intended, leading to critical errors being swallowed. This pattern should be corrected to use a tuple for multiple exceptions, as shown in the suggestion. This issue is present in multiple places in this file (lines 63, 108, 180, 288).
except (MemoryError, RecursionError):| if isinstance(response, ExecutionResult): | ||
| return response | ||
|
|
||
| turns.append(make_turn_record(turn_number, response)) |
There was a problem hiding this comment.
The planning and re-planning calls are a form of framework overhead and should be categorized as SYSTEM calls for accurate analytics. This PR introduces call categorization, and this seems to be a missed spot.
To fix this, you'll also need to import LLMCallCategory at the top of the file:
from ai_company.budget.call_category import LLMCallCategory turns.append(make_turn_record(turn_number, response, call_category=LLMCallCategory.SYSTEM))| if isinstance(response, ExecutionResult): | ||
| return response | ||
|
|
||
| turns.append(make_turn_record(turn_number, response)) |
There was a problem hiding this comment.
src/ai_company/engine/react_loop.py
Outdated
| return response | ||
|
|
||
| turns.append(_make_turn_record(turn_number, response)) | ||
| turns.append(make_turn_record(turn_number, response)) |
There was a problem hiding this comment.
The LLM calls within the ReAct loop are direct task work and should be categorized as PRODUCTIVE for analytics. This seems to be a missed spot for the new call categorization feature.
To fix this, you'll also need to import LLMCallCategory at the top of the file:
from ai_company.budget.call_category import LLMCallCategory| turns.append(make_turn_record(turn_number, response)) | |
| turns.append(make_turn_record(turn_number, response, call_category=LLMCallCategory.PRODUCTIVE)) |
Greptile SummaryThis PR implements two major engine extensions: a full Plan-and-Execute execution loop ( The implementation is well-engineered overall: the phase separation (planning → step execution → replanning) is clean and easy to follow, the Key findings:
Confidence Score: 4/5
|
There was a problem hiding this comment.
Pull request overview
This PR implements two major features: the Plan-and-Execute execution loop (Loop 2) and LLM call categorization with coordination metrics. The Plan-and-Execute loop follows the ExecutionLoop protocol, decomposing tasks into steps via LLM planning then executing each step with a mini-ReAct sub-loop, with re-planning on failure. The call categorization adds LLMCallCategory (PRODUCTIVE/COORDINATION/SYSTEM) to TurnRecord and CostRecord, with per-category analytics, orchestration ratio computation, and five coordination metrics (efficiency, overhead, error amplification, message density, redundancy rate).
Changes:
- New Plan-and-Execute loop (
plan_execute_loop.py,plan_models.py) with shared helpers extracted from ReactLoop intoloop_helpers.py, andcost_recording.pyupdates for category propagation - New budget modules (
call_category.py,category_analytics.py,coordination_metrics.py,coordination_config.py) withCostTrackerquery methods andRootConfigintegration - Comprehensive test coverage across all new modules with observability event constants and DESIGN_SPEC documentation updates
Reviewed changes
Copilot reviewed 30 out of 30 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/ai_company/engine/plan_execute_loop.py |
New Plan-and-Execute loop with two-phase execution, replanning, and plan parsing |
src/ai_company/engine/plan_models.py |
Frozen Pydantic models for plan steps, execution plans, and loop config |
src/ai_company/engine/loop_helpers.py |
Shared stateless helpers extracted from ReactLoop for reuse |
src/ai_company/engine/react_loop.py |
Refactored to use shared loop_helpers instead of inline methods |
src/ai_company/engine/loop_protocol.py |
Added call_category to TurnRecord, changed metadata type to dict[str, object] |
src/ai_company/engine/cost_recording.py |
Propagates call_category from TurnRecord to CostRecord |
src/ai_company/engine/__init__.py |
Re-exports new plan-execute types |
src/ai_company/budget/call_category.py |
LLMCallCategory and OrchestrationAlertLevel enums |
src/ai_company/budget/category_analytics.py |
CategoryBreakdown, OrchestrationRatio, and pure computation functions |
src/ai_company/budget/coordination_metrics.py |
Five coordination metric models with @computed_field |
src/ai_company/budget/coordination_config.py |
Config models for metric collection and alert thresholds |
src/ai_company/budget/cost_record.py |
Added call_category field |
src/ai_company/budget/tracker.py |
New get_category_breakdown and get_orchestration_ratio query methods, task_id filter |
src/ai_company/budget/__init__.py |
Re-exports new budget types |
src/ai_company/config/schema.py |
Added coordination_metrics field to RootConfig |
src/ai_company/config/defaults.py |
Added coordination_metrics default |
src/ai_company/observability/events/execution.py |
Plan-and-execute event constants |
src/ai_company/observability/events/budget.py |
Category analytics event constants |
DESIGN_SPEC.md |
Updated §6.5, §10.5, §15.3, §15.5 for new features |
tests/unit/engine/test_plan_execute_loop.py |
Comprehensive tests for Plan-and-Execute loop |
tests/unit/engine/test_plan_models.py |
Tests for plan data models |
tests/unit/engine/test_loop_helpers.py |
Tests for shared loop helpers |
tests/unit/engine/test_loop_protocol.py |
Extended with call_category and PlanExecuteLoop conformance tests |
tests/unit/engine/test_cost_recording.py |
Tests for cost recording with category propagation |
tests/unit/budget/test_call_category.py |
Tests for call categorization enums |
tests/unit/budget/test_category_analytics.py |
Tests for category breakdown and orchestration ratio |
tests/unit/budget/test_coordination_metrics.py |
Tests for five coordination metrics |
tests/unit/budget/test_coordination_config.py |
Tests for coordination config models |
tests/unit/budget/test_cost_record.py |
Extended with call_category tests |
tests/unit/config/conftest.py |
Updated factory for coordination_metrics field |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Should terminate (either MAX_TURNS or COMPLETED for 1st step) | ||
| assert result.termination_reason in ( | ||
| TerminationReason.MAX_TURNS, | ||
| TerminationReason.COMPLETED, | ||
| ) |
There was a problem hiding this comment.
This test asserts result.termination_reason in (TerminationReason.MAX_TURNS, TerminationReason.COMPLETED), which makes the test non-deterministic in its expected outcome. With max_turns=2, the plan consumes 1 turn and step 1 consumes 1 turn, reaching the limit. Step 2 and 3 would not execute. The test should assert a specific outcome (likely MAX_TURNS since 3 steps can't complete in 2 turns, or COMPLETED if the first step completing is sufficient). This ambiguity weakens the test's ability to catch regressions.
| # Should terminate (either MAX_TURNS or COMPLETED for 1st step) | |
| assert result.termination_reason in ( | |
| TerminationReason.MAX_TURNS, | |
| TerminationReason.COMPLETED, | |
| ) | |
| # Should terminate because the maximum number of turns was reached | |
| assert result.termination_reason == TerminationReason.MAX_TURNS |
| async def _run_steps( # noqa: PLR0913 | ||
| self, | ||
| ctx: AgentContext, | ||
| provider: CompletionProvider, | ||
| executor_model: str, | ||
| config: CompletionConfig, | ||
| tool_defs: list[ToolDefinition] | None, | ||
| tool_invoker: ToolInvoker | None, | ||
| plan: ExecutionPlan, | ||
| turns: list[TurnRecord], | ||
| all_plans: list[ExecutionPlan], | ||
| replans_used: int, | ||
| planner_model: str, | ||
| budget_checker: BudgetChecker | None, | ||
| shutdown_checker: ShutdownChecker | None, | ||
| ) -> ExecutionResult: |
There was a problem hiding this comment.
_run_steps accepts 13 positional parameters (plus self), making it one of the most heavily parameterized methods in the codebase. While # noqa: PLR0913 suppresses the linting warning, this many parameters make the method harder to maintain and test. Consider grouping related parameters into a lightweight dataclass or named tuple (e.g., a _StepExecutionContext containing provider, executor_model, config, tool_defs, tool_invoker, budget_checker, shutdown_checker) to reduce the parameter count.
| # Verify different models were used (check recorded call count) | ||
| assert provider.call_count == 2 |
There was a problem hiding this comment.
The test_different_models_for_phases test only verifies provider.call_count == 2 but doesn't verify that different models were actually used for the planning vs execution phase. The MockCompletionProvider doesn't record the model parameter passed to each complete() call. This means the test doesn't actually validate the model tiering behavior it claims to test. Consider extending MockCompletionProvider to record models or adding a model-recording assertion here.
| # Verify different models were used (check recorded call count) | |
| assert provider.call_count == 2 | |
| # Verify provider was called once for planning and once for execution | |
| assert provider.call_count == 2 | |
| # Verify the configured planner and executor models were actually used | |
| assert getattr(provider, "called_models", None) == [ | |
| "test-planner-001", | |
| "test-executor-001", | |
| ] |
|
|
||
| import pytest | ||
|
|
||
| from ai_company.core.agent import AgentIdentity # noqa: TC001 |
There was a problem hiding this comment.
The AgentIdentity import on line 8 uses # noqa: TC001 to suppress the "move to TYPE_CHECKING" linting rule, which is the correct convention for test files where pytest evaluates annotations at runtime. However, AgentIdentity is only used as a type annotation for the sample_agent_with_personality fixture parameter on line 411 (test_max_turns_during_step). In tests that use sample_agent_context, the fixture already provides AgentContext. Consider whether this import is actually needed at runtime — if it's only used in type annotations that pytest doesn't evaluate (e.g., fixture parameter types), it could remain in TYPE_CHECKING.
There was a problem hiding this comment.
Actionable comments posted: 17
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/ai_company/engine/cost_recording.py (1)
68-75:⚠️ Potential issue | 🟠 MajorA single
identity.modelcannot describe every turn anymore.The new planner/executor overrides mean one
ExecutionResultcan contain calls from different models, but everyCostRecordhere is still stamped withidentity.model.provider/identity.model.model_id. That will mislabel per-call audit data as soon as a non-default planner or executor model is used. The actual model metadata needs to travel with each turn and be recorded here instead.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/cost_recording.py` around lines 68 - 75, CostRecord entries are being stamped with identity.model.provider and identity.model.model_id even though ExecutionResult turns may come from different models; update the record creation in CostRecord construction to use the model metadata attached to each turn (e.g., turn.model.provider and turn.model.model_id), falling back to identity.model.provider/identity.model.model_id only if the turn lacks its own model info, so each CostRecord accurately reflects the per-call model used by ExecutionResult/turn.src/ai_company/engine/react_loop.py (1)
230-249:⚠️ Potential issue | 🟠 MajorReturn a non-success termination reason for truncated completions.
FinishReason.MAX_TOKENSmeans the provider stopped because it ran out of generation budget, not because the task completed. ReturningTerminationReason.COMPLETEDhere hides truncation from callers and can suppress fallback or retry behavior.🛠️ Suggested fix
if response.finish_reason == FinishReason.MAX_TOKENS: logger.warning( EXECUTION_LOOP_TERMINATED, execution_id=ctx.execution_id, - reason=TerminationReason.COMPLETED.value, + reason=TerminationReason.MAX_TURNS.value, turns=len(turns), truncated=True, ) + return build_result( + ctx, + TerminationReason.MAX_TURNS, + turns, + ) else: logger.info( EXECUTION_LOOP_TERMINATED, execution_id=ctx.execution_id, reason=TerminationReason.COMPLETED.value, turns=len(turns), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/react_loop.py` around lines 230 - 249, The code treats FinishReason.MAX_TOKENS as a successful completion; change it to return and log a non-success termination (e.g., TerminationReason.TRUNCATED or another appropriate non-success value) so callers can detect truncation and trigger fallback/retry. Specifically, in the branch checking response.finish_reason == FinishReason.MAX_TOKENS update the logger.warning call's reason field to the non-success TerminationReason and pass that same TerminationReason (instead of TerminationReason.COMPLETED) into build_result(ctx, ..., turns), keeping truncated=True and execution/context metadata (ctx.execution_id, turns) intact.DESIGN_SPEC.md (1)
1567-1575:⚠️ Potential issue | 🟡 MinorFix the orchestration-ratio formula in §10.5.
The implementation and surrounding prose use
(coordination + system) / total, but this sentence sayscoordination / total. That mismatch changes the meaning of the metric and the alert thresholds readers will infer from it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DESIGN_SPEC.md` around lines 1567 - 1575, The prose in §10.5 is inconsistent with the implementation: the orchestration ratio used in code is (coordination + system) / total but the text defines it as coordination / total. Update the prose and any alert guidance in §10.5 to define the orchestration ratio as (coordination + system) / total (and adjust explanatory text about thresholds/alerts accordingly), referencing the CostRecord categories (`coordination` and `system`) so readers understand the formula matches the implementation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai_company/budget/category_analytics.py`:
- Around line 116-135: OrchestrationRatio is dropping the uncategorized_tokens
bucket from its public model so callers cannot tell if a zero ratio was due to
true zero overhead or uncategorized traffic; add an uncategorized_tokens: int =
Field(ge=0, description="Uncategorized tokens") attribute to the
OrchestrationRatio model (and mirror the same change in the similar
model/section referenced around lines 204-233) so the denominator and token
breakdown are exposed; update any constructor/factory that builds
OrchestrationRatio to populate this new field from the computed
uncategorized_tokens value.
- Around line 8-20: This module lacks the observability hook: add "from
ai_company.observability import get_logger" at the top and declare "logger =
get_logger(__name__)" so the file-level logger is available; then instrument the
existing business logic functions build_category_breakdown and
compute_orchestration_ratio to emit events using the domain event constants from
ai_company.observability.events (import the appropriate event names) rather than
raw strings, ensuring calls to logger.info/debug/error use those constants and
that logging occurs at entry/exit and on error paths; place the imports
alongside other top-level imports and add the logger declaration near them so
all functions in this file (including lines ~138-247) use the repo-wide
observability contract.
In `@src/ai_company/budget/coordination_metrics.py`:
- Around line 86-90: Update the error-rate field validators and the computation
to enforce a max of 1.0: change the Pydantic Field on error_rate_mas to include
le=1.0 (keep ge=0.0) and change error_rate_sas to use ge=0.0, le=1.0 (instead of
gt=0), and then in compute_error_amplification() defensively clamp both inputs
to min(value, 1.0) before doing any math so out-of-range inputs can't produce
nonsensical amplification ratios; reference error_rate_mas, error_rate_sas, and
compute_error_amplification when making the changes.
- Around line 1-12: Add logging to this business-logic module by importing
get_logger and initializing logger = get_logger(__name__), then update every
invalid-input branch in the coordination metric helpers (functions efficiency,
overhead, error_amplification, message_density, redundancy_rate) to call
logger.warning or logger.error with contextual details (input values and reason)
immediately before raising ValueError; ensure the import line "from
ai_company.observability import get_logger" is added and that each bad-input
path logs at WARNING/ERROR level with a clear message including the function
name and offending parameters.
In `@src/ai_company/budget/tracker.py`:
- Around line 243-244: The public query entry points that accept agent_id and
task_id currently allow blank/whitespace strings which then propagate into
_filter_records; change the parameter types from str | None to
core.types.NotBlankStr | None for all identifier/name parameters (e.g., the
methods taking agent_id: str | None, task_id: str | None) and add an upfront
normalization/validation step that rejects or normalizes purely-blank values
(trim and treat "" as None or raise) before calling _filter_records; apply the
same change to the other affected signatures/paths mentioned (the other method
overloads that accept task_id/agent_id) so blank-only inputs are validated at
the system boundary.
In `@src/ai_company/engine/cost_recording.py`:
- Around line 76-78: The code is currently persisting turns even when
turn.call_category is None; modify the recording path in cost_recording.py so
that before creating/saving the record (the block that sets
call_category=turn.call_category and timestamp=datetime.now(UTC)) you check if
turn.call_category is None and handle it instead of writing: either log an
error/warning with context (e.g., include turn id or session) and skip
recording, or raise an exception to fail fast; update the surrounding function
that constructs the record to early-return/raise when call_category is missing
so no uncategorized cost rows are persisted.
In `@src/ai_company/engine/loop_helpers.py`:
- Around line 258-274: In the branch where tool_invoker is None (the if
tool_invoker is None check), clear the recorded tool calls before logging the
error and returning so the turn history and ExecutionResult.total_tool_calls
aren’t inflated; specifically, empty response.tool_calls (or call
tool_calls_made.clear() if that list variable is used elsewhere) immediately
before the logger.error(...) / build_result(...) sequence that returns
TerminationReason.ERROR. Ensure the clearing happens before the error log and
return so no stale tool call entries remain.
In `@src/ai_company/engine/plan_execute_loop.py`:
- Line 531: The TurnRecord created by make_turn_record is missing call_category,
so planner/replanner turns are uncategorized; update the make_turn_record calls
that append to turns (the ones in the planner/replanner code path and the
per-step executor path) to pass call_category="SYSTEM" for planning/replanning
turns and call_category="PRODUCTIVE" for per-step execution turns so TurnRecord
instances are correctly categorized for orchestration breakdowns.
- Around line 274-279: The code updates an immutable ExecutionPlan via
_update_step_status but never writes the replacement back into all_plans, so
later serialization/_finalize and _handle_step_failure operate on stale objects;
update every site that calls _update_step_status (e.g., the block guarded by
step_ok and the other uses around the ranges that call _update_step_status and
_handle_step_failure) to assign the returned ExecutionPlan into the original
all_plans entry (use the same index/key you used when appending the original
plan), and ensure any local variable `plan` is also replaced with that returned
object before proceeding to _finalize() or returning an ExecutionResult so all
subsequent logic operates on the new immutable instance.
- Around line 667-676: Replace using response.finish_reason as the success check
with an explicit step result/status from the LLM response (e.g.,
response.step_result or response.success) so that "STOP" vs "MAX_TOKENS" are
treated only as generation end signals and not as step success; keep the
existing logger.warning(EXECUTION_PLAN_STEP_COMPLETE, ..., truncated=True) when
response.finish_reason == FinishReason.MAX_TOKENS but determine success by
calling/inspecting the explicit field used by self._assess_step_success (or by
adapting _assess_step_success to accept the explicit step result), and apply the
same change to the other identical block around lines 870-882.
- Around line 531-535: The code calls response_to_message(response)
unconditionally which raises when CompletionResponse finished with finish_reason
'error' or 'content_filter' (no content/tool calls); change the flow in the
block that uses make_turn_record and ctx.with_turn_completed to first detect
when response.finish_reason is 'error' or 'content_filter' or when response has
no content and no tool calls, and in that case: (1) avoid calling
response_to_message(), (2) create and return a clean ExecutionResult
representing planner failure (including response.finish_reason and any
error/details from CompletionResponse), and (3) still call
ctx.with_turn_completed(response.usage, None) or otherwise mark the turn
completed without an assistant ChatMessage; keep references to make_turn_record,
ctx.with_turn_completed, response_to_message, CompletionResponse, and
ExecutionResult so the fix is easy to locate.
- Around line 105-913: The module and several long methods violate
size/complexity rules: split parsing and replanning orchestration into separate
helper modules/classes so execute() and _run_steps() fall under 50 lines and the
file is <800 lines. Extract the plan parsing logic (_parse_plan,
_parse_json_plan, _parse_text_plan, _data_to_plan, _extract_task_summary) into a
new PlanParser class/module and replace calls to self._parse_* with a PlanParser
instance; likewise extract planner orchestration and replan creation
(_generate_plan, _call_planner, _replan, _handle_step_failure) into a
PlannerOrchestrator helper that exposes methods used by execute()/_run_steps;
update PlanExecuteLoop to delegate to these helpers so its execute() and
_run_steps() only orchestrate high-level control flow.
In `@src/ai_company/engine/react_loop.py`:
- Around line 24-35: make_turn_record is currently called without a category in
react_loop.py so ReAct turns are recorded as uncategorized; update the calls to
make_turn_record in this module (both ReAct branch and the other occurrence) to
pass an explicit call_category (e.g., call_category="productive" or "react") so
ReAct turns are counted in the per-category breakdowns.
In `@tests/unit/engine/test_cost_recording.py`:
- Around line 141-155: Add a third case for the COORDINATION category to the
call-category propagation test by converting test_call_category_propagated into
a parametrized test (use pytest.mark.parametrize) over call_category values
including LLMCallCategory.PRODUCTIVE, LLMCallCategory.SYSTEM, and
LLMCallCategory.COORDINATION; for each param, construct the turn via
_turn(call_category=...) and run record_execution_costs with the same
_FakeTracker, then assert the tracker.records entry preserves the passed
call_category (ensure you reference the existing test_call_category_propagated,
record_execution_costs, _turn, and _FakeTracker symbols).
In `@tests/unit/engine/test_plan_execute_loop.py`:
- Around line 470-472: The test currently only checks provider.call_count == 2
which doesn't guarantee different model IDs were used; instead inspect the
provider's recorded calls (e.g., provider.recorded_calls or provider.calls) and
assert that the first call (planner invocation) used model id "test-planner-001"
and the second call (executor invocation) used model id "test-executor-001",
while keeping the existing assertion that result.termination_reason ==
TerminationReason.COMPLETED; update the assertions to check the request payloads
or model_id fields of the two recorded provider calls rather than just the call
count.
- Around line 435-439: The test currently allows both
TerminationReason.MAX_TURNS and TerminationReason.COMPLETED for
result.termination_reason, which masks a false positive when running with
max_turns=2 against a 3-step plan; change the assertion to only accept
TerminationReason.MAX_TURNS (e.g., assert result.termination_reason ==
TerminationReason.MAX_TURNS) so the test correctly verifies max-turn termination
for the given setup and keep references to result.termination_reason,
TerminationReason.MAX_TURNS, TerminationReason.COMPLETED and the max_turns=2 /
3-step plan in your reasoning while editing.
- Around line 267-297: The test uses FinishReason.ERROR which is intercepted by
check_response_errors() before reaching _handle_step_failure(), so max_replans=0
never gets exercised; change the mocked provider sequence so the second response
is a non-error CompletionResponse (e.g., FinishReason.FINISHED) whose content
decodes into a step result that causes _handle_step_failure() to run (i.e., a
successful LLM completion that represents a failed step execution), then keep
PlanExecuteLoop(PlanExecuteConfig(max_replans=0)) and assert the
exhausted-replan behavior (TerminationReason.ERROR and loop_type) to exercise
the max-replans branch.
---
Outside diff comments:
In `@DESIGN_SPEC.md`:
- Around line 1567-1575: The prose in §10.5 is inconsistent with the
implementation: the orchestration ratio used in code is (coordination + system)
/ total but the text defines it as coordination / total. Update the prose and
any alert guidance in §10.5 to define the orchestration ratio as (coordination +
system) / total (and adjust explanatory text about thresholds/alerts
accordingly), referencing the CostRecord categories (`coordination` and
`system`) so readers understand the formula matches the implementation.
In `@src/ai_company/engine/cost_recording.py`:
- Around line 68-75: CostRecord entries are being stamped with
identity.model.provider and identity.model.model_id even though ExecutionResult
turns may come from different models; update the record creation in CostRecord
construction to use the model metadata attached to each turn (e.g.,
turn.model.provider and turn.model.model_id), falling back to
identity.model.provider/identity.model.model_id only if the turn lacks its own
model info, so each CostRecord accurately reflects the per-call model used by
ExecutionResult/turn.
In `@src/ai_company/engine/react_loop.py`:
- Around line 230-249: The code treats FinishReason.MAX_TOKENS as a successful
completion; change it to return and log a non-success termination (e.g.,
TerminationReason.TRUNCATED or another appropriate non-success value) so callers
can detect truncation and trigger fallback/retry. Specifically, in the branch
checking response.finish_reason == FinishReason.MAX_TOKENS update the
logger.warning call's reason field to the non-success TerminationReason and pass
that same TerminationReason (instead of TerminationReason.COMPLETED) into
build_result(ctx, ..., turns), keeping truncated=True and execution/context
metadata (ctx.execution_id, turns) intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 23586307-b157-457d-952a-2634bdb9f6ca
📒 Files selected for processing (30)
DESIGN_SPEC.mdsrc/ai_company/budget/__init__.pysrc/ai_company/budget/call_category.pysrc/ai_company/budget/category_analytics.pysrc/ai_company/budget/coordination_config.pysrc/ai_company/budget/coordination_metrics.pysrc/ai_company/budget/cost_record.pysrc/ai_company/budget/tracker.pysrc/ai_company/config/defaults.pysrc/ai_company/config/schema.pysrc/ai_company/engine/__init__.pysrc/ai_company/engine/cost_recording.pysrc/ai_company/engine/loop_helpers.pysrc/ai_company/engine/loop_protocol.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/engine/plan_models.pysrc/ai_company/engine/react_loop.pysrc/ai_company/observability/events/budget.pysrc/ai_company/observability/events/execution.pytests/unit/budget/test_call_category.pytests/unit/budget/test_category_analytics.pytests/unit/budget/test_coordination_config.pytests/unit/budget/test_coordination_metrics.pytests/unit/budget/test_cost_record.pytests/unit/config/conftest.pytests/unit/engine/test_cost_recording.pytests/unit/engine/test_loop_helpers.pytests/unit/engine/test_loop_protocol.pytests/unit/engine/test_plan_execute_loop.pytests/unit/engine/test_plan_models.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Agent
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Nofrom __future__ import annotations— Python 3.14 has PEP 649
Useexcept A, B:syntax (no parentheses) for exception handling — ruff enforces this on Python 3.14
Include type hints on all public functions; enforce with mypy strict mode
Use Google-style docstrings on all public classes and functions — enforced by ruff D rules
Use immutability: create new objects, never mutate existing ones. For non-Pydantic internal collections, usecopy.deepcopy()at construction +MappingProxyTypewrapping. Fordict/listfields in frozen Pydantic models, rely onfrozen=Trueandcopy.deepcopy()at system boundaries
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves
Use Pydantic v2 (BaseModel,model_validator,computed_field,ConfigDict). Use@computed_fieldfor derived values; useNotBlankStrfromcore.typesfor all identifier/name fields
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code; prefer structured concurrency over barecreate_task
Maintain line length of 88 characters — enforced by ruff
Keep functions under 50 lines and files under 800 lines
Handle errors explicitly, never silently swallow exceptions
Validate input at system boundaries: user input, external APIs, and config files
Files:
src/ai_company/engine/cost_recording.pysrc/ai_company/engine/__init__.pytests/unit/budget/test_cost_record.pysrc/ai_company/budget/__init__.pysrc/ai_company/observability/events/budget.pytests/unit/config/conftest.pysrc/ai_company/observability/events/execution.pysrc/ai_company/budget/category_analytics.pysrc/ai_company/budget/call_category.pytests/unit/budget/test_call_category.pytests/unit/engine/test_loop_protocol.pysrc/ai_company/engine/plan_models.pysrc/ai_company/budget/coordination_config.pytests/unit/budget/test_coordination_metrics.pysrc/ai_company/config/defaults.pysrc/ai_company/engine/react_loop.pytests/unit/engine/test_plan_execute_loop.pytests/unit/engine/test_plan_models.pytests/unit/engine/test_loop_helpers.pysrc/ai_company/engine/loop_protocol.pytests/unit/budget/test_category_analytics.pysrc/ai_company/config/schema.pytests/unit/engine/test_cost_recording.pytests/unit/budget/test_coordination_config.pysrc/ai_company/engine/loop_helpers.pysrc/ai_company/budget/cost_record.pysrc/ai_company/budget/tracker.pysrc/ai_company/budget/coordination_metrics.pysrc/ai_company/engine/plan_execute_loop.py
src/ai_company/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/ai_company/**/*.py: Every module with business logic MUST have:from ai_company.observability import get_loggerthenlogger = get_logger(__name__)
Never useimport logging/logging.getLogger()/print()in application code — useget_logger()from observability module
Use variable namelogger(not_logger, notlog) for logging instances
Always use event name constants fromai_company.observability.eventsdomain-specific modules; import directly (e.g.,from ai_company.observability.events.<domain> import EVENT_CONSTANT)
Use structured logging format:logger.info(EVENT, key=value)— neverlogger.info("msg %s", val)
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
Use DEBUG level for object creation, internal flow, and entry/exit of key functions
Pure data models, enums, and re-exports do NOT need logging
Files:
src/ai_company/engine/cost_recording.pysrc/ai_company/engine/__init__.pysrc/ai_company/budget/__init__.pysrc/ai_company/observability/events/budget.pysrc/ai_company/observability/events/execution.pysrc/ai_company/budget/category_analytics.pysrc/ai_company/budget/call_category.pysrc/ai_company/engine/plan_models.pysrc/ai_company/budget/coordination_config.pysrc/ai_company/config/defaults.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/loop_protocol.pysrc/ai_company/config/schema.pysrc/ai_company/engine/loop_helpers.pysrc/ai_company/budget/cost_record.pysrc/ai_company/budget/tracker.pysrc/ai_company/budget/coordination_metrics.pysrc/ai_company/engine/plan_execute_loop.py
src/ai_company/{providers,engine}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
RetryExhaustedErrorsignals that all retries failed — the engine layer catches this to trigger fallback chains
Files:
src/ai_company/engine/cost_recording.pysrc/ai_company/engine/__init__.pysrc/ai_company/engine/plan_models.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/loop_protocol.pysrc/ai_company/engine/loop_helpers.pysrc/ai_company/engine/plan_execute_loop.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples — use generic names like
example-provider,example-large-001,large/medium/smallas aliases, ortest-provider,test-small-001in tests
Files:
src/ai_company/engine/cost_recording.pysrc/ai_company/engine/__init__.pysrc/ai_company/budget/__init__.pysrc/ai_company/observability/events/budget.pysrc/ai_company/observability/events/execution.pysrc/ai_company/budget/category_analytics.pysrc/ai_company/budget/call_category.pysrc/ai_company/engine/plan_models.pysrc/ai_company/budget/coordination_config.pysrc/ai_company/config/defaults.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/loop_protocol.pysrc/ai_company/config/schema.pysrc/ai_company/engine/loop_helpers.pysrc/ai_company/budget/cost_record.pysrc/ai_company/budget/tracker.pysrc/ai_company/budget/coordination_metrics.pysrc/ai_company/engine/plan_execute_loop.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Use pytest markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e,@pytest.mark.slow
Maintain 80% minimum code coverage — enforced in CI
Useasyncio_mode = "auto"in pytest configuration — no manual@pytest.mark.asyncioneeded
Set 30-second timeout per test
Usepytest-xdistvia-n autofor parallel test execution
Prefer@pytest.mark.parametrizefor testing similar cases
Files:
tests/unit/budget/test_cost_record.pytests/unit/config/conftest.pytests/unit/budget/test_call_category.pytests/unit/engine/test_loop_protocol.pytests/unit/budget/test_coordination_metrics.pytests/unit/engine/test_plan_execute_loop.pytests/unit/engine/test_plan_models.pytests/unit/engine/test_loop_helpers.pytests/unit/budget/test_category_analytics.pytests/unit/engine/test_cost_recording.pytests/unit/budget/test_coordination_config.py
🧠 Learnings (5)
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Always use event name constants from `ai_company.observability.events` domain-specific modules; import directly (e.g., `from ai_company.observability.events.<domain> import EVENT_CONSTANT`)
Applied to files:
src/ai_company/observability/events/budget.pysrc/ai_company/observability/events/execution.pysrc/ai_company/budget/tracker.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves
Applied to files:
tests/unit/engine/test_plan_models.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to **/*.py : Use Pydantic v2 (`BaseModel`, `model_validator`, `computed_field`, `ConfigDict`). Use `computed_field` for derived values; use `NotBlankStr` from `core.types` for all identifier/name fields
Applied to files:
src/ai_company/config/schema.pysrc/ai_company/budget/cost_record.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Every module with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`
Applied to files:
src/ai_company/budget/tracker.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Never use `import logging` / `logging.getLogger()` / `print()` in application code — use `get_logger()` from observability module
Applied to files:
src/ai_company/budget/tracker.py
🧬 Code graph analysis (19)
src/ai_company/engine/__init__.py (2)
src/ai_company/engine/plan_execute_loop.py (1)
PlanExecuteLoop(105-913)src/ai_company/engine/plan_models.py (4)
ExecutionPlan(53-88)PlanExecuteConfig(91-118)PlanStep(25-50)StepStatus(15-22)
tests/unit/budget/test_cost_record.py (3)
src/ai_company/budget/call_category.py (1)
LLMCallCategory(11-25)tests/unit/engine/test_loop_protocol.py (4)
test_call_category_none_default(95-103)test_call_category_productive(105-114)test_call_category_coordination(116-125)test_call_category_system(127-136)src/ai_company/budget/cost_record.py (1)
CostRecord(15-56)
src/ai_company/budget/__init__.py (4)
src/ai_company/budget/call_category.py (2)
LLMCallCategory(11-25)OrchestrationAlertLevel(28-45)src/ai_company/budget/category_analytics.py (2)
CategoryBreakdown(28-107)OrchestrationRatio(110-135)src/ai_company/budget/coordination_config.py (5)
CoordinationMetricName(13-20)CoordinationMetricsConfig(95-129)ErrorCategory(23-29)ErrorTaxonomyConfig(32-49)OrchestrationAlertThresholds(52-92)src/ai_company/budget/coordination_metrics.py (6)
CoordinationEfficiency(17-45)CoordinationMetrics(155-189)CoordinationOverhead(48-70)ErrorAmplification(73-98)MessageDensity(101-129)RedundancyRate(132-152)
tests/unit/config/conftest.py (1)
src/ai_company/budget/coordination_config.py (1)
CoordinationMetricsConfig(95-129)
tests/unit/budget/test_call_category.py (1)
src/ai_company/budget/call_category.py (2)
LLMCallCategory(11-25)OrchestrationAlertLevel(28-45)
tests/unit/engine/test_loop_protocol.py (6)
src/ai_company/budget/call_category.py (1)
LLMCallCategory(11-25)src/ai_company/engine/context.py (1)
AgentContext(87-307)src/ai_company/engine/loop_protocol.py (3)
ExecutionLoop(139-177)ExecutionResult(77-128)get_loop_type(175-177)src/ai_company/engine/plan_execute_loop.py (2)
PlanExecuteLoop(105-913)get_loop_type(115-117)src/ai_company/providers/enums.py (1)
FinishReason(15-22)src/ai_company/engine/react_loop.py (1)
get_loop_type(62-64)
tests/unit/budget/test_coordination_metrics.py (1)
src/ai_company/budget/coordination_metrics.py (15)
CoordinationEfficiency(17-45)CoordinationMetrics(155-189)CoordinationOverhead(48-70)ErrorAmplification(73-98)MessageDensity(101-129)RedundancyRate(132-152)compute_efficiency(195-221)compute_error_amplification(250-273)compute_message_density(276-299)compute_overhead(224-247)compute_redundancy_rate(302-329)value(43-45)value(96-98)value(127-129)value_percent(68-70)
src/ai_company/engine/react_loop.py (1)
src/ai_company/engine/loop_helpers.py (10)
build_result(369-384)call_provider(138-195)check_budget(89-135)check_response_errors(198-232)check_shutdown(44-86)clear_last_turn_tool_calls(318-329)execute_tool_calls(235-315)get_tool_definitions(332-339)make_turn_record(351-366)response_to_message(342-348)
tests/unit/engine/test_plan_execute_loop.py (3)
src/ai_company/engine/loop_protocol.py (4)
TerminationReason(27-34)execute(147-173)ExecutionLoop(139-177)get_loop_type(175-177)src/ai_company/engine/plan_execute_loop.py (3)
PlanExecuteLoop(105-913)execute(119-197)get_loop_type(115-117)src/ai_company/engine/plan_models.py (1)
PlanExecuteConfig(91-118)
tests/unit/engine/test_plan_models.py (1)
src/ai_company/engine/plan_models.py (4)
ExecutionPlan(53-88)PlanExecuteConfig(91-118)PlanStep(25-50)StepStatus(15-22)
tests/unit/engine/test_loop_helpers.py (1)
src/ai_company/engine/loop_helpers.py (9)
build_result(369-384)call_provider(138-195)check_budget(89-135)check_response_errors(198-232)check_shutdown(44-86)execute_tool_calls(235-315)get_tool_definitions(332-339)make_turn_record(351-366)response_to_message(342-348)
src/ai_company/engine/loop_protocol.py (2)
src/ai_company/budget/call_category.py (1)
LLMCallCategory(11-25)src/ai_company/tools/base.py (1)
description(115-117)
tests/unit/budget/test_category_analytics.py (5)
src/ai_company/budget/call_category.py (2)
LLMCallCategory(11-25)OrchestrationAlertLevel(28-45)src/ai_company/budget/category_analytics.py (3)
CategoryBreakdown(28-107)build_category_breakdown(138-183)compute_orchestration_ratio(186-233)src/ai_company/budget/coordination_config.py (1)
OrchestrationAlertThresholds(52-92)src/ai_company/budget/cost_record.py (1)
CostRecord(15-56)src/ai_company/budget/tracker.py (2)
CostTracker(67-412)record(98-111)
src/ai_company/config/schema.py (1)
src/ai_company/budget/coordination_config.py (1)
CoordinationMetricsConfig(95-129)
tests/unit/budget/test_coordination_config.py (1)
src/ai_company/budget/coordination_config.py (5)
CoordinationMetricName(13-20)CoordinationMetricsConfig(95-129)ErrorCategory(23-29)ErrorTaxonomyConfig(32-49)OrchestrationAlertThresholds(52-92)
src/ai_company/engine/loop_helpers.py (5)
src/ai_company/providers/enums.py (2)
FinishReason(15-22)MessageRole(6-12)src/ai_company/providers/models.py (5)
ChatMessage(138-210)CompletionConfig(213-254)CompletionResponse(257-306)ToolDefinition(64-93)add_token_usage(46-61)src/ai_company/engine/loop_protocol.py (3)
ExecutionResult(77-128)TerminationReason(27-34)TurnRecord(37-74)src/ai_company/engine/context.py (2)
AgentContext(87-307)with_message(173-182)src/ai_company/providers/protocol.py (1)
CompletionProvider(21-80)
src/ai_company/budget/cost_record.py (1)
src/ai_company/budget/call_category.py (1)
LLMCallCategory(11-25)
src/ai_company/budget/tracker.py (4)
src/ai_company/budget/call_category.py (1)
OrchestrationAlertLevel(28-45)src/ai_company/budget/category_analytics.py (4)
CategoryBreakdown(28-107)OrchestrationRatio(110-135)build_category_breakdown(138-183)compute_orchestration_ratio(186-233)src/ai_company/budget/coordination_config.py (1)
OrchestrationAlertThresholds(52-92)src/ai_company/budget/cost_record.py (1)
CostRecord(15-56)
src/ai_company/engine/plan_execute_loop.py (6)
src/ai_company/providers/models.py (3)
ChatMessage(138-210)CompletionConfig(213-254)CompletionResponse(257-306)src/ai_company/engine/loop_protocol.py (4)
ExecutionResult(77-128)TerminationReason(27-34)TurnRecord(37-74)execute(147-173)src/ai_company/engine/plan_models.py (3)
ExecutionPlan(53-88)PlanExecuteConfig(91-118)PlanStep(25-50)src/ai_company/engine/context.py (4)
AgentContext(87-307)has_turns_remaining(305-307)with_message(173-182)with_turn_completed(184-233)src/ai_company/providers/protocol.py (1)
CompletionProvider(21-80)src/ai_company/tools/invoker.py (1)
ToolInvoker(61-489)
🔇 Additional comments (3)
tests/unit/config/conftest.py (1)
9-9: Good fixture alignment.Adding
coordination_metricshere keeps the sharedRootConfigFactoryrepresentative of the new root config surface without extra per-test setup.Also applies to: 73-73
tests/unit/engine/test_loop_protocol.py (1)
6-6: Nice coverage of the new loop protocol surface.These additions cover both the new
TurnRecord.call_categoryfield andPlanExecuteLoop'sExecutionLoopcontract.Also applies to: 14-15, 95-136, 294-300
tests/unit/engine/test_plan_models.py (1)
14-314: Good coverage of the immutable plan models.Creation, validation, frozen behavior, copy-on-write updates, and JSON round-trip are all exercised here, which should keep the new plan model contracts stable.
Based on learnings, "Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves".
| import math | ||
| from typing import TYPE_CHECKING | ||
|
|
||
| from pydantic import BaseModel, ConfigDict, Field | ||
|
|
||
| from ai_company.budget.call_category import ( | ||
| LLMCallCategory, | ||
| OrchestrationAlertLevel, | ||
| ) | ||
| from ai_company.budget.coordination_config import ( | ||
| OrchestrationAlertThresholds, | ||
| ) | ||
| from ai_company.constants import BUDGET_ROUNDING_PRECISION |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add the required observability hook to this analytics module.
build_category_breakdown() and compute_orchestration_ratio() are budget business logic, but this file never declares logger = get_logger(__name__). That leaves direct callers of these helpers invisible in logs and breaks the repo-wide observability contract.
As per coding guidelines src/ai_company/**/*.py: Every module with business logic MUST have: from ai_company.observability import get_logger then logger = get_logger(__name__); always use event name constants from ai_company.observability.events domain-specific modules.
Also applies to: 138-247
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/budget/category_analytics.py` around lines 8 - 20, This module
lacks the observability hook: add "from ai_company.observability import
get_logger" at the top and declare "logger = get_logger(__name__)" so the
file-level logger is available; then instrument the existing business logic
functions build_category_breakdown and compute_orchestration_ratio to emit
events using the domain event constants from ai_company.observability.events
(import the appropriate event names) rather than raw strings, ensuring calls to
logger.info/debug/error use those constants and that logging occurs at
entry/exit and on error paths; place the imports alongside other top-level
imports and add the logger declaration near them so all functions in this file
(including lines ~138-247) use the repo-wide observability contract.
| Attributes: | ||
| ratio: Orchestration ratio (0.0-1.0). | ||
| alert_level: Alert level based on ratio thresholds. | ||
| total_tokens: Total tokens across all categories (includes | ||
| uncategorized tokens in the denominator). | ||
| productive_tokens: Productive category tokens. | ||
| coordination_tokens: Coordination category tokens. | ||
| system_tokens: System category tokens. | ||
| """ | ||
|
|
||
| model_config = ConfigDict(frozen=True) | ||
|
|
||
| ratio: float = Field(ge=0.0, le=1.0, description="Orchestration ratio") | ||
| alert_level: OrchestrationAlertLevel = Field( | ||
| description="Alert level for orchestration overhead", | ||
| ) | ||
| total_tokens: int = Field(ge=0, description="Total tokens") | ||
| productive_tokens: int = Field(ge=0, description="Productive tokens") | ||
| coordination_tokens: int = Field(ge=0, description="Coordination tokens") | ||
| system_tokens: int = Field(ge=0, description="System tokens") |
There was a problem hiding this comment.
Carry uncategorized tokens through OrchestrationRatio.
The denominator includes uncategorized_tokens, but the returned model drops that bucket. A caller can therefore see ratio=0.0 with total_tokens>0 and no way to tell whether overhead was truly zero or whether the traffic was simply uncategorized.
Possible fix
class OrchestrationRatio(BaseModel):
@@
total_tokens: int = Field(ge=0, description="Total tokens")
productive_tokens: int = Field(ge=0, description="Productive tokens")
coordination_tokens: int = Field(ge=0, description="Coordination tokens")
system_tokens: int = Field(ge=0, description="System tokens")
+ uncategorized_tokens: int = Field(
+ ge=0,
+ description="Uncategorized tokens",
+ )
@@
return OrchestrationRatio(
ratio=round(ratio, BUDGET_ROUNDING_PRECISION),
alert_level=alert,
total_tokens=total,
productive_tokens=breakdown.productive_tokens,
coordination_tokens=breakdown.coordination_tokens,
system_tokens=breakdown.system_tokens,
+ uncategorized_tokens=breakdown.uncategorized_tokens,
)Also applies to: 204-233
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/budget/category_analytics.py` around lines 116 - 135,
OrchestrationRatio is dropping the uncategorized_tokens bucket from its public
model so callers cannot tell if a zero ratio was due to true zero overhead or
uncategorized traffic; add an uncategorized_tokens: int = Field(ge=0,
description="Uncategorized tokens") attribute to the OrchestrationRatio model
(and mirror the same change in the similar model/section referenced around lines
204-233) so the denominator and token breakdown are exposed; update any
constructor/factory that builds OrchestrationRatio to populate this new field
from the computed uncategorized_tokens value.
| """Coordination metrics for multi-agent system tuning. | ||
|
|
||
| Pure computation functions for five coordination metrics defined in | ||
| DESIGN_SPEC (Coordination Metrics): efficiency, overhead, error | ||
| amplification, message density, and redundancy rate. | ||
| """ | ||
|
|
||
| import statistics | ||
| from typing import TYPE_CHECKING | ||
|
|
||
| from pydantic import BaseModel, ConfigDict, Field, computed_field | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add the required logger and log-before-raise on these helpers.
This new module contains business logic, but it never defines logger, and every invalid-input branch raises ValueError without emitting warning/error context first. Import get_logger, initialize logger, and log each bad-input path before raising. As per coding guidelines, "Every module with business logic MUST have: from ai_company.observability import get_logger then logger = get_logger(__name__)" and "All error paths must log at WARNING or ERROR with context before raising."
Also applies to: 214-216, 241-243, 267-269, 293-295, 318-324
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/budget/coordination_metrics.py` around lines 1 - 12, Add
logging to this business-logic module by importing get_logger and initializing
logger = get_logger(__name__), then update every invalid-input branch in the
coordination metric helpers (functions efficiency, overhead,
error_amplification, message_density, redundancy_rate) to call logger.warning or
logger.error with contextual details (input values and reason) immediately
before raising ValueError; ensure the import line "from ai_company.observability
import get_logger" is added and that each bad-input path logs at WARNING/ERROR
level with a clear message including the function name and offending parameters.
| error_rate_mas: float = Field( | ||
| ge=0.0, | ||
| description="Multi-agent error rate", | ||
| ) | ||
| error_rate_sas: float = Field(gt=0, description="Single-agent error rate") |
There was a problem hiding this comment.
Cap error-rate inputs at 1.0.
Both fields are documented as rates, but values above 1.0 currently validate and compute_error_amplification() forwards them straight through. That produces meaningless amplification ratios instead of rejecting bad inputs.
Suggested fix
error_rate_mas: float = Field(
ge=0.0,
+ le=1.0,
description="Multi-agent error rate",
)
- error_rate_sas: float = Field(gt=0, description="Single-agent error rate")
+ error_rate_sas: float = Field(
+ gt=0,
+ le=1.0,
+ description="Single-agent error rate",
+ )Also applies to: 250-273
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/budget/coordination_metrics.py` around lines 86 - 90, Update
the error-rate field validators and the computation to enforce a max of 1.0:
change the Pydantic Field on error_rate_mas to include le=1.0 (keep ge=0.0) and
change error_rate_sas to use ge=0.0, le=1.0 (instead of gt=0), and then in
compute_error_amplification() defensively clamp both inputs to min(value, 1.0)
before doing any math so out-of-range inputs can't produce nonsensical
amplification ratios; reference error_rate_mas, error_rate_sas, and
compute_error_amplification when making the changes.
| agent_id: str | None = None, | ||
| task_id: str | None = None, |
There was a problem hiding this comment.
Reject blank task_id filters before querying.
The new task_id path flows straight into _filter_records(), so " " is treated as a real identifier and quietly returns an empty breakdown/ratio instead of surfacing bad input. Validate or normalize blank values up front on these public query methods.
As per coding guidelines **/*.py: Use NotBlankStr from core.types for all identifier/name fields, and validate input at system boundaries.
Also applies to: 283-284, 437-449
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/budget/tracker.py` around lines 243 - 244, The public query
entry points that accept agent_id and task_id currently allow blank/whitespace
strings which then propagate into _filter_records; change the parameter types
from str | None to core.types.NotBlankStr | None for all identifier/name
parameters (e.g., the methods taking agent_id: str | None, task_id: str | None)
and add an upfront normalization/validation step that rejects or normalizes
purely-blank values (trim and treat "" as None or raise) before calling
_filter_records; apply the same change to the other affected signatures/paths
mentioned (the other method overloads that accept task_id/agent_id) so
blank-only inputs are validated at the system boundary.
| async def test_call_category_propagated(self) -> None: | ||
| turns = ( | ||
| _turn(call_category=LLMCallCategory.PRODUCTIVE), | ||
| _turn(turn_number=2, call_category=LLMCallCategory.SYSTEM), | ||
| ) | ||
| tracker = _FakeTracker() | ||
| await record_execution_costs( | ||
| _result(turns), | ||
| _identity(), | ||
| "agent-1", | ||
| "task-1", | ||
| tracker=tracker, # type: ignore[arg-type] | ||
| ) | ||
| assert tracker.records[0].call_category == LLMCallCategory.PRODUCTIVE | ||
| assert tracker.records[1].call_category == LLMCallCategory.SYSTEM |
There was a problem hiding this comment.
Cover the coordination category in the propagation test.
This is the only end-to-end assertion in this suite for call_category persistence, and it skips LLMCallCategory.COORDINATION, which is the category the new orchestration metrics depend on most.
♻️ Suggested test shape
- async def test_call_category_propagated(self) -> None:
- turns = (
- _turn(call_category=LLMCallCategory.PRODUCTIVE),
- _turn(turn_number=2, call_category=LLMCallCategory.SYSTEM),
- )
+ `@pytest.mark.parametrize`(
+ "category",
+ (
+ LLMCallCategory.PRODUCTIVE,
+ LLMCallCategory.COORDINATION,
+ LLMCallCategory.SYSTEM,
+ ),
+ )
+ async def test_call_category_propagated(
+ self,
+ category: LLMCallCategory,
+ ) -> None:
+ turns = (_turn(call_category=category),)
tracker = _FakeTracker()
await record_execution_costs(
_result(turns),
_identity(),
"agent-1",
"task-1",
tracker=tracker, # type: ignore[arg-type]
)
- assert tracker.records[0].call_category == LLMCallCategory.PRODUCTIVE
- assert tracker.records[1].call_category == LLMCallCategory.SYSTEM
+ assert tracker.records[0].call_category == categoryAs per coding guidelines, "Prefer @pytest.mark.parametrize for testing similar cases".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def test_call_category_propagated(self) -> None: | |
| turns = ( | |
| _turn(call_category=LLMCallCategory.PRODUCTIVE), | |
| _turn(turn_number=2, call_category=LLMCallCategory.SYSTEM), | |
| ) | |
| tracker = _FakeTracker() | |
| await record_execution_costs( | |
| _result(turns), | |
| _identity(), | |
| "agent-1", | |
| "task-1", | |
| tracker=tracker, # type: ignore[arg-type] | |
| ) | |
| assert tracker.records[0].call_category == LLMCallCategory.PRODUCTIVE | |
| assert tracker.records[1].call_category == LLMCallCategory.SYSTEM | |
| `@pytest.mark.parametrize`( | |
| "category", | |
| ( | |
| LLMCallCategory.PRODUCTIVE, | |
| LLMCallCategory.COORDINATION, | |
| LLMCallCategory.SYSTEM, | |
| ), | |
| ) | |
| async def test_call_category_propagated( | |
| self, | |
| category: LLMCallCategory, | |
| ) -> None: | |
| turns = (_turn(call_category=category),) | |
| tracker = _FakeTracker() | |
| await record_execution_costs( | |
| _result(turns), | |
| _identity(), | |
| "agent-1", | |
| "task-1", | |
| tracker=tracker, # type: ignore[arg-type] | |
| ) | |
| assert tracker.records[0].call_category == category |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/engine/test_cost_recording.py` around lines 141 - 155, Add a third
case for the COORDINATION category to the call-category propagation test by
converting test_call_category_propagated into a parametrized test (use
pytest.mark.parametrize) over call_category values including
LLMCallCategory.PRODUCTIVE, LLMCallCategory.SYSTEM, and
LLMCallCategory.COORDINATION; for each param, construct the turn via
_turn(call_category=...) and run record_execution_costs with the same
_FakeTracker, then assert the tracker.records entry preserves the passed
call_category (ensure you reference the existing test_call_category_propagated,
record_execution_costs, _turn, and _FakeTracker symbols).
…, and Greptile Source fixes: - Add check_response_errors after planner LLM call (CRITICAL) - Add shutdown/budget checks before replan LLM call (CRITICAL) - Extract plan_parsing.py from plan_execute_loop.py (<800 lines) - Decompose _run_steps via _attempt_replan extraction - Tag planning turns as SYSTEM, step turns as PRODUCTIVE - Add XML delimiters for LLM-generated content in step instructions - Set StepStatus.IN_PROGRESS at step start - Use EXECUTION_PLAN_STEP_TRUNCATED event for MAX_TOKENS - Decompose _run_step_turn into _handle_step_completion/_handle_step_tool_calls - Fix docstrings for OrchestrationAlertLevel thresholds - Add allow_inf_nan=False to CostRecord, CategoryBreakdown, OrchestrationRatio - Add cross-field validation to OrchestrationRatio - Add unique-category validators to ErrorTaxonomyConfig, CoordinationMetricsConfig - Deep-copy ExecutionResult metadata at construction boundary - Fix ReactLoop class docstring and add call categorization - Narrow _parse_json_plan exception to JSONDecodeError only - Add debug logging to silent return-None paths in plan parsing - Fix Raises docstrings for compute_efficiency/compute_overhead - Move CompletionResponse import to TYPE_CHECKING in plan_parsing Test fixes: - Add test for successful replan path - Rewrite test_max_replans_exhausted to exercise actual replan exhaustion - Add multi-step with tools integration test - Add ReAct vs Plan-and-Execute comparison test - Add tests for clear_last_turn_tool_calls - Add tests for make_budget_checker factory - Verify model IDs in test_different_models_for_phases - Tighten test_max_turns_during_step to expect only MAX_TURNS - Add check_budget MemoryError/RecursionError propagation tests - Add model recording to MockCompletionProvider Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…n no-invoker - Add logger to category_analytics.py (CLAUDE.md: business logic needs logger) - Add logger to coordination_metrics.py (CLAUDE.md: business logic needs logger) - Clear tool_calls_made on turn record when no invoker is available (consistency with shutdown handling — tools were never executed) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 32 out of 32 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _PLANNING_PROMPT = """\ | ||
| You are a planning agent. Analyze the task and create a step-by-step \ | ||
| execution plan. Return your plan as a JSON object with this exact schema: | ||
|
|
||
| ```json | ||
| { | ||
| "steps": [ | ||
| { | ||
| "step_number": 1, | ||
| "description": "What to do in this step", | ||
| "expected_outcome": "What should result from this step" | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| Each step should be concrete, actionable, and independently verifiable. \ | ||
| Return ONLY the JSON object, no other text.""" | ||
|
|
||
| _REPLAN_JSON_EXAMPLE = """\ | ||
| ```json | ||
| { | ||
| "steps": [ | ||
| { | ||
| "step_number": 1, | ||
| "description": "What to do in this step", | ||
| "expected_outcome": "What should result from this step" | ||
| } | ||
| ] | ||
| } | ||
| ```""" |
There was a problem hiding this comment.
_PLANNING_PROMPT and _REPLAN_JSON_EXAMPLE use leading-underscore naming, which conventionally signals module-private names. However, they are imported and used by plan_execute_loop.py. Since these constants are part of the cross-module API, they should be renamed to PLANNING_PROMPT and REPLAN_JSON_EXAMPLE (without the leading underscore) to signal that they are intended for use outside plan_parsing.py.
| import pytest | ||
|
|
||
| from ai_company.budget.call_category import LLMCallCategory, OrchestrationAlertLevel | ||
|
|
There was a problem hiding this comment.
All other test files in tests/unit/budget/ include pytestmark = pytest.mark.timeout(30) (see test_config.py:18, test_cost_record.py:13, test_enums.py:7, test_hierarchy.py:18, test_spending_summary.py:23, test_tracker.py:20). This new file is missing that module-level marker, which is a deviation from the established convention in this directory.
| pytestmark = pytest.mark.timeout(30) |
| """Tests for category-based analytics.""" | ||
|
|
||
| from datetime import UTC, datetime |
There was a problem hiding this comment.
All other test files in tests/unit/budget/ include pytestmark = pytest.mark.timeout(30) (see test_config.py:18, test_cost_record.py:13, test_enums.py:7, test_hierarchy.py:18, test_spending_summary.py:23, test_tracker.py:20). This new file is missing that module-level marker.
| """Tests for coordination metrics computations.""" | ||
|
|
||
| import pytest | ||
|
|
There was a problem hiding this comment.
All other test files in tests/unit/budget/ include pytestmark = pytest.mark.timeout(30) (see test_config.py:18, test_cost_record.py:13, test_enums.py:7, test_hierarchy.py:18, test_spending_summary.py:23, test_tracker.py:20). This new file is missing that module-level marker.
| pytestmark = pytest.mark.timeout(30) |
|
|
||
| import pytest | ||
| from pydantic import ValidationError | ||
|
|
There was a problem hiding this comment.
All other test files in tests/unit/budget/ include pytestmark = pytest.mark.timeout(30) (see test_config.py:18, test_cost_record.py:13, test_enums.py:7, test_hierarchy.py:18, test_spending_summary.py:23, test_tracker.py:20). This new file is missing that module-level marker.
| pytestmark = pytest.mark.timeout(30) |
| def parse_plan( | ||
| response: CompletionResponse, | ||
| execution_id: str, | ||
| task_summary: str, | ||
| *, | ||
| revision_number: int = 0, | ||
| ) -> ExecutionPlan | None: | ||
| """Parse an ExecutionPlan from LLM response content. | ||
|
|
||
| Tries JSON extraction first (with markdown code fence stripping), | ||
| then falls back to structured text parsing. | ||
|
|
||
| Args: | ||
| response: LLM completion response. | ||
| execution_id: Execution ID for logging. | ||
| task_summary: Brief summary of the task being planned. | ||
| revision_number: Plan revision counter (0 = original). | ||
|
|
||
| Returns: | ||
| Parsed ``ExecutionPlan``, or ``None`` on failure. | ||
| """ | ||
| content = response.content or "" | ||
| if not content.strip(): | ||
| logger.warning( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| execution_id=execution_id, | ||
| reason="empty LLM response content", | ||
| ) | ||
| return None | ||
|
|
||
| plan = _parse_json_plan(content, task_summary, revision_number) | ||
| if plan is not None: | ||
| return plan | ||
|
|
||
| plan = _parse_text_plan(content, task_summary, revision_number) | ||
| if plan is not None: | ||
| return plan | ||
|
|
||
| logger.warning( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| execution_id=execution_id, | ||
| content_length=len(content), | ||
| content_snippet=content[:200], | ||
| ) | ||
| return None | ||
|
|
||
|
|
||
| def _parse_json_plan( | ||
| content: str, | ||
| task_summary: str, | ||
| revision_number: int, | ||
| ) -> ExecutionPlan | None: | ||
| """Try to extract a JSON plan from the content.""" | ||
| json_str = content.strip() | ||
| fence_match = re.search( | ||
| r"```(?:json)?\s*\n?(.*?)```", | ||
| json_str, | ||
| re.DOTALL, | ||
| ) | ||
| if fence_match: | ||
| json_str = fence_match.group(1).strip() | ||
|
|
||
| try: | ||
| data = json.loads(json_str) | ||
| except json.JSONDecodeError as exc: | ||
| logger.debug( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| parser="json", | ||
| error=str(exc), | ||
| ) | ||
| return None | ||
|
|
||
| return _data_to_plan(data, task_summary, revision_number) | ||
|
|
||
|
|
||
| def _parse_text_plan( | ||
| content: str, | ||
| task_summary: str, | ||
| revision_number: int, | ||
| ) -> ExecutionPlan | None: | ||
| """Fallback: extract steps from numbered text lines.""" | ||
| step_pattern = re.compile( | ||
| r"(?:^|\n)\s*(\d+)\.\s+(.+?)(?=\n\s*\d+\.|\Z)", | ||
| re.DOTALL, | ||
| ) | ||
| matches = step_pattern.findall(content) | ||
| if not matches: | ||
| logger.debug( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| parser="text_fallback", | ||
| reason="no numbered steps found", | ||
| ) | ||
| return None | ||
|
|
||
| steps: list[PlanStep] = [] | ||
| for _, desc in matches: | ||
| desc_clean = desc.strip() | ||
| if not desc_clean: | ||
| continue | ||
| steps.append( | ||
| PlanStep( | ||
| step_number=len(steps) + 1, | ||
| description=desc_clean, | ||
| expected_outcome=desc_clean, | ||
| ) | ||
| ) | ||
|
|
||
| if not steps: | ||
| logger.debug( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| parser="text_fallback", | ||
| reason="all descriptions empty after stripping", | ||
| ) | ||
| return None | ||
|
|
||
| try: | ||
| return ExecutionPlan( | ||
| steps=tuple(steps), | ||
| revision_number=revision_number, | ||
| original_task_summary=task_summary, | ||
| ) | ||
| except ValueError as exc: | ||
| logger.debug( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| parser="text_fallback", | ||
| error=str(exc), | ||
| ) | ||
| return None | ||
|
|
||
|
|
||
| def _data_to_plan( | ||
| data: object, | ||
| task_summary: str, | ||
| revision_number: int, | ||
| ) -> ExecutionPlan | None: | ||
| """Convert parsed JSON data to an ExecutionPlan.""" | ||
| if not isinstance(data, dict): | ||
| logger.debug( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| parser="json_data", | ||
| reason="top-level value is not a dict", | ||
| data_type=type(data).__name__, | ||
| ) | ||
| return None | ||
|
|
||
| raw_steps = data.get("steps") | ||
| if not isinstance(raw_steps, list) or not raw_steps: | ||
| logger.debug( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| parser="json_data", | ||
| reason="missing or empty 'steps' list", | ||
| ) | ||
| return None | ||
|
|
||
| steps: list[PlanStep] = [] | ||
| for i, raw_step in enumerate(raw_steps, start=1): | ||
| if not isinstance(raw_step, dict): | ||
| logger.debug( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| parser="json_data", | ||
| reason=f"step {i} is not a dict", | ||
| ) | ||
| return None | ||
| desc = raw_step.get("description", "") | ||
| outcome = raw_step.get("expected_outcome", desc) | ||
| if not desc: | ||
| logger.debug( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| parser="json_data", | ||
| reason=f"step {i} has no description", | ||
| ) | ||
| return None | ||
| steps.append( | ||
| PlanStep( | ||
| step_number=i, | ||
| description=str(desc), | ||
| expected_outcome=str(outcome), | ||
| ) | ||
| ) | ||
|
|
||
| try: | ||
| return ExecutionPlan( | ||
| steps=tuple(steps), | ||
| revision_number=revision_number, | ||
| original_task_summary=task_summary, | ||
| ) | ||
| except ValueError as exc: | ||
| logger.debug( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| parser="json_data", | ||
| error=str(exc), | ||
| ) | ||
| return None |
There was a problem hiding this comment.
src/ai_company/engine/plan_parsing.py (249 lines) has no dedicated test file, unlike every other source file in the engine package which has a corresponding test_*.py file. While some parsing paths are exercised indirectly through test_plan_execute_loop.py, the individual parsing functions (_parse_json_plan, _parse_text_plan, _data_to_plan) and their edge cases (e.g., non-dict top-level JSON, steps with missing descriptions, empty descriptions after stripping) aren't directly tested. A test_plan_parsing.py file should be added for parity with the rest of the package.
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (6)
src/ai_company/engine/plan_execute_loop.py (2)
221-225:⚠️ Potential issue | 🟠 Major
all_plansdrifts stale after step-status updates.
_update_step_status()returns a replacement plan, butall_planskeeps the older instance._finalize()serializesall_plans, so the emitted plan history will show stale statuses after steps move toIN_PROGRESS,COMPLETED, orFAILED.🧩 Suggested fix
plan = self._update_step_status( plan, step_idx, StepStatus.IN_PROGRESS, ) + all_plans[-1] = plan logger.info( EXECUTION_PLAN_STEP_START, execution_id=ctx.execution_id, step_number=step.step_number, description=step.description, @@ plan = self._update_step_status( plan, step_idx, StepStatus.COMPLETED, ) + all_plans[-1] = plan logger.info( EXECUTION_PLAN_STEP_COMPLETE, execution_id=ctx.execution_id, step_number=step.step_number, ) @@ - plan = self._update_step_status(plan, step_idx, StepStatus.FAILED) + plan = self._update_step_status(plan, step_idx, StepStatus.FAILED) + all_plans[-1] = plan logger.warning( EXECUTION_PLAN_STEP_FAILED, execution_id=ctx.execution_id, step_number=step.step_number, )Also applies to: 252-256, 315-316, 777-787
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/plan_execute_loop.py` around lines 221 - 225, The plan object returned by _update_step_status is not propagated back into the all_plans list, so serialized history in _finalize uses stale plan instances; update the entry in all_plans with the returned plan whenever you call _update_step_status (e.g., inside PlanExecuteLoop where plan = self._update_step_status(...) is used), replacing the old all_plans[plan_index] (or the element found by plan.id) with the new plan; apply the same replacement behavior for all other call sites mentioned (around lines 252-256, 315-316, and 777-787) so all_plans always holds the latest plan instances before _finalize serializes them.
599-608:⚠️ Potential issue | 🟠 Major
finish_reasonis not a step-success signal.
STOP/MAX_TOKENSonly mean generation ended. A reply like “I couldn't complete this step because …” still returnsTruehere, so the step is marked completed and the replan path never runs. Have the executor emit an explicit success/failure marker, or a tiny JSON status, and key_assess_step_success()off that instead.Also applies to: 709-724, 761-774
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/plan_execute_loop.py` around lines 599 - 608, The current logic treats the LLM response finish_reason as a success signal; instead modify the executor and success-checking logic so the model must emit an explicit success marker (e.g. a tiny JSON like {"success": true/false, "note": "..."} appended to its reply) and have _assess_step_success() parse that marker from the assistant content to decide success or failure; update the instruction templates built around the step execution (the strings at the sites shown around the instruction construction) to require that explicit JSON status at the end, and change the places that currently rely on finish_reason (including the response handling paths referenced by _assess_step_success() and the call sites around the ranges you noted) to call _assess_step_success() and act on its boolean result instead of finish_reason.src/ai_company/budget/coordination_metrics.py (2)
86-90:⚠️ Potential issue | 🟠 MajorReject error rates above
1.0.Both fields are documented as rates, but values like
1.5still validate and produce meaningless amplification factors. Add an upper bound to both inputs and keep the explicit zero-divisor guard forerror_rate_sas.📏 Suggested fix
error_rate_mas: float = Field( ge=0.0, + le=1.0, description="Multi-agent error rate", ) - error_rate_sas: float = Field(gt=0, description="Single-agent error rate") + error_rate_sas: float = Field( + gt=0, + le=1.0, + description="Single-agent error rate", + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/budget/coordination_metrics.py` around lines 86 - 90, Add an upper bound of 1.0 to both rate fields to reject values >1.0: update the Field() for error_rate_mas to include le=1.0 (keeping ge=0.0) and update the Field() for error_rate_sas to include le=1.0 while preserving its gt=0 guard; reference the existing symbols error_rate_mas and error_rate_sas and modify their Field(...) constraints accordingly.
8-12: 🛠️ Refactor suggestion | 🟠 MajorAdd the required logger and log-before-raise on invalid inputs.
This is budget business logic, but the module still raises
ValueErrorin several branches without aloggeror event constants. That makes bad metric inputs hard to diagnose and violates the repo observability contract.As per coding guidelines
src/ai_company/**/*.py: Every module with business logic MUST havefrom ai_company.observability import get_loggerthenlogger = get_logger(__name__), and all error paths must log at WARNING or ERROR with context before raising.Also applies to: 216-218, 245-247, 271-273, 297-299, 322-328
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/budget/coordination_metrics.py` around lines 8 - 12, Add observability: import get_logger from ai_company.observability and instantiate logger = get_logger(__name__) at top of src/ai_company/budget/coordination_metrics.py, then before every ValueError raise in the module (including the branches referenced around lines 216-218, 245-247, 271-273, 297-299, 322-328) emit a logger.warning or logger.error with contextual fields (function name, input values, and a repo event constant) and include a named event constant from ai_company.observability.events (e.g., BAD_METRIC_INPUT) so that each error path logs context before raising the ValueError.src/ai_company/budget/category_analytics.py (2)
8-20: 🛠️ Refactor suggestion | 🟠 MajorAdd the required observability hook to this analytics module.
build_category_breakdown()andcompute_orchestration_ratio()are budget business logic, but this file still has nologgeror budget event constants. That breaks the repo observability contract for direct callers of these helpers.As per coding guidelines
src/ai_company/**/*.py: Every module with business logic MUST havefrom ai_company.observability import get_loggerthenlogger = get_logger(__name__); use event name constants fromai_company.observability.events.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/budget/category_analytics.py` around lines 8 - 20, This module lacks the observability hook and event constants; add "from ai_company.observability import get_logger" and "from ai_company.observability.events import <appropriate budget event constants>" at the top, then create "logger = get_logger(__name__)" and emit the relevant budget events from build_category_breakdown and compute_orchestration_ratio (use the imported event name constants when logging/emitting) so callers follow the repo observability contract.
116-149:⚠️ Potential issue | 🟠 MajorExpose
uncategorized_tokensonOrchestrationRatio.
total_tokensincludes uncategorized traffic, but the returned model drops that bucket. Callers can therefore seeratio=0.0with non-zero totals and no way to distinguish real zero overhead from missing categorization.🧭 Suggested fix
class OrchestrationRatio(BaseModel): @@ productive_tokens: int = Field(ge=0, description="Productive tokens") coordination_tokens: int = Field(ge=0, description="Coordination tokens") system_tokens: int = Field(ge=0, description="System tokens") + uncategorized_tokens: int = Field( + ge=0, + description="Uncategorized tokens", + ) `@model_validator`(mode="after") def _validate_token_consistency(self) -> Self: - """Ensure total_tokens >= sum of category tokens.""" + """Ensure total_tokens matches the exposed category buckets.""" category_sum = ( - self.productive_tokens + self.coordination_tokens + self.system_tokens + self.productive_tokens + + self.coordination_tokens + + self.system_tokens + + self.uncategorized_tokens ) - if self.total_tokens < category_sum: + if self.total_tokens != category_sum: msg = ( - f"total_tokens ({self.total_tokens}) must be >= " + f"total_tokens ({self.total_tokens}) must equal " f"sum of category tokens ({category_sum})" ) raise ValueError(msg) return self @@ total_tokens=0, productive_tokens=0, coordination_tokens=0, system_tokens=0, + uncategorized_tokens=0, ) @@ total_tokens=total, productive_tokens=breakdown.productive_tokens, coordination_tokens=breakdown.coordination_tokens, system_tokens=breakdown.system_tokens, + uncategorized_tokens=breakdown.uncategorized_tokens, )Also applies to: 225-247
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/budget/category_analytics.py` around lines 116 - 149, Add an uncategorized_tokens Field to the OrchestrationRatio model and include it in the token consistency check and docstring: add "uncategorized_tokens: int = Field(ge=0, description='Uncategorized tokens')" to the model fields, update the class docstring to document uncategorized_tokens, and modify _validate_token_consistency to compute category_sum = self.productive_tokens + self.coordination_tokens + self.system_tokens + self.uncategorized_tokens so total_tokens >= category_sum; apply the same change to the second analogous block noted (lines ~225-247).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai_company/engine/plan_execute_loop.py`:
- Around line 685-707: After updating ctx with ctx.with_turn_completed (so
accumulated_cost includes this turn), immediately call the budget check
(check_budget(ctx) or the existing budget-checking helper used elsewhere) and if
it reports budget exhausted, do not execute any tools: log the exhaustion and
return the same path as a completed step (i.e., call and return
self._handle_step_completion(ctx, response, turn_number)) instead of calling
_handle_step_tool_calls; this ensures no side-effecting tool calls run after
budget is exhausted.
- Around line 140-150: The planning-phase early returns bypass self._finalize(),
so modify execute() to call self._finalize(...) before returning when
_run_planning_phase(...) yields an ExecutionResult; specifically, after
detecting isinstance(plan_result, ExecutionResult) call
self._finalize(loop_type="planning", plan=... ) with the same plan metadata used
by step-phase exits (e.g., the plan object from planner_model or plan_result)
and any required finalization args (shutdown/budget info), then return
plan_result.
In `@src/ai_company/engine/plan_parsing.py`:
- Around line 95-100: The logger call that currently includes raw planner output
via content_snippet=content[:200] must be changed to avoid logging user/model
data; in the code around the logger.warning call in plan_parsing.py (the
EXECUTION_PLAN_PARSE_ERROR log using execution_id and content), remove the
content_snippet field and instead log a non-sensitive fingerprint (e.g., compute
a SHA-256 hex digest of content and include content_hash) while keeping
content_length and execution_id; update the logging call to include content_hash
(or similar) and/or other structured metadata but not any substring of content.
In `@src/ai_company/engine/react_loop.py`:
- Around line 201-204: The shutdown ExecutionResult is built before we clear
tool calls, so after calling clear_last_turn_tool_calls(turns) you must rebuild
the shutdown result to reflect the mutated turns; i.e., call check_shutdown(ctx,
shutdown_checker, turns) again (or otherwise reconstruct the ExecutionResult
from the updated turns) so the returned value from the shutdown branch
accurately reports no-executed tool calls (referencing check_shutdown,
clear_last_turn_tool_calls, TurnRecord, ExecutionResult, and the turns
variable).
In `@tests/unit/engine/test_plan_execute_loop.py`:
- Around line 191-199: Add assertions that validate the category of each turn in
the happy-path test: verify the first turn in result.turns (the planning turn)
has category SYSTEM and that subsequent execution turns have category
PRODUCTIVE; keep existing checks (termination_reason, turn count, metadata) and
insert checks referencing result.turns (e.g., result.turns[0].category and each
turn in result.turns[1:]) so a regression that mis-categorizes planning vs.
execution will fail.
---
Duplicate comments:
In `@src/ai_company/budget/category_analytics.py`:
- Around line 8-20: This module lacks the observability hook and event
constants; add "from ai_company.observability import get_logger" and "from
ai_company.observability.events import <appropriate budget event constants>" at
the top, then create "logger = get_logger(__name__)" and emit the relevant
budget events from build_category_breakdown and compute_orchestration_ratio (use
the imported event name constants when logging/emitting) so callers follow the
repo observability contract.
- Around line 116-149: Add an uncategorized_tokens Field to the
OrchestrationRatio model and include it in the token consistency check and
docstring: add "uncategorized_tokens: int = Field(ge=0,
description='Uncategorized tokens')" to the model fields, update the class
docstring to document uncategorized_tokens, and modify
_validate_token_consistency to compute category_sum = self.productive_tokens +
self.coordination_tokens + self.system_tokens + self.uncategorized_tokens so
total_tokens >= category_sum; apply the same change to the second analogous
block noted (lines ~225-247).
In `@src/ai_company/budget/coordination_metrics.py`:
- Around line 86-90: Add an upper bound of 1.0 to both rate fields to reject
values >1.0: update the Field() for error_rate_mas to include le=1.0 (keeping
ge=0.0) and update the Field() for error_rate_sas to include le=1.0 while
preserving its gt=0 guard; reference the existing symbols error_rate_mas and
error_rate_sas and modify their Field(...) constraints accordingly.
- Around line 8-12: Add observability: import get_logger from
ai_company.observability and instantiate logger = get_logger(__name__) at top of
src/ai_company/budget/coordination_metrics.py, then before every ValueError
raise in the module (including the branches referenced around lines 216-218,
245-247, 271-273, 297-299, 322-328) emit a logger.warning or logger.error with
contextual fields (function name, input values, and a repo event constant) and
include a named event constant from ai_company.observability.events (e.g.,
BAD_METRIC_INPUT) so that each error path logs context before raising the
ValueError.
In `@src/ai_company/engine/plan_execute_loop.py`:
- Around line 221-225: The plan object returned by _update_step_status is not
propagated back into the all_plans list, so serialized history in _finalize uses
stale plan instances; update the entry in all_plans with the returned plan
whenever you call _update_step_status (e.g., inside PlanExecuteLoop where plan =
self._update_step_status(...) is used), replacing the old all_plans[plan_index]
(or the element found by plan.id) with the new plan; apply the same replacement
behavior for all other call sites mentioned (around lines 252-256, 315-316, and
777-787) so all_plans always holds the latest plan instances before _finalize
serializes them.
- Around line 599-608: The current logic treats the LLM response finish_reason
as a success signal; instead modify the executor and success-checking logic so
the model must emit an explicit success marker (e.g. a tiny JSON like
{"success": true/false, "note": "..."} appended to its reply) and have
_assess_step_success() parse that marker from the assistant content to decide
success or failure; update the instruction templates built around the step
execution (the strings at the sites shown around the instruction construction)
to require that explicit JSON status at the end, and change the places that
currently rely on finish_reason (including the response handling paths
referenced by _assess_step_success() and the call sites around the ranges you
noted) to call _assess_step_success() and act on its boolean result instead of
finish_reason.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: f944f2e3-8aab-48be-af91-f36f4b51323f
📒 Files selected for processing (14)
src/ai_company/budget/call_category.pysrc/ai_company/budget/category_analytics.pysrc/ai_company/budget/coordination_config.pysrc/ai_company/budget/coordination_metrics.pysrc/ai_company/budget/cost_record.pysrc/ai_company/engine/loop_protocol.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/engine/plan_parsing.pysrc/ai_company/engine/react_loop.pysrc/ai_company/observability/events/execution.pytests/unit/engine/conftest.pytests/unit/engine/test_loop_helpers.pytests/unit/engine/test_loop_protocol.pytests/unit/engine/test_plan_execute_loop.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Nofrom __future__ import annotations— Python 3.14 has PEP 649
Useexcept A, B:syntax (no parentheses) for exception handling — ruff enforces this on Python 3.14
Include type hints on all public functions; enforce with mypy strict mode
Use Google-style docstrings on all public classes and functions — enforced by ruff D rules
Use immutability: create new objects, never mutate existing ones. For non-Pydantic internal collections, usecopy.deepcopy()at construction +MappingProxyTypewrapping. Fordict/listfields in frozen Pydantic models, rely onfrozen=Trueandcopy.deepcopy()at system boundaries
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models for runtime state that evolves
Use Pydantic v2 (BaseModel,model_validator,computed_field,ConfigDict). Use@computed_fieldfor derived values; useNotBlankStrfromcore.typesfor all identifier/name fields
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code; prefer structured concurrency over barecreate_task
Maintain line length of 88 characters — enforced by ruff
Keep functions under 50 lines and files under 800 lines
Handle errors explicitly, never silently swallow exceptions
Validate input at system boundaries: user input, external APIs, and config files
Files:
src/ai_company/budget/coordination_metrics.pytests/unit/engine/test_loop_helpers.pysrc/ai_company/budget/call_category.pysrc/ai_company/budget/category_analytics.pysrc/ai_company/engine/plan_parsing.pysrc/ai_company/engine/loop_protocol.pysrc/ai_company/budget/coordination_config.pytests/unit/engine/test_loop_protocol.pysrc/ai_company/observability/events/execution.pysrc/ai_company/engine/react_loop.pytests/unit/engine/conftest.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/budget/cost_record.pytests/unit/engine/test_plan_execute_loop.py
src/ai_company/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/ai_company/**/*.py: Every module with business logic MUST have:from ai_company.observability import get_loggerthenlogger = get_logger(__name__)
Never useimport logging/logging.getLogger()/print()in application code — useget_logger()from observability module
Use variable namelogger(not_logger, notlog) for logging instances
Always use event name constants fromai_company.observability.eventsdomain-specific modules; import directly (e.g.,from ai_company.observability.events.<domain> import EVENT_CONSTANT)
Use structured logging format:logger.info(EVENT, key=value)— neverlogger.info("msg %s", val)
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
Use DEBUG level for object creation, internal flow, and entry/exit of key functions
Pure data models, enums, and re-exports do NOT need logging
Files:
src/ai_company/budget/coordination_metrics.pysrc/ai_company/budget/call_category.pysrc/ai_company/budget/category_analytics.pysrc/ai_company/engine/plan_parsing.pysrc/ai_company/engine/loop_protocol.pysrc/ai_company/budget/coordination_config.pysrc/ai_company/observability/events/execution.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/budget/cost_record.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples — use generic names like
example-provider,example-large-001,large/medium/smallas aliases, ortest-provider,test-small-001in tests
Files:
src/ai_company/budget/coordination_metrics.pysrc/ai_company/budget/call_category.pysrc/ai_company/budget/category_analytics.pysrc/ai_company/engine/plan_parsing.pysrc/ai_company/engine/loop_protocol.pysrc/ai_company/budget/coordination_config.pysrc/ai_company/observability/events/execution.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/plan_execute_loop.pysrc/ai_company/budget/cost_record.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Use pytest markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e,@pytest.mark.slow
Maintain 80% minimum code coverage — enforced in CI
Useasyncio_mode = "auto"in pytest configuration — no manual@pytest.mark.asyncioneeded
Set 30-second timeout per test
Usepytest-xdistvia-n autofor parallel test execution
Prefer@pytest.mark.parametrizefor testing similar cases
Files:
tests/unit/engine/test_loop_helpers.pytests/unit/engine/test_loop_protocol.pytests/unit/engine/conftest.pytests/unit/engine/test_plan_execute_loop.py
src/ai_company/{providers,engine}/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
RetryExhaustedErrorsignals that all retries failed — the engine layer catches this to trigger fallback chains
Files:
src/ai_company/engine/plan_parsing.pysrc/ai_company/engine/loop_protocol.pysrc/ai_company/engine/react_loop.pysrc/ai_company/engine/plan_execute_loop.py
🧠 Learnings (9)
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Every module with business logic MUST have: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`
Applied to files:
src/ai_company/budget/coordination_metrics.pysrc/ai_company/budget/category_analytics.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Never use `import logging` / `logging.getLogger()` / `print()` in application code — use `get_logger()` from observability module
Applied to files:
src/ai_company/budget/coordination_metrics.pysrc/ai_company/budget/category_analytics.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : All error paths must log at WARNING or ERROR with context before raising
Applied to files:
src/ai_company/budget/coordination_metrics.pysrc/ai_company/budget/category_analytics.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Use variable name `logger` (not `_logger`, not `log`) for logging instances
Applied to files:
src/ai_company/budget/coordination_metrics.pysrc/ai_company/budget/category_analytics.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Use structured logging format: `logger.info(EVENT, key=value)` — never `logger.info("msg %s", val)`
Applied to files:
src/ai_company/budget/coordination_metrics.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/**/*.py : Always use event name constants from `ai_company.observability.events` domain-specific modules; import directly (e.g., `from ai_company.observability.events.<domain> import EVENT_CONSTANT`)
Applied to files:
src/ai_company/budget/category_analytics.pysrc/ai_company/observability/events/execution.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to **/*.py : Keep functions under 50 lines and files under 800 lines
Applied to files:
src/ai_company/engine/plan_execute_loop.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to **/*.py : Use Pydantic v2 (`BaseModel`, `model_validator`, `computed_field`, `ConfigDict`). Use `computed_field` for derived values; use `NotBlankStr` from `core.types` for all identifier/name fields
Applied to files:
src/ai_company/budget/cost_record.py
📚 Learning: 2026-03-07T14:50:05.694Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-07T14:50:05.694Z
Learning: Applies to src/ai_company/{providers,engine}/**/*.py : `RetryExhaustedError` signals that all retries failed — the engine layer catches this to trigger fallback chains
Applied to files:
tests/unit/engine/test_plan_execute_loop.py
🧬 Code graph analysis (4)
tests/unit/engine/test_loop_helpers.py (7)
src/ai_company/engine/context.py (2)
AgentContext(87-307)with_message(173-182)src/ai_company/engine/loop_helpers.py (1)
build_result(369-384)src/ai_company/engine/loop_protocol.py (4)
TerminationReason(28-35)TurnRecord(38-75)execute(154-180)ExecutionResult(78-135)src/ai_company/providers/enums.py (2)
FinishReason(15-22)MessageRole(6-12)src/ai_company/providers/models.py (5)
CompletionConfig(213-254)CompletionResponse(257-306)TokenUsage(12-35)ToolCall(96-119)ChatMessage(138-210)src/ai_company/tools/base.py (4)
BaseTool(56-161)ToolExecutionResult(24-53)description(115-117)category(110-112)src/ai_company/tools/registry.py (1)
ToolRegistry(30-122)
src/ai_company/budget/category_analytics.py (3)
src/ai_company/budget/call_category.py (2)
LLMCallCategory(11-25)OrchestrationAlertLevel(28-45)src/ai_company/budget/coordination_config.py (1)
OrchestrationAlertThresholds(60-100)src/ai_company/budget/cost_record.py (2)
CostRecord(15-56)_validate_token_consistency(51-56)
src/ai_company/engine/loop_protocol.py (1)
src/ai_company/budget/call_category.py (1)
LLMCallCategory(11-25)
src/ai_company/budget/cost_record.py (1)
src/ai_company/budget/call_category.py (1)
LLMCallCategory(11-25)
🔇 Additional comments (2)
tests/unit/engine/conftest.py (2)
191-206: Defensive exposure of recorded models looks good.Initializing
_recorded_modelsand returning a copied list fromrecorded_modelsmatches the existing config-tracking pattern and keeps the test double encapsulated.
220-222: Model capture is wired intocomplete()correctly.Appending
modelat call time ensures the recorded sequence stays aligned withcall_countandrecorded_configs, which makes multi-call assertions straightforward.
| ctx = ctx.with_turn_completed( | ||
| response.usage, | ||
| response_to_message(response), | ||
| ) | ||
| logger.info( | ||
| EXECUTION_LOOP_TURN_COMPLETE, | ||
| execution_id=ctx.execution_id, | ||
| turn=turn_number, | ||
| finish_reason=response.finish_reason.value, | ||
| tool_call_count=len(response.tool_calls), | ||
| ) | ||
|
|
||
| if not response.tool_calls: | ||
| return self._handle_step_completion(ctx, response, turn_number) | ||
|
|
||
| return await self._handle_step_tool_calls( | ||
| ctx, | ||
| tool_invoker, | ||
| response, | ||
| turn_number, | ||
| turns, | ||
| shutdown_checker, | ||
| ) |
There was a problem hiding this comment.
Budget exhaustion can still execute tools.
After ctx.with_turn_completed(...), this turn's cost is already in ctx.accumulated_cost. If that response exhausted the budget and also requested tools, we still execute them before the next check_budget(), which can fire side-effecting tools after the hard stop.
💸 Suggested fix
if not response.tool_calls:
return self._handle_step_completion(ctx, response, turn_number)
+ budget_result = check_budget(ctx, budget_checker, turns)
+ if budget_result is not None:
+ clear_last_turn_tool_calls(turns)
+ return budget_result
+
return await self._handle_step_tool_calls(
ctx,
tool_invoker,
response,
turn_number,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/engine/plan_execute_loop.py` around lines 685 - 707, After
updating ctx with ctx.with_turn_completed (so accumulated_cost includes this
turn), immediately call the budget check (check_budget(ctx) or the existing
budget-checking helper used elsewhere) and if it reports budget exhausted, do
not execute any tools: log the exhaustion and return the same path as a
completed step (i.e., call and return self._handle_step_completion(ctx,
response, turn_number)) instead of calling _handle_step_tool_calls; this ensures
no side-effecting tool calls run after budget is exhausted.
| logger.warning( | ||
| EXECUTION_PLAN_PARSE_ERROR, | ||
| execution_id=execution_id, | ||
| content_length=len(content), | ||
| content_snippet=content[:200], | ||
| ) |
There was a problem hiding this comment.
Don't log raw planner output on parse errors.
content_snippet=content[:200] writes arbitrary model/task output into warning logs. On plan failures that can leak user data, secrets, or code into retained logs; keep structured metadata like length/hash instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/engine/plan_parsing.py` around lines 95 - 100, The logger call
that currently includes raw planner output via content_snippet=content[:200]
must be changed to avoid logging user/model data; in the code around the
logger.warning call in plan_parsing.py (the EXECUTION_PLAN_PARSE_ERROR log using
execution_id and content), remove the content_snippet field and instead log a
non-sensitive fingerprint (e.g., compute a SHA-256 hex digest of content and
include content_hash) while keeping content_length and execution_id; update the
logging call to include content_hash (or similar) and/or other structured
metadata but not any substring of content.
Copilot feedback: - Add pytestmark = pytest.mark.timeout(30) to 4 budget test files - Add dedicated test_plan_parsing.py for plan parsing edge cases CodeRabbit feedback: - Wrap planning-phase early returns in _finalize() for metadata consistency - Fix clear_last_turn_tool_calls race: rebuild result with cleaned turns (both ReactLoop and PlanExecuteLoop _handle_step_tool_calls) - Remove raw content snippet from parse_plan warning log (security) - Assert call categories (SYSTEM/PRODUCTIVE) in happy-path test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🤖 I have created a release *beep* *boop* --- ## [0.1.1](ai-company-v0.1.0...ai-company-v0.1.1) (2026-03-10) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.0...v0.1.0) (2026-03-11) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add mandatory JWT + API key authentication ([#256](#256)) ([c279cfe](c279cfe)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable output scan response policies ([#263](#263)) ([b9907e8](b9907e8)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement AuditRepository for security audit log persistence ([#279](#279)) ([94bc29f](94bc29f)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * resolve circular imports, bump litellm, fix release tag format ([#286](#286)) ([a6659b5](a6659b5)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * bump anchore/scan-action from 6.5.1 to 7.3.2 ([#271](#271)) ([80a1c15](80a1c15)) * bump docker/build-push-action from 6.19.2 to 7.0.0 ([#273](#273)) ([dd0219e](dd0219e)) * bump docker/login-action from 3.7.0 to 4.0.0 ([#272](#272)) ([33d6238](33d6238)) * bump docker/metadata-action from 5.10.0 to 6.0.0 ([#270](#270)) ([baee04e](baee04e)) * bump docker/setup-buildx-action from 3.12.0 to 4.0.0 ([#274](#274)) ([5fc06f7](5fc06f7)) * bump sigstore/cosign-installer from 3.9.1 to 4.1.0 ([#275](#275)) ([29dd16c](29dd16c)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * **main:** release ai-company 0.1.1 ([#282](#282)) ([2f4703d](2f4703d)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>
Summary
plan_execute_loop.py): Full implementation of the Plan-and-Execute execution strategy behind theExecutionLoopprotocol, with plan generation, step-by-step execution, replanning on failure, and 3 plan parsers (JSON, numbered list, text)call_category.py,category_analytics.py):LLMCallCategoryenum (PRODUCTIVE/COORDINATION/SYSTEM) with per-category cost/token breakdowns and orchestration overhead ratio with alert levelscoordination_metrics.py,coordination_config.py): Data models for delegation success rate, message efficiency, redundancy rate, and coordination efficiency — using@computed_fieldfor derived valuesloop_helpers.py): Shared utilities extracted from ReactLoop — token estimation, response error checking, turn building, tool call clearingplan_models.py):ExecutionPlanandPlanStepfrozen Pydantic modelscost_recording.py): Per-turn cost recording helper with category propagationKey design decisions
ExecutionLoopis a@runtime_checkable Protocolsatisfied by bothReactLoopandPlanExecuteLoopmetadata: dict[str, object](notAny) for type safety in loop results.format()with LLM contentexecute()from 198→45 lines)@computed_fieldfor all derivable metric values (coordination efficiency, message efficiency)Test plan
test_cost_recording.py(8 tests): noop, per-turn recording, zero-cost skip, free-tier, category propagation, exception handling, BaseException propagationtest_plan_execute_loop.py: plan generation, multi-step, replanning, max replans, budget/turn limits, content filter handling, metadata structuretest_plan_models.py,test_loop_helpers.py,test_call_category.py,test_category_analytics.py,test_coordination_metrics.py,test_coordination_config.pyReview coverage
Pre-reviewed by 8 agents (code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, type-design-analyzer, logging-audit, resilience-audit, docs-consistency). 36 findings addressed, 2 polish simplifications applied.
Closes #134, closes #135