Skip to content

feat: add CFO cost optimization service with anomaly detection, reports, and approval decisions#186

Merged
Aureliolo merged 4 commits intomainfrom
feat/cfo-agent
Mar 9, 2026
Merged

feat: add CFO cost optimization service with anomaly detection, reports, and approval decisions#186
Aureliolo merged 4 commits intomainfrom
feat/cfo-agent

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

  • CostOptimizer service (budget/optimizer.py): Spending anomaly detection (spike/zero-baseline), cost efficiency analysis with per-agent ratings, model downgrade recommendations via resolver + downgrade map, and operation approval decisions with configurable auto-deny thresholds
  • Optimizer domain models (budget/optimizer_models.py): Frozen Pydantic models for anomalies, efficiency, downgrades, approvals, and config — using @computed_field for derived values, NotBlankStr for identifiers, cross-field validators
  • ReportGenerator service (budget/reports.py): Multi-dimensional spending reports with task/provider/model breakdowns, period-over-period comparison (computed fields), top-N agent/task rankings with sort-order validators
  • Event constants (events/cfo.py, events/budget.py): 12 CFO event constants + BUDGET_RECORDS_QUERIED
  • CostTracker extension (budget/tracker.py): get_records() query method for analytical consumers
  • Comprehensive test coverage: 65 tests across optimizer, optimizer models, and reports — including parametrized _classify_severity thresholds, computed field verification, validator edge cases, input validation, and downgrade path coverage

Closes #46

Pre-PR Review Coverage

  • 9 review agents run: code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer, logging-audit, resilience-audit, docs-consistency
  • 35 findings addressed (2 CRITICAL, 7 MAJOR, 14 MEDIUM, 12 MINOR)
  • Key fixes: spike severity bug when stddev=0, unit mismatch in savings field name, 4 stored→computed field conversions, double-fetch elimination, explicit input validation on all public methods, comprehensive debug logging

Test plan

  • uv run ruff check src/ tests/ — passes
  • uv run mypy src/ tests/ — passes
  • uv run pytest tests/ -n auto --cov=ai_company --cov-fail-under=80 — 4826 passed, 96.27% coverage
  • Verify CI passes on GitHub

🤖 Generated with Claude Code

…ts, and approval decisions (#46)

Implement CostOptimizer and ReportGenerator domain services backing the
CFO role (DESIGN_SPEC §10.3). CostOptimizer provides spending anomaly
detection (Z-score + spike factor), cost efficiency analysis per agent,
model downgrade recommendations via ModelResolver, and operation
approval/denial based on budget utilization. ReportGenerator produces
multi-dimensional spending reports with task/provider/model breakdowns
and period-over-period comparison. Adds get_records() to CostTracker
for raw record access. 80 new tests, 96% budget module coverage.
…ements

Pre-reviewed by 9 agents, 35 findings addressed.
Copilot AI review requested due to automatic review settings March 9, 2026 13:21
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 9, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 9, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 45f4e043-ec7f-4602-9e4d-25a7aba9d54c

📥 Commits

Reviewing files that changed from the base of the PR and between 69f06c1 and f909c79.

📒 Files selected for processing (9)
  • DESIGN_SPEC.md
  • src/ai_company/budget/_optimizer_helpers.py
  • src/ai_company/budget/optimizer.py
  • src/ai_company/budget/reports.py
  • src/ai_company/observability/events/cfo.py
  • tests/unit/budget/conftest.py
  • tests/unit/budget/test_optimizer.py
  • tests/unit/budget/test_optimizer_analysis.py
  • tests/unit/budget/test_optimizer_decisions.py

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • CFO cost-optimization service: anomaly detection, per-agent efficiency analysis, downgrade suggestions, routing optimizations, and operation approval decisions.
    • Multi-dimensional spending reports with breakdowns by task/provider/model, period comparisons, and top‑N rankings.
    • API to fetch filtered cost records.
  • Documentation

    • Expanded budget enforcement docs and added CFO observability event coverage.
  • Tests

    • Extensive unit tests covering optimizer, models, reports, and record queries.

Walkthrough

Adds a CFO cost-optimization subsystem: CostOptimizer service, ReportGenerator, domain models, internal helpers, an enriched CostTracker query API, new CFO/budget observability events, expanded public exports, and extensive unit tests for detection, analysis, recommendations, reporting, and approval evaluation.

Changes

Cohort / File(s) Summary
Core Budget Services
src/ai_company/budget/optimizer.py, src/ai_company/budget/reports.py, src/ai_company/budget/_optimizer_helpers.py
New CostOptimizer and ReportGenerator services plus private helpers implementing anomaly detection, efficiency analysis, downgrade/routing recommendations, approval evaluation, and report construction.
Domain Models
src/ai_company/budget/optimizer_models.py, src/ai_company/budget/reports.py
Adds frozen Pydantic models and enums for anomalies, efficiency, downgrade and routing suggestions, approval decisions, and report artifacts (TaskSpending, ProviderDistribution, ModelDistribution, PeriodComparison, SpendingReport).
Tracker API & Events
src/ai_company/budget/tracker.py, src/ai_company/observability/events/budget.py
Adds CostTracker.get_records(...) to fetch filtered cost records and registers new observability constant BUDGET_RECORDS_QUERIED.
Observability - CFO Domain
src/ai_company/observability/events/cfo.py, tests/unit/observability/test_events.py
Adds CFO-specific event constants (e.g., CFO_REPORT_VALIDATION_ERROR, CFO anomaly/report events) and updates discovery test to include cfo.
Public API Exports
src/ai_company/budget/__init__.py
Exports new services and domain-model symbols (CostOptimizer, optimizer models, report models) by adding imports and extending __all__.
Tests & Fixtures
tests/unit/budget/conftest.py, tests/unit/budget/test_optimizer*.py, tests/unit/budget/test_reports.py, tests/unit/budget/test_tracker_get_records.py, tests/unit/observability/test_events.py
Adds fixtures and many unit tests covering initialization, anomaly detection, efficiency analysis, downgrade/routing recommendations, approval logic, report generation, model validation, and get_records behavior.
Test Helpers / Factories
tests/unit/budget/conftest.py
Adds factories and helpers for building CostOptimizer, ReportGenerator, ModelResolver, and test model data.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(240,248,255,0.5)
    participant Client as Client Agent
    participant Optimizer as CostOptimizer
    participant Tracker as CostTracker
    participant Resolver as ModelResolver
    participant Logger as EventLogger
    end

    Client->>Optimizer: detect_anomalies(start,end)
    Optimizer->>Tracker: get_records(start,end)
    Tracker-->>Optimizer: CostRecord[]
    Optimizer->>Optimizer: windowing & per-agent analysis
    Optimizer->>Logger: CFO_ANOMALY_DETECTED
    Optimizer-->>Client: AnomalyDetectionResult

    Client->>Optimizer: recommend_downgrades(start,end)
    Optimizer->>Optimizer: analyze_efficiency(start,end)
    Optimizer->>Resolver: resolve candidate models (async)
    Resolver-->>Optimizer: ResolvedModel(s)
    Optimizer->>Logger: CFO_DOWNGRADE_RECOMMENDED
    Optimizer-->>Client: DowngradeAnalysis

    Client->>Optimizer: evaluate_operation(agent_id, cost)
    Optimizer->>Tracker: get_records(month_window)
    Optimizer->>Optimizer: compute budget pressure & projected level
    Optimizer->>Logger: CFO_APPROVAL_EVALUATED
    Optimizer-->>Client: ApprovalDecision
Loading
sequenceDiagram
    rect rgba(255,250,240,0.5)
    participant Client as Client Agent
    participant ReportGen as ReportGenerator
    participant Tracker as CostTracker
    participant Aggregator as AggregationLogic
    participant Logger as EventLogger
    end

    Client->>ReportGen: generate_report(start,end,top_n,cmp?)
    ReportGen->>Tracker: get_records(start,end)
    Tracker-->>ReportGen: CostRecord[]
    ReportGen->>Aggregator: build by_task/by_provider/by_model
    opt include_period_comparison
        ReportGen->>Tracker: get_records(prev_start,prev_end)
        Tracker-->>ReportGen: CostRecord[]
        ReportGen->>Aggregator: compute PeriodComparison
    end
    ReportGen->>Logger: CFO_REPORT_GENERATED
    ReportGen-->>Client: SpendingReport
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.37% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding a CFO cost optimization service with three core capabilities (anomaly detection, reports, approval decisions).
Description check ✅ Passed The description is detailed and well-structured, covering all major components (CostOptimizer, domain models, ReportGenerator, events, CostTracker extension, and tests), addressing the PR objectives comprehensively.
Linked Issues check ✅ Passed The PR implements all major objectives from issue #46: CFO agent role [monitoring and alerts with anomaly detection], cost optimization [model downgrades, efficiency analysis, routing suggestions], reporting [multi-dimensional reports], integration [approval decisions and CostTracker extension], and comprehensive testing (>80% coverage).
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #46 objectives: core services (CostOptimizer, ReportGenerator), domain models, event constants, CostTracker extension, and comprehensive test coverage. No extraneous modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/cfo-agent
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch feat/cfo-agent

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive CFO cost optimization system, enabling the AI company to intelligently manage and reduce operational spending. It provides tools for detecting unusual spending patterns, analyzing agent efficiency, recommending cost-saving model downgrades, and making automated approval decisions for operations based on budget health. This system significantly enhances financial oversight and proactive cost management within the AI agent ecosystem.

Highlights

  • CFO Cost Optimization Service: Introduced the CostOptimizer service for spending anomaly detection, cost efficiency analysis, model downgrade recommendations, and operation approval decisions.
  • New Domain Models: Defined Pydantic models (optimizer_models.py) for anomalies, efficiency, downgrades, approvals, and configuration, utilizing computed fields and cross-field validators.
  • Spending Report Generator: Added the ReportGenerator service for creating multi-dimensional spending reports, including breakdowns by task, provider, model, and period-over-period comparisons.
  • Extended CostTracker: Enhanced the CostTracker with a new get_records() query method to support analytical consumers with filtered cost record retrieval.
  • New Event Constants: Added 12 new CFO-specific event constants and one budget-related event constant (BUDGET_RECORDS_QUERIED) for improved observability.
  • Comprehensive Testing: Included 65 new tests across the optimizer, its models, and the report generator, ensuring robust functionality, validation, and high code coverage.
Changelog
  • CLAUDE.md
    • Updated documentation to reflect the new CFO cost optimization features in the budget/ module.
    • Added CFO_ANOMALY_DETECTED to the example list of event names.
  • DESIGN_SPEC.md
    • Added detailed implementation notes for the CostOptimizer and ReportGenerator services.
    • Updated the directory structure to include new CFO-related files and their descriptions.
  • README.md
    • Updated the "Budget Enforcement" section to highlight the new CostOptimizer CFO service and ReportGenerator for spending reports.
  • src/ai_company/budget/init.py
    • Expanded the __init__.py to import and expose the newly added CostOptimizer, ReportGenerator, and their associated Pydantic models.
  • src/ai_company/budget/optimizer.py
    • Added the CostOptimizer service, implementing anomaly detection, efficiency analysis, downgrade recommendations, and operation approval logic.
  • src/ai_company/budget/optimizer_models.py
    • Added Pydantic models for the CostOptimizer domain, including SpendingAnomaly, EfficiencyAnalysis, DowngradeRecommendation, ApprovalDecision, and CostOptimizerConfig.
  • src/ai_company/budget/reports.py
    • Added the ReportGenerator service, providing functionality to create multi-dimensional spending reports with various breakdowns and period comparisons.
  • src/ai_company/budget/tracker.py
    • Updated the module docstring to reflect current persistence plans.
    • Added the BUDGET_RECORDS_QUERIED event.
    • Implemented the get_records method for filtered cost record retrieval.
  • src/ai_company/observability/events/budget.py
    • Added the BUDGET_RECORDS_QUERIED constant for logging when budget records are queried.
  • src/ai_company/observability/events/cfo.py
    • Added a new module containing various CFO-specific event constants for observability.
  • tests/unit/budget/conftest.py
    • Updated test configuration to include new factories and fixtures for CostOptimizerConfig, CostOptimizer, and ReportGenerator for easier testing.
  • tests/unit/budget/test_optimizer.py
    • Added comprehensive unit tests for the CostOptimizer service, covering anomaly detection, efficiency analysis, downgrade recommendations, and approval decisions.
  • tests/unit/budget/test_optimizer_models.py
    • Added unit tests for the Pydantic models defined in optimizer_models.py, verifying their structure, computed fields, and validators.
  • tests/unit/budget/test_reports.py
    • Added unit tests for the ReportGenerator service and its associated report models, ensuring correct report generation and data aggregation.
  • tests/unit/budget/test_tracker_get_records.py
    • Added unit tests specifically for the new CostTracker.get_records method, verifying its filtering and data retrieval capabilities.
  • tests/unit/observability/test_events.py
    • Updated the event domain discovery test to include the new cfo module.
Activity
  • The author, Aureliolo, initiated this feature to add CFO cost optimization capabilities.
  • Extensive pre-PR review coverage was performed by 9 review agents, leading to 35 findings being addressed (2 CRITICAL, 7 MAJOR, 14 MEDIUM, 12 MINOR).
  • Key fixes included resolving a spike severity bug when standard deviation was zero, correcting a unit mismatch in a savings field, converting 4 stored fields to computed fields, eliminating double-fetches, adding explicit input validation, and comprehensive debug logging.
  • The test plan indicates successful execution of ruff check, mypy, and pytest with 96.27% coverage.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the CFO cost optimization service, featuring the CostOptimizer for anomaly detection, efficiency analysis, and downgrade recommendations, and a ReportGenerator for detailed spending reports. The implementation is generally robust, with well-structured code, robust Pydantic models, and comprehensive test coverage. However, two potential Denial of Service (DoS) vectors were identified in the CostOptimizer service: one due to inefficient algorithmic complexity in anomaly detection and downgrade recommendations, and another from missing upper-bound validation on the window_count parameter. Addressing these by grouping records by agent once and adding a maximum limit to the number of windows will improve the service's resilience against resource exhaustion attacks. Additionally, a minor suggestion was noted to improve code clarity by removing a redundant check.

Comment on lines +151 to +157
for agent_id in agent_ids:
window_costs = _compute_window_costs(
records,
agent_id,
window_starts,
window_duration,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The detect_anomalies and recommend_downgrades methods exhibit O(N*M) algorithmic complexity, where N is the number of agents and M is the number of cost records. Specifically, detect_anomalies iterates over all unique agents (line 151) and, for each agent, calls _compute_window_costs which iterates over the entire set of records (lines 547-551). Similarly, recommend_downgrades iterates over agents (line 299) and calls _find_most_used_model which also iterates over all records (lines 685-686). An attacker who can populate the CostTracker with a large number of records for many distinct agent IDs could trigger these methods to cause excessive CPU consumption, leading to a Denial of Service (DoS).

Comment on lines +134 to +146
if window_count < 2: # noqa: PLR2004
msg = f"window_count must be >= 2, got {window_count}"
raise ValueError(msg)

now = datetime.now(UTC)
records = await self._cost_tracker.get_records(
start=start,
end=end,
)

total_duration = end - start
window_duration = total_duration / window_count
window_starts = tuple(start + window_duration * i for i in range(window_count))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The detect_anomalies method accepts a window_count parameter (line 111) that is used to create a tuple of time window starts (line 146). While there is a check to ensure window_count >= 2 (line 134), there is no upper bound validation. A very large value for window_count could lead to excessive memory allocation when creating the window_starts tuple, potentially causing an Out-of-Memory (OOM) condition and crashing the service.

# Check sigma threshold
stddev = statistics.stdev(historical) if len(historical) > 1 else 0.0
deviation = (current - mean) / stddev if stddev > 0 else 0.0
is_sigma_anomaly = stddev > 0 and deviation > config.anomaly_sigma_threshold
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The stddev > 0 check in this line is redundant. The preceding line ensures that deviation is 0.0 when stddev is 0. Since config.anomaly_sigma_threshold is constrained to be greater than 0, the comparison deviation > config.anomaly_sigma_threshold will correctly evaluate to False in that case. Removing the redundant check simplifies the logic.

Suggested change
is_sigma_anomaly = stddev > 0 and deviation > config.anomaly_sigma_threshold
is_sigma_anomaly = deviation > config.anomaly_sigma_threshold

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai_company/budget/optimizer.py`:
- Around line 106-190: The public methods detect_anomalies,
recommend_downgrades, and evaluate_operation are too large and mix validation,
data loading, decision logic, and logging; refactor each into smaller helpers
(e.g., extract validation into _validate_detect_args, data fetch into
_load_records_for_agent or _fetch_scan_records, core decision logic into
_compute_window_costs and _detect_spike_anomaly already exist but move
surrounding orchestration into helpers like _detect_anomalies_for_agent, and
logging into _log_anomaly and _log_scan_summary) so each public method is <50
lines: keep detect_anomalies responsible only for argument checks, calling the
helpers for records loading and per-agent analysis, aggregating results, and
invoking a single summary log; apply the same pattern to recommend_downgrades
and evaluate_operation by splitting validation, data access, business rules, and
logging into clearly named private functions.
- Around line 380-417: The auto-deny check currently compares
approval_auto_deny_alert_level against the current alert_level computed from
used_pct; change it to compute projected_used_pct = round(projected_cost /
cfg.total_monthly * 100, BUDGET_ROUNDING_PRECISION), then call
projected_alert_level = _compute_alert_level(projected_used_pct, cfg) and
compare _ALERT_LEVEL_ORDER[projected_alert_level] >=
_ALERT_LEVEL_ORDER[auto_deny_level]; if true, log the denial (use same logger
fields but include projected_* values) and return an ApprovalDecision denying
the request (similar to the existing block) so the configurable auto-deny
threshold is enforced based on projected usage rather than current usage.
- Around line 333-379: In evaluate_operation, validate the public input
estimated_cost_usd at the top of the function (before any budget logic) and fail
fast on impossible values: if estimated_cost_usd is negative, raise a clear
exception (e.g., ValueError) indicating the invalid estimated_cost_usd and
include the provided value and agent_id for diagnostics; this prevents callers
from increasing budget_remaining_usd by passing negative estimates and keeps the
public boundary robust.

In `@src/ai_company/budget/reports.py`:
- Around line 177-184: Update the tuple element types for top_agents_by_cost and
top_tasks_by_cost to use NotBlankStr for the identifier positions instead of
plain str; locate the Field declarations for top_agents_by_cost and
top_tasks_by_cost in the Reports model and change their type annotations from
tuple[tuple[str, float], ...] to tuple[tuple[NotBlankStr, float], ...], ensuring
any imports include NotBlankStr where these fields are defined.
- Around line 220-228: Add a DEBUG-level log in the __init__ of the class that
accepts CostTracker and BudgetConfig to record object creation and key init
values; update the __init__ method (the constructor with parameters
cost_tracker: CostTracker and budget_config: BudgetConfig) to call the
module/class logger.debug with a concise message that the report object was
created and include non-sensitive identifying info (e.g., id(cost_tracker) or
budget_config.name) to aid tracing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cb438d54-3b98-4383-a58e-139b7127add4

📥 Commits

Reviewing files that changed from the base of the PR and between 873b0aa and 9048bf8.

📒 Files selected for processing (16)
  • CLAUDE.md
  • DESIGN_SPEC.md
  • README.md
  • src/ai_company/budget/__init__.py
  • src/ai_company/budget/optimizer.py
  • src/ai_company/budget/optimizer_models.py
  • src/ai_company/budget/reports.py
  • src/ai_company/budget/tracker.py
  • src/ai_company/observability/events/budget.py
  • src/ai_company/observability/events/cfo.py
  • tests/unit/budget/conftest.py
  • tests/unit/budget/test_optimizer.py
  • tests/unit/budget/test_optimizer_models.py
  • tests/unit/budget/test_reports.py
  • tests/unit/budget/test_tracker_get_records.py
  • tests/unit/observability/test_events.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Agent
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Do NOT use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations
Use except A, B: syntax (without parentheses) per PEP 758 — ruff enforces this on Python 3.14
All public functions must have type hints; use mypy strict mode for type-checking
Use Google-style docstrings on all public classes and functions; enforced by ruff D rules
Create new objects instead of mutating existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves; never mix static config fields with mutable runtime fields in one model
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use @computed_field for derived values instead of storing redundant fields; use NotBlankStr for all identifier/name fields (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Keep functions under 50 lines and files under 800 lines
Handle errors explicitly, never silently swallow exceptions
Validate at system boundaries (user input, external APIs, config files)
Use line length of 88 characters (ruff)

Files:

  • src/ai_company/observability/events/budget.py
  • tests/unit/budget/test_optimizer_models.py
  • src/ai_company/observability/events/cfo.py
  • src/ai_company/budget/__init__.py
  • src/ai_company/budget/optimizer_models.py
  • tests/unit/budget/test_tracker_get_records.py
  • src/ai_company/budget/tracker.py
  • tests/unit/budget/test_optimizer.py
  • tests/unit/budget/conftest.py
  • tests/unit/observability/test_events.py
  • src/ai_company/budget/reports.py
  • tests/unit/budget/test_reports.py
  • src/ai_company/budget/optimizer.py
src/ai_company/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/ai_company/**/*.py: Every module with business logic must import and use get_logger(name) from ai_company.observability; never use import logging or logging.getLogger() or print() in application code
Always use 'logger' as the variable name (not '_logger', not 'log')
Always use event name constants from ai_company.observability.events domain modules (e.g., PROVIDER_CALL_START from events.provider) instead of string literals
Use structured logging with logger.info(EVENT, key=value) — never use logger.info('msg %s', val) string formatting
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
Use DEBUG level logging for object creation, internal flow, and entry/exit of key functions

Files:

  • src/ai_company/observability/events/budget.py
  • src/ai_company/observability/events/cfo.py
  • src/ai_company/budget/__init__.py
  • src/ai_company/budget/optimizer_models.py
  • src/ai_company/budget/tracker.py
  • src/ai_company/budget/reports.py
  • src/ai_company/budget/optimizer.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples; use generic names (example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small aliases)

Files:

  • src/ai_company/observability/events/budget.py
  • src/ai_company/observability/events/cfo.py
  • src/ai_company/budget/__init__.py
  • src/ai_company/budget/optimizer_models.py
  • src/ai_company/budget/tracker.py
  • src/ai_company/budget/reports.py
  • src/ai_company/budget/optimizer.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Mark tests with @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow
Prefer @pytest.mark.parametrize for testing similar cases
In tests, use test-provider, test-small-001, etc. instead of real vendor names

Files:

  • tests/unit/budget/test_optimizer_models.py
  • tests/unit/budget/test_tracker_get_records.py
  • tests/unit/budget/test_optimizer.py
  • tests/unit/budget/conftest.py
  • tests/unit/observability/test_events.py
  • tests/unit/budget/test_reports.py
🧠 Learnings (7)
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Every module with business logic must import and use get_logger(__name__) from ai_company.observability; never use import logging or logging.getLogger() or print() in application code

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Always use 'logger' as the variable name (not '_logger', not 'log')

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Always use event name constants from ai_company.observability.events domain modules (e.g., PROVIDER_CALL_START from events.provider) instead of string literals

Applied to files:

  • CLAUDE.md
  • src/ai_company/observability/events/cfo.py
  • DESIGN_SPEC.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Use structured logging with logger.info(EVENT, key=value) — never use logger.info('msg %s', val) string formatting

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : All error paths must log at WARNING or ERROR with context before raising

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : All state transitions must log at INFO level

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Use DEBUG level logging for object creation, internal flow, and entry/exit of key functions

Applied to files:

  • CLAUDE.md
🧬 Code graph analysis (8)
tests/unit/budget/test_optimizer_models.py (1)
src/ai_company/budget/optimizer_models.py (12)
  • AgentEfficiency (142-176)
  • AnomalyDetectionResult (104-136)
  • AnomalySeverity (34-39)
  • AnomalyType (22-31)
  • CostOptimizerConfig (332-381)
  • DowngradeAnalysis (267-290)
  • DowngradeRecommendation (231-264)
  • EfficiencyAnalysis (179-225)
  • EfficiencyRating (42-47)
  • SpendingAnomaly (53-101)
  • cost_per_1k_tokens (169-176)
  • inefficient_agent_count (206-212)
src/ai_company/budget/__init__.py (3)
src/ai_company/budget/optimizer.py (1)
  • CostOptimizer (72-483)
src/ai_company/budget/optimizer_models.py (11)
  • AgentEfficiency (142-176)
  • AnomalyDetectionResult (104-136)
  • AnomalySeverity (34-39)
  • AnomalyType (22-31)
  • ApprovalDecision (296-326)
  • CostOptimizerConfig (332-381)
  • DowngradeAnalysis (267-290)
  • DowngradeRecommendation (231-264)
  • EfficiencyAnalysis (179-225)
  • EfficiencyRating (42-47)
  • SpendingAnomaly (53-101)
src/ai_company/budget/reports.py (6)
  • ModelDistribution (77-98)
  • PeriodComparison (101-141)
  • ProviderDistribution (55-74)
  • ReportGenerator (209-331)
  • SpendingReport (144-203)
  • TaskSpending (37-52)
src/ai_company/budget/optimizer_models.py (1)
src/ai_company/budget/enums.py (1)
  • BudgetAlertLevel (6-16)
tests/unit/budget/test_tracker_get_records.py (2)
src/ai_company/budget/tracker.py (2)
  • get_records (185-225)
  • record (99-112)
tests/unit/budget/conftest.py (1)
  • make_cost_record (286-307)
src/ai_company/budget/tracker.py (1)
src/ai_company/budget/cost_record.py (1)
  • CostRecord (15-56)
tests/unit/budget/conftest.py (4)
src/ai_company/budget/optimizer.py (1)
  • CostOptimizer (72-483)
src/ai_company/budget/optimizer_models.py (1)
  • CostOptimizerConfig (332-381)
src/ai_company/budget/reports.py (1)
  • ReportGenerator (209-331)
src/ai_company/budget/enforcer.py (1)
  • cost_tracker (90-92)
src/ai_company/budget/reports.py (4)
src/ai_company/budget/spending_summary.py (1)
  • SpendingSummary (102-161)
src/ai_company/budget/config.py (1)
  • BudgetConfig (151-227)
src/ai_company/budget/cost_record.py (1)
  • CostRecord (15-56)
src/ai_company/budget/tracker.py (3)
  • CostTracker (68-455)
  • build_summary (227-281)
  • get_records (185-225)
tests/unit/budget/test_reports.py (1)
src/ai_company/budget/reports.py (9)
  • ModelDistribution (77-98)
  • PeriodComparison (101-141)
  • ProviderDistribution (55-74)
  • ReportGenerator (209-331)
  • SpendingReport (144-203)
  • TaskSpending (37-52)
  • cost_change_usd (125-130)
  • cost_change_percent (134-141)
  • generate_report (229-306)
🪛 LanguageTool
README.md

[typographical] ~26-~26: To join two clauses or introduce examples, consider using an em dash.
Context: ...n failures - Budget Enforcement (M5) - BudgetEnforcer service with pre-flight...

(DASH_RULE)

CLAUDE.md

[style] ~86-~86: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...

(EG_NO_COMMA)

🔇 Additional comments (40)
src/ai_company/observability/events/budget.py (1)

32-33: LGTM!

The new BUDGET_RECORDS_QUERIED event constant follows the established pattern: Final[str] typing and domain.subject.qualifier naming convention consistent with other budget events.

src/ai_company/budget/tracker.py (1)

185-225: LGTM!

The new get_records() method follows established patterns in this class:

  • Validates time range via _validate_time_range
  • Uses structured logging with event constant at DEBUG level
  • Returns immutable tuple[CostRecord, ...] snapshot
  • Consistent with get_category_breakdown() which also filters by agent_id and task_id
src/ai_company/observability/events/cfo.py (1)

1-15: LGTM!

Well-organized CFO event constants module following established patterns:

  • All constants use Final[str] typing
  • All values follow cfo.subject.qualifier naming convention
  • Comprehensive coverage for optimizer lifecycle, anomaly detection, efficiency analysis, downgrades, approvals, and reports

Based on learnings: these event name constants from ai_company.observability.events.cfo should be used instead of string literals in business logic.

CLAUDE.md (2)

47-47: LGTM!

The budget module description is accurately updated to reflect the new CFO cost optimization capabilities including anomaly detection, efficiency analysis, downgrade recommendations, approval decisions, and spending reports.


86-86: LGTM!

Good addition of CFO_ANOMALY_DETECTED from events.cfo to the event names documentation example, ensuring developers know about the new CFO domain module for observability events.

tests/unit/observability/test_events.py (1)

179-179: LGTM!

Correctly adds "cfo" to the expected domain modules set, ensuring the test validates that the new CFO events module is properly discoverable by pkgutil.

README.md (1)

26-26: LGTM!

The Budget Enforcement description is accurately updated to reflect the new CFO capabilities:

  • CostOptimizer CFO service with anomaly detection, efficiency analysis, downgrade recommendations, and approval decisions
  • ReportGenerator for multi-dimensional spending reports

The formatting is consistent with the rest of the document.

tests/unit/budget/test_optimizer_models.py (9)

1-20: LGTM!

Well-structured test module with proper imports and organization. Test coverage spans all CFO optimizer domain models including enums, data classes, validators, and computed fields.


25-51: LGTM!

Enum tests verify both string values and member counts, ensuring the enum definitions remain stable.


56-109: LGTM!

SpendingAnomaly tests comprehensively cover:

  • Construction with all required fields
  • Frozen model immutability
  • Period ordering validation (period_start must be before period_end)

114-136: LGTM!

AnomalyDetectionResult tests cover empty results and period ordering validation, consistent with the model's constraints.


141-178: LGTM!

AgentEfficiency tests validate:

  • Basic construction
  • Zero-token edge case (cost_per_1k_tokens returns 0.0)
  • Computed field derivation for cost_per_1k_tokens

183-229: LGTM!

EfficiencyAnalysis tests cover empty analysis, computed inefficient_agent_count, and period ordering validation.


234-273: LGTM!

DowngradeRecommendation and DowngradeAnalysis tests verify construction, immutability, and empty analysis handling. Uses test-large-001/test-small-001 per coding guidelines (no real vendor names).


278-315: LGTM!

ApprovalDecision tests cover approved/denied states, alert levels, and optional conditions tuple.


320-395: LGTM!

CostOptimizerConfig tests comprehensively validate:

  • Default values
  • Custom value acceptance
  • Constraint enforcement (sigma > 0, spike_factor > 1, inefficiency_factor > 1, min_anomaly_windows >= 2)
  • Frozen model immutability
  • Validator tests for DowngradeRecommendation (same model rejection, zero savings rejection)
tests/unit/budget/test_reports.py (5)

1-35: LGTM!

Well-organized test module with clean helper functions. The _make_report_generator factory creates fresh CostTracker and ReportGenerator instances for isolated test execution.


40-87: LGTM!

Report model tests verify construction and immutability for TaskSpending, ProviderDistribution, and ModelDistribution. Uses generic provider/model names per coding guidelines.


89-122: LGTM!

PeriodComparison tests comprehensively cover:

  • Cost increase (positive change)
  • Cost decrease (negative change)
  • No previous data (percent is None)
  • Equal periods (zero change)

127-348: LGTM!

ReportGenerator tests provide excellent coverage:

  • Initialization verification
  • Empty/no records scenario
  • Multiple agents/tasks aggregation
  • Provider/model distribution percentages
  • Period comparison (increase, decrease, no prior data, skip)
  • Top-N agents/tasks with proper sorting
  • Input validation (top_n < 1, start after end)

353-398: LGTM!

SpendingReport validator tests verify that top_agents_by_cost and top_tasks_by_cost must be sorted in descending order by cost, with both acceptance and rejection cases.

src/ai_company/budget/optimizer_models.py (10)

1-18: LGTM!

The module docstring follows Google-style, imports are clean, and the file correctly avoids from __future__ import annotations per coding guidelines. The noqa comments for TC001/TC003 are appropriate for runtime Pydantic requirements.


22-48: LGTM!

Enum definitions are clean with appropriate docstrings. Good practice to document that SUSTAINED_HIGH and RATE_INCREASE are reserved for future detection algorithms.


53-102: LGTM!

The SpendingAnomaly model is well-designed with proper constraints (ge=0.0 for non-negative values), NotBlankStr for identifiers, and a cross-field validator ensuring temporal ordering. The edge case for deviation_factor=0.0 when baseline is zero is properly documented.


104-137: LGTM!

The AnomalyDetectionResult model correctly uses an immutable tuple for anomalies with a sensible empty default. The period ordering validator follows the same pattern as SpendingAnomaly, maintaining consistency.


142-177: LGTM!

The AgentEfficiency model correctly uses @computed_field for the derived cost_per_1k_tokens value, handles division by zero gracefully, and applies consistent rounding via BUDGET_ROUNDING_PRECISION.


179-226: LGTM!

The EfficiencyAnalysis model properly uses @computed_field for inefficient_agent_count, maintains consistent period ordering validation, and follows the established patterns from other models in this file.


231-265: LGTM!

The DowngradeRecommendation model enforces meaningful recommendations with gt=0.0 for savings and a validator ensuring the current and recommended models differ. Good defensive design.


267-291: LGTM!

The DowngradeAnalysis model is a clean aggregation container with appropriate non-negative constraints.


296-327: LGTM!

The ApprovalDecision model correctly allows negative budget_remaining_usd for over-budget scenarios (well-documented). Good use of tuple[NotBlankStr, ...] for conditions to ensure non-blank approval conditions.


332-381: LGTM!

The CostOptimizerConfig model has well-reasoned constraints: gt=1.0 for factors that must exceed baseline, ge=2 for minimum windows ensuring meaningful statistical comparison, and sensible defaults aligned with typical anomaly detection practices.

src/ai_company/budget/reports.py (9)

1-32: LGTM!

The module follows coding guidelines: uses get_logger(__name__) with logger variable name, imports event constant CFO_REPORT_GENERATED from the events module, and properly uses TYPE_CHECKING for type-only imports.


37-53: LGTM!

The TaskSpending model is clean with appropriate constraints and follows established patterns from the codebase.


55-75: LGTM!

The ProviderDistribution model properly constrains percentage_of_total to the valid range [0.0, 100.0].


77-99: LGTM!

The ModelDistribution model maintains consistency with ProviderDistribution while adding the model-provider relationship.


101-142: LGTM!

The PeriodComparison model correctly uses @computed_field for derived values. The <= 0 check on line 136 is appropriately defensive (even though ge=0.0 constraint ensures non-negative values, it guards against division by zero).


187-204: LGTM!

The ranking validators correctly ensure descending order for both top agents and top tasks, maintaining data integrity.


229-306: LGTM!

The generate_report method validates inputs at the system boundary, uses structured logging with the event constant CFO_REPORT_GENERATED, and follows a clear workflow. Good separation between data fetching, aggregation, and report assembly.


308-331: LGTM!

The period comparison calculation correctly computes the previous period without overlap. The early return when both periods have zero cost avoids generating meaningless comparisons.


337-452: LGTM!

The helper functions are clean and follow best practices:

  • math.fsum for precise float aggregation
  • Consistent use of BUDGET_ROUNDING_PRECISION
  • Deterministic output ordering via sorted()
  • Proper type hints with Sequence for input flexibility

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the CFO “CostOptimizer” analytics layer and reporting capabilities on top of the existing budget tracking/enforcement stack, aligning with DESIGN_SPEC §10.3 and extending observability coverage for CFO/budget analytics events.

Changes:

  • Introduces CostOptimizer service + domain models for anomaly detection, efficiency analysis, downgrade recommendations, and operation approval decisions.
  • Adds ReportGenerator service and report models for multi-dimensional spending breakdowns and period-over-period comparisons.
  • Extends CostTracker with a get_records() query API, adds new observability event constants, and adds extensive unit test coverage.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/observability/test_events.py Updates expected domain modules to include cfo events domain.
tests/unit/budget/test_tracker_get_records.py Adds unit tests for new CostTracker.get_records() filtering semantics.
tests/unit/budget/test_reports.py Adds unit tests for ReportGenerator and report model validators/computed fields.
tests/unit/budget/test_optimizer_models.py Adds unit tests for optimizer Pydantic models/enums/validators/computed fields.
tests/unit/budget/test_optimizer.py Adds unit tests for CostOptimizer anomaly detection, efficiency, downgrades, and approvals.
tests/unit/budget/conftest.py Adds fixtures/factories for optimizer + report generator.
src/ai_company/observability/events/cfo.py Introduces CFO event constants for structured logging.
src/ai_company/observability/events/budget.py Adds BUDGET_RECORDS_QUERIED event constant.
src/ai_company/budget/tracker.py Adds get_records() API and logs record queries via new event constant.
src/ai_company/budget/reports.py Implements report models + ReportGenerator service.
src/ai_company/budget/optimizer_models.py Implements frozen optimizer domain models + config.
src/ai_company/budget/optimizer.py Implements CostOptimizer service and pure helper functions.
src/ai_company/budget/init.py Re-exports optimizer/report services and models from the budget package.
README.md Updates “Budget Enforcement (M5)” description to include CFO optimizer/reporting.
DESIGN_SPEC.md Documents the new M5 implementation note and updates project tree entries.
CLAUDE.md Updates package structure/logging guidance to include CFO optimizer/events.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +333 to +339
async def evaluate_operation(
self,
*,
agent_id: str,
estimated_cost_usd: float,
now: datetime | None = None,
) -> ApprovalDecision:
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

evaluate_operation() accepts estimated_cost_usd without validating it’s non-negative. A negative estimate can reduce projected_cost and incorrectly approve operations (or skip high-cost conditions). Add explicit input validation (e.g., raise ValueError when estimated_cost_usd < 0).

Copilot uses AI. Check for mistakes.
Comment on lines +625 to +633
severity=severity,
description=(
f"Agent {agent_id!r} spent ${current:.2f} vs "
f"${mean:.2f} baseline ({deviation:.1f} sigma)"
),
current_value=current,
baseline_value=round(mean, BUDGET_ROUNDING_PRECISION),
deviation_factor=round(deviation, BUDGET_ROUNDING_PRECISION),
detected_at=now,
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When stddev == 0 but a spike is detected, the anomaly description still reports "(0.0 sigma)" and deviation_factor is forced to 0.0, which is misleading (sigma deviation is undefined in this case). Consider adjusting the message/fields for the stddev == 0 path (e.g., report spike ratio instead of sigma, and/or make the stored deviation metric consistent with what severity is based on).

Copilot uses AI. Check for mistakes.
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR introduces the CostOptimizer and ReportGenerator services — the CFO analytical layer backing DESIGN_SPEC §10.3 — along with their domain models, 14 structured event constants, a new CostTracker.get_records() query method, and 65 tests achieving 96% coverage. The implementation is well-engineered: frozen Pydantic models, @computed_field for derived values, pre-grouped O(N+M) record iteration, and careful use of asyncio.TaskGroup in recommend_downgrades.

Key findings from this review:

  • Missing INFO log on evaluate_operation's enforcement-disabled path (optimizer.py:471): the total_monthly <= 0 early return emits no log event, violating the CLAUDE.md "all state transitions must log at INFO" rule and making this production code path invisible.
  • approval_auto_deny_alert_level = NORMAL footgun (optimizer_models.py:383): the field accepts BudgetAlertLevel.NORMAL, which maps to order 0 in _ALERT_LEVEL_ORDER, causing _check_denial to auto-deny every operation when misconfigured — no validator guards against it.
  • Stale __all__ re-export of private _classify_severity (optimizer.py:751): the comment cites "backwards compatibility with tests" but those tests already import from _optimizer_helpers; the export leaks a module-private helper into the public API surface.

Confidence Score: 3/5

  • Mostly safe to merge; the missing INFO log is a CLAUDE.md violation and the NORMAL auto-deny footgun is a misconfiguration risk, but neither causes data corruption or breaks existing functionality.
  • The core logic is correct and well-tested (96% coverage, 65 tests). The three new issues are: a missing log on one code path (violating convention but not breaking correctness), a validator gap that enables a dangerous misconfiguration, and an unused private-symbol re-export. The PR also carries forward previously flagged issues — sequential async calls in generate_report and a residual double-snapshot for top_agents — which are unresolved. Together these reduce confidence below 4.
  • src/ai_company/budget/optimizer.py (missing log, stale all) and src/ai_company/budget/optimizer_models.py (NORMAL deny footgun) need attention before merge.

Important Files Changed

Filename Overview
src/ai_company/budget/optimizer.py Core CFO analytical service — well-structured, but the budget-enforcement-disabled early return in evaluate_operation silently skips the INFO log mandated by CLAUDE.md, and a stale __all__ exports a private helper.
src/ai_company/budget/optimizer_models.py Frozen Pydantic domain models with good use of @computed_field and cross-field validators; CostOptimizerConfig.approval_auto_deny_alert_level permits NORMAL, silently causing all operations to be auto-denied if misconfigured.
src/ai_company/budget/reports.py Multi-dimensional report generator; distribution percentages are now consistently derived from a single records snapshot, but top_agents_by_cost still draws from the separate summary snapshot, leaving a residual inconsistency between the two rankings.
src/ai_company/budget/_optimizer_helpers.py Pure stateless helper functions correctly extracted for the 800-line limit; _detect_spike_anomaly now properly uses spike_ratio as deviation_factor when stddev == 0, resolving the previously flagged misleading value.
src/ai_company/budget/tracker.py Minimal, clean addition of get_records() query method with proper time-range validation and debug logging via the new BUDGET_RECORDS_QUERIED event constant.
src/ai_company/observability/events/cfo.py New CFO event constants module — 14 Final[str] constants covering all observable state transitions introduced by this PR; correctly follows the domain event pattern.
tests/unit/budget/test_optimizer_decisions.py Thorough decision-path tests for approvals and downgrades; covers negative cost rejection, projected alert level, and enforcement-disabled path — though there is no assertion that the disabled-budget approval is logged at INFO.
tests/unit/budget/test_optimizer_analysis.py Comprehensive anomaly detection and efficiency tests including zero-stddev spike severity regression, zero-baseline spike, and window-count boundary validation.
tests/unit/budget/test_reports.py Reports test coverage looks solid with breakdowns and period comparison; the residual double-snapshot inconsistency between top_agents and top_tasks is not directly tested.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant CostOptimizer
    participant CostTracker
    participant ReportGenerator

    Note over CostOptimizer: detect_anomalies / analyze_efficiency / recommend_downgrades
    Caller->>CostOptimizer: detect_anomalies(start, end, window_count)
    CostOptimizer->>CostTracker: get_records(start, end)
    CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
    CostOptimizer->>CostOptimizer: _group_records_by_agent()
    loop per agent
        CostOptimizer->>CostOptimizer: _compute_window_costs()
        CostOptimizer->>CostOptimizer: _detect_spike_anomaly()
    end
    CostOptimizer-->>Caller: AnomalyDetectionResult

    Note over CostOptimizer: recommend_downgrades (parallel fetch)
    Caller->>CostOptimizer: recommend_downgrades(start, end)
    par asyncio.TaskGroup
        CostOptimizer->>CostTracker: get_records(start, end)
        CostOptimizer->>CostTracker: get_total_cost(billing_period_start)
    end
    CostTracker-->>CostOptimizer: records + budget_pressure
    CostOptimizer->>CostOptimizer: _build_efficiency_from_records()
    CostOptimizer->>CostOptimizer: _build_recommendations()
    CostOptimizer-->>Caller: DowngradeAnalysis

    Note over CostOptimizer: evaluate_operation
    Caller->>CostOptimizer: evaluate_operation(agent_id, estimated_cost_usd)
    alt total_monthly <= 0
        CostOptimizer-->>Caller: ApprovalDecision(approved=True, enforcement_disabled)
    else budget active
        CostOptimizer->>CostTracker: get_total_cost(period_start)
        CostTracker-->>CostOptimizer: monthly_cost
        CostOptimizer->>CostOptimizer: _check_denial(projected_alert)
        alt denied
            CostOptimizer-->>Caller: ApprovalDecision(approved=False)
        else approved
            CostOptimizer->>CostOptimizer: _build_approval_conditions()
            CostOptimizer-->>Caller: ApprovalDecision(approved=True, conditions)
        end
    end

    Note over ReportGenerator: generate_report (sequential — asyncio.TaskGroup pending)
    Caller->>ReportGenerator: generate_report(start, end, top_n)
    ReportGenerator->>CostTracker: get_records(start, end)
    CostTracker-->>ReportGenerator: records snapshot 1
    ReportGenerator->>CostTracker: build_summary(start, end)
    CostTracker-->>ReportGenerator: summary snapshot 2
    ReportGenerator->>ReportGenerator: _build_task/provider/model distributions (from records)
    ReportGenerator->>ReportGenerator: _build_top_agents (from summary ⚠️ different snapshot)
    ReportGenerator-->>Caller: SpendingReport
Loading

Last reviewed commit: f909c79

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR delivers the CFO cost optimization layer for the budget module: a CostOptimizer service (anomaly detection, efficiency analysis, model downgrade recommendations, operation approval), a ReportGenerator service (multi-dimensional spending reports with period comparison), supporting frozen Pydantic domain models, 11 CFO event constants, a new get_records() method on CostTracker, and 65 new tests. The implementation is well-structured and follows project conventions closely, but two functional issues require attention before merge.

Key findings:

  • [Logic — optimizer.py] evaluate_operation lacks input validation for estimated_cost_usd. A negative value reduces projected_cost below monthly_cost, allowing the hard-stop guard to pass incorrectly — a budget-bypass path that contradicts the safety guarantees advertised in the PR description ("explicit input validation on all public methods").
  • [Logic — reports.py] generate_report takes two independent async snapshots (build_summary then get_records). A record added between the two awaits will appear in only one snapshot, causing provider/model percentages to sum to less than 100 % and making top-agents and top-tasks rankings derived from inconsistent data sets.
  • [Style — optimizer.py] When stddev == 0 and a spike is detected, deviation_factor is stored as 0.0 while severity may be HIGH, producing contradictory signals for consumers. Storing the spike_ratio in this path would make the data self-consistent.
  • [Style — optimizer.py] The CFO_APPROVAL_EVALUATED log fires before the ApprovalDecision object is fully constructed; the log should be moved after the object is created to avoid a misleading entry if Pydantic validation raises.

Confidence Score: 3/5

  • Not safe to merge as-is — the missing estimated_cost_usd validation creates a budget-bypass path and the double-snapshot in generate_report produces silently inconsistent report data.
  • Two confirmed logic issues exist: (1) a negative estimated_cost_usd passed to evaluate_operation can cause the hard-stop check to pass when it should deny, undermining the core safety contract of the CFO service; (2) the double-fetch in generate_report is a concurrency inconsistency that silently corrupts report percentages. The rest of the implementation — models, event constants, anomaly detection math, downgrade resolution — is solid and well-tested.
  • src/ai_company/budget/optimizer.py (input validation gap in evaluate_operation) and src/ai_company/budget/reports.py (double-snapshot in generate_report)

Important Files Changed

Filename Overview
src/ai_company/budget/optimizer.py New CostOptimizer service (799 lines) with anomaly detection, efficiency analysis, downgrade recommendations, and approval decisions; missing input validation on evaluate_operation (negative cost bypasses hard-stop guard) and a pre-construction approval log
src/ai_company/budget/reports.py New ReportGenerator service; double snapshot in generate_report (build_summary + get_records taken separately) can produce inconsistent provider/model percentages and mismatched top-agents vs top-tasks rankings
src/ai_company/budget/optimizer_models.py Well-structured frozen Pydantic models with appropriate computed fields, cross-field validators, and NotBlankStr identifiers; no issues found
src/ai_company/budget/tracker.py Adds get_records() query method with correct lock/snapshot semantics, filter support, and event logging; no issues found
src/ai_company/observability/events/cfo.py 11 CFO event constants (PR description says 12 — minor discrepancy); all constants follow naming conventions
tests/unit/budget/test_optimizer.py Comprehensive unit tests for CostOptimizer including parametrized severity thresholds, zero-baseline spikes, and downgrade paths; good coverage

Sequence Diagram

sequenceDiagram
    participant Caller
    participant CostOptimizer
    participant ReportGenerator
    participant CostTracker
    participant BudgetConfig

    Note over CostOptimizer: detect_anomalies()
    Caller->>CostOptimizer: detect_anomalies(start, end, window_count)
    CostOptimizer->>CostTracker: get_records(start, end)
    CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
    CostOptimizer->>CostOptimizer: _compute_window_costs() per agent
    CostOptimizer->>CostOptimizer: _detect_spike_anomaly() per agent
    CostOptimizer-->>Caller: AnomalyDetectionResult

    Note over CostOptimizer: recommend_downgrades()
    Caller->>CostOptimizer: recommend_downgrades(start, end)
    CostOptimizer->>CostTracker: get_records(start, end)
    CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
    CostOptimizer->>CostOptimizer: _build_efficiency_from_records()
    CostOptimizer->>CostTracker: get_total_cost(start=period_start)
    CostTracker-->>CostOptimizer: monthly_cost
    CostOptimizer->>BudgetConfig: auto_downgrade.downgrade_map
    CostOptimizer->>CostOptimizer: _build_downgrade_recommendation() per agent
    CostOptimizer-->>Caller: DowngradeAnalysis

    Note over CostOptimizer: evaluate_operation()
    Caller->>CostOptimizer: evaluate_operation(agent_id, estimated_cost_usd)
    CostOptimizer->>CostTracker: get_total_cost(start=period_start)
    CostTracker-->>CostOptimizer: monthly_cost
    CostOptimizer->>CostOptimizer: _compute_alert_level()
    CostOptimizer-->>Caller: ApprovalDecision

    Note over ReportGenerator: generate_report()
    Caller->>ReportGenerator: generate_report(start, end, top_n)
    ReportGenerator->>CostTracker: build_summary(start, end)
    CostTracker-->>ReportGenerator: SpendingSummary (snapshot 1)
    ReportGenerator->>CostTracker: get_records(start, end)
    CostTracker-->>ReportGenerator: tuple[CostRecord, ...] (snapshot 2)
    ReportGenerator->>ReportGenerator: _build_task_spendings()
    ReportGenerator->>ReportGenerator: _build_provider_distribution()
    ReportGenerator->>ReportGenerator: _build_model_distribution()
    ReportGenerator->>CostTracker: build_summary(prev_start, prev_end)
    CostTracker-->>ReportGenerator: prev SpendingSummary
    ReportGenerator-->>Caller: SpendingReport
Loading

Last reviewed commit: 9048bf8

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR implements the CFO cost optimization layer for the budget module, adding CostOptimizer, ReportGenerator, and their domain models as advisory complements to the existing BudgetEnforcer. The new services are well-structured, thoroughly tested (65 tests, 96% coverage), and follow the project's patterns for logging, event constants, and frozen Pydantic models.

Key changes:

  • budget/optimizer.py: CostOptimizer service with spike anomaly detection (sigma + spike-ratio), per-agent efficiency ratings, model downgrade recommendations via ModelResolver, and operation approval evaluation with configurable auto-deny thresholds.
  • budget/optimizer_models.py: Frozen Pydantic models for all CFO domain types, using @computed_field for derived values and model_validator for cross-field invariants.
  • budget/reports.py: ReportGenerator producing multi-dimensional spending reports (task/provider/model breakdowns, period-over-period comparison, top-N rankings).
  • budget/tracker.py: New get_records() query method used by both optimizer and report generator.
  • observability/events/cfo.py: 11 new CFO event constants; BUDGET_RECORDS_QUERIED added to the budget events module.

Issues found:

  • generate_report takes two independent async snapshots (build_summary then get_records). Records added between the two await expressions produce a total_cost denominator that doesn't match the records used for distribution calculations, potentially causing distribution percentages to exceed 100.0 and trigger a Pydantic ValidationError on ProviderDistribution or ModelDistribution.
  • SpendingAnomaly.deviation_factor is documented as "Set to 0.0 when the baseline is zero" but is also 0.0 when historical stddev is zero (identical spending history, non-zero mean) — two semantically distinct situations that consumers may need to distinguish.
  • approval_warn_threshold_usd allows ge=0.0, so setting it to 0 causes the "High-cost" condition to be attached to every approved operation regardless of cost.

Confidence Score: 3/5

  • Mostly safe to merge, but the double-snapshot race in generate_report can produce a Pydantic ValidationError in concurrent use and should be addressed first.
  • The core optimizer logic is correct and well-tested. The main concern is generate_report's two independent async snapshots: in a concurrent async environment this can yield a total_cost denominator that doesn't cover all records used for percentage calculations, potentially violating the le=100.0 Pydantic constraint and raising a ValidationError at runtime. The two style issues (deviation_factor docstring, zero warn threshold) are low-risk but worth fixing for API clarity.
  • src/ai_company/budget/reports.py — double-snapshot inconsistency in generate_report.

Important Files Changed

Filename Overview
src/ai_company/budget/optimizer.py New CostOptimizer service implementing anomaly detection (sigma + spike ratio), efficiency analysis, downgrade recommendations, and operation approval. Logic is well-structured and tested. One style issue: approval_warn_threshold_usd=0 silently attaches a "High-cost" condition to every approved operation.
src/ai_company/budget/optimizer_models.py Frozen Pydantic models for all CFO domain concepts with good use of computed fields, cross-field validators, and strict constraints. Minor documentation inaccuracy in SpendingAnomaly.deviation_factor: the zero-stddev path also sets it to 0.0, but the docstring only documents the zero-baseline case.
src/ai_company/budget/reports.py ReportGenerator service with multi-dimensional breakdowns. Contains a potential race condition: build_summary and get_records take independent async snapshots; records added between the two awaits can cause distribution percentages to exceed 100.0 and trigger a Pydantic ValidationError.
src/ai_company/budget/tracker.py Clean addition of get_records() query method following the existing _snapshot/_filter_records pattern with proper lock usage, logging, and time-range validation.
src/ai_company/observability/events/cfo.py New CFO event constants module with 11 typed Final[str] constants following the established domain-event naming pattern (cfo.*).

Sequence Diagram

sequenceDiagram
    participant Caller
    participant CostOptimizer
    participant ReportGenerator
    participant CostTracker
    participant BudgetConfig
    participant ModelResolver

    Note over Caller,ModelResolver: detect_anomalies()
    Caller->>CostOptimizer: detect_anomalies(start, end, window_count)
    CostOptimizer->>CostTracker: get_records(start, end)
    CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
    CostOptimizer->>CostOptimizer: _compute_window_costs() per agent
    CostOptimizer->>CostOptimizer: _detect_spike_anomaly() per agent
    CostOptimizer-->>Caller: AnomalyDetectionResult

    Note over Caller,ModelResolver: analyze_efficiency()
    Caller->>CostOptimizer: analyze_efficiency(start, end)
    CostOptimizer->>CostTracker: get_records(start, end)
    CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
    CostOptimizer->>CostOptimizer: _build_efficiency_from_records()
    CostOptimizer-->>Caller: EfficiencyAnalysis

    Note over Caller,ModelResolver: recommend_downgrades()
    Caller->>CostOptimizer: recommend_downgrades(start, end)
    CostOptimizer->>CostTracker: get_records(start, end)
    CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
    CostOptimizer->>CostOptimizer: _build_efficiency_from_records()
    CostOptimizer->>CostTracker: get_total_cost(period_start) [budget pressure]
    CostOptimizer->>ModelResolver: resolve_safe(model) + all_models_sorted_by_cost()
    CostOptimizer-->>Caller: DowngradeAnalysis

    Note over Caller,ModelResolver: evaluate_operation()
    Caller->>CostOptimizer: evaluate_operation(agent_id, estimated_cost, now)
    CostOptimizer->>BudgetConfig: read total_monthly, alerts, reset_day
    CostOptimizer->>CostTracker: get_total_cost(period_start)
    CostTracker-->>CostOptimizer: monthly_cost
    CostOptimizer->>CostOptimizer: _compute_alert_level()
    CostOptimizer-->>Caller: ApprovalDecision

    Note over Caller,ModelResolver: generate_report()
    Caller->>ReportGenerator: generate_report(start, end, top_n)
    ReportGenerator->>CostTracker: build_summary(start, end) [snapshot #1]
    CostTracker-->>ReportGenerator: SpendingSummary
    ReportGenerator->>CostTracker: get_records(start, end) [snapshot #2]
    CostTracker-->>ReportGenerator: tuple[CostRecord, ...]
    ReportGenerator->>CostTracker: build_summary(prev_start, prev_end) [period comparison]
    CostTracker-->>ReportGenerator: SpendingSummary
    ReportGenerator-->>Caller: SpendingReport
Loading

Comments Outside Diff (2)

  1. src/ai_company/budget/reports.py, line 1628-1635 (link)

    Inconsistent snapshots between build_summary and get_records

    generate_report makes two independent await calls that each acquire their own lock and take a separate in-memory snapshot. Between the two await expressions, the asyncio event loop can interleave other coroutines that call tracker.record(...), meaning records could contain more entries than what produced total_cost in summary.

    This creates a real inconsistency: total_cost = summary.period.total_cost_usd is used as the denominator in _build_provider_distribution and _build_model_distribution. If a new record is added between the two snapshots, an individual provider's aggregated cost could exceed total_cost, causing its percentage_of_total to exceed 100.0, which will trigger a Pydantic ValidationError (le=100.0 constraint on ProviderDistribution and ModelDistribution).

    The simplest fix is to derive total_cost from the same records tuple rather than from summary:

    records = await self._cost_tracker.get_records(start=start, end=end)
    total_cost = round(math.fsum(r.cost_usd for r in records), BUDGET_ROUNDING_PRECISION)

    Alternatively, a combined atomic operation (fetch records once, build summary from them, then compute distributions) would eliminate the race entirely.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/ai_company/budget/reports.py
    Line: 1628-1635
    
    Comment:
    **Inconsistent snapshots between `build_summary` and `get_records`**
    
    `generate_report` makes two independent `await` calls that each acquire their own lock and take a separate in-memory snapshot. Between the two `await` expressions, the asyncio event loop can interleave other coroutines that call `tracker.record(...)`, meaning `records` could contain more entries than what produced `total_cost` in `summary`.
    
    This creates a real inconsistency: `total_cost = summary.period.total_cost_usd` is used as the denominator in `_build_provider_distribution` and `_build_model_distribution`. If a new record is added between the two snapshots, an individual provider's aggregated cost could exceed `total_cost`, causing its `percentage_of_total` to exceed 100.0, which will trigger a Pydantic `ValidationError` (`le=100.0` constraint on `ProviderDistribution` and `ModelDistribution`).
    
    The simplest fix is to derive `total_cost` from the same `records` tuple rather than from `summary`:
    
    ```python
    records = await self._cost_tracker.get_records(start=start, end=end)
    total_cost = round(math.fsum(r.cost_usd for r in records), BUDGET_ROUNDING_PRECISION)
    ```
    
    Alternatively, a combined atomic operation (fetch records once, build summary from them, then compute distributions) would eliminate the race entirely.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. src/ai_company/budget/optimizer_models.py, line 1064-1067 (link)

    deviation_factor docstring inaccurate for the zero-stddev case

    The field description says "Set to 0.0 when the baseline is zero (no historical spending)." However, deviation_factor is also stored as 0.0 when all historical window values are identical (mean > 0, stddev == 0) — for example, four windows all at $1.00. In _detect_spike_anomaly, deviation = (current - mean) / stddev if stddev > 0 else 0.0, so a constant-baseline spike produces deviation_factor=0.0 even though the baseline is non-zero.

    A consumer checking anomaly.deviation_factor == 0.0 to infer "no historical data" would get a false positive in that scenario. The description should be updated to cover both cases:

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/ai_company/budget/optimizer_models.py
    Line: 1064-1067
    
    Comment:
    **`deviation_factor` docstring inaccurate for the zero-stddev case**
    
    The field description says "Set to 0.0 when the baseline is zero (no historical spending)." However, `deviation_factor` is also stored as `0.0` when all historical window values are identical (mean > 0, stddev == 0) — for example, four windows all at `$1.00`. In `_detect_spike_anomaly`, `deviation = (current - mean) / stddev if stddev > 0 else 0.0`, so a constant-baseline spike produces `deviation_factor=0.0` even though the baseline is non-zero.
    
    A consumer checking `anomaly.deviation_factor == 0.0` to infer "no historical data" would get a false positive in that scenario. The description should be updated to cover both cases:
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 9048bf8

- Add routing optimization feature (#1): new suggest_routing_optimizations()
  method, RoutingSuggestion and RoutingOptimizationAnalysis models
- Add negative estimated_cost_usd validation (#2)
- Fix double snapshot in generate_report (#3)
- Fix deviation_factor to use spike_ratio when stddev=0 (#4)
- Convert DowngradeAnalysis.total_estimated_savings_per_1k to @computed_field (#5)
- Change str to NotBlankStr in SpendingReport tuple fields (#6)
- Add window_count upper bound validation (#7)
- Pre-group records by agent for O(N+M) complexity (#8)
- Update DESIGN_SPEC.md implementation snapshot (#9)
- Use projected alert level for auto-deny check (#11)
- Move approval log after ApprovalDecision construction (#12)
- Add ReportGenerator.__init__ debug log + event constant (#13)
- Fix _ALERT_LEVEL_ORDER comment (#14)
- Fix _classify_severity docstring for dual-use (#15)
- Add WARNING logs before ValueError raises (#16)
- Update evaluate_operation docstring (#17)
- Add sort-order validator to EfficiencyAnalysis.agents (#18)
- Add debug log when _find_most_used_model returns None (#19)
- Remove redundant stddev > 0 check in is_sigma_anomaly (#20)
- Document approval_warn_threshold_usd=0.0 behavior (#21)
- Extract helpers to _optimizer_helpers.py to stay under 800-line limit
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@DESIGN_SPEC.md`:
- Around line 1848-1853: The M5 implementation note omits routing optimization;
update the description for CostOptimizer in budget/optimizer.py (the
"CostOptimizer" service) to include routing optimization suggestions alongside
anomaly detection, per-agent efficiency analysis, model downgrade
recommendations (via ModelResolver), and operation approval evaluation, and also
mention that ReportGenerator (budget/reports.py) includes routing-aware
breakdowns in its multi-dimensional spending reports and period-over-period
comparisons so the CFO feature summary remains current.

In `@src/ai_company/budget/_optimizer_helpers.py`:
- Around line 290-300: The code incorrectly accepts a cheaper model solely by
price even if it reduces context size; update the block that calls
_find_cheaper_model (the branch setting target_ref when target_ref is None) to
validate that the returned cheaper model's max_context
(cheaper.model_info.max_context or resolved info via resolver) is >=
current_resolved.model_info.max_context (or the routing-required context), and
if not, treat it as unavailable: log CFO_DOWNGRADE_SKIPPED with reason
"no_cheaper_model_preserving_context" and return None. Apply the same check to
the analogous branch around lines 340-349 so downgrades never pick models that
shrink capability.
- Around line 110-190: The _detect_spike_anomaly function is too large and mixes
validation, zero-baseline handling, threshold evaluation, severity
classification, and SpendingAnomaly construction; refactor by splitting it into
small helpers (e.g., _validate_windows(agent_id, window_costs, config),
_handle_zero_baseline(agent_id, current, now, window_starts, window_duration),
_evaluate_spike_and_sigma(historical, current, config) which returns (is_spike,
is_sigma_anomaly, spike_ratio, deviation, stddev), and
_build_spending_anomaly(agent_id, current, mean, effective_deviation, severity,
now, window_starts, window_duration)). Keep existing behavior and return values
(use _classify_severity for severity, round baseline_value and deviation_factor
per BUDGET_ROUNDING_PRECISION, and preserve SpendingAnomaly fields), then
simplify _detect_spike_anomaly to call these helpers in sequence so the
top-level function is under 50 lines.

In `@src/ai_company/budget/optimizer.py`:
- Around line 338-339: The code repeatedly calls _find_most_used_model(records,
agent.agent_id) and rescans the whole window per agent; instead use the existing
by_agent grouping within suggest_routing_optimizations to avoid O(agent_count ×
record_count). Change the call sites (including the similar block around lines
423-429) to pass only that agent's records (e.g., by_agent[agent.agent_id]) or
refactor _find_most_used_model to accept an agent-specific records list so the
function scans only that subset; update references to most_used_model
accordingly.
- Around line 550-565: The approval path currently uses current values
(used_pct, alert_level) to build conditions, budget_used_percent, and the INFO
log, which misses when the proposed spend crosses thresholds; update the
approval branch that constructs conditions and the
budget_used_percent/alert_level logging to use projected_pct and projected_alert
when projected_alert > alert_level (i.e., crossing into a higher alert),
otherwise keep the current values; reference the computed names projected_pct,
projected_alert, used_pct, alert_level and the helper _compute_alert_level so
you locate the logic that assembles conditions and logs, and apply the same
change in the analogous block around projected_pct/projected_alert at the other
location (lines noted in review).
- Around line 376-378: The recommendation logic currently only checks cost and
max_context and ignores latency; update the candidate filter in the
recommendation function (the code that compares cost and max_context using
estimated_latency_ms from the model resolver) to enforce a latency guard: when
both the source model and candidate expose estimated_latency_ms, skip any
candidate whose estimated_latency_ms exceeds the source estimated_latency_ms
multiplied by a configurable max_latency_ratio (e.g., 1.1) or a hard threshold,
and surface that decision in the returned suggestion metadata; add a small
unit-test or example to cover the case where a cheaper model is rejected due to
higher latency and document the new max_latency_ratio configuration.
- Around line 301-309: The early-return branch that fires when
self._model_resolver is None currently returns DowngradeAnalysis with
budget_pressure_percent=0.0 which is wrong; change it to compute the actual
budget pressure using the same logic used elsewhere (reuse the existing helper
that calculates budget pressure—e.g., compute_budget_pressure /
_calculate_budget_pressure / similar budget pressure function used by this
class) and pass that real value into DowngradeAnalysis while still returning
empty recommendations; keep the CFO_RESOLVER_MISSING warning but replace the
hard-coded 0.0 with the computed budget_pressure_percent.

In `@src/ai_company/budget/reports.py`:
- Around line 272-280: The current code awaits two separate tracker calls
(_cost_tracker.get_records and _cost_tracker.build_summary) which allows
intervening writes to cause summary to drift; instead generate the summary from
the same records snapshot (use the already-fetched variable records to compute
summary) or add/use a tracker helper that accepts a records snapshot (e.g., a
new method like build_summary_from_snapshot(records) on _cost_tracker) and
replace the build_summary call so that summary is derived from records, ensuring
by_task/by_provider/by_model/top_agents_by_cost remain consistent with records.
- Around line 263-268: In the two validation branches where you currently raise
ValueError for "start >= end" and "top_n < 1", add a WARNING-level CFO event log
(using the project's CFO event constant API) that emits the same context message
and includes the values of start, end, and top_n before raising; specifically,
in the branches surrounding the checks for start >= end and top_n < 1 (the
blocks that construct msg and raise ValueError), call the CFO warning/emitter
with the msg and any additional context fields (start.isoformat(),
end.isoformat(), top_n) so the warning is recorded via the CFO event constant
prior to raising the ValueError.

In `@tests/unit/budget/test_optimizer.py`:
- Around line 645-652: The test test_find_cheaper_model_picks_cheapest never
exercises _find_cheaper_model because recommend_downgrades returns early on
empty data; either seed an inefficient usage record before calling
recommend_downgrades so the _find_cheaper_model path runs and assert the chosen
cheaper model, or rename the test to reflect empty-state behavior. Concretely,
in the test that calls _make_resolver() and _make_optimizer(), add a
fixture/seeded record (matching whatever helper you use to insert records in
tests) representing an inefficient/high-cost model so recommend_downgrades
evaluates downgrades, then assert the returned recommendation target; otherwise
change the test name and expected assertion to indicate it verifies the
empty-data result from recommend_downgrades.
- Around line 1-900: The test module is too large; split it into smaller focused
test files by moving the related test classes into separate modules (e.g.,
tests/unit/budget/test_anomalies.py, test_efficiency.py, test_downgrades.py,
test_approval.py, test_routing.py). Extract shared helpers/constants (_START,
_END, _make_optimizer, _make_resolver, make_cost_record import) into a common
test helper or conftest (e.g., tests/unit/budget/test_helpers.py or reuse
tests/unit/budget/conftest.py) and update imports in each new file; preserve
pytest.mark.unit decorators and keep each test class (TestDetectAnomalies,
TestAnalyzeEfficiency, TestRecommendDowngrades, TestEvaluateOperation,
TestSuggestRoutingOptimizations, TestClassifySeverity, TestInputValidation,
TestEdgeCases) intact when moving so tests and references (CostOptimizer,
CostTracker, CostOptimizerConfig, BudgetConfig, ModelResolver, ResolvedModel,
_classify_severity) still resolve. Ensure no duplicate fixtures/names and run
pytest to verify imports and test discovery.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: f608e87d-6969-44a1-b81d-dca1fd84730f

📥 Commits

Reviewing files that changed from the base of the PR and between 9048bf8 and 69f06c1.

📒 Files selected for processing (9)
  • DESIGN_SPEC.md
  • src/ai_company/budget/__init__.py
  • src/ai_company/budget/_optimizer_helpers.py
  • src/ai_company/budget/optimizer.py
  • src/ai_company/budget/optimizer_models.py
  • src/ai_company/budget/reports.py
  • src/ai_company/observability/events/cfo.py
  • tests/unit/budget/test_optimizer.py
  • tests/unit/budget/test_optimizer_models.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Do NOT use from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations
Use except A, B: syntax (without parentheses) per PEP 758 — ruff enforces this on Python 3.14
All public functions must have type hints; use mypy strict mode for type-checking
Use Google-style docstrings on all public classes and functions; enforced by ruff D rules
Create new objects instead of mutating existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves; never mix static config fields with mutable runtime fields in one model
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use @computed_field for derived values instead of storing redundant fields; use NotBlankStr for all identifier/name fields (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Keep functions under 50 lines and files under 800 lines
Handle errors explicitly, never silently swallow exceptions
Validate at system boundaries (user input, external APIs, config files)
Use line length of 88 characters (ruff)

Files:

  • tests/unit/budget/test_optimizer_models.py
  • tests/unit/budget/test_optimizer.py
  • src/ai_company/budget/optimizer.py
  • src/ai_company/budget/_optimizer_helpers.py
  • src/ai_company/observability/events/cfo.py
  • src/ai_company/budget/optimizer_models.py
  • src/ai_company/budget/__init__.py
  • src/ai_company/budget/reports.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Mark tests with @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow
Prefer @pytest.mark.parametrize for testing similar cases
In tests, use test-provider, test-small-001, etc. instead of real vendor names

Files:

  • tests/unit/budget/test_optimizer_models.py
  • tests/unit/budget/test_optimizer.py
src/ai_company/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/ai_company/**/*.py: Every module with business logic must import and use get_logger(name) from ai_company.observability; never use import logging or logging.getLogger() or print() in application code
Always use 'logger' as the variable name (not '_logger', not 'log')
Always use event name constants from ai_company.observability.events domain modules (e.g., PROVIDER_CALL_START from events.provider) instead of string literals
Use structured logging with logger.info(EVENT, key=value) — never use logger.info('msg %s', val) string formatting
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
Use DEBUG level logging for object creation, internal flow, and entry/exit of key functions

Files:

  • src/ai_company/budget/optimizer.py
  • src/ai_company/budget/_optimizer_helpers.py
  • src/ai_company/observability/events/cfo.py
  • src/ai_company/budget/optimizer_models.py
  • src/ai_company/budget/__init__.py
  • src/ai_company/budget/reports.py
src/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples; use generic names (example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small aliases)

Files:

  • src/ai_company/budget/optimizer.py
  • src/ai_company/budget/_optimizer_helpers.py
  • src/ai_company/observability/events/cfo.py
  • src/ai_company/budget/optimizer_models.py
  • src/ai_company/budget/__init__.py
  • src/ai_company/budget/reports.py
🧠 Learnings (8)
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Always use event name constants from ai_company.observability.events domain modules (e.g., PROVIDER_CALL_START from events.provider) instead of string literals

Applied to files:

  • DESIGN_SPEC.md
  • src/ai_company/observability/events/cfo.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to **/*.py : Keep functions under 50 lines and files under 800 lines

Applied to files:

  • src/ai_company/budget/optimizer.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to **/*.py : Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task

Applied to files:

  • src/ai_company/budget/optimizer.py
  • src/ai_company/budget/reports.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : All state transitions must log at INFO level

Applied to files:

  • src/ai_company/budget/optimizer.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : All error paths must log at WARNING or ERROR with context before raising

Applied to files:

  • src/ai_company/budget/optimizer.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves; never mix static config fields with mutable runtime fields in one model

Applied to files:

  • src/ai_company/budget/optimizer_models.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to **/*.py : Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use computed_field for derived values instead of storing redundant fields; use NotBlankStr for all identifier/name fields (including optional and tuple variants) instead of manual whitespace validators

Applied to files:

  • src/ai_company/budget/reports.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Use DEBUG level logging for object creation, internal flow, and entry/exit of key functions

Applied to files:

  • src/ai_company/budget/reports.py
🧬 Code graph analysis (6)
tests/unit/budget/test_optimizer.py (10)
src/ai_company/budget/_optimizer_helpers.py (1)
  • _classify_severity (193-205)
src/ai_company/budget/config.py (3)
  • BudgetAlertConfig (15-62)
  • BudgetConfig (151-227)
  • AutoDowngradeConfig (65-148)
src/ai_company/budget/enums.py (1)
  • BudgetAlertLevel (6-16)
src/ai_company/budget/optimizer_models.py (4)
  • AnomalySeverity (34-39)
  • AnomalyType (22-31)
  • CostOptimizerConfig (346-397)
  • EfficiencyRating (42-47)
src/ai_company/budget/tracker.py (2)
  • CostTracker (68-455)
  • record (99-112)
src/ai_company/providers/routing/models.py (1)
  • ResolvedModel (9-52)
src/ai_company/providers/routing/resolver.py (1)
  • ModelResolver (25-205)
tests/unit/budget/conftest.py (2)
  • make_cost_record (286-307)
  • cost_tracker (262-270)
src/ai_company/budget/billing.py (1)
  • billing_period_start (11-45)
tests/unit/budget/test_reports.py (1)
  • test_start_after_end_rejected (344-347)
src/ai_company/budget/optimizer.py (6)
src/ai_company/budget/_optimizer_helpers.py (5)
  • _build_efficiency_from_records (46-91)
  • _classify_severity (193-205)
  • _compute_window_costs (94-107)
  • _find_most_used_model (239-255)
  • _group_records_by_agent (367-374)
src/ai_company/budget/tracker.py (2)
  • get_records (185-225)
  • get_total_cost (114-137)
src/ai_company/budget/billing.py (1)
  • billing_period_start (11-45)
src/ai_company/budget/enums.py (1)
  • BudgetAlertLevel (6-16)
src/ai_company/budget/optimizer_models.py (8)
  • DowngradeAnalysis (276-304)
  • DowngradeRecommendation (240-273)
  • EfficiencyAnalysis (179-234)
  • EfficiencyRating (42-47)
  • inefficient_agent_count (206-212)
  • estimated_savings_per_1k (436-441)
  • total_estimated_savings_per_1k (299-304)
  • total_estimated_savings_per_1k (491-496)
src/ai_company/providers/routing/resolver.py (4)
  • ModelResolver (25-205)
  • all_models (174-177)
  • all_models_sorted_by_cost (179-189)
  • resolve_safe (154-172)
src/ai_company/budget/_optimizer_helpers.py (6)
src/ai_company/budget/enums.py (1)
  • BudgetAlertLevel (6-16)
src/ai_company/budget/optimizer_models.py (9)
  • AgentEfficiency (142-176)
  • AnomalySeverity (34-39)
  • AnomalyType (22-31)
  • DowngradeRecommendation (240-273)
  • EfficiencyAnalysis (179-234)
  • EfficiencyRating (42-47)
  • SpendingAnomaly (53-101)
  • cost_per_1k_tokens (169-176)
  • estimated_savings_per_1k (436-441)
src/ai_company/budget/config.py (1)
  • BudgetConfig (151-227)
src/ai_company/budget/cost_record.py (1)
  • CostRecord (15-56)
src/ai_company/providers/routing/models.py (2)
  • ResolvedModel (9-52)
  • total_cost_per_1k (50-52)
src/ai_company/providers/routing/resolver.py (4)
  • ModelResolver (25-205)
  • resolve_safe (154-172)
  • all_models (174-177)
  • all_models_sorted_by_cost (179-189)
src/ai_company/budget/optimizer_models.py (1)
src/ai_company/budget/enums.py (1)
  • BudgetAlertLevel (6-16)
src/ai_company/budget/__init__.py (3)
src/ai_company/budget/optimizer.py (1)
  • CostOptimizer (76-665)
src/ai_company/budget/optimizer_models.py (11)
  • AgentEfficiency (142-176)
  • AnomalyDetectionResult (104-136)
  • AnomalySeverity (34-39)
  • AnomalyType (22-31)
  • ApprovalDecision (310-340)
  • CostOptimizerConfig (346-397)
  • DowngradeAnalysis (276-304)
  • EfficiencyAnalysis (179-234)
  • EfficiencyRating (42-47)
  • RoutingOptimizationAnalysis (467-509)
  • SpendingAnomaly (53-101)
src/ai_company/budget/reports.py (6)
  • ModelDistribution (80-101)
  • PeriodComparison (104-144)
  • ProviderDistribution (58-77)
  • ReportGenerator (212-343)
  • SpendingReport (147-206)
  • TaskSpending (40-55)
src/ai_company/budget/reports.py (3)
src/ai_company/budget/spending_summary.py (1)
  • SpendingSummary (102-161)
src/ai_company/budget/cost_record.py (1)
  • CostRecord (15-56)
src/ai_company/budget/tracker.py (3)
  • CostTracker (68-455)
  • get_records (185-225)
  • build_summary (227-281)

Comment on lines +110 to +190
def _detect_spike_anomaly( # noqa: PLR0913
agent_id: str,
window_costs: tuple[float, ...],
now: datetime,
window_starts: tuple[datetime, ...],
window_duration: timedelta,
config: CostOptimizerConfig,
) -> SpendingAnomaly | None:
"""Detect a spike anomaly for a single agent.

Returns ``None`` if no anomaly is detected or insufficient data.
"""
if len(window_costs) < config.min_anomaly_windows:
logger.debug(
CFO_INSUFFICIENT_WINDOWS,
agent_id=agent_id,
window_count=len(window_costs),
min_required=config.min_anomaly_windows,
)
return None

historical = window_costs[:-1]
current = window_costs[-1]

if current == 0.0:
return None

mean = statistics.mean(historical)

if mean == 0.0:
# No historical spending -- spike from zero (current > 0 per guard)
return SpendingAnomaly(
agent_id=agent_id,
anomaly_type=AnomalyType.SPIKE,
severity=AnomalySeverity.HIGH,
description=(
f"Agent {agent_id!r} went from $0.00 baseline "
f"to ${current:.2f} in the latest window"
),
current_value=current,
baseline_value=0.0,
deviation_factor=0.0,
detected_at=now,
period_start=window_starts[-1],
period_end=window_starts[-1] + window_duration,
)

# Check spike factor (independent of stddev)
spike_ratio = current / mean
is_spike = spike_ratio > config.anomaly_spike_factor

# Check sigma threshold
stddev = statistics.stdev(historical) if len(historical) > 1 else 0.0
deviation = (current - mean) / stddev if stddev > 0 else 0.0
is_sigma_anomaly = deviation > config.anomaly_sigma_threshold

if not is_spike and not is_sigma_anomaly:
return None

# When stddev is zero, use the spike ratio for severity classification
classification_value = spike_ratio if is_spike and stddev == 0.0 else deviation
severity = _classify_severity(classification_value)

# Use spike_ratio as deviation_factor when stddev is zero
effective_deviation = spike_ratio if stddev == 0.0 else deviation

return SpendingAnomaly(
agent_id=agent_id,
anomaly_type=AnomalyType.SPIKE,
severity=severity,
description=(
f"Agent {agent_id!r} spent ${current:.2f} vs "
f"${mean:.2f} baseline ({effective_deviation:.1f}x)"
),
current_value=current,
baseline_value=round(mean, BUDGET_ROUNDING_PRECISION),
deviation_factor=round(effective_deviation, BUDGET_ROUNDING_PRECISION),
detected_at=now,
period_start=window_starts[-1],
period_end=window_starts[-1] + window_duration,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Split _detect_spike_anomaly again.

This helper still bundles validation, zero-baseline handling, threshold evaluation, severity mapping, and model construction into one 80+ line block. Breaking those branches into smaller helpers will keep the anomaly logic easier to audit and back under the repo’s function-size limit.

As per coding guidelines, "Keep functions under 50 lines and files under 800 lines".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/budget/_optimizer_helpers.py` around lines 110 - 190, The
_detect_spike_anomaly function is too large and mixes validation, zero-baseline
handling, threshold evaluation, severity classification, and SpendingAnomaly
construction; refactor by splitting it into small helpers (e.g.,
_validate_windows(agent_id, window_costs, config),
_handle_zero_baseline(agent_id, current, now, window_starts, window_duration),
_evaluate_spike_and_sigma(historical, current, config) which returns (is_spike,
is_sigma_anomaly, spike_ratio, deviation, stddev), and
_build_spending_anomaly(agent_id, current, mean, effective_deviation, severity,
now, window_starts, window_duration)). Keep existing behavior and return values
(use _classify_severity for severity, round baseline_value and deviation_factor
per BUDGET_ROUNDING_PRECISION, and preserve SpendingAnomaly fields), then
simplify _detect_spike_anomaly to call these helpers in sequence so the
top-level function is under 50 lines.

Comment on lines +1 to +900
"""Tests for CostOptimizer service."""

from datetime import UTC, datetime, timedelta

import pytest

from ai_company.budget._optimizer_helpers import _classify_severity
from ai_company.budget.config import BudgetAlertConfig, BudgetConfig
from ai_company.budget.enums import BudgetAlertLevel
from ai_company.budget.optimizer import CostOptimizer
from ai_company.budget.optimizer_models import (
AnomalySeverity,
AnomalyType,
CostOptimizerConfig,
EfficiencyRating,
)
from ai_company.budget.tracker import CostTracker
from ai_company.providers.routing.models import ResolvedModel
from ai_company.providers.routing.resolver import ModelResolver
from tests.unit.budget.conftest import make_cost_record

# ── Helpers ───────────────────────────────────────────────────────

_START = datetime(2026, 2, 1, tzinfo=UTC)
_END = datetime(2026, 3, 1, tzinfo=UTC)


def _make_optimizer(
*,
budget_config: BudgetConfig | None = None,
config: CostOptimizerConfig | None = None,
model_resolver: ModelResolver | None = None,
) -> tuple[CostOptimizer, CostTracker]:
"""Build a CostOptimizer with a fresh CostTracker."""
bc = budget_config or BudgetConfig(total_monthly=100.0)
tracker = CostTracker(budget_config=bc)
optimizer = CostOptimizer(
cost_tracker=tracker,
budget_config=bc,
config=config,
model_resolver=model_resolver,
)
return optimizer, tracker


def _make_resolver(
models: list[ResolvedModel] | None = None,
) -> ModelResolver:
"""Build a ModelResolver from a list of ResolvedModel."""
if models is None:
models = [
ResolvedModel(
provider_name="test-provider",
model_id="test-large-001",
alias="large",
cost_per_1k_input=0.03,
cost_per_1k_output=0.06,
),
ResolvedModel(
provider_name="test-provider",
model_id="test-medium-001",
alias="medium",
cost_per_1k_input=0.01,
cost_per_1k_output=0.02,
),
ResolvedModel(
provider_name="test-provider",
model_id="test-small-001",
alias="small",
cost_per_1k_input=0.001,
cost_per_1k_output=0.002,
),
]
index: dict[str, ResolvedModel] = {}
for m in models:
index[m.model_id] = m
if m.alias is not None:
index[m.alias] = m
return ModelResolver(index)


# ── Init Tests ────────────────────────────────────────────────────


@pytest.mark.unit
class TestInit:
async def test_defaults(self) -> None:
optimizer, _ = _make_optimizer()
assert optimizer._config == CostOptimizerConfig()

async def test_custom_config(self) -> None:
cfg = CostOptimizerConfig(anomaly_sigma_threshold=3.0)
optimizer, _ = _make_optimizer(config=cfg)
assert optimizer._config.anomaly_sigma_threshold == 3.0


# ── Anomaly Detection Tests ──────────────────────────────────────


@pytest.mark.unit
class TestDetectAnomalies:
async def test_no_records_empty_result(self) -> None:
optimizer, _ = _make_optimizer()
result = await optimizer.detect_anomalies(start=_START, end=_END)
assert result.anomalies == ()
assert result.agents_scanned == 0

async def test_normal_spending_no_anomalies(self) -> None:
optimizer, tracker = _make_optimizer()
# Create uniform spending across 5 windows
window_duration = (_END - _START) / 5
for i in range(5):
ts = _START + window_duration * i + timedelta(hours=1)
await tracker.record(
make_cost_record(agent_id="alice", cost_usd=1.0, timestamp=ts),
)

result = await optimizer.detect_anomalies(start=_START, end=_END)
assert result.anomalies == ()
assert result.agents_scanned == 1

async def test_spike_detected(self) -> None:
optimizer, tracker = _make_optimizer()
window_duration = (_END - _START) / 5

# Normal spending in first 4 windows
for i in range(4):
ts = _START + window_duration * i + timedelta(hours=1)
await tracker.record(
make_cost_record(agent_id="alice", cost_usd=1.0, timestamp=ts),
)

# Spike in last window
ts = _START + window_duration * 4 + timedelta(hours=1)
await tracker.record(
make_cost_record(agent_id="alice", cost_usd=20.0, timestamp=ts),
)

result = await optimizer.detect_anomalies(start=_START, end=_END)
assert len(result.anomalies) == 1
anomaly = result.anomalies[0]
assert anomaly.agent_id == "alice"
assert anomaly.anomaly_type == AnomalyType.SPIKE
assert anomaly.current_value == 20.0

async def test_insufficient_windows_no_false_positive(self) -> None:
config = CostOptimizerConfig(min_anomaly_windows=5)
optimizer, tracker = _make_optimizer(config=config)

# Only 3 windows of data in a 3-window analysis
window_duration = (_END - _START) / 3
for i in range(3):
ts = _START + window_duration * i + timedelta(hours=1)
cost = 1.0 if i < 2 else 50.0
await tracker.record(
make_cost_record(agent_id="alice", cost_usd=cost, timestamp=ts),
)

result = await optimizer.detect_anomalies(
start=_START,
end=_END,
window_count=3,
)
assert result.anomalies == ()

async def test_multiple_agents_only_anomalous_flagged(self) -> None:
optimizer, tracker = _make_optimizer()
window_duration = (_END - _START) / 5

# Alice: uniform spending
for i in range(5):
ts = _START + window_duration * i + timedelta(hours=1)
await tracker.record(
make_cost_record(agent_id="alice", cost_usd=1.0, timestamp=ts),
)

# Bob: spike in last window
for i in range(4):
ts = _START + window_duration * i + timedelta(hours=1)
await tracker.record(
make_cost_record(agent_id="bob", cost_usd=1.0, timestamp=ts),
)
ts = _START + window_duration * 4 + timedelta(hours=1)
await tracker.record(
make_cost_record(agent_id="bob", cost_usd=20.0, timestamp=ts),
)

result = await optimizer.detect_anomalies(start=_START, end=_END)
assert len(result.anomalies) == 1
assert result.anomalies[0].agent_id == "bob"
assert result.agents_scanned == 2

async def test_window_count_validation(self) -> None:
optimizer, _ = _make_optimizer()
with pytest.raises(ValueError, match="window_count must be >= 2"):
await optimizer.detect_anomalies(
start=_START,
end=_END,
window_count=1,
)

async def test_spike_from_zero_baseline(self) -> None:
"""Agent with no historical spending that suddenly appears."""
optimizer, tracker = _make_optimizer(
config=CostOptimizerConfig(min_anomaly_windows=3),
)
window_duration = (_END - _START) / 5

# No spending in first 4 windows, spending in window 5
ts = _START + window_duration * 4 + timedelta(hours=1)
await tracker.record(
make_cost_record(agent_id="alice", cost_usd=5.0, timestamp=ts),
)

result = await optimizer.detect_anomalies(start=_START, end=_END)
assert len(result.anomalies) == 1
anomaly = result.anomalies[0]
assert anomaly.severity == AnomalySeverity.HIGH
assert anomaly.baseline_value == 0.0

async def test_spike_severity_with_zero_stddev(self) -> None:
"""Spike severity uses spike_ratio when stddev is 0."""
optimizer, tracker = _make_optimizer(
config=CostOptimizerConfig(
anomaly_sigma_threshold=2.0,
anomaly_spike_factor=2.0,
min_anomaly_windows=3,
),
)
window_duration = (_END - _START) / 5

# Identical baseline → stddev=0
for i in range(4):
ts = _START + window_duration * i + timedelta(hours=1)
await tracker.record(
make_cost_record(agent_id="alice", cost_usd=1.0, timestamp=ts),
)

# Spike: 4x baseline → spike_ratio=4.0 → HIGH (>=3.0)
ts = _START + window_duration * 4 + timedelta(hours=1)
await tracker.record(
make_cost_record(agent_id="alice", cost_usd=4.0, timestamp=ts),
)

result = await optimizer.detect_anomalies(start=_START, end=_END)
assert len(result.anomalies) == 1
assert result.anomalies[0].severity == AnomalySeverity.HIGH


# ── Efficiency Analysis Tests ─────────────────────────────────────


@pytest.mark.unit
class TestAnalyzeEfficiency:
async def test_uniform_all_normal(self) -> None:
optimizer, tracker = _make_optimizer()

# Same cost/token ratio for all agents
for agent in ("alice", "bob", "carol"):
await tracker.record(
make_cost_record(
agent_id=agent,
cost_usd=1.0,
input_tokens=1000,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)

result = await optimizer.analyze_efficiency(start=_START, end=_END)
assert all(
a.efficiency_rating == EfficiencyRating.NORMAL for a in result.agents
)
assert result.inefficient_agent_count == 0

async def test_one_inefficient(self) -> None:
optimizer, tracker = _make_optimizer()

# Alice: cheap (1.0/1000 = 1.0 per 1k)
await tracker.record(
make_cost_record(
agent_id="alice",
cost_usd=1.0,
input_tokens=1000,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)
# Bob: expensive (10.0/1000 = 10.0 per 1k)
await tracker.record(
make_cost_record(
agent_id="bob",
cost_usd=10.0,
input_tokens=1000,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)

result = await optimizer.analyze_efficiency(start=_START, end=_END)
assert result.inefficient_agent_count == 1
# Sorted by cost_per_1k desc
assert result.agents[0].agent_id == "bob"
assert result.agents[0].efficiency_rating == EfficiencyRating.INEFFICIENT

async def test_zero_tokens_handled(self) -> None:
optimizer, tracker = _make_optimizer()

await tracker.record(
make_cost_record(
agent_id="alice",
cost_usd=0.0,
input_tokens=0,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)

result = await optimizer.analyze_efficiency(start=_START, end=_END)
assert len(result.agents) == 1
assert result.agents[0].cost_per_1k_tokens == 0.0
assert result.agents[0].efficiency_rating == EfficiencyRating.NORMAL

async def test_efficient_agent_flagged(self) -> None:
optimizer, tracker = _make_optimizer()

# Alice: very cheap (0.1/10000 = 0.01 per 1k)
await tracker.record(
make_cost_record(
agent_id="alice",
cost_usd=0.1,
input_tokens=10000,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)
# Bob: normal (1.0/1000 = 1.0 per 1k)
await tracker.record(
make_cost_record(
agent_id="bob",
cost_usd=1.0,
input_tokens=1000,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)
# Carol: normal (1.0/1000 = 1.0 per 1k)
await tracker.record(
make_cost_record(
agent_id="carol",
cost_usd=1.0,
input_tokens=1000,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)

result = await optimizer.analyze_efficiency(start=_START, end=_END)
alice = next(a for a in result.agents if a.agent_id == "alice")
assert alice.efficiency_rating == EfficiencyRating.EFFICIENT

async def test_empty_records(self) -> None:
optimizer, _ = _make_optimizer()
result = await optimizer.analyze_efficiency(start=_START, end=_END)
assert result.agents == ()
assert result.global_avg_cost_per_1k == 0.0


# ── Downgrade Recommendation Tests ────────────────────────────────


@pytest.mark.unit
class TestRecommendDowngrades:
async def test_no_resolver_empty_result(self) -> None:
optimizer, _ = _make_optimizer()
result = await optimizer.recommend_downgrades(start=_START, end=_END)
assert result.recommendations == ()

async def test_with_downgrade_path(self) -> None:
from ai_company.budget.config import AutoDowngradeConfig

resolver = _make_resolver()
bc = BudgetConfig(
total_monthly=100.0,
auto_downgrade=AutoDowngradeConfig(
enabled=True,
threshold=80,
downgrade_map=(("large", "small"),),
),
)
tracker = CostTracker(budget_config=bc)
optimizer = CostOptimizer(
cost_tracker=tracker,
budget_config=bc,
model_resolver=resolver,
)

# Make alice inefficient using large model
await tracker.record(
make_cost_record(
agent_id="alice",
model="test-large-001",
cost_usd=10.0,
input_tokens=1000,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)
# Make bob efficient using small model
await tracker.record(
make_cost_record(
agent_id="bob",
model="test-small-001",
cost_usd=0.1,
input_tokens=1000,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)

result = await optimizer.recommend_downgrades(start=_START, end=_END)
assert len(result.recommendations) == 1
rec = result.recommendations[0]
assert rec.agent_id == "alice"
assert rec.current_model == "test-large-001"
assert rec.recommended_model == "test-small-001"
assert rec.estimated_savings_per_1k > 0

async def test_no_cheaper_model_empty(self) -> None:
"""No recommendation when agent already uses cheapest model."""
resolver = _make_resolver(
[
ResolvedModel(
provider_name="test-provider",
model_id="test-only-001",
alias="only",
cost_per_1k_input=0.01,
cost_per_1k_output=0.02,
),
]
)
bc = BudgetConfig(total_monthly=100.0)
tracker = CostTracker(budget_config=bc)
optimizer = CostOptimizer(
cost_tracker=tracker,
budget_config=bc,
model_resolver=resolver,
)

# Only agent, only model — inefficient by default since it's the only one
await tracker.record(
make_cost_record(
agent_id="alice",
model="test-only-001",
cost_usd=10.0,
input_tokens=1000,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)

result = await optimizer.recommend_downgrades(start=_START, end=_END)
assert result.recommendations == ()


# ── Evaluate Operation Tests ──────────────────────────────────────


@pytest.mark.unit
class TestEvaluateOperation:
async def test_healthy_budget_approved(self) -> None:
optimizer, tracker = _make_optimizer()
# Spend only 10% of budget
await tracker.record(
make_cost_record(cost_usd=10.0, timestamp=_START + timedelta(hours=1)),
)
decision = await optimizer.evaluate_operation(
agent_id="alice",
estimated_cost_usd=0.5,
now=_START + timedelta(days=15),
)
assert decision.approved is True
assert decision.alert_level == BudgetAlertLevel.NORMAL

async def test_hard_stop_denied(self) -> None:
bc = BudgetConfig(
total_monthly=100.0,
alerts=BudgetAlertConfig(warn_at=75, critical_at=90, hard_stop_at=100),
)
optimizer, tracker = _make_optimizer(budget_config=bc)

# Spend 100% of budget
await tracker.record(
make_cost_record(cost_usd=100.0, timestamp=_START + timedelta(hours=1)),
)

decision = await optimizer.evaluate_operation(
agent_id="alice",
estimated_cost_usd=1.0,
now=_START + timedelta(days=15),
)
assert decision.approved is False
assert decision.alert_level == BudgetAlertLevel.HARD_STOP

async def test_would_exceed_budget_denied(self) -> None:
bc = BudgetConfig(
total_monthly=100.0,
alerts=BudgetAlertConfig(warn_at=75, critical_at=90, hard_stop_at=100),
)
optimizer, tracker = _make_optimizer(budget_config=bc)

# Spend 95% and request 10 more → projected 105% → HARD_STOP
await tracker.record(
make_cost_record(cost_usd=95.0, timestamp=_START + timedelta(hours=1)),
)

decision = await optimizer.evaluate_operation(
agent_id="alice",
estimated_cost_usd=10.0,
now=_START + timedelta(days=15),
)
assert decision.approved is False
# With projected alert level, this now triggers auto-deny
assert "denied" in decision.reason.lower()

async def test_warning_level_approved_with_conditions(self) -> None:
bc = BudgetConfig(
total_monthly=100.0,
alerts=BudgetAlertConfig(warn_at=75, critical_at=90, hard_stop_at=100),
)
optimizer, tracker = _make_optimizer(budget_config=bc)

# Spend 80% (warning level)
await tracker.record(
make_cost_record(cost_usd=80.0, timestamp=_START + timedelta(hours=1)),
)

decision = await optimizer.evaluate_operation(
agent_id="alice",
estimated_cost_usd=2.0,
now=_START + timedelta(days=15),
)
assert decision.approved is True
assert decision.alert_level == BudgetAlertLevel.WARNING
assert len(decision.conditions) > 0

async def test_budget_enforcement_disabled(self) -> None:
bc = BudgetConfig(total_monthly=0.0)
optimizer, _ = _make_optimizer(budget_config=bc)

decision = await optimizer.evaluate_operation(
agent_id="alice",
estimated_cost_usd=100.0,
)
assert decision.approved is True
assert "disabled" in decision.reason.lower()

async def test_critical_level_auto_deny_with_custom_config(self) -> None:
"""Auto-deny at CRITICAL when configured."""
bc = BudgetConfig(
total_monthly=100.0,
alerts=BudgetAlertConfig(warn_at=75, critical_at=90, hard_stop_at=100),
)
config = CostOptimizerConfig(
approval_auto_deny_alert_level=BudgetAlertLevel.CRITICAL,
)
optimizer, tracker = _make_optimizer(budget_config=bc, config=config)

# Spend 92% (critical level)
await tracker.record(
make_cost_record(cost_usd=92.0, timestamp=_START + timedelta(hours=1)),
)

decision = await optimizer.evaluate_operation(
agent_id="alice",
estimated_cost_usd=0.01,
now=_START + timedelta(days=15),
)
assert decision.approved is False
assert decision.alert_level == BudgetAlertLevel.CRITICAL

async def test_high_cost_condition(self) -> None:
"""High-cost warning condition when estimated cost >= threshold."""
config = CostOptimizerConfig(approval_warn_threshold_usd=0.5)
optimizer, _ = _make_optimizer(config=config)

decision = await optimizer.evaluate_operation(
agent_id="alice",
estimated_cost_usd=1.0,
now=_START + timedelta(days=15),
)
assert decision.approved is True
assert any("High-cost" in c for c in decision.conditions)


# ── _classify_severity Tests ─────────────────────────────────────


@pytest.mark.unit
class TestClassifySeverity:
@pytest.mark.parametrize(
("deviation", "expected"),
[
(0.0, AnomalySeverity.LOW),
(1.5, AnomalySeverity.LOW),
(1.99, AnomalySeverity.LOW),
(2.0, AnomalySeverity.MEDIUM),
(2.5, AnomalySeverity.MEDIUM),
(2.99, AnomalySeverity.MEDIUM),
(3.0, AnomalySeverity.HIGH),
(5.0, AnomalySeverity.HIGH),
(100.0, AnomalySeverity.HIGH),
],
)
def test_thresholds(self, deviation: float, expected: AnomalySeverity) -> None:
assert _classify_severity(deviation) == expected


# ── Input Validation Tests ───────────────────────────────────────


@pytest.mark.unit
class TestInputValidation:
async def test_detect_anomalies_start_after_end(self) -> None:
optimizer, _ = _make_optimizer()
with pytest.raises(ValueError, match=r"start .* must be before end"):
await optimizer.detect_anomalies(start=_END, end=_START)

async def test_analyze_efficiency_start_after_end(self) -> None:
optimizer, _ = _make_optimizer()
with pytest.raises(ValueError, match=r"start .* must be before end"):
await optimizer.analyze_efficiency(start=_END, end=_START)

async def test_recommend_downgrades_start_after_end(self) -> None:
optimizer, _ = _make_optimizer()
with pytest.raises(ValueError, match=r"start .* must be before end"):
await optimizer.recommend_downgrades(start=_END, end=_START)


# ── Edge Case Tests ──────────────────────────────────────────────


@pytest.mark.unit
class TestEdgeCases:
async def test_find_cheaper_model_picks_cheapest(self) -> None:
"""_find_cheaper_model selects the overall cheapest below current."""
resolver = _make_resolver()
result = await _make_optimizer(model_resolver=resolver)[0].recommend_downgrades(
start=_START, end=_END
)
# No records → no recommendations, but validates the path
assert result.recommendations == ()

async def test_budget_pressure_percent_reflects_spending(self) -> None:
"""budget_pressure_percent reflects actual spend vs budget."""
from ai_company.budget.billing import billing_period_start

resolver = _make_resolver()
bc = BudgetConfig(total_monthly=100.0)
tracker = CostTracker(budget_config=bc)
optimizer = CostOptimizer(
cost_tracker=tracker,
budget_config=bc,
model_resolver=resolver,
)
# Record in the current billing period so pressure reflects it
now = datetime.now(UTC)
period_start = billing_period_start(bc.reset_day, now=now)
await tracker.record(
make_cost_record(
cost_usd=60.0,
timestamp=period_start + timedelta(hours=1),
),
)
# Use a period that covers the data for the efficiency analysis
analysis_start = period_start
analysis_end = now + timedelta(days=1)
result = await optimizer.recommend_downgrades(
start=analysis_start, end=analysis_end
)
assert result.budget_pressure_percent == 60.0

async def test_downgrade_target_not_resolved(self) -> None:
"""No recommendation when downgrade target doesn't resolve."""
from ai_company.budget.config import AutoDowngradeConfig

resolver = _make_resolver(
[
ResolvedModel(
provider_name="test-provider",
model_id="test-large-001",
alias="large",
cost_per_1k_input=0.03,
cost_per_1k_output=0.06,
),
]
)
bc = BudgetConfig(
total_monthly=100.0,
auto_downgrade=AutoDowngradeConfig(
enabled=True,
threshold=80,
downgrade_map=(("large", "nonexistent"),),
),
)
tracker = CostTracker(budget_config=bc)
optimizer = CostOptimizer(
cost_tracker=tracker,
budget_config=bc,
model_resolver=resolver,
)

# Make alice inefficient (only agent, but needs another to set avg)
await tracker.record(
make_cost_record(
agent_id="alice",
model="test-large-001",
cost_usd=10.0,
input_tokens=1000,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)
await tracker.record(
make_cost_record(
agent_id="bob",
model="test-large-001",
cost_usd=0.1,
input_tokens=1000,
output_tokens=0,
timestamp=_START + timedelta(hours=1),
),
)

result = await optimizer.recommend_downgrades(start=_START, end=_END)
# Target "nonexistent" can't be resolved → no recommendation
assert result.recommendations == ()

async def test_negative_estimated_cost_rejected(self) -> None:
"""Negative estimated_cost_usd raises ValueError."""
optimizer, _ = _make_optimizer()
with pytest.raises(ValueError, match="estimated_cost_usd must be >= 0"):
await optimizer.evaluate_operation(
agent_id="alice",
estimated_cost_usd=-1.0,
)

async def test_window_count_upper_bound(self) -> None:
"""window_count > 1000 raises ValueError."""
optimizer, _ = _make_optimizer()
with pytest.raises(ValueError, match="window_count must be <= 1000"):
await optimizer.detect_anomalies(
start=_START,
end=_END,
window_count=1001,
)

async def test_projected_alert_level_used_for_auto_deny(self) -> None:
"""Auto-deny uses projected alert level, not current."""
bc = BudgetConfig(
total_monthly=100.0,
alerts=BudgetAlertConfig(warn_at=75, critical_at=90, hard_stop_at=100),
)
config = CostOptimizerConfig(
approval_auto_deny_alert_level=BudgetAlertLevel.HARD_STOP,
)
optimizer, tracker = _make_optimizer(budget_config=bc, config=config)

# Spend 95% — current alert is CRITICAL, but requesting 10
# would push to 105% → projected HARD_STOP → denied
await tracker.record(
make_cost_record(cost_usd=95.0, timestamp=_START + timedelta(hours=1)),
)

decision = await optimizer.evaluate_operation(
agent_id="alice",
estimated_cost_usd=10.0,
now=_START + timedelta(days=15),
)
assert decision.approved is False
assert "projected" in decision.reason.lower()


# ── Routing Optimization Tests ──────────────────────────────────


@pytest.mark.unit
class TestSuggestRoutingOptimizations:
async def test_no_resolver_empty_result(self) -> None:
optimizer, _ = _make_optimizer()
result = await optimizer.suggest_routing_optimizations(
start=_START,
end=_END,
)
assert result.suggestions == ()
assert result.agents_analyzed == 0

async def test_no_records_empty_suggestions(self) -> None:
resolver = _make_resolver()
optimizer, _ = _make_optimizer(model_resolver=resolver)
result = await optimizer.suggest_routing_optimizations(
start=_START,
end=_END,
)
assert result.suggestions == ()
assert result.agents_analyzed == 0

async def test_suggests_cheaper_model(self) -> None:
resolver = _make_resolver()
optimizer, tracker = _make_optimizer(model_resolver=resolver)

# Alice uses the expensive large model
await tracker.record(
make_cost_record(
agent_id="alice",
model="test-large-001",
cost_usd=5.0,
input_tokens=1000,
output_tokens=500,
timestamp=_START + timedelta(hours=1),
),
)

result = await optimizer.suggest_routing_optimizations(
start=_START,
end=_END,
)
assert len(result.suggestions) == 1
suggestion = result.suggestions[0]
assert suggestion.agent_id == "alice"
assert suggestion.current_model == "test-large-001"
assert suggestion.estimated_savings_per_1k > 0
assert result.total_estimated_savings_per_1k > 0

async def test_no_suggestion_for_cheapest_model(self) -> None:
resolver = _make_resolver()
optimizer, tracker = _make_optimizer(model_resolver=resolver)

# Alice already uses the cheapest model
await tracker.record(
make_cost_record(
agent_id="alice",
model="test-small-001",
cost_usd=0.1,
input_tokens=1000,
output_tokens=500,
timestamp=_START + timedelta(hours=1),
),
)

result = await optimizer.suggest_routing_optimizations(
start=_START,
end=_END,
)
assert result.suggestions == ()
assert result.agents_analyzed == 1

async def test_start_after_end_rejected(self) -> None:
optimizer, _ = _make_optimizer()
with pytest.raises(ValueError, match=r"start .* must be before end"):
await optimizer.suggest_routing_optimizations(start=_END, end=_START)

async def test_context_window_respected(self) -> None:
"""Suggestions only include models with sufficient context window."""
models = [
ResolvedModel(
provider_name="test-provider",
model_id="test-large-001",
alias="large",
cost_per_1k_input=0.03,
cost_per_1k_output=0.06,
max_context=200000,
),
ResolvedModel(
provider_name="test-provider",
model_id="test-small-001",
alias="small",
cost_per_1k_input=0.001,
cost_per_1k_output=0.002,
max_context=50000, # Smaller context than large
),
]
resolver = _make_resolver(models)
optimizer, tracker = _make_optimizer(model_resolver=resolver)

await tracker.record(
make_cost_record(
agent_id="alice",
model="test-large-001",
cost_usd=5.0,
timestamp=_START + timedelta(hours=1),
),
)

result = await optimizer.suggest_routing_optimizations(
start=_START,
end=_END,
)
# small has insufficient context window → no suggestion
assert result.suggestions == ()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Split this test module.

This new file is already around 900 lines, which is past the repo's size limit and will only get harder to navigate as optimizer coverage grows. Breaking it into anomaly/efficiency/downgrade/approval/routing modules would keep failures much easier to localize.

As per coding guidelines: "Keep functions under 50 lines and files under 800 lines."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/budget/test_optimizer.py` around lines 1 - 900, The test module is
too large; split it into smaller focused test files by moving the related test
classes into separate modules (e.g., tests/unit/budget/test_anomalies.py,
test_efficiency.py, test_downgrades.py, test_approval.py, test_routing.py).
Extract shared helpers/constants (_START, _END, _make_optimizer, _make_resolver,
make_cost_record import) into a common test helper or conftest (e.g.,
tests/unit/budget/test_helpers.py or reuse tests/unit/budget/conftest.py) and
update imports in each new file; preserve pytest.mark.unit decorators and keep
each test class (TestDetectAnomalies, TestAnalyzeEfficiency,
TestRecommendDowngrades, TestEvaluateOperation, TestSuggestRoutingOptimizations,
TestClassifySeverity, TestInputValidation, TestEdgeCases) intact when moving so
tests and references (CostOptimizer, CostTracker, CostOptimizerConfig,
BudgetConfig, ModelResolver, ResolvedModel, _classify_severity) still resolve.
Ensure no duplicate fixtures/names and run pytest to verify imports and test
discovery.

Comment on lines +645 to +652
async def test_find_cheaper_model_picks_cheapest(self) -> None:
"""_find_cheaper_model selects the overall cheapest below current."""
resolver = _make_resolver()
result = await _make_optimizer(model_resolver=resolver)[0].recommend_downgrades(
start=_START, end=_END
)
# No records → no recommendations, but validates the path
assert result.recommendations == ()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This test never reaches cheaper-model selection.

No records are seeded here, so recommend_downgrades() returns on the empty-data path before any _find_cheaper_model logic runs. The test passes even if that branch is broken. Either seed an inefficient record and assert the chosen target, or rename the test to the empty-state behavior it actually covers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/budget/test_optimizer.py` around lines 645 - 652, The test
test_find_cheaper_model_picks_cheapest never exercises _find_cheaper_model
because recommend_downgrades returns early on empty data; either seed an
inefficient usage record before calling recommend_downgrades so the
_find_cheaper_model path runs and assert the chosen cheaper model, or rename the
test to reflect empty-state behavior. Concretely, in the test that calls
_make_resolver() and _make_optimizer(), add a fixture/seeded record (matching
whatever helper you use to insert records in tests) representing an
inefficient/high-cost model so recommend_downgrades evaluates downgrades, then
assert the returned recommendation target; otherwise change the test name and
expected assertion to indicate it verifies the empty-data result from
recommend_downgrades.

- (A) _find_most_used_model accepts pre-filtered agent records
- (B) _find_cheaper_model respects min_context for context window
- (C) recommend_downgrades returns real budget_pressure when no resolver
- (D) evaluate_operation uses projected_alert for conditions
- (E) reports.py logs WARNING before validation ValueErrors
- (F) suggest_routing_optimizations docstring no longer claims latency
- (G) generate_report derives total_cost from records for consistency
- (H) evaluate_operation split into _check_denial/_build_approval_conditions;
      recommend_downgrades/suggest_routing_optimizations loops extracted
- (I) recommend_downgrades parallelizes get_records + budget_pressure
- (J) test_optimizer.py split into 3 files (analysis, decisions)
- (K) DESIGN_SPEC §10.3 mentions routing optimization
- (L) _find_cheaper_model tests exercise actual code path + min_context
Copilot AI review requested due to automatic review settings March 9, 2026 15:20
@Aureliolo Aureliolo merged commit a7fa00b into main Mar 9, 2026
8 checks passed
@Aureliolo Aureliolo deleted the feat/cfo-agent branch March 9, 2026 15:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +245 to +250
"""Generate a spending report for the given period.

Fetches records and summary concurrently; derives ``total_cost``
from the records snapshot for consistent distribution
percentages.

Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generate_report() docstring says records and summary are fetched concurrently, but the implementation awaits get_records() and then build_summary() sequentially. Either update the docstring to match the actual behavior or use a TaskGroup/gather to fetch both concurrently (noting CostTracker snapshots under a lock).

Copilot uses AI. Check for mistakes.
Comment on lines +61 to +65
current_value: Spending in the most recent window.
baseline_value: Mean spending across historical windows.
deviation_factor: How many standard deviations above baseline.
Set to 0.0 when the baseline is zero (no historical spending).
detected_at: Timestamp when the anomaly was detected.
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpendingAnomaly.deviation_factor is documented as “standard deviations above baseline”, but when historical stddev is 0 the implementation sets deviation_factor to the spike ratio (a multiplier), not a sigma value. Please update the field/docstring to reflect the actual semantics (e.g., “sigma or spike ratio depending on variance”) so consumers don’t misinterpret it.

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +72
# Same ordering as BudgetEnforcer._ALERT_LEVEL_ORDER
_ALERT_LEVEL_ORDER: dict[BudgetAlertLevel, int] = {
BudgetAlertLevel.NORMAL: 0,
BudgetAlertLevel.WARNING: 1,
BudgetAlertLevel.CRITICAL: 2,
BudgetAlertLevel.HARD_STOP: 3,
}
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimizer.py duplicates BudgetEnforcer’s _ALERT_LEVEL_ORDER mapping but omits the runtime sanity checks that enforcer.py has (ensuring keys match BudgetAlertLevel and values are unique). Adding the same validation (or importing a shared constant) would prevent silent drift if BudgetAlertLevel changes.

Copilot uses AI. Check for mistakes.
Comment on lines +688 to +692
if projected_cost >= hard_stop_limit:
logger.warning(
CFO_OPERATION_DENIED,
agent_id=agent_id,
estimated_cost=estimated_cost_usd,
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _check_denial(), the if projected_cost >= hard_stop_limit branch is unreachable with the current logic: whenever that condition is true, projected_pct will be >= hard_stop_at and _compute_alert_level() will return HARD_STOP, which is always >= any configured approval_auto_deny_alert_level, so the earlier auto-deny check already returns. Consider removing this dead branch, or changing the first check if you intend hard-stop to be handled differently.

Copilot uses AI. Check for mistakes.
Comment on lines +471 to +480
if cfg.total_monthly <= 0:
return ApprovalDecision(
approved=True,
reason="Budget enforcement disabled (no monthly budget)",
budget_remaining_usd=0.0,
budget_used_percent=0.0,
alert_level=BudgetAlertLevel.NORMAL,
conditions=(),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing INFO log on budget-enforcement-disabled approval path

The total_monthly <= 0 early-return at line 471 emits no log entry before returning the ApprovalDecision. All other code paths in this method (CFO_OPERATION_DENIED for negative cost, CFO_APPROVAL_EVALUATED for the normal approval, and _check_denial's CFO_OPERATION_DENIED) are instrumented at INFO/WARNING. CLAUDE.md mandates "All state transitions must log at INFO," and this early-exit is a production-relevant state transition that will be completely invisible in logs.

if cfg.total_monthly <= 0:
    decision = ApprovalDecision(
        approved=True,
        reason="Budget enforcement disabled (no monthly budget)",
        budget_remaining_usd=0.0,
        budget_used_percent=0.0,
        alert_level=BudgetAlertLevel.NORMAL,
        conditions=(),
    )
    logger.info(
        CFO_APPROVAL_EVALUATED,
        agent_id=agent_id,
        approved=True,
        estimated_cost=estimated_cost_usd,
        alert_level=BudgetAlertLevel.NORMAL.value,
        conditions_count=0,
        reason="enforcement_disabled",
    )
    return decision
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/budget/optimizer.py
Line: 471-480

Comment:
**Missing INFO log on budget-enforcement-disabled approval path**

The `total_monthly <= 0` early-return at line 471 emits no log entry before returning the `ApprovalDecision`. All other code paths in this method (`CFO_OPERATION_DENIED` for negative cost, `CFO_APPROVAL_EVALUATED` for the normal approval, and `_check_denial`'s `CFO_OPERATION_DENIED`) are instrumented at INFO/WARNING. CLAUDE.md mandates "All state transitions must log at INFO," and this early-exit is a production-relevant state transition that will be completely invisible in logs.

```python
if cfg.total_monthly <= 0:
    decision = ApprovalDecision(
        approved=True,
        reason="Budget enforcement disabled (no monthly budget)",
        budget_remaining_usd=0.0,
        budget_used_percent=0.0,
        alert_level=BudgetAlertLevel.NORMAL,
        conditions=(),
    )
    logger.info(
        CFO_APPROVAL_EVALUATED,
        agent_id=agent_id,
        approved=True,
        estimated_cost=estimated_cost_usd,
        alert_level=BudgetAlertLevel.NORMAL.value,
        conditions_count=0,
        reason="enforcement_disabled",
    )
    return decision
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +383 to +386
approval_auto_deny_alert_level: BudgetAlertLevel = Field(
default=BudgetAlertLevel.HARD_STOP,
description="Alert level triggering auto-deny",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approval_auto_deny_alert_level = NORMAL silently denies every operation

approval_auto_deny_alert_level accepts any BudgetAlertLevel, including BudgetAlertLevel.NORMAL. In _check_denial, the guard is:

if _ALERT_LEVEL_ORDER[projected_alert] >= _ALERT_LEVEL_ORDER[auto_deny_level]:

_ALERT_LEVEL_ORDER[NORMAL] is 0, so this condition is always True for any projected_alert (since all levels map to >= 0). Setting the field to NORMAL therefore auto-denies every operation regardless of actual budget usage — a much harder footgun than the approval_warn_threshold_usd = 0 case already flagged, because it makes the service silently refuse all work.

Consider adding a validator that rejects NORMAL as the deny threshold (or documents this behaviour explicitly):

@field_validator("approval_auto_deny_alert_level")
@classmethod
def _deny_level_not_normal(cls, v: BudgetAlertLevel) -> BudgetAlertLevel:
    if v == BudgetAlertLevel.NORMAL:
        msg = (
            "approval_auto_deny_alert_level=NORMAL would deny every operation; "
            "use WARNING, CRITICAL, or HARD_STOP"
        )
        raise ValueError(msg)
    return v
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/budget/optimizer_models.py
Line: 383-386

Comment:
**`approval_auto_deny_alert_level = NORMAL` silently denies every operation**

`approval_auto_deny_alert_level` accepts any `BudgetAlertLevel`, including `BudgetAlertLevel.NORMAL`. In `_check_denial`, the guard is:

```python
if _ALERT_LEVEL_ORDER[projected_alert] >= _ALERT_LEVEL_ORDER[auto_deny_level]:
```

`_ALERT_LEVEL_ORDER[NORMAL]` is `0`, so this condition is always `True` for any `projected_alert` (since all levels map to `>= 0`). Setting the field to `NORMAL` therefore auto-denies **every** operation regardless of actual budget usage — a much harder footgun than the `approval_warn_threshold_usd = 0` case already flagged, because it makes the service silently refuse all work.

Consider adding a validator that rejects `NORMAL` as the deny threshold (or documents this behaviour explicitly):

```python
@field_validator("approval_auto_deny_alert_level")
@classmethod
def _deny_level_not_normal(cls, v: BudgetAlertLevel) -> BudgetAlertLevel:
    if v == BudgetAlertLevel.NORMAL:
        msg = (
            "approval_auto_deny_alert_level=NORMAL would deny every operation; "
            "use WARNING, CRITICAL, or HARD_STOP"
        )
        raise ValueError(msg)
    return v
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +749 to +751
# Re-export _classify_severity for backwards compatibility with tests
# that import it directly from optimizer.
__all__ = ["CostOptimizer", "_classify_severity"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale re-export of private _classify_severity in __all__

The comment claims _classify_severity is re-exported here for "backwards compatibility with tests that import it directly from optimizer," but test_optimizer.py already imports it from ai_company.budget._optimizer_helpers (line 4 of that file), not from optimizer. The re-export is therefore unused, and exporting a module-private function (double-underscore-prefixed convention) via __all__ is unconventional and misleading — consumers of ai_company.budget.optimizer would see it as part of the public API.

# Re-export _classify_severity for backwards compatibility with tests
# that import it directly from optimizer.
__all__ = ["CostOptimizer", "_classify_severity"]

Consider removing _classify_severity from __all__:

Suggested change
# Re-export _classify_severity for backwards compatibility with tests
# that import it directly from optimizer.
__all__ = ["CostOptimizer", "_classify_severity"]
__all__ = ["CostOptimizer"]
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/budget/optimizer.py
Line: 749-751

Comment:
**Stale re-export of private `_classify_severity` in `__all__`**

The comment claims `_classify_severity` is re-exported here for "backwards compatibility with tests that import it directly from optimizer," but `test_optimizer.py` already imports it from `ai_company.budget._optimizer_helpers` (line 4 of that file), not from `optimizer`. The re-export is therefore unused, and exporting a module-private function (double-underscore-prefixed convention) via `__all__` is unconventional and misleading — consumers of `ai_company.budget.optimizer` would see it as part of the public API.

```python
# Re-export _classify_severity for backwards compatibility with tests
# that import it directly from optimizer.
__all__ = ["CostOptimizer", "_classify_severity"]
```

Consider removing `_classify_severity` from `__all__`:

```suggestion
__all__ = ["CostOptimizer"]
```

How can I resolve this? If you propose a fix, please make it concise.

Aureliolo added a commit that referenced this pull request Mar 10, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.1.1](ai-company-v0.1.0...ai-company-v0.1.1)
(2026-03-10)


### Features

* add autonomy levels and approval timeout policies
([#42](#42),
[#126](#126))
([#197](#197))
([eecc25a](eecc25a))
* add CFO cost optimization service with anomaly detection, reports, and
approval decisions
([#186](#186))
([a7fa00b](a7fa00b))
* add code quality toolchain (ruff, mypy, pre-commit, dependabot)
([#63](#63))
([36681a8](36681a8))
* add configurable cost tiers and subscription/quota-aware tracking
([#67](#67))
([#185](#185))
([9baedfa](9baedfa))
* add container packaging, Docker Compose, and CI pipeline
([#269](#269))
([435bdfe](435bdfe)),
closes [#267](#267)
* add coordination error taxonomy classification pipeline
([#146](#146))
([#181](#181))
([70c7480](70c7480))
* add cost-optimized, hierarchical, and auction assignment strategies
([#175](#175))
([ce924fa](ce924fa)),
closes [#173](#173)
* add design specification, license, and project setup
([8669a09](8669a09))
* add env var substitution and config file auto-discovery
([#77](#77))
([7f53832](7f53832))
* add FastestStrategy routing + vendor-agnostic cleanup
([#140](#140))
([09619cb](09619cb)),
closes [#139](#139)
* add HR engine and performance tracking
([#45](#45),
[#47](#47))
([#193](#193))
([2d091ea](2d091ea))
* add issue auto-search and resolution verification to PR review skill
([#119](#119))
([deecc39](deecc39))
* add memory retrieval, ranking, and context injection pipeline
([#41](#41))
([873b0aa](873b0aa))
* add pluggable MemoryBackend protocol with models, config, and events
([#180](#180))
([46cfdd4](46cfdd4))
* add pluggable MemoryBackend protocol with models, config, and events
([#32](#32))
([46cfdd4](46cfdd4))
* add pluggable PersistenceBackend protocol with SQLite implementation
([#36](#36))
([f753779](f753779))
* add progressive trust and promotion/demotion subsystems
([#43](#43),
[#49](#49))
([3a87c08](3a87c08))
* add retry handler, rate limiter, and provider resilience
([#100](#100))
([b890545](b890545))
* add SecOps security agent with rule engine, audit log, and ToolInvoker
integration ([#40](#40))
([83b7b6c](83b7b6c))
* add shared org memory and memory consolidation/archival
([#125](#125),
[#48](#48))
([4a0832b](4a0832b))
* design unified provider interface
([#86](#86))
([3e23d64](3e23d64))
* expand template presets, rosters, and add inheritance
([#80](#80),
[#81](#81),
[#84](#84))
([15a9134](15a9134))
* implement agent runtime state vs immutable config split
([#115](#115))
([4cb1ca5](4cb1ca5))
* implement AgentEngine core orchestrator
([#11](#11))
([#143](#143))
([f2eb73a](f2eb73a))
* implement basic tool system (registry, invocation, results)
([#15](#15))
([c51068b](c51068b))
* implement built-in file system tools
([#18](#18))
([325ef98](325ef98))
* implement communication foundation — message bus, dispatcher, and
messenger ([#157](#157))
([8e71bfd](8e71bfd))
* implement company template system with 7 built-in presets
([#85](#85))
([cbf1496](cbf1496))
* implement conflict resolution protocol
([#122](#122))
([#166](#166))
([e03f9f2](e03f9f2))
* implement core entity and role system models
([#69](#69))
([acf9801](acf9801))
* implement crash recovery with fail-and-reassign strategy
([#149](#149))
([e6e91ed](e6e91ed))
* implement engine extensions — Plan-and-Execute loop and call
categorization
([#134](#134),
[#135](#135))
([#159](#159))
([9b2699f](9b2699f))
* implement enterprise logging system with structlog
([#73](#73))
([2f787e5](2f787e5))
* implement graceful shutdown with cooperative timeout strategy
([#130](#130))
([6592515](6592515))
* implement hierarchical delegation and loop prevention
([#12](#12),
[#17](#17))
([6be60b6](6be60b6))
* implement LiteLLM driver and provider registry
([#88](#88))
([ae3f18b](ae3f18b)),
closes [#4](#4)
* implement LLM decomposition strategy and workspace isolation
([#174](#174))
([aa0eefe](aa0eefe))
* implement meeting protocol system
([#123](#123))
([ee7caca](ee7caca))
* implement message and communication domain models
([#74](#74))
([560a5d2](560a5d2))
* implement model routing engine
([#99](#99))
([d3c250b](d3c250b))
* implement parallel agent execution
([#22](#22))
([#161](#161))
([65940b3](65940b3))
* implement per-call cost tracking service
([#7](#7))
([#102](#102))
([c4f1f1c](c4f1f1c))
* implement personality injection and system prompt construction
([#105](#105))
([934dd85](934dd85))
* implement single-task execution lifecycle
([#21](#21))
([#144](#144))
([c7e64e4](c7e64e4))
* implement subprocess sandbox for tool execution isolation
([#131](#131))
([#153](#153))
([3c8394e](3c8394e))
* implement task assignment subsystem with pluggable strategies
([#172](#172))
([c7f1b26](c7f1b26)),
closes [#26](#26)
[#30](#30)
* implement task decomposition and routing engine
([#14](#14))
([9c7fb52](9c7fb52))
* implement Task, Project, Artifact, Budget, and Cost domain models
([#71](#71))
([81eabf1](81eabf1))
* implement tool permission checking
([#16](#16))
([833c190](833c190))
* implement YAML config loader with Pydantic validation
([#59](#59))
([ff3a2ba](ff3a2ba))
* implement YAML config loader with Pydantic validation
([#75](#75))
([ff3a2ba](ff3a2ba))
* initialize project with uv, hatchling, and src layout
([39005f9](39005f9))
* initialize project with uv, hatchling, and src layout
([#62](#62))
([39005f9](39005f9))
* Litestar REST API, WebSocket feed, and approval queue (M6)
([#189](#189))
([29fcd08](29fcd08))
* make TokenUsage.total_tokens a computed field
([#118](#118))
([c0bab18](c0bab18)),
closes [#109](#109)
* parallel tool execution in ToolInvoker.invoke_all
([#137](#137))
([58517ee](58517ee))
* testing framework, CI pipeline, and M0 gap fixes
([#64](#64))
([f581749](f581749))
* wire all modules into observability system
([#97](#97))
([f7a0617](f7a0617))


### Bug Fixes

* address Greptile post-merge review findings from PRs
[#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175)
([#176](#176))
([c5ca929](c5ca929))
* address post-merge review feedback from PRs
[#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167)
([#170](#170))
([3bf897a](3bf897a)),
closes [#169](#169)
* enforce strict mypy on test files
([#89](#89))
([aeeff8c](aeeff8c))
* harden Docker sandbox, MCP bridge, and code runner
([#50](#50),
[#53](#53))
([d5e1b6e](d5e1b6e))
* harden git tools security + code quality improvements
([#150](#150))
([000a325](000a325))
* harden subprocess cleanup, env filtering, and shutdown resilience
([#155](#155))
([d1fe1fb](d1fe1fb))
* incorporate post-merge feedback + pre-PR review fixes
([#164](#164))
([c02832a](c02832a))
* pre-PR review fixes for post-merge findings
([#183](#183))
([26b3108](26b3108))
* strengthen immutability for BaseTool schema and ToolInvoker boundaries
([#117](#117))
([7e5e861](7e5e861))


### Performance

* harden non-inferable principle implementation
([#195](#195))
([02b5f4e](02b5f4e)),
closes [#188](#188)


### Refactoring

* adopt NotBlankStr across all models
([#108](#108))
([#120](#120))
([ef89b90](ef89b90))
* extract _SpendingTotals base class from spending summary models
([#111](#111))
([2f39c1b](2f39c1b))
* harden BudgetEnforcer with error handling, validation extraction, and
review fixes
([#182](#182))
([c107bf9](c107bf9))
* harden personality profiles, department validation, and template
rendering ([#158](#158))
([10b2299](10b2299))
* pre-PR review improvements for ExecutionLoop + ReAct loop
([#124](#124))
([8dfb3c0](8dfb3c0))
* split events.py into per-domain event modules
([#136](#136))
([e9cba89](e9cba89))


### Documentation

* add ADR-001 memory layer evaluation and selection
([#178](#178))
([db3026f](db3026f)),
closes [#39](#39)
* add agent scaling research findings to DESIGN_SPEC
([#145](#145))
([57e487b](57e487b))
* add CLAUDE.md, contributing guide, and dev documentation
([#65](#65))
([55c1025](55c1025)),
closes [#54](#54)
* add crash recovery, sandboxing, analytics, and testing decisions
([#127](#127))
([5c11595](5c11595))
* address external review feedback with MVP scope and new protocols
([#128](#128))
([3b30b9a](3b30b9a))
* expand design spec with pluggable strategy protocols
([#121](#121))
([6832db6](6832db6))
* finalize 23 design decisions (ADR-002)
([#190](#190))
([8c39742](8c39742))
* update project docs for M2.5 conventions and add docs-consistency
review agent
([#114](#114))
([99766ee](99766ee))


### Tests

* add e2e single agent integration tests
([#24](#24))
([#156](#156))
([f566fb4](f566fb4))
* add provider adapter integration tests
([#90](#90))
([40a61f4](40a61f4))


### CI/CD

* add Release Please for automated versioning and GitHub Releases
([#278](#278))
([a488758](a488758))
* bump actions/checkout from 4 to 6
([#95](#95))
([1897247](1897247))
* bump actions/upload-artifact from 4 to 7
([#94](#94))
([27b1517](27b1517))
* harden CI/CD pipeline
([#92](#92))
([ce4693c](ce4693c))
* split vulnerability scans into critical-fail and high-warn tiers
([#277](#277))
([aba48af](aba48af))


### Maintenance

* add /worktree skill for parallel worktree management
([#171](#171))
([951e337](951e337))
* add design spec context loading to research-link skill
([8ef9685](8ef9685))
* add post-merge-cleanup skill
([#70](#70))
([f913705](f913705))
* add pre-pr-review skill and update CLAUDE.md
([#103](#103))
([92e9023](92e9023))
* add research-link skill and rename skill files to SKILL.md
([#101](#101))
([651c577](651c577))
* bump aiosqlite from 0.21.0 to 0.22.1
([#191](#191))
([3274a86](3274a86))
* bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group
([#96](#96))
([0338d0c](0338d0c))
* bump ruff from 0.15.4 to 0.15.5
([a49ee46](a49ee46))
* fix M0 audit items
([#66](#66))
([c7724b5](c7724b5))
* pin setup-uv action to full SHA
([#281](#281))
([4448002](4448002))
* post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests,
hookify rules
([#148](#148))
([c57a6a9](c57a6a9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
Aureliolo added a commit that referenced this pull request Mar 11, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.1.0](v0.0.0...v0.1.0)
(2026-03-11)


### Features

* add autonomy levels and approval timeout policies
([#42](#42),
[#126](#126))
([#197](#197))
([eecc25a](eecc25a))
* add CFO cost optimization service with anomaly detection, reports, and
approval decisions
([#186](#186))
([a7fa00b](a7fa00b))
* add code quality toolchain (ruff, mypy, pre-commit, dependabot)
([#63](#63))
([36681a8](36681a8))
* add configurable cost tiers and subscription/quota-aware tracking
([#67](#67))
([#185](#185))
([9baedfa](9baedfa))
* add container packaging, Docker Compose, and CI pipeline
([#269](#269))
([435bdfe](435bdfe)),
closes [#267](#267)
* add coordination error taxonomy classification pipeline
([#146](#146))
([#181](#181))
([70c7480](70c7480))
* add cost-optimized, hierarchical, and auction assignment strategies
([#175](#175))
([ce924fa](ce924fa)),
closes [#173](#173)
* add design specification, license, and project setup
([8669a09](8669a09))
* add env var substitution and config file auto-discovery
([#77](#77))
([7f53832](7f53832))
* add FastestStrategy routing + vendor-agnostic cleanup
([#140](#140))
([09619cb](09619cb)),
closes [#139](#139)
* add HR engine and performance tracking
([#45](#45),
[#47](#47))
([#193](#193))
([2d091ea](2d091ea))
* add issue auto-search and resolution verification to PR review skill
([#119](#119))
([deecc39](deecc39))
* add mandatory JWT + API key authentication
([#256](#256))
([c279cfe](c279cfe))
* add memory retrieval, ranking, and context injection pipeline
([#41](#41))
([873b0aa](873b0aa))
* add pluggable MemoryBackend protocol with models, config, and events
([#180](#180))
([46cfdd4](46cfdd4))
* add pluggable MemoryBackend protocol with models, config, and events
([#32](#32))
([46cfdd4](46cfdd4))
* add pluggable output scan response policies
([#263](#263))
([b9907e8](b9907e8))
* add pluggable PersistenceBackend protocol with SQLite implementation
([#36](#36))
([f753779](f753779))
* add progressive trust and promotion/demotion subsystems
([#43](#43),
[#49](#49))
([3a87c08](3a87c08))
* add retry handler, rate limiter, and provider resilience
([#100](#100))
([b890545](b890545))
* add SecOps security agent with rule engine, audit log, and ToolInvoker
integration ([#40](#40))
([83b7b6c](83b7b6c))
* add shared org memory and memory consolidation/archival
([#125](#125),
[#48](#48))
([4a0832b](4a0832b))
* design unified provider interface
([#86](#86))
([3e23d64](3e23d64))
* expand template presets, rosters, and add inheritance
([#80](#80),
[#81](#81),
[#84](#84))
([15a9134](15a9134))
* implement agent runtime state vs immutable config split
([#115](#115))
([4cb1ca5](4cb1ca5))
* implement AgentEngine core orchestrator
([#11](#11))
([#143](#143))
([f2eb73a](f2eb73a))
* implement AuditRepository for security audit log persistence
([#279](#279))
([94bc29f](94bc29f))
* implement basic tool system (registry, invocation, results)
([#15](#15))
([c51068b](c51068b))
* implement built-in file system tools
([#18](#18))
([325ef98](325ef98))
* implement communication foundation — message bus, dispatcher, and
messenger ([#157](#157))
([8e71bfd](8e71bfd))
* implement company template system with 7 built-in presets
([#85](#85))
([cbf1496](cbf1496))
* implement conflict resolution protocol
([#122](#122))
([#166](#166))
([e03f9f2](e03f9f2))
* implement core entity and role system models
([#69](#69))
([acf9801](acf9801))
* implement crash recovery with fail-and-reassign strategy
([#149](#149))
([e6e91ed](e6e91ed))
* implement engine extensions — Plan-and-Execute loop and call
categorization
([#134](#134),
[#135](#135))
([#159](#159))
([9b2699f](9b2699f))
* implement enterprise logging system with structlog
([#73](#73))
([2f787e5](2f787e5))
* implement graceful shutdown with cooperative timeout strategy
([#130](#130))
([6592515](6592515))
* implement hierarchical delegation and loop prevention
([#12](#12),
[#17](#17))
([6be60b6](6be60b6))
* implement LiteLLM driver and provider registry
([#88](#88))
([ae3f18b](ae3f18b)),
closes [#4](#4)
* implement LLM decomposition strategy and workspace isolation
([#174](#174))
([aa0eefe](aa0eefe))
* implement meeting protocol system
([#123](#123))
([ee7caca](ee7caca))
* implement message and communication domain models
([#74](#74))
([560a5d2](560a5d2))
* implement model routing engine
([#99](#99))
([d3c250b](d3c250b))
* implement parallel agent execution
([#22](#22))
([#161](#161))
([65940b3](65940b3))
* implement per-call cost tracking service
([#7](#7))
([#102](#102))
([c4f1f1c](c4f1f1c))
* implement personality injection and system prompt construction
([#105](#105))
([934dd85](934dd85))
* implement single-task execution lifecycle
([#21](#21))
([#144](#144))
([c7e64e4](c7e64e4))
* implement subprocess sandbox for tool execution isolation
([#131](#131))
([#153](#153))
([3c8394e](3c8394e))
* implement task assignment subsystem with pluggable strategies
([#172](#172))
([c7f1b26](c7f1b26)),
closes [#26](#26)
[#30](#30)
* implement task decomposition and routing engine
([#14](#14))
([9c7fb52](9c7fb52))
* implement Task, Project, Artifact, Budget, and Cost domain models
([#71](#71))
([81eabf1](81eabf1))
* implement tool permission checking
([#16](#16))
([833c190](833c190))
* implement YAML config loader with Pydantic validation
([#59](#59))
([ff3a2ba](ff3a2ba))
* implement YAML config loader with Pydantic validation
([#75](#75))
([ff3a2ba](ff3a2ba))
* initialize project with uv, hatchling, and src layout
([39005f9](39005f9))
* initialize project with uv, hatchling, and src layout
([#62](#62))
([39005f9](39005f9))
* Litestar REST API, WebSocket feed, and approval queue (M6)
([#189](#189))
([29fcd08](29fcd08))
* make TokenUsage.total_tokens a computed field
([#118](#118))
([c0bab18](c0bab18)),
closes [#109](#109)
* parallel tool execution in ToolInvoker.invoke_all
([#137](#137))
([58517ee](58517ee))
* testing framework, CI pipeline, and M0 gap fixes
([#64](#64))
([f581749](f581749))
* wire all modules into observability system
([#97](#97))
([f7a0617](f7a0617))


### Bug Fixes

* address Greptile post-merge review findings from PRs
[#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175)
([#176](#176))
([c5ca929](c5ca929))
* address post-merge review feedback from PRs
[#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167)
([#170](#170))
([3bf897a](3bf897a)),
closes [#169](#169)
* enforce strict mypy on test files
([#89](#89))
([aeeff8c](aeeff8c))
* harden Docker sandbox, MCP bridge, and code runner
([#50](#50),
[#53](#53))
([d5e1b6e](d5e1b6e))
* harden git tools security + code quality improvements
([#150](#150))
([000a325](000a325))
* harden subprocess cleanup, env filtering, and shutdown resilience
([#155](#155))
([d1fe1fb](d1fe1fb))
* incorporate post-merge feedback + pre-PR review fixes
([#164](#164))
([c02832a](c02832a))
* pre-PR review fixes for post-merge findings
([#183](#183))
([26b3108](26b3108))
* resolve circular imports, bump litellm, fix release tag format
([#286](#286))
([a6659b5](a6659b5))
* strengthen immutability for BaseTool schema and ToolInvoker boundaries
([#117](#117))
([7e5e861](7e5e861))


### Performance

* harden non-inferable principle implementation
([#195](#195))
([02b5f4e](02b5f4e)),
closes [#188](#188)


### Refactoring

* adopt NotBlankStr across all models
([#108](#108))
([#120](#120))
([ef89b90](ef89b90))
* extract _SpendingTotals base class from spending summary models
([#111](#111))
([2f39c1b](2f39c1b))
* harden BudgetEnforcer with error handling, validation extraction, and
review fixes
([#182](#182))
([c107bf9](c107bf9))
* harden personality profiles, department validation, and template
rendering ([#158](#158))
([10b2299](10b2299))
* pre-PR review improvements for ExecutionLoop + ReAct loop
([#124](#124))
([8dfb3c0](8dfb3c0))
* split events.py into per-domain event modules
([#136](#136))
([e9cba89](e9cba89))


### Documentation

* add ADR-001 memory layer evaluation and selection
([#178](#178))
([db3026f](db3026f)),
closes [#39](#39)
* add agent scaling research findings to DESIGN_SPEC
([#145](#145))
([57e487b](57e487b))
* add CLAUDE.md, contributing guide, and dev documentation
([#65](#65))
([55c1025](55c1025)),
closes [#54](#54)
* add crash recovery, sandboxing, analytics, and testing decisions
([#127](#127))
([5c11595](5c11595))
* address external review feedback with MVP scope and new protocols
([#128](#128))
([3b30b9a](3b30b9a))
* expand design spec with pluggable strategy protocols
([#121](#121))
([6832db6](6832db6))
* finalize 23 design decisions (ADR-002)
([#190](#190))
([8c39742](8c39742))
* update project docs for M2.5 conventions and add docs-consistency
review agent
([#114](#114))
([99766ee](99766ee))


### Tests

* add e2e single agent integration tests
([#24](#24))
([#156](#156))
([f566fb4](f566fb4))
* add provider adapter integration tests
([#90](#90))
([40a61f4](40a61f4))


### CI/CD

* add Release Please for automated versioning and GitHub Releases
([#278](#278))
([a488758](a488758))
* bump actions/checkout from 4 to 6
([#95](#95))
([1897247](1897247))
* bump actions/upload-artifact from 4 to 7
([#94](#94))
([27b1517](27b1517))
* bump anchore/scan-action from 6.5.1 to 7.3.2
([#271](#271))
([80a1c15](80a1c15))
* bump docker/build-push-action from 6.19.2 to 7.0.0
([#273](#273))
([dd0219e](dd0219e))
* bump docker/login-action from 3.7.0 to 4.0.0
([#272](#272))
([33d6238](33d6238))
* bump docker/metadata-action from 5.10.0 to 6.0.0
([#270](#270))
([baee04e](baee04e))
* bump docker/setup-buildx-action from 3.12.0 to 4.0.0
([#274](#274))
([5fc06f7](5fc06f7))
* bump sigstore/cosign-installer from 3.9.1 to 4.1.0
([#275](#275))
([29dd16c](29dd16c))
* harden CI/CD pipeline
([#92](#92))
([ce4693c](ce4693c))
* split vulnerability scans into critical-fail and high-warn tiers
([#277](#277))
([aba48af](aba48af))


### Maintenance

* add /worktree skill for parallel worktree management
([#171](#171))
([951e337](951e337))
* add design spec context loading to research-link skill
([8ef9685](8ef9685))
* add post-merge-cleanup skill
([#70](#70))
([f913705](f913705))
* add pre-pr-review skill and update CLAUDE.md
([#103](#103))
([92e9023](92e9023))
* add research-link skill and rename skill files to SKILL.md
([#101](#101))
([651c577](651c577))
* bump aiosqlite from 0.21.0 to 0.22.1
([#191](#191))
([3274a86](3274a86))
* bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group
([#96](#96))
([0338d0c](0338d0c))
* bump ruff from 0.15.4 to 0.15.5
([a49ee46](a49ee46))
* fix M0 audit items
([#66](#66))
([c7724b5](c7724b5))
* **main:** release ai-company 0.1.1
([#282](#282))
([2f4703d](2f4703d))
* pin setup-uv action to full SHA
([#281](#281))
([4448002](4448002))
* post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests,
hookify rules
([#148](#148))
([c57a6a9](c57a6a9))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement CFO agent logic (spending monitoring, model downgrade suggestions)

2 participants