feat: add CFO cost optimization service with anomaly detection, reports, and approval decisions#186
feat: add CFO cost optimization service with anomaly detection, reports, and approval decisions#186
Conversation
…ts, and approval decisions (#46) Implement CostOptimizer and ReportGenerator domain services backing the CFO role (DESIGN_SPEC §10.3). CostOptimizer provides spending anomaly detection (Z-score + spike factor), cost efficiency analysis per agent, model downgrade recommendations via ModelResolver, and operation approval/denial based on budget utilization. ReportGenerator produces multi-dimensional spending reports with task/provider/model breakdowns and period-over-period comparison. Adds get_records() to CostTracker for raw record access. 80 new tests, 96% budget module coverage.
…ements Pre-reviewed by 9 agents, 35 findings addressed.
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (9)
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a CFO cost-optimization subsystem: CostOptimizer service, ReportGenerator, domain models, internal helpers, an enriched CostTracker query API, new CFO/budget observability events, expanded public exports, and extensive unit tests for detection, analysis, recommendations, reporting, and approval evaluation. Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(240,248,255,0.5)
participant Client as Client Agent
participant Optimizer as CostOptimizer
participant Tracker as CostTracker
participant Resolver as ModelResolver
participant Logger as EventLogger
end
Client->>Optimizer: detect_anomalies(start,end)
Optimizer->>Tracker: get_records(start,end)
Tracker-->>Optimizer: CostRecord[]
Optimizer->>Optimizer: windowing & per-agent analysis
Optimizer->>Logger: CFO_ANOMALY_DETECTED
Optimizer-->>Client: AnomalyDetectionResult
Client->>Optimizer: recommend_downgrades(start,end)
Optimizer->>Optimizer: analyze_efficiency(start,end)
Optimizer->>Resolver: resolve candidate models (async)
Resolver-->>Optimizer: ResolvedModel(s)
Optimizer->>Logger: CFO_DOWNGRADE_RECOMMENDED
Optimizer-->>Client: DowngradeAnalysis
Client->>Optimizer: evaluate_operation(agent_id, cost)
Optimizer->>Tracker: get_records(month_window)
Optimizer->>Optimizer: compute budget pressure & projected level
Optimizer->>Logger: CFO_APPROVAL_EVALUATED
Optimizer-->>Client: ApprovalDecision
sequenceDiagram
rect rgba(255,250,240,0.5)
participant Client as Client Agent
participant ReportGen as ReportGenerator
participant Tracker as CostTracker
participant Aggregator as AggregationLogic
participant Logger as EventLogger
end
Client->>ReportGen: generate_report(start,end,top_n,cmp?)
ReportGen->>Tracker: get_records(start,end)
Tracker-->>ReportGen: CostRecord[]
ReportGen->>Aggregator: build by_task/by_provider/by_model
opt include_period_comparison
ReportGen->>Tracker: get_records(prev_start,prev_end)
Tracker-->>ReportGen: CostRecord[]
ReportGen->>Aggregator: compute PeriodComparison
end
ReportGen->>Logger: CFO_REPORT_GENERATED
ReportGen-->>Client: SpendingReport
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive CFO cost optimization system, enabling the AI company to intelligently manage and reduce operational spending. It provides tools for detecting unusual spending patterns, analyzing agent efficiency, recommending cost-saving model downgrades, and making automated approval decisions for operations based on budget health. This system significantly enhances financial oversight and proactive cost management within the AI agent ecosystem. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the CFO cost optimization service, featuring the CostOptimizer for anomaly detection, efficiency analysis, and downgrade recommendations, and a ReportGenerator for detailed spending reports. The implementation is generally robust, with well-structured code, robust Pydantic models, and comprehensive test coverage. However, two potential Denial of Service (DoS) vectors were identified in the CostOptimizer service: one due to inefficient algorithmic complexity in anomaly detection and downgrade recommendations, and another from missing upper-bound validation on the window_count parameter. Addressing these by grouping records by agent once and adding a maximum limit to the number of windows will improve the service's resilience against resource exhaustion attacks. Additionally, a minor suggestion was noted to improve code clarity by removing a redundant check.
| for agent_id in agent_ids: | ||
| window_costs = _compute_window_costs( | ||
| records, | ||
| agent_id, | ||
| window_starts, | ||
| window_duration, | ||
| ) |
There was a problem hiding this comment.
The detect_anomalies and recommend_downgrades methods exhibit O(N*M) algorithmic complexity, where N is the number of agents and M is the number of cost records. Specifically, detect_anomalies iterates over all unique agents (line 151) and, for each agent, calls _compute_window_costs which iterates over the entire set of records (lines 547-551). Similarly, recommend_downgrades iterates over agents (line 299) and calls _find_most_used_model which also iterates over all records (lines 685-686). An attacker who can populate the CostTracker with a large number of records for many distinct agent IDs could trigger these methods to cause excessive CPU consumption, leading to a Denial of Service (DoS).
| if window_count < 2: # noqa: PLR2004 | ||
| msg = f"window_count must be >= 2, got {window_count}" | ||
| raise ValueError(msg) | ||
|
|
||
| now = datetime.now(UTC) | ||
| records = await self._cost_tracker.get_records( | ||
| start=start, | ||
| end=end, | ||
| ) | ||
|
|
||
| total_duration = end - start | ||
| window_duration = total_duration / window_count | ||
| window_starts = tuple(start + window_duration * i for i in range(window_count)) |
There was a problem hiding this comment.
The detect_anomalies method accepts a window_count parameter (line 111) that is used to create a tuple of time window starts (line 146). While there is a check to ensure window_count >= 2 (line 134), there is no upper bound validation. A very large value for window_count could lead to excessive memory allocation when creating the window_starts tuple, potentially causing an Out-of-Memory (OOM) condition and crashing the service.
src/ai_company/budget/optimizer.py
Outdated
| # Check sigma threshold | ||
| stddev = statistics.stdev(historical) if len(historical) > 1 else 0.0 | ||
| deviation = (current - mean) / stddev if stddev > 0 else 0.0 | ||
| is_sigma_anomaly = stddev > 0 and deviation > config.anomaly_sigma_threshold |
There was a problem hiding this comment.
The stddev > 0 check in this line is redundant. The preceding line ensures that deviation is 0.0 when stddev is 0. Since config.anomaly_sigma_threshold is constrained to be greater than 0, the comparison deviation > config.anomaly_sigma_threshold will correctly evaluate to False in that case. Removing the redundant check simplifies the logic.
| is_sigma_anomaly = stddev > 0 and deviation > config.anomaly_sigma_threshold | |
| is_sigma_anomaly = deviation > config.anomaly_sigma_threshold |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai_company/budget/optimizer.py`:
- Around line 106-190: The public methods detect_anomalies,
recommend_downgrades, and evaluate_operation are too large and mix validation,
data loading, decision logic, and logging; refactor each into smaller helpers
(e.g., extract validation into _validate_detect_args, data fetch into
_load_records_for_agent or _fetch_scan_records, core decision logic into
_compute_window_costs and _detect_spike_anomaly already exist but move
surrounding orchestration into helpers like _detect_anomalies_for_agent, and
logging into _log_anomaly and _log_scan_summary) so each public method is <50
lines: keep detect_anomalies responsible only for argument checks, calling the
helpers for records loading and per-agent analysis, aggregating results, and
invoking a single summary log; apply the same pattern to recommend_downgrades
and evaluate_operation by splitting validation, data access, business rules, and
logging into clearly named private functions.
- Around line 380-417: The auto-deny check currently compares
approval_auto_deny_alert_level against the current alert_level computed from
used_pct; change it to compute projected_used_pct = round(projected_cost /
cfg.total_monthly * 100, BUDGET_ROUNDING_PRECISION), then call
projected_alert_level = _compute_alert_level(projected_used_pct, cfg) and
compare _ALERT_LEVEL_ORDER[projected_alert_level] >=
_ALERT_LEVEL_ORDER[auto_deny_level]; if true, log the denial (use same logger
fields but include projected_* values) and return an ApprovalDecision denying
the request (similar to the existing block) so the configurable auto-deny
threshold is enforced based on projected usage rather than current usage.
- Around line 333-379: In evaluate_operation, validate the public input
estimated_cost_usd at the top of the function (before any budget logic) and fail
fast on impossible values: if estimated_cost_usd is negative, raise a clear
exception (e.g., ValueError) indicating the invalid estimated_cost_usd and
include the provided value and agent_id for diagnostics; this prevents callers
from increasing budget_remaining_usd by passing negative estimates and keeps the
public boundary robust.
In `@src/ai_company/budget/reports.py`:
- Around line 177-184: Update the tuple element types for top_agents_by_cost and
top_tasks_by_cost to use NotBlankStr for the identifier positions instead of
plain str; locate the Field declarations for top_agents_by_cost and
top_tasks_by_cost in the Reports model and change their type annotations from
tuple[tuple[str, float], ...] to tuple[tuple[NotBlankStr, float], ...], ensuring
any imports include NotBlankStr where these fields are defined.
- Around line 220-228: Add a DEBUG-level log in the __init__ of the class that
accepts CostTracker and BudgetConfig to record object creation and key init
values; update the __init__ method (the constructor with parameters
cost_tracker: CostTracker and budget_config: BudgetConfig) to call the
module/class logger.debug with a concise message that the report object was
created and include non-sensitive identifying info (e.g., id(cost_tracker) or
budget_config.name) to aid tracing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cb438d54-3b98-4383-a58e-139b7127add4
📒 Files selected for processing (16)
CLAUDE.mdDESIGN_SPEC.mdREADME.mdsrc/ai_company/budget/__init__.pysrc/ai_company/budget/optimizer.pysrc/ai_company/budget/optimizer_models.pysrc/ai_company/budget/reports.pysrc/ai_company/budget/tracker.pysrc/ai_company/observability/events/budget.pysrc/ai_company/observability/events/cfo.pytests/unit/budget/conftest.pytests/unit/budget/test_optimizer.pytests/unit/budget/test_optimizer_models.pytests/unit/budget/test_reports.pytests/unit/budget/test_tracker_get_records.pytests/unit/observability/test_events.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Agent
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Do NOT usefrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations
Useexcept A, B:syntax (without parentheses) per PEP 758 — ruff enforces this on Python 3.14
All public functions must have type hints; use mypy strict mode for type-checking
Use Google-style docstrings on all public classes and functions; enforced by ruff D rules
Create new objects instead of mutating existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves; never mix static config fields with mutable runtime fields in one model
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use@computed_fieldfor derived values instead of storing redundant fields; use NotBlankStr for all identifier/name fields (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Keep functions under 50 lines and files under 800 lines
Handle errors explicitly, never silently swallow exceptions
Validate at system boundaries (user input, external APIs, config files)
Use line length of 88 characters (ruff)
Files:
src/ai_company/observability/events/budget.pytests/unit/budget/test_optimizer_models.pysrc/ai_company/observability/events/cfo.pysrc/ai_company/budget/__init__.pysrc/ai_company/budget/optimizer_models.pytests/unit/budget/test_tracker_get_records.pysrc/ai_company/budget/tracker.pytests/unit/budget/test_optimizer.pytests/unit/budget/conftest.pytests/unit/observability/test_events.pysrc/ai_company/budget/reports.pytests/unit/budget/test_reports.pysrc/ai_company/budget/optimizer.py
src/ai_company/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/ai_company/**/*.py: Every module with business logic must import and use get_logger(name) from ai_company.observability; never use import logging or logging.getLogger() or print() in application code
Always use 'logger' as the variable name (not '_logger', not 'log')
Always use event name constants from ai_company.observability.events domain modules (e.g., PROVIDER_CALL_START from events.provider) instead of string literals
Use structured logging with logger.info(EVENT, key=value) — never use logger.info('msg %s', val) string formatting
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
Use DEBUG level logging for object creation, internal flow, and entry/exit of key functions
Files:
src/ai_company/observability/events/budget.pysrc/ai_company/observability/events/cfo.pysrc/ai_company/budget/__init__.pysrc/ai_company/budget/optimizer_models.pysrc/ai_company/budget/tracker.pysrc/ai_company/budget/reports.pysrc/ai_company/budget/optimizer.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples; use generic names (example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small aliases)
Files:
src/ai_company/observability/events/budget.pysrc/ai_company/observability/events/cfo.pysrc/ai_company/budget/__init__.pysrc/ai_company/budget/optimizer_models.pysrc/ai_company/budget/tracker.pysrc/ai_company/budget/reports.pysrc/ai_company/budget/optimizer.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Mark tests with@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e, or@pytest.mark.slow
Prefer@pytest.mark.parametrizefor testing similar cases
In tests, use test-provider, test-small-001, etc. instead of real vendor names
Files:
tests/unit/budget/test_optimizer_models.pytests/unit/budget/test_tracker_get_records.pytests/unit/budget/test_optimizer.pytests/unit/budget/conftest.pytests/unit/observability/test_events.pytests/unit/budget/test_reports.py
🧠 Learnings (7)
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Every module with business logic must import and use get_logger(__name__) from ai_company.observability; never use import logging or logging.getLogger() or print() in application code
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Always use 'logger' as the variable name (not '_logger', not 'log')
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Always use event name constants from ai_company.observability.events domain modules (e.g., PROVIDER_CALL_START from events.provider) instead of string literals
Applied to files:
CLAUDE.mdsrc/ai_company/observability/events/cfo.pyDESIGN_SPEC.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Use structured logging with logger.info(EVENT, key=value) — never use logger.info('msg %s', val) string formatting
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : All error paths must log at WARNING or ERROR with context before raising
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : All state transitions must log at INFO level
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Use DEBUG level logging for object creation, internal flow, and entry/exit of key functions
Applied to files:
CLAUDE.md
🧬 Code graph analysis (8)
tests/unit/budget/test_optimizer_models.py (1)
src/ai_company/budget/optimizer_models.py (12)
AgentEfficiency(142-176)AnomalyDetectionResult(104-136)AnomalySeverity(34-39)AnomalyType(22-31)CostOptimizerConfig(332-381)DowngradeAnalysis(267-290)DowngradeRecommendation(231-264)EfficiencyAnalysis(179-225)EfficiencyRating(42-47)SpendingAnomaly(53-101)cost_per_1k_tokens(169-176)inefficient_agent_count(206-212)
src/ai_company/budget/__init__.py (3)
src/ai_company/budget/optimizer.py (1)
CostOptimizer(72-483)src/ai_company/budget/optimizer_models.py (11)
AgentEfficiency(142-176)AnomalyDetectionResult(104-136)AnomalySeverity(34-39)AnomalyType(22-31)ApprovalDecision(296-326)CostOptimizerConfig(332-381)DowngradeAnalysis(267-290)DowngradeRecommendation(231-264)EfficiencyAnalysis(179-225)EfficiencyRating(42-47)SpendingAnomaly(53-101)src/ai_company/budget/reports.py (6)
ModelDistribution(77-98)PeriodComparison(101-141)ProviderDistribution(55-74)ReportGenerator(209-331)SpendingReport(144-203)TaskSpending(37-52)
src/ai_company/budget/optimizer_models.py (1)
src/ai_company/budget/enums.py (1)
BudgetAlertLevel(6-16)
tests/unit/budget/test_tracker_get_records.py (2)
src/ai_company/budget/tracker.py (2)
get_records(185-225)record(99-112)tests/unit/budget/conftest.py (1)
make_cost_record(286-307)
src/ai_company/budget/tracker.py (1)
src/ai_company/budget/cost_record.py (1)
CostRecord(15-56)
tests/unit/budget/conftest.py (4)
src/ai_company/budget/optimizer.py (1)
CostOptimizer(72-483)src/ai_company/budget/optimizer_models.py (1)
CostOptimizerConfig(332-381)src/ai_company/budget/reports.py (1)
ReportGenerator(209-331)src/ai_company/budget/enforcer.py (1)
cost_tracker(90-92)
src/ai_company/budget/reports.py (4)
src/ai_company/budget/spending_summary.py (1)
SpendingSummary(102-161)src/ai_company/budget/config.py (1)
BudgetConfig(151-227)src/ai_company/budget/cost_record.py (1)
CostRecord(15-56)src/ai_company/budget/tracker.py (3)
CostTracker(68-455)build_summary(227-281)get_records(185-225)
tests/unit/budget/test_reports.py (1)
src/ai_company/budget/reports.py (9)
ModelDistribution(77-98)PeriodComparison(101-141)ProviderDistribution(55-74)ReportGenerator(209-331)SpendingReport(144-203)TaskSpending(37-52)cost_change_usd(125-130)cost_change_percent(134-141)generate_report(229-306)
🪛 LanguageTool
README.md
[typographical] ~26-~26: To join two clauses or introduce examples, consider using an em dash.
Context: ...n failures - Budget Enforcement (M5) - BudgetEnforcer service with pre-flight...
(DASH_RULE)
CLAUDE.md
[style] ~86-~86: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...
(EG_NO_COMMA)
🔇 Additional comments (40)
src/ai_company/observability/events/budget.py (1)
32-33: LGTM!The new
BUDGET_RECORDS_QUERIEDevent constant follows the established pattern:Final[str]typing anddomain.subject.qualifiernaming convention consistent with other budget events.src/ai_company/budget/tracker.py (1)
185-225: LGTM!The new
get_records()method follows established patterns in this class:
- Validates time range via
_validate_time_range- Uses structured logging with event constant at DEBUG level
- Returns immutable
tuple[CostRecord, ...]snapshot- Consistent with
get_category_breakdown()which also filters byagent_idandtask_idsrc/ai_company/observability/events/cfo.py (1)
1-15: LGTM!Well-organized CFO event constants module following established patterns:
- All constants use
Final[str]typing- All values follow
cfo.subject.qualifiernaming convention- Comprehensive coverage for optimizer lifecycle, anomaly detection, efficiency analysis, downgrades, approvals, and reports
Based on learnings: these event name constants from
ai_company.observability.events.cfoshould be used instead of string literals in business logic.CLAUDE.md (2)
47-47: LGTM!The budget module description is accurately updated to reflect the new CFO cost optimization capabilities including anomaly detection, efficiency analysis, downgrade recommendations, approval decisions, and spending reports.
86-86: LGTM!Good addition of
CFO_ANOMALY_DETECTEDfromevents.cfoto the event names documentation example, ensuring developers know about the new CFO domain module for observability events.tests/unit/observability/test_events.py (1)
179-179: LGTM!Correctly adds
"cfo"to the expected domain modules set, ensuring the test validates that the new CFO events module is properly discoverable bypkgutil.README.md (1)
26-26: LGTM!The Budget Enforcement description is accurately updated to reflect the new CFO capabilities:
CostOptimizerCFO service with anomaly detection, efficiency analysis, downgrade recommendations, and approval decisionsReportGeneratorfor multi-dimensional spending reportsThe formatting is consistent with the rest of the document.
tests/unit/budget/test_optimizer_models.py (9)
1-20: LGTM!Well-structured test module with proper imports and organization. Test coverage spans all CFO optimizer domain models including enums, data classes, validators, and computed fields.
25-51: LGTM!Enum tests verify both string values and member counts, ensuring the enum definitions remain stable.
56-109: LGTM!
SpendingAnomalytests comprehensively cover:
- Construction with all required fields
- Frozen model immutability
- Period ordering validation (period_start must be before period_end)
114-136: LGTM!
AnomalyDetectionResulttests cover empty results and period ordering validation, consistent with the model's constraints.
141-178: LGTM!
AgentEfficiencytests validate:
- Basic construction
- Zero-token edge case (cost_per_1k_tokens returns 0.0)
- Computed field derivation for cost_per_1k_tokens
183-229: LGTM!
EfficiencyAnalysistests cover empty analysis, computedinefficient_agent_count, and period ordering validation.
234-273: LGTM!
DowngradeRecommendationandDowngradeAnalysistests verify construction, immutability, and empty analysis handling. Usestest-large-001/test-small-001per coding guidelines (no real vendor names).
278-315: LGTM!
ApprovalDecisiontests cover approved/denied states, alert levels, and optional conditions tuple.
320-395: LGTM!
CostOptimizerConfigtests comprehensively validate:
- Default values
- Custom value acceptance
- Constraint enforcement (sigma > 0, spike_factor > 1, inefficiency_factor > 1, min_anomaly_windows >= 2)
- Frozen model immutability
- Validator tests for
DowngradeRecommendation(same model rejection, zero savings rejection)tests/unit/budget/test_reports.py (5)
1-35: LGTM!Well-organized test module with clean helper functions. The
_make_report_generatorfactory creates freshCostTrackerandReportGeneratorinstances for isolated test execution.
40-87: LGTM!Report model tests verify construction and immutability for
TaskSpending,ProviderDistribution, andModelDistribution. Uses generic provider/model names per coding guidelines.
89-122: LGTM!
PeriodComparisontests comprehensively cover:
- Cost increase (positive change)
- Cost decrease (negative change)
- No previous data (percent is None)
- Equal periods (zero change)
127-348: LGTM!
ReportGeneratortests provide excellent coverage:
- Initialization verification
- Empty/no records scenario
- Multiple agents/tasks aggregation
- Provider/model distribution percentages
- Period comparison (increase, decrease, no prior data, skip)
- Top-N agents/tasks with proper sorting
- Input validation (top_n < 1, start after end)
353-398: LGTM!
SpendingReportvalidator tests verify thattop_agents_by_costandtop_tasks_by_costmust be sorted in descending order by cost, with both acceptance and rejection cases.src/ai_company/budget/optimizer_models.py (10)
1-18: LGTM!The module docstring follows Google-style, imports are clean, and the file correctly avoids
from __future__ import annotationsper coding guidelines. Thenoqacomments for TC001/TC003 are appropriate for runtime Pydantic requirements.
22-48: LGTM!Enum definitions are clean with appropriate docstrings. Good practice to document that
SUSTAINED_HIGHandRATE_INCREASEare reserved for future detection algorithms.
53-102: LGTM!The
SpendingAnomalymodel is well-designed with proper constraints (ge=0.0for non-negative values),NotBlankStrfor identifiers, and a cross-field validator ensuring temporal ordering. The edge case fordeviation_factor=0.0when baseline is zero is properly documented.
104-137: LGTM!The
AnomalyDetectionResultmodel correctly uses an immutable tuple for anomalies with a sensible empty default. The period ordering validator follows the same pattern asSpendingAnomaly, maintaining consistency.
142-177: LGTM!The
AgentEfficiencymodel correctly uses@computed_fieldfor the derivedcost_per_1k_tokensvalue, handles division by zero gracefully, and applies consistent rounding viaBUDGET_ROUNDING_PRECISION.
179-226: LGTM!The
EfficiencyAnalysismodel properly uses@computed_fieldforinefficient_agent_count, maintains consistent period ordering validation, and follows the established patterns from other models in this file.
231-265: LGTM!The
DowngradeRecommendationmodel enforces meaningful recommendations withgt=0.0for savings and a validator ensuring the current and recommended models differ. Good defensive design.
267-291: LGTM!The
DowngradeAnalysismodel is a clean aggregation container with appropriate non-negative constraints.
296-327: LGTM!The
ApprovalDecisionmodel correctly allows negativebudget_remaining_usdfor over-budget scenarios (well-documented). Good use oftuple[NotBlankStr, ...]for conditions to ensure non-blank approval conditions.
332-381: LGTM!The
CostOptimizerConfigmodel has well-reasoned constraints:gt=1.0for factors that must exceed baseline,ge=2for minimum windows ensuring meaningful statistical comparison, and sensible defaults aligned with typical anomaly detection practices.src/ai_company/budget/reports.py (9)
1-32: LGTM!The module follows coding guidelines: uses
get_logger(__name__)withloggervariable name, imports event constantCFO_REPORT_GENERATEDfrom the events module, and properly usesTYPE_CHECKINGfor type-only imports.
37-53: LGTM!The
TaskSpendingmodel is clean with appropriate constraints and follows established patterns from the codebase.
55-75: LGTM!The
ProviderDistributionmodel properly constrainspercentage_of_totalto the valid range [0.0, 100.0].
77-99: LGTM!The
ModelDistributionmodel maintains consistency withProviderDistributionwhile adding the model-provider relationship.
101-142: LGTM!The
PeriodComparisonmodel correctly uses@computed_fieldfor derived values. The<= 0check on line 136 is appropriately defensive (even thoughge=0.0constraint ensures non-negative values, it guards against division by zero).
187-204: LGTM!The ranking validators correctly ensure descending order for both top agents and top tasks, maintaining data integrity.
229-306: LGTM!The
generate_reportmethod validates inputs at the system boundary, uses structured logging with the event constantCFO_REPORT_GENERATED, and follows a clear workflow. Good separation between data fetching, aggregation, and report assembly.
308-331: LGTM!The period comparison calculation correctly computes the previous period without overlap. The early return when both periods have zero cost avoids generating meaningless comparisons.
337-452: LGTM!The helper functions are clean and follow best practices:
math.fsumfor precise float aggregation- Consistent use of
BUDGET_ROUNDING_PRECISION- Deterministic output ordering via
sorted()- Proper type hints with
Sequencefor input flexibility
There was a problem hiding this comment.
Pull request overview
Adds the CFO “CostOptimizer” analytics layer and reporting capabilities on top of the existing budget tracking/enforcement stack, aligning with DESIGN_SPEC §10.3 and extending observability coverage for CFO/budget analytics events.
Changes:
- Introduces
CostOptimizerservice + domain models for anomaly detection, efficiency analysis, downgrade recommendations, and operation approval decisions. - Adds
ReportGeneratorservice and report models for multi-dimensional spending breakdowns and period-over-period comparisons. - Extends
CostTrackerwith aget_records()query API, adds new observability event constants, and adds extensive unit test coverage.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/observability/test_events.py | Updates expected domain modules to include cfo events domain. |
| tests/unit/budget/test_tracker_get_records.py | Adds unit tests for new CostTracker.get_records() filtering semantics. |
| tests/unit/budget/test_reports.py | Adds unit tests for ReportGenerator and report model validators/computed fields. |
| tests/unit/budget/test_optimizer_models.py | Adds unit tests for optimizer Pydantic models/enums/validators/computed fields. |
| tests/unit/budget/test_optimizer.py | Adds unit tests for CostOptimizer anomaly detection, efficiency, downgrades, and approvals. |
| tests/unit/budget/conftest.py | Adds fixtures/factories for optimizer + report generator. |
| src/ai_company/observability/events/cfo.py | Introduces CFO event constants for structured logging. |
| src/ai_company/observability/events/budget.py | Adds BUDGET_RECORDS_QUERIED event constant. |
| src/ai_company/budget/tracker.py | Adds get_records() API and logs record queries via new event constant. |
| src/ai_company/budget/reports.py | Implements report models + ReportGenerator service. |
| src/ai_company/budget/optimizer_models.py | Implements frozen optimizer domain models + config. |
| src/ai_company/budget/optimizer.py | Implements CostOptimizer service and pure helper functions. |
| src/ai_company/budget/init.py | Re-exports optimizer/report services and models from the budget package. |
| README.md | Updates “Budget Enforcement (M5)” description to include CFO optimizer/reporting. |
| DESIGN_SPEC.md | Documents the new M5 implementation note and updates project tree entries. |
| CLAUDE.md | Updates package structure/logging guidance to include CFO optimizer/events. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| async def evaluate_operation( | ||
| self, | ||
| *, | ||
| agent_id: str, | ||
| estimated_cost_usd: float, | ||
| now: datetime | None = None, | ||
| ) -> ApprovalDecision: |
There was a problem hiding this comment.
evaluate_operation() accepts estimated_cost_usd without validating it’s non-negative. A negative estimate can reduce projected_cost and incorrectly approve operations (or skip high-cost conditions). Add explicit input validation (e.g., raise ValueError when estimated_cost_usd < 0).
src/ai_company/budget/optimizer.py
Outdated
| severity=severity, | ||
| description=( | ||
| f"Agent {agent_id!r} spent ${current:.2f} vs " | ||
| f"${mean:.2f} baseline ({deviation:.1f} sigma)" | ||
| ), | ||
| current_value=current, | ||
| baseline_value=round(mean, BUDGET_ROUNDING_PRECISION), | ||
| deviation_factor=round(deviation, BUDGET_ROUNDING_PRECISION), | ||
| detected_at=now, |
There was a problem hiding this comment.
When stddev == 0 but a spike is detected, the anomaly description still reports "(0.0 sigma)" and deviation_factor is forced to 0.0, which is misleading (sigma deviation is undefined in this case). Consider adjusting the message/fields for the stddev == 0 path (e.g., report spike ratio instead of sigma, and/or make the stored deviation metric consistent with what severity is based on).
Greptile SummaryThis PR introduces the Key findings from this review:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant CostOptimizer
participant CostTracker
participant ReportGenerator
Note over CostOptimizer: detect_anomalies / analyze_efficiency / recommend_downgrades
Caller->>CostOptimizer: detect_anomalies(start, end, window_count)
CostOptimizer->>CostTracker: get_records(start, end)
CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
CostOptimizer->>CostOptimizer: _group_records_by_agent()
loop per agent
CostOptimizer->>CostOptimizer: _compute_window_costs()
CostOptimizer->>CostOptimizer: _detect_spike_anomaly()
end
CostOptimizer-->>Caller: AnomalyDetectionResult
Note over CostOptimizer: recommend_downgrades (parallel fetch)
Caller->>CostOptimizer: recommend_downgrades(start, end)
par asyncio.TaskGroup
CostOptimizer->>CostTracker: get_records(start, end)
CostOptimizer->>CostTracker: get_total_cost(billing_period_start)
end
CostTracker-->>CostOptimizer: records + budget_pressure
CostOptimizer->>CostOptimizer: _build_efficiency_from_records()
CostOptimizer->>CostOptimizer: _build_recommendations()
CostOptimizer-->>Caller: DowngradeAnalysis
Note over CostOptimizer: evaluate_operation
Caller->>CostOptimizer: evaluate_operation(agent_id, estimated_cost_usd)
alt total_monthly <= 0
CostOptimizer-->>Caller: ApprovalDecision(approved=True, enforcement_disabled)
else budget active
CostOptimizer->>CostTracker: get_total_cost(period_start)
CostTracker-->>CostOptimizer: monthly_cost
CostOptimizer->>CostOptimizer: _check_denial(projected_alert)
alt denied
CostOptimizer-->>Caller: ApprovalDecision(approved=False)
else approved
CostOptimizer->>CostOptimizer: _build_approval_conditions()
CostOptimizer-->>Caller: ApprovalDecision(approved=True, conditions)
end
end
Note over ReportGenerator: generate_report (sequential — asyncio.TaskGroup pending)
Caller->>ReportGenerator: generate_report(start, end, top_n)
ReportGenerator->>CostTracker: get_records(start, end)
CostTracker-->>ReportGenerator: records snapshot 1
ReportGenerator->>CostTracker: build_summary(start, end)
CostTracker-->>ReportGenerator: summary snapshot 2
ReportGenerator->>ReportGenerator: _build_task/provider/model distributions (from records)
ReportGenerator->>ReportGenerator: _build_top_agents (from summary ⚠️ different snapshot)
ReportGenerator-->>Caller: SpendingReport
Last reviewed commit: f909c79 |
Greptile SummaryThis PR delivers the CFO cost optimization layer for the budget module: a Key findings:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant CostOptimizer
participant ReportGenerator
participant CostTracker
participant BudgetConfig
Note over CostOptimizer: detect_anomalies()
Caller->>CostOptimizer: detect_anomalies(start, end, window_count)
CostOptimizer->>CostTracker: get_records(start, end)
CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
CostOptimizer->>CostOptimizer: _compute_window_costs() per agent
CostOptimizer->>CostOptimizer: _detect_spike_anomaly() per agent
CostOptimizer-->>Caller: AnomalyDetectionResult
Note over CostOptimizer: recommend_downgrades()
Caller->>CostOptimizer: recommend_downgrades(start, end)
CostOptimizer->>CostTracker: get_records(start, end)
CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
CostOptimizer->>CostOptimizer: _build_efficiency_from_records()
CostOptimizer->>CostTracker: get_total_cost(start=period_start)
CostTracker-->>CostOptimizer: monthly_cost
CostOptimizer->>BudgetConfig: auto_downgrade.downgrade_map
CostOptimizer->>CostOptimizer: _build_downgrade_recommendation() per agent
CostOptimizer-->>Caller: DowngradeAnalysis
Note over CostOptimizer: evaluate_operation()
Caller->>CostOptimizer: evaluate_operation(agent_id, estimated_cost_usd)
CostOptimizer->>CostTracker: get_total_cost(start=period_start)
CostTracker-->>CostOptimizer: monthly_cost
CostOptimizer->>CostOptimizer: _compute_alert_level()
CostOptimizer-->>Caller: ApprovalDecision
Note over ReportGenerator: generate_report()
Caller->>ReportGenerator: generate_report(start, end, top_n)
ReportGenerator->>CostTracker: build_summary(start, end)
CostTracker-->>ReportGenerator: SpendingSummary (snapshot 1)
ReportGenerator->>CostTracker: get_records(start, end)
CostTracker-->>ReportGenerator: tuple[CostRecord, ...] (snapshot 2)
ReportGenerator->>ReportGenerator: _build_task_spendings()
ReportGenerator->>ReportGenerator: _build_provider_distribution()
ReportGenerator->>ReportGenerator: _build_model_distribution()
ReportGenerator->>CostTracker: build_summary(prev_start, prev_end)
CostTracker-->>ReportGenerator: prev SpendingSummary
ReportGenerator-->>Caller: SpendingReport
Last reviewed commit: 9048bf8 |
Greptile SummaryThis PR implements the CFO cost optimization layer for the budget module, adding Key changes:
Issues found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant CostOptimizer
participant ReportGenerator
participant CostTracker
participant BudgetConfig
participant ModelResolver
Note over Caller,ModelResolver: detect_anomalies()
Caller->>CostOptimizer: detect_anomalies(start, end, window_count)
CostOptimizer->>CostTracker: get_records(start, end)
CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
CostOptimizer->>CostOptimizer: _compute_window_costs() per agent
CostOptimizer->>CostOptimizer: _detect_spike_anomaly() per agent
CostOptimizer-->>Caller: AnomalyDetectionResult
Note over Caller,ModelResolver: analyze_efficiency()
Caller->>CostOptimizer: analyze_efficiency(start, end)
CostOptimizer->>CostTracker: get_records(start, end)
CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
CostOptimizer->>CostOptimizer: _build_efficiency_from_records()
CostOptimizer-->>Caller: EfficiencyAnalysis
Note over Caller,ModelResolver: recommend_downgrades()
Caller->>CostOptimizer: recommend_downgrades(start, end)
CostOptimizer->>CostTracker: get_records(start, end)
CostTracker-->>CostOptimizer: tuple[CostRecord, ...]
CostOptimizer->>CostOptimizer: _build_efficiency_from_records()
CostOptimizer->>CostTracker: get_total_cost(period_start) [budget pressure]
CostOptimizer->>ModelResolver: resolve_safe(model) + all_models_sorted_by_cost()
CostOptimizer-->>Caller: DowngradeAnalysis
Note over Caller,ModelResolver: evaluate_operation()
Caller->>CostOptimizer: evaluate_operation(agent_id, estimated_cost, now)
CostOptimizer->>BudgetConfig: read total_monthly, alerts, reset_day
CostOptimizer->>CostTracker: get_total_cost(period_start)
CostTracker-->>CostOptimizer: monthly_cost
CostOptimizer->>CostOptimizer: _compute_alert_level()
CostOptimizer-->>Caller: ApprovalDecision
Note over Caller,ModelResolver: generate_report()
Caller->>ReportGenerator: generate_report(start, end, top_n)
ReportGenerator->>CostTracker: build_summary(start, end) [snapshot #1]
CostTracker-->>ReportGenerator: SpendingSummary
ReportGenerator->>CostTracker: get_records(start, end) [snapshot #2]
CostTracker-->>ReportGenerator: tuple[CostRecord, ...]
ReportGenerator->>CostTracker: build_summary(prev_start, prev_end) [period comparison]
CostTracker-->>ReportGenerator: SpendingSummary
ReportGenerator-->>Caller: SpendingReport
|
- Add routing optimization feature (#1): new suggest_routing_optimizations() method, RoutingSuggestion and RoutingOptimizationAnalysis models - Add negative estimated_cost_usd validation (#2) - Fix double snapshot in generate_report (#3) - Fix deviation_factor to use spike_ratio when stddev=0 (#4) - Convert DowngradeAnalysis.total_estimated_savings_per_1k to @computed_field (#5) - Change str to NotBlankStr in SpendingReport tuple fields (#6) - Add window_count upper bound validation (#7) - Pre-group records by agent for O(N+M) complexity (#8) - Update DESIGN_SPEC.md implementation snapshot (#9) - Use projected alert level for auto-deny check (#11) - Move approval log after ApprovalDecision construction (#12) - Add ReportGenerator.__init__ debug log + event constant (#13) - Fix _ALERT_LEVEL_ORDER comment (#14) - Fix _classify_severity docstring for dual-use (#15) - Add WARNING logs before ValueError raises (#16) - Update evaluate_operation docstring (#17) - Add sort-order validator to EfficiencyAnalysis.agents (#18) - Add debug log when _find_most_used_model returns None (#19) - Remove redundant stddev > 0 check in is_sigma_anomaly (#20) - Document approval_warn_threshold_usd=0.0 behavior (#21) - Extract helpers to _optimizer_helpers.py to stay under 800-line limit
There was a problem hiding this comment.
Actionable comments posted: 11
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@DESIGN_SPEC.md`:
- Around line 1848-1853: The M5 implementation note omits routing optimization;
update the description for CostOptimizer in budget/optimizer.py (the
"CostOptimizer" service) to include routing optimization suggestions alongside
anomaly detection, per-agent efficiency analysis, model downgrade
recommendations (via ModelResolver), and operation approval evaluation, and also
mention that ReportGenerator (budget/reports.py) includes routing-aware
breakdowns in its multi-dimensional spending reports and period-over-period
comparisons so the CFO feature summary remains current.
In `@src/ai_company/budget/_optimizer_helpers.py`:
- Around line 290-300: The code incorrectly accepts a cheaper model solely by
price even if it reduces context size; update the block that calls
_find_cheaper_model (the branch setting target_ref when target_ref is None) to
validate that the returned cheaper model's max_context
(cheaper.model_info.max_context or resolved info via resolver) is >=
current_resolved.model_info.max_context (or the routing-required context), and
if not, treat it as unavailable: log CFO_DOWNGRADE_SKIPPED with reason
"no_cheaper_model_preserving_context" and return None. Apply the same check to
the analogous branch around lines 340-349 so downgrades never pick models that
shrink capability.
- Around line 110-190: The _detect_spike_anomaly function is too large and mixes
validation, zero-baseline handling, threshold evaluation, severity
classification, and SpendingAnomaly construction; refactor by splitting it into
small helpers (e.g., _validate_windows(agent_id, window_costs, config),
_handle_zero_baseline(agent_id, current, now, window_starts, window_duration),
_evaluate_spike_and_sigma(historical, current, config) which returns (is_spike,
is_sigma_anomaly, spike_ratio, deviation, stddev), and
_build_spending_anomaly(agent_id, current, mean, effective_deviation, severity,
now, window_starts, window_duration)). Keep existing behavior and return values
(use _classify_severity for severity, round baseline_value and deviation_factor
per BUDGET_ROUNDING_PRECISION, and preserve SpendingAnomaly fields), then
simplify _detect_spike_anomaly to call these helpers in sequence so the
top-level function is under 50 lines.
In `@src/ai_company/budget/optimizer.py`:
- Around line 338-339: The code repeatedly calls _find_most_used_model(records,
agent.agent_id) and rescans the whole window per agent; instead use the existing
by_agent grouping within suggest_routing_optimizations to avoid O(agent_count ×
record_count). Change the call sites (including the similar block around lines
423-429) to pass only that agent's records (e.g., by_agent[agent.agent_id]) or
refactor _find_most_used_model to accept an agent-specific records list so the
function scans only that subset; update references to most_used_model
accordingly.
- Around line 550-565: The approval path currently uses current values
(used_pct, alert_level) to build conditions, budget_used_percent, and the INFO
log, which misses when the proposed spend crosses thresholds; update the
approval branch that constructs conditions and the
budget_used_percent/alert_level logging to use projected_pct and projected_alert
when projected_alert > alert_level (i.e., crossing into a higher alert),
otherwise keep the current values; reference the computed names projected_pct,
projected_alert, used_pct, alert_level and the helper _compute_alert_level so
you locate the logic that assembles conditions and logs, and apply the same
change in the analogous block around projected_pct/projected_alert at the other
location (lines noted in review).
- Around line 376-378: The recommendation logic currently only checks cost and
max_context and ignores latency; update the candidate filter in the
recommendation function (the code that compares cost and max_context using
estimated_latency_ms from the model resolver) to enforce a latency guard: when
both the source model and candidate expose estimated_latency_ms, skip any
candidate whose estimated_latency_ms exceeds the source estimated_latency_ms
multiplied by a configurable max_latency_ratio (e.g., 1.1) or a hard threshold,
and surface that decision in the returned suggestion metadata; add a small
unit-test or example to cover the case where a cheaper model is rejected due to
higher latency and document the new max_latency_ratio configuration.
- Around line 301-309: The early-return branch that fires when
self._model_resolver is None currently returns DowngradeAnalysis with
budget_pressure_percent=0.0 which is wrong; change it to compute the actual
budget pressure using the same logic used elsewhere (reuse the existing helper
that calculates budget pressure—e.g., compute_budget_pressure /
_calculate_budget_pressure / similar budget pressure function used by this
class) and pass that real value into DowngradeAnalysis while still returning
empty recommendations; keep the CFO_RESOLVER_MISSING warning but replace the
hard-coded 0.0 with the computed budget_pressure_percent.
In `@src/ai_company/budget/reports.py`:
- Around line 272-280: The current code awaits two separate tracker calls
(_cost_tracker.get_records and _cost_tracker.build_summary) which allows
intervening writes to cause summary to drift; instead generate the summary from
the same records snapshot (use the already-fetched variable records to compute
summary) or add/use a tracker helper that accepts a records snapshot (e.g., a
new method like build_summary_from_snapshot(records) on _cost_tracker) and
replace the build_summary call so that summary is derived from records, ensuring
by_task/by_provider/by_model/top_agents_by_cost remain consistent with records.
- Around line 263-268: In the two validation branches where you currently raise
ValueError for "start >= end" and "top_n < 1", add a WARNING-level CFO event log
(using the project's CFO event constant API) that emits the same context message
and includes the values of start, end, and top_n before raising; specifically,
in the branches surrounding the checks for start >= end and top_n < 1 (the
blocks that construct msg and raise ValueError), call the CFO warning/emitter
with the msg and any additional context fields (start.isoformat(),
end.isoformat(), top_n) so the warning is recorded via the CFO event constant
prior to raising the ValueError.
In `@tests/unit/budget/test_optimizer.py`:
- Around line 645-652: The test test_find_cheaper_model_picks_cheapest never
exercises _find_cheaper_model because recommend_downgrades returns early on
empty data; either seed an inefficient usage record before calling
recommend_downgrades so the _find_cheaper_model path runs and assert the chosen
cheaper model, or rename the test to reflect empty-state behavior. Concretely,
in the test that calls _make_resolver() and _make_optimizer(), add a
fixture/seeded record (matching whatever helper you use to insert records in
tests) representing an inefficient/high-cost model so recommend_downgrades
evaluates downgrades, then assert the returned recommendation target; otherwise
change the test name and expected assertion to indicate it verifies the
empty-data result from recommend_downgrades.
- Around line 1-900: The test module is too large; split it into smaller focused
test files by moving the related test classes into separate modules (e.g.,
tests/unit/budget/test_anomalies.py, test_efficiency.py, test_downgrades.py,
test_approval.py, test_routing.py). Extract shared helpers/constants (_START,
_END, _make_optimizer, _make_resolver, make_cost_record import) into a common
test helper or conftest (e.g., tests/unit/budget/test_helpers.py or reuse
tests/unit/budget/conftest.py) and update imports in each new file; preserve
pytest.mark.unit decorators and keep each test class (TestDetectAnomalies,
TestAnalyzeEfficiency, TestRecommendDowngrades, TestEvaluateOperation,
TestSuggestRoutingOptimizations, TestClassifySeverity, TestInputValidation,
TestEdgeCases) intact when moving so tests and references (CostOptimizer,
CostTracker, CostOptimizerConfig, BudgetConfig, ModelResolver, ResolvedModel,
_classify_severity) still resolve. Ensure no duplicate fixtures/names and run
pytest to verify imports and test discovery.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: f608e87d-6969-44a1-b81d-dca1fd84730f
📒 Files selected for processing (9)
DESIGN_SPEC.mdsrc/ai_company/budget/__init__.pysrc/ai_company/budget/_optimizer_helpers.pysrc/ai_company/budget/optimizer.pysrc/ai_company/budget/optimizer_models.pysrc/ai_company/budget/reports.pysrc/ai_company/observability/events/cfo.pytests/unit/budget/test_optimizer.pytests/unit/budget/test_optimizer_models.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (4)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Do NOT usefrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations
Useexcept A, B:syntax (without parentheses) per PEP 758 — ruff enforces this on Python 3.14
All public functions must have type hints; use mypy strict mode for type-checking
Use Google-style docstrings on all public classes and functions; enforced by ruff D rules
Create new objects instead of mutating existing ones; for non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType wrapping for read-only enforcement
For dict/list fields in frozen Pydantic models, rely on frozen=True for field reassignment prevention and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves; never mix static config fields with mutable runtime fields in one model
Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use@computed_fieldfor derived values instead of storing redundant fields; use NotBlankStr for all identifier/name fields (including optional and tuple variants) instead of manual whitespace validators
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Keep functions under 50 lines and files under 800 lines
Handle errors explicitly, never silently swallow exceptions
Validate at system boundaries (user input, external APIs, config files)
Use line length of 88 characters (ruff)
Files:
tests/unit/budget/test_optimizer_models.pytests/unit/budget/test_optimizer.pysrc/ai_company/budget/optimizer.pysrc/ai_company/budget/_optimizer_helpers.pysrc/ai_company/observability/events/cfo.pysrc/ai_company/budget/optimizer_models.pysrc/ai_company/budget/__init__.pysrc/ai_company/budget/reports.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Mark tests with@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e, or@pytest.mark.slow
Prefer@pytest.mark.parametrizefor testing similar cases
In tests, use test-provider, test-small-001, etc. instead of real vendor names
Files:
tests/unit/budget/test_optimizer_models.pytests/unit/budget/test_optimizer.py
src/ai_company/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/ai_company/**/*.py: Every module with business logic must import and use get_logger(name) from ai_company.observability; never use import logging or logging.getLogger() or print() in application code
Always use 'logger' as the variable name (not '_logger', not 'log')
Always use event name constants from ai_company.observability.events domain modules (e.g., PROVIDER_CALL_START from events.provider) instead of string literals
Use structured logging with logger.info(EVENT, key=value) — never use logger.info('msg %s', val) string formatting
All error paths must log at WARNING or ERROR with context before raising
All state transitions must log at INFO level
Use DEBUG level logging for object creation, internal flow, and entry/exit of key functions
Files:
src/ai_company/budget/optimizer.pysrc/ai_company/budget/_optimizer_helpers.pysrc/ai_company/observability/events/cfo.pysrc/ai_company/budget/optimizer_models.pysrc/ai_company/budget/__init__.pysrc/ai_company/budget/reports.py
src/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Never use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples; use generic names (example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small aliases)
Files:
src/ai_company/budget/optimizer.pysrc/ai_company/budget/_optimizer_helpers.pysrc/ai_company/observability/events/cfo.pysrc/ai_company/budget/optimizer_models.pysrc/ai_company/budget/__init__.pysrc/ai_company/budget/reports.py
🧠 Learnings (8)
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Always use event name constants from ai_company.observability.events domain modules (e.g., PROVIDER_CALL_START from events.provider) instead of string literals
Applied to files:
DESIGN_SPEC.mdsrc/ai_company/observability/events/cfo.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to **/*.py : Keep functions under 50 lines and files under 800 lines
Applied to files:
src/ai_company/budget/optimizer.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to **/*.py : Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls); prefer structured concurrency over bare create_task
Applied to files:
src/ai_company/budget/optimizer.pysrc/ai_company/budget/reports.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : All state transitions must log at INFO level
Applied to files:
src/ai_company/budget/optimizer.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : All error paths must log at WARNING or ERROR with context before raising
Applied to files:
src/ai_company/budget/optimizer.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to **/*.py : Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves; never mix static config fields with mutable runtime fields in one model
Applied to files:
src/ai_company/budget/optimizer_models.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to **/*.py : Use Pydantic v2 (BaseModel, model_validator, computed_field, ConfigDict); use computed_field for derived values instead of storing redundant fields; use NotBlankStr for all identifier/name fields (including optional and tuple variants) instead of manual whitespace validators
Applied to files:
src/ai_company/budget/reports.py
📚 Learning: 2026-03-09T12:14:21.716Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T12:14:21.716Z
Learning: Applies to src/ai_company/**/*.py : Use DEBUG level logging for object creation, internal flow, and entry/exit of key functions
Applied to files:
src/ai_company/budget/reports.py
🧬 Code graph analysis (6)
tests/unit/budget/test_optimizer.py (10)
src/ai_company/budget/_optimizer_helpers.py (1)
_classify_severity(193-205)src/ai_company/budget/config.py (3)
BudgetAlertConfig(15-62)BudgetConfig(151-227)AutoDowngradeConfig(65-148)src/ai_company/budget/enums.py (1)
BudgetAlertLevel(6-16)src/ai_company/budget/optimizer_models.py (4)
AnomalySeverity(34-39)AnomalyType(22-31)CostOptimizerConfig(346-397)EfficiencyRating(42-47)src/ai_company/budget/tracker.py (2)
CostTracker(68-455)record(99-112)src/ai_company/providers/routing/models.py (1)
ResolvedModel(9-52)src/ai_company/providers/routing/resolver.py (1)
ModelResolver(25-205)tests/unit/budget/conftest.py (2)
make_cost_record(286-307)cost_tracker(262-270)src/ai_company/budget/billing.py (1)
billing_period_start(11-45)tests/unit/budget/test_reports.py (1)
test_start_after_end_rejected(344-347)
src/ai_company/budget/optimizer.py (6)
src/ai_company/budget/_optimizer_helpers.py (5)
_build_efficiency_from_records(46-91)_classify_severity(193-205)_compute_window_costs(94-107)_find_most_used_model(239-255)_group_records_by_agent(367-374)src/ai_company/budget/tracker.py (2)
get_records(185-225)get_total_cost(114-137)src/ai_company/budget/billing.py (1)
billing_period_start(11-45)src/ai_company/budget/enums.py (1)
BudgetAlertLevel(6-16)src/ai_company/budget/optimizer_models.py (8)
DowngradeAnalysis(276-304)DowngradeRecommendation(240-273)EfficiencyAnalysis(179-234)EfficiencyRating(42-47)inefficient_agent_count(206-212)estimated_savings_per_1k(436-441)total_estimated_savings_per_1k(299-304)total_estimated_savings_per_1k(491-496)src/ai_company/providers/routing/resolver.py (4)
ModelResolver(25-205)all_models(174-177)all_models_sorted_by_cost(179-189)resolve_safe(154-172)
src/ai_company/budget/_optimizer_helpers.py (6)
src/ai_company/budget/enums.py (1)
BudgetAlertLevel(6-16)src/ai_company/budget/optimizer_models.py (9)
AgentEfficiency(142-176)AnomalySeverity(34-39)AnomalyType(22-31)DowngradeRecommendation(240-273)EfficiencyAnalysis(179-234)EfficiencyRating(42-47)SpendingAnomaly(53-101)cost_per_1k_tokens(169-176)estimated_savings_per_1k(436-441)src/ai_company/budget/config.py (1)
BudgetConfig(151-227)src/ai_company/budget/cost_record.py (1)
CostRecord(15-56)src/ai_company/providers/routing/models.py (2)
ResolvedModel(9-52)total_cost_per_1k(50-52)src/ai_company/providers/routing/resolver.py (4)
ModelResolver(25-205)resolve_safe(154-172)all_models(174-177)all_models_sorted_by_cost(179-189)
src/ai_company/budget/optimizer_models.py (1)
src/ai_company/budget/enums.py (1)
BudgetAlertLevel(6-16)
src/ai_company/budget/__init__.py (3)
src/ai_company/budget/optimizer.py (1)
CostOptimizer(76-665)src/ai_company/budget/optimizer_models.py (11)
AgentEfficiency(142-176)AnomalyDetectionResult(104-136)AnomalySeverity(34-39)AnomalyType(22-31)ApprovalDecision(310-340)CostOptimizerConfig(346-397)DowngradeAnalysis(276-304)EfficiencyAnalysis(179-234)EfficiencyRating(42-47)RoutingOptimizationAnalysis(467-509)SpendingAnomaly(53-101)src/ai_company/budget/reports.py (6)
ModelDistribution(80-101)PeriodComparison(104-144)ProviderDistribution(58-77)ReportGenerator(212-343)SpendingReport(147-206)TaskSpending(40-55)
src/ai_company/budget/reports.py (3)
src/ai_company/budget/spending_summary.py (1)
SpendingSummary(102-161)src/ai_company/budget/cost_record.py (1)
CostRecord(15-56)src/ai_company/budget/tracker.py (3)
CostTracker(68-455)get_records(185-225)build_summary(227-281)
| def _detect_spike_anomaly( # noqa: PLR0913 | ||
| agent_id: str, | ||
| window_costs: tuple[float, ...], | ||
| now: datetime, | ||
| window_starts: tuple[datetime, ...], | ||
| window_duration: timedelta, | ||
| config: CostOptimizerConfig, | ||
| ) -> SpendingAnomaly | None: | ||
| """Detect a spike anomaly for a single agent. | ||
|
|
||
| Returns ``None`` if no anomaly is detected or insufficient data. | ||
| """ | ||
| if len(window_costs) < config.min_anomaly_windows: | ||
| logger.debug( | ||
| CFO_INSUFFICIENT_WINDOWS, | ||
| agent_id=agent_id, | ||
| window_count=len(window_costs), | ||
| min_required=config.min_anomaly_windows, | ||
| ) | ||
| return None | ||
|
|
||
| historical = window_costs[:-1] | ||
| current = window_costs[-1] | ||
|
|
||
| if current == 0.0: | ||
| return None | ||
|
|
||
| mean = statistics.mean(historical) | ||
|
|
||
| if mean == 0.0: | ||
| # No historical spending -- spike from zero (current > 0 per guard) | ||
| return SpendingAnomaly( | ||
| agent_id=agent_id, | ||
| anomaly_type=AnomalyType.SPIKE, | ||
| severity=AnomalySeverity.HIGH, | ||
| description=( | ||
| f"Agent {agent_id!r} went from $0.00 baseline " | ||
| f"to ${current:.2f} in the latest window" | ||
| ), | ||
| current_value=current, | ||
| baseline_value=0.0, | ||
| deviation_factor=0.0, | ||
| detected_at=now, | ||
| period_start=window_starts[-1], | ||
| period_end=window_starts[-1] + window_duration, | ||
| ) | ||
|
|
||
| # Check spike factor (independent of stddev) | ||
| spike_ratio = current / mean | ||
| is_spike = spike_ratio > config.anomaly_spike_factor | ||
|
|
||
| # Check sigma threshold | ||
| stddev = statistics.stdev(historical) if len(historical) > 1 else 0.0 | ||
| deviation = (current - mean) / stddev if stddev > 0 else 0.0 | ||
| is_sigma_anomaly = deviation > config.anomaly_sigma_threshold | ||
|
|
||
| if not is_spike and not is_sigma_anomaly: | ||
| return None | ||
|
|
||
| # When stddev is zero, use the spike ratio for severity classification | ||
| classification_value = spike_ratio if is_spike and stddev == 0.0 else deviation | ||
| severity = _classify_severity(classification_value) | ||
|
|
||
| # Use spike_ratio as deviation_factor when stddev is zero | ||
| effective_deviation = spike_ratio if stddev == 0.0 else deviation | ||
|
|
||
| return SpendingAnomaly( | ||
| agent_id=agent_id, | ||
| anomaly_type=AnomalyType.SPIKE, | ||
| severity=severity, | ||
| description=( | ||
| f"Agent {agent_id!r} spent ${current:.2f} vs " | ||
| f"${mean:.2f} baseline ({effective_deviation:.1f}x)" | ||
| ), | ||
| current_value=current, | ||
| baseline_value=round(mean, BUDGET_ROUNDING_PRECISION), | ||
| deviation_factor=round(effective_deviation, BUDGET_ROUNDING_PRECISION), | ||
| detected_at=now, | ||
| period_start=window_starts[-1], | ||
| period_end=window_starts[-1] + window_duration, | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Split _detect_spike_anomaly again.
This helper still bundles validation, zero-baseline handling, threshold evaluation, severity mapping, and model construction into one 80+ line block. Breaking those branches into smaller helpers will keep the anomaly logic easier to audit and back under the repo’s function-size limit.
As per coding guidelines, "Keep functions under 50 lines and files under 800 lines".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/budget/_optimizer_helpers.py` around lines 110 - 190, The
_detect_spike_anomaly function is too large and mixes validation, zero-baseline
handling, threshold evaluation, severity classification, and SpendingAnomaly
construction; refactor by splitting it into small helpers (e.g.,
_validate_windows(agent_id, window_costs, config),
_handle_zero_baseline(agent_id, current, now, window_starts, window_duration),
_evaluate_spike_and_sigma(historical, current, config) which returns (is_spike,
is_sigma_anomaly, spike_ratio, deviation, stddev), and
_build_spending_anomaly(agent_id, current, mean, effective_deviation, severity,
now, window_starts, window_duration)). Keep existing behavior and return values
(use _classify_severity for severity, round baseline_value and deviation_factor
per BUDGET_ROUNDING_PRECISION, and preserve SpendingAnomaly fields), then
simplify _detect_spike_anomaly to call these helpers in sequence so the
top-level function is under 50 lines.
tests/unit/budget/test_optimizer.py
Outdated
| """Tests for CostOptimizer service.""" | ||
|
|
||
| from datetime import UTC, datetime, timedelta | ||
|
|
||
| import pytest | ||
|
|
||
| from ai_company.budget._optimizer_helpers import _classify_severity | ||
| from ai_company.budget.config import BudgetAlertConfig, BudgetConfig | ||
| from ai_company.budget.enums import BudgetAlertLevel | ||
| from ai_company.budget.optimizer import CostOptimizer | ||
| from ai_company.budget.optimizer_models import ( | ||
| AnomalySeverity, | ||
| AnomalyType, | ||
| CostOptimizerConfig, | ||
| EfficiencyRating, | ||
| ) | ||
| from ai_company.budget.tracker import CostTracker | ||
| from ai_company.providers.routing.models import ResolvedModel | ||
| from ai_company.providers.routing.resolver import ModelResolver | ||
| from tests.unit.budget.conftest import make_cost_record | ||
|
|
||
| # ── Helpers ─────────────────────────────────────────────────────── | ||
|
|
||
| _START = datetime(2026, 2, 1, tzinfo=UTC) | ||
| _END = datetime(2026, 3, 1, tzinfo=UTC) | ||
|
|
||
|
|
||
| def _make_optimizer( | ||
| *, | ||
| budget_config: BudgetConfig | None = None, | ||
| config: CostOptimizerConfig | None = None, | ||
| model_resolver: ModelResolver | None = None, | ||
| ) -> tuple[CostOptimizer, CostTracker]: | ||
| """Build a CostOptimizer with a fresh CostTracker.""" | ||
| bc = budget_config or BudgetConfig(total_monthly=100.0) | ||
| tracker = CostTracker(budget_config=bc) | ||
| optimizer = CostOptimizer( | ||
| cost_tracker=tracker, | ||
| budget_config=bc, | ||
| config=config, | ||
| model_resolver=model_resolver, | ||
| ) | ||
| return optimizer, tracker | ||
|
|
||
|
|
||
| def _make_resolver( | ||
| models: list[ResolvedModel] | None = None, | ||
| ) -> ModelResolver: | ||
| """Build a ModelResolver from a list of ResolvedModel.""" | ||
| if models is None: | ||
| models = [ | ||
| ResolvedModel( | ||
| provider_name="test-provider", | ||
| model_id="test-large-001", | ||
| alias="large", | ||
| cost_per_1k_input=0.03, | ||
| cost_per_1k_output=0.06, | ||
| ), | ||
| ResolvedModel( | ||
| provider_name="test-provider", | ||
| model_id="test-medium-001", | ||
| alias="medium", | ||
| cost_per_1k_input=0.01, | ||
| cost_per_1k_output=0.02, | ||
| ), | ||
| ResolvedModel( | ||
| provider_name="test-provider", | ||
| model_id="test-small-001", | ||
| alias="small", | ||
| cost_per_1k_input=0.001, | ||
| cost_per_1k_output=0.002, | ||
| ), | ||
| ] | ||
| index: dict[str, ResolvedModel] = {} | ||
| for m in models: | ||
| index[m.model_id] = m | ||
| if m.alias is not None: | ||
| index[m.alias] = m | ||
| return ModelResolver(index) | ||
|
|
||
|
|
||
| # ── Init Tests ──────────────────────────────────────────────────── | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| class TestInit: | ||
| async def test_defaults(self) -> None: | ||
| optimizer, _ = _make_optimizer() | ||
| assert optimizer._config == CostOptimizerConfig() | ||
|
|
||
| async def test_custom_config(self) -> None: | ||
| cfg = CostOptimizerConfig(anomaly_sigma_threshold=3.0) | ||
| optimizer, _ = _make_optimizer(config=cfg) | ||
| assert optimizer._config.anomaly_sigma_threshold == 3.0 | ||
|
|
||
|
|
||
| # ── Anomaly Detection Tests ────────────────────────────────────── | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| class TestDetectAnomalies: | ||
| async def test_no_records_empty_result(self) -> None: | ||
| optimizer, _ = _make_optimizer() | ||
| result = await optimizer.detect_anomalies(start=_START, end=_END) | ||
| assert result.anomalies == () | ||
| assert result.agents_scanned == 0 | ||
|
|
||
| async def test_normal_spending_no_anomalies(self) -> None: | ||
| optimizer, tracker = _make_optimizer() | ||
| # Create uniform spending across 5 windows | ||
| window_duration = (_END - _START) / 5 | ||
| for i in range(5): | ||
| ts = _START + window_duration * i + timedelta(hours=1) | ||
| await tracker.record( | ||
| make_cost_record(agent_id="alice", cost_usd=1.0, timestamp=ts), | ||
| ) | ||
|
|
||
| result = await optimizer.detect_anomalies(start=_START, end=_END) | ||
| assert result.anomalies == () | ||
| assert result.agents_scanned == 1 | ||
|
|
||
| async def test_spike_detected(self) -> None: | ||
| optimizer, tracker = _make_optimizer() | ||
| window_duration = (_END - _START) / 5 | ||
|
|
||
| # Normal spending in first 4 windows | ||
| for i in range(4): | ||
| ts = _START + window_duration * i + timedelta(hours=1) | ||
| await tracker.record( | ||
| make_cost_record(agent_id="alice", cost_usd=1.0, timestamp=ts), | ||
| ) | ||
|
|
||
| # Spike in last window | ||
| ts = _START + window_duration * 4 + timedelta(hours=1) | ||
| await tracker.record( | ||
| make_cost_record(agent_id="alice", cost_usd=20.0, timestamp=ts), | ||
| ) | ||
|
|
||
| result = await optimizer.detect_anomalies(start=_START, end=_END) | ||
| assert len(result.anomalies) == 1 | ||
| anomaly = result.anomalies[0] | ||
| assert anomaly.agent_id == "alice" | ||
| assert anomaly.anomaly_type == AnomalyType.SPIKE | ||
| assert anomaly.current_value == 20.0 | ||
|
|
||
| async def test_insufficient_windows_no_false_positive(self) -> None: | ||
| config = CostOptimizerConfig(min_anomaly_windows=5) | ||
| optimizer, tracker = _make_optimizer(config=config) | ||
|
|
||
| # Only 3 windows of data in a 3-window analysis | ||
| window_duration = (_END - _START) / 3 | ||
| for i in range(3): | ||
| ts = _START + window_duration * i + timedelta(hours=1) | ||
| cost = 1.0 if i < 2 else 50.0 | ||
| await tracker.record( | ||
| make_cost_record(agent_id="alice", cost_usd=cost, timestamp=ts), | ||
| ) | ||
|
|
||
| result = await optimizer.detect_anomalies( | ||
| start=_START, | ||
| end=_END, | ||
| window_count=3, | ||
| ) | ||
| assert result.anomalies == () | ||
|
|
||
| async def test_multiple_agents_only_anomalous_flagged(self) -> None: | ||
| optimizer, tracker = _make_optimizer() | ||
| window_duration = (_END - _START) / 5 | ||
|
|
||
| # Alice: uniform spending | ||
| for i in range(5): | ||
| ts = _START + window_duration * i + timedelta(hours=1) | ||
| await tracker.record( | ||
| make_cost_record(agent_id="alice", cost_usd=1.0, timestamp=ts), | ||
| ) | ||
|
|
||
| # Bob: spike in last window | ||
| for i in range(4): | ||
| ts = _START + window_duration * i + timedelta(hours=1) | ||
| await tracker.record( | ||
| make_cost_record(agent_id="bob", cost_usd=1.0, timestamp=ts), | ||
| ) | ||
| ts = _START + window_duration * 4 + timedelta(hours=1) | ||
| await tracker.record( | ||
| make_cost_record(agent_id="bob", cost_usd=20.0, timestamp=ts), | ||
| ) | ||
|
|
||
| result = await optimizer.detect_anomalies(start=_START, end=_END) | ||
| assert len(result.anomalies) == 1 | ||
| assert result.anomalies[0].agent_id == "bob" | ||
| assert result.agents_scanned == 2 | ||
|
|
||
| async def test_window_count_validation(self) -> None: | ||
| optimizer, _ = _make_optimizer() | ||
| with pytest.raises(ValueError, match="window_count must be >= 2"): | ||
| await optimizer.detect_anomalies( | ||
| start=_START, | ||
| end=_END, | ||
| window_count=1, | ||
| ) | ||
|
|
||
| async def test_spike_from_zero_baseline(self) -> None: | ||
| """Agent with no historical spending that suddenly appears.""" | ||
| optimizer, tracker = _make_optimizer( | ||
| config=CostOptimizerConfig(min_anomaly_windows=3), | ||
| ) | ||
| window_duration = (_END - _START) / 5 | ||
|
|
||
| # No spending in first 4 windows, spending in window 5 | ||
| ts = _START + window_duration * 4 + timedelta(hours=1) | ||
| await tracker.record( | ||
| make_cost_record(agent_id="alice", cost_usd=5.0, timestamp=ts), | ||
| ) | ||
|
|
||
| result = await optimizer.detect_anomalies(start=_START, end=_END) | ||
| assert len(result.anomalies) == 1 | ||
| anomaly = result.anomalies[0] | ||
| assert anomaly.severity == AnomalySeverity.HIGH | ||
| assert anomaly.baseline_value == 0.0 | ||
|
|
||
| async def test_spike_severity_with_zero_stddev(self) -> None: | ||
| """Spike severity uses spike_ratio when stddev is 0.""" | ||
| optimizer, tracker = _make_optimizer( | ||
| config=CostOptimizerConfig( | ||
| anomaly_sigma_threshold=2.0, | ||
| anomaly_spike_factor=2.0, | ||
| min_anomaly_windows=3, | ||
| ), | ||
| ) | ||
| window_duration = (_END - _START) / 5 | ||
|
|
||
| # Identical baseline → stddev=0 | ||
| for i in range(4): | ||
| ts = _START + window_duration * i + timedelta(hours=1) | ||
| await tracker.record( | ||
| make_cost_record(agent_id="alice", cost_usd=1.0, timestamp=ts), | ||
| ) | ||
|
|
||
| # Spike: 4x baseline → spike_ratio=4.0 → HIGH (>=3.0) | ||
| ts = _START + window_duration * 4 + timedelta(hours=1) | ||
| await tracker.record( | ||
| make_cost_record(agent_id="alice", cost_usd=4.0, timestamp=ts), | ||
| ) | ||
|
|
||
| result = await optimizer.detect_anomalies(start=_START, end=_END) | ||
| assert len(result.anomalies) == 1 | ||
| assert result.anomalies[0].severity == AnomalySeverity.HIGH | ||
|
|
||
|
|
||
| # ── Efficiency Analysis Tests ───────────────────────────────────── | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| class TestAnalyzeEfficiency: | ||
| async def test_uniform_all_normal(self) -> None: | ||
| optimizer, tracker = _make_optimizer() | ||
|
|
||
| # Same cost/token ratio for all agents | ||
| for agent in ("alice", "bob", "carol"): | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id=agent, | ||
| cost_usd=1.0, | ||
| input_tokens=1000, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
|
|
||
| result = await optimizer.analyze_efficiency(start=_START, end=_END) | ||
| assert all( | ||
| a.efficiency_rating == EfficiencyRating.NORMAL for a in result.agents | ||
| ) | ||
| assert result.inefficient_agent_count == 0 | ||
|
|
||
| async def test_one_inefficient(self) -> None: | ||
| optimizer, tracker = _make_optimizer() | ||
|
|
||
| # Alice: cheap (1.0/1000 = 1.0 per 1k) | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="alice", | ||
| cost_usd=1.0, | ||
| input_tokens=1000, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
| # Bob: expensive (10.0/1000 = 10.0 per 1k) | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="bob", | ||
| cost_usd=10.0, | ||
| input_tokens=1000, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
|
|
||
| result = await optimizer.analyze_efficiency(start=_START, end=_END) | ||
| assert result.inefficient_agent_count == 1 | ||
| # Sorted by cost_per_1k desc | ||
| assert result.agents[0].agent_id == "bob" | ||
| assert result.agents[0].efficiency_rating == EfficiencyRating.INEFFICIENT | ||
|
|
||
| async def test_zero_tokens_handled(self) -> None: | ||
| optimizer, tracker = _make_optimizer() | ||
|
|
||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="alice", | ||
| cost_usd=0.0, | ||
| input_tokens=0, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
|
|
||
| result = await optimizer.analyze_efficiency(start=_START, end=_END) | ||
| assert len(result.agents) == 1 | ||
| assert result.agents[0].cost_per_1k_tokens == 0.0 | ||
| assert result.agents[0].efficiency_rating == EfficiencyRating.NORMAL | ||
|
|
||
| async def test_efficient_agent_flagged(self) -> None: | ||
| optimizer, tracker = _make_optimizer() | ||
|
|
||
| # Alice: very cheap (0.1/10000 = 0.01 per 1k) | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="alice", | ||
| cost_usd=0.1, | ||
| input_tokens=10000, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
| # Bob: normal (1.0/1000 = 1.0 per 1k) | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="bob", | ||
| cost_usd=1.0, | ||
| input_tokens=1000, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
| # Carol: normal (1.0/1000 = 1.0 per 1k) | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="carol", | ||
| cost_usd=1.0, | ||
| input_tokens=1000, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
|
|
||
| result = await optimizer.analyze_efficiency(start=_START, end=_END) | ||
| alice = next(a for a in result.agents if a.agent_id == "alice") | ||
| assert alice.efficiency_rating == EfficiencyRating.EFFICIENT | ||
|
|
||
| async def test_empty_records(self) -> None: | ||
| optimizer, _ = _make_optimizer() | ||
| result = await optimizer.analyze_efficiency(start=_START, end=_END) | ||
| assert result.agents == () | ||
| assert result.global_avg_cost_per_1k == 0.0 | ||
|
|
||
|
|
||
| # ── Downgrade Recommendation Tests ──────────────────────────────── | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| class TestRecommendDowngrades: | ||
| async def test_no_resolver_empty_result(self) -> None: | ||
| optimizer, _ = _make_optimizer() | ||
| result = await optimizer.recommend_downgrades(start=_START, end=_END) | ||
| assert result.recommendations == () | ||
|
|
||
| async def test_with_downgrade_path(self) -> None: | ||
| from ai_company.budget.config import AutoDowngradeConfig | ||
|
|
||
| resolver = _make_resolver() | ||
| bc = BudgetConfig( | ||
| total_monthly=100.0, | ||
| auto_downgrade=AutoDowngradeConfig( | ||
| enabled=True, | ||
| threshold=80, | ||
| downgrade_map=(("large", "small"),), | ||
| ), | ||
| ) | ||
| tracker = CostTracker(budget_config=bc) | ||
| optimizer = CostOptimizer( | ||
| cost_tracker=tracker, | ||
| budget_config=bc, | ||
| model_resolver=resolver, | ||
| ) | ||
|
|
||
| # Make alice inefficient using large model | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="alice", | ||
| model="test-large-001", | ||
| cost_usd=10.0, | ||
| input_tokens=1000, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
| # Make bob efficient using small model | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="bob", | ||
| model="test-small-001", | ||
| cost_usd=0.1, | ||
| input_tokens=1000, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
|
|
||
| result = await optimizer.recommend_downgrades(start=_START, end=_END) | ||
| assert len(result.recommendations) == 1 | ||
| rec = result.recommendations[0] | ||
| assert rec.agent_id == "alice" | ||
| assert rec.current_model == "test-large-001" | ||
| assert rec.recommended_model == "test-small-001" | ||
| assert rec.estimated_savings_per_1k > 0 | ||
|
|
||
| async def test_no_cheaper_model_empty(self) -> None: | ||
| """No recommendation when agent already uses cheapest model.""" | ||
| resolver = _make_resolver( | ||
| [ | ||
| ResolvedModel( | ||
| provider_name="test-provider", | ||
| model_id="test-only-001", | ||
| alias="only", | ||
| cost_per_1k_input=0.01, | ||
| cost_per_1k_output=0.02, | ||
| ), | ||
| ] | ||
| ) | ||
| bc = BudgetConfig(total_monthly=100.0) | ||
| tracker = CostTracker(budget_config=bc) | ||
| optimizer = CostOptimizer( | ||
| cost_tracker=tracker, | ||
| budget_config=bc, | ||
| model_resolver=resolver, | ||
| ) | ||
|
|
||
| # Only agent, only model — inefficient by default since it's the only one | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="alice", | ||
| model="test-only-001", | ||
| cost_usd=10.0, | ||
| input_tokens=1000, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
|
|
||
| result = await optimizer.recommend_downgrades(start=_START, end=_END) | ||
| assert result.recommendations == () | ||
|
|
||
|
|
||
| # ── Evaluate Operation Tests ────────────────────────────────────── | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| class TestEvaluateOperation: | ||
| async def test_healthy_budget_approved(self) -> None: | ||
| optimizer, tracker = _make_optimizer() | ||
| # Spend only 10% of budget | ||
| await tracker.record( | ||
| make_cost_record(cost_usd=10.0, timestamp=_START + timedelta(hours=1)), | ||
| ) | ||
| decision = await optimizer.evaluate_operation( | ||
| agent_id="alice", | ||
| estimated_cost_usd=0.5, | ||
| now=_START + timedelta(days=15), | ||
| ) | ||
| assert decision.approved is True | ||
| assert decision.alert_level == BudgetAlertLevel.NORMAL | ||
|
|
||
| async def test_hard_stop_denied(self) -> None: | ||
| bc = BudgetConfig( | ||
| total_monthly=100.0, | ||
| alerts=BudgetAlertConfig(warn_at=75, critical_at=90, hard_stop_at=100), | ||
| ) | ||
| optimizer, tracker = _make_optimizer(budget_config=bc) | ||
|
|
||
| # Spend 100% of budget | ||
| await tracker.record( | ||
| make_cost_record(cost_usd=100.0, timestamp=_START + timedelta(hours=1)), | ||
| ) | ||
|
|
||
| decision = await optimizer.evaluate_operation( | ||
| agent_id="alice", | ||
| estimated_cost_usd=1.0, | ||
| now=_START + timedelta(days=15), | ||
| ) | ||
| assert decision.approved is False | ||
| assert decision.alert_level == BudgetAlertLevel.HARD_STOP | ||
|
|
||
| async def test_would_exceed_budget_denied(self) -> None: | ||
| bc = BudgetConfig( | ||
| total_monthly=100.0, | ||
| alerts=BudgetAlertConfig(warn_at=75, critical_at=90, hard_stop_at=100), | ||
| ) | ||
| optimizer, tracker = _make_optimizer(budget_config=bc) | ||
|
|
||
| # Spend 95% and request 10 more → projected 105% → HARD_STOP | ||
| await tracker.record( | ||
| make_cost_record(cost_usd=95.0, timestamp=_START + timedelta(hours=1)), | ||
| ) | ||
|
|
||
| decision = await optimizer.evaluate_operation( | ||
| agent_id="alice", | ||
| estimated_cost_usd=10.0, | ||
| now=_START + timedelta(days=15), | ||
| ) | ||
| assert decision.approved is False | ||
| # With projected alert level, this now triggers auto-deny | ||
| assert "denied" in decision.reason.lower() | ||
|
|
||
| async def test_warning_level_approved_with_conditions(self) -> None: | ||
| bc = BudgetConfig( | ||
| total_monthly=100.0, | ||
| alerts=BudgetAlertConfig(warn_at=75, critical_at=90, hard_stop_at=100), | ||
| ) | ||
| optimizer, tracker = _make_optimizer(budget_config=bc) | ||
|
|
||
| # Spend 80% (warning level) | ||
| await tracker.record( | ||
| make_cost_record(cost_usd=80.0, timestamp=_START + timedelta(hours=1)), | ||
| ) | ||
|
|
||
| decision = await optimizer.evaluate_operation( | ||
| agent_id="alice", | ||
| estimated_cost_usd=2.0, | ||
| now=_START + timedelta(days=15), | ||
| ) | ||
| assert decision.approved is True | ||
| assert decision.alert_level == BudgetAlertLevel.WARNING | ||
| assert len(decision.conditions) > 0 | ||
|
|
||
| async def test_budget_enforcement_disabled(self) -> None: | ||
| bc = BudgetConfig(total_monthly=0.0) | ||
| optimizer, _ = _make_optimizer(budget_config=bc) | ||
|
|
||
| decision = await optimizer.evaluate_operation( | ||
| agent_id="alice", | ||
| estimated_cost_usd=100.0, | ||
| ) | ||
| assert decision.approved is True | ||
| assert "disabled" in decision.reason.lower() | ||
|
|
||
| async def test_critical_level_auto_deny_with_custom_config(self) -> None: | ||
| """Auto-deny at CRITICAL when configured.""" | ||
| bc = BudgetConfig( | ||
| total_monthly=100.0, | ||
| alerts=BudgetAlertConfig(warn_at=75, critical_at=90, hard_stop_at=100), | ||
| ) | ||
| config = CostOptimizerConfig( | ||
| approval_auto_deny_alert_level=BudgetAlertLevel.CRITICAL, | ||
| ) | ||
| optimizer, tracker = _make_optimizer(budget_config=bc, config=config) | ||
|
|
||
| # Spend 92% (critical level) | ||
| await tracker.record( | ||
| make_cost_record(cost_usd=92.0, timestamp=_START + timedelta(hours=1)), | ||
| ) | ||
|
|
||
| decision = await optimizer.evaluate_operation( | ||
| agent_id="alice", | ||
| estimated_cost_usd=0.01, | ||
| now=_START + timedelta(days=15), | ||
| ) | ||
| assert decision.approved is False | ||
| assert decision.alert_level == BudgetAlertLevel.CRITICAL | ||
|
|
||
| async def test_high_cost_condition(self) -> None: | ||
| """High-cost warning condition when estimated cost >= threshold.""" | ||
| config = CostOptimizerConfig(approval_warn_threshold_usd=0.5) | ||
| optimizer, _ = _make_optimizer(config=config) | ||
|
|
||
| decision = await optimizer.evaluate_operation( | ||
| agent_id="alice", | ||
| estimated_cost_usd=1.0, | ||
| now=_START + timedelta(days=15), | ||
| ) | ||
| assert decision.approved is True | ||
| assert any("High-cost" in c for c in decision.conditions) | ||
|
|
||
|
|
||
| # ── _classify_severity Tests ───────────────────────────────────── | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| class TestClassifySeverity: | ||
| @pytest.mark.parametrize( | ||
| ("deviation", "expected"), | ||
| [ | ||
| (0.0, AnomalySeverity.LOW), | ||
| (1.5, AnomalySeverity.LOW), | ||
| (1.99, AnomalySeverity.LOW), | ||
| (2.0, AnomalySeverity.MEDIUM), | ||
| (2.5, AnomalySeverity.MEDIUM), | ||
| (2.99, AnomalySeverity.MEDIUM), | ||
| (3.0, AnomalySeverity.HIGH), | ||
| (5.0, AnomalySeverity.HIGH), | ||
| (100.0, AnomalySeverity.HIGH), | ||
| ], | ||
| ) | ||
| def test_thresholds(self, deviation: float, expected: AnomalySeverity) -> None: | ||
| assert _classify_severity(deviation) == expected | ||
|
|
||
|
|
||
| # ── Input Validation Tests ─────────────────────────────────────── | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| class TestInputValidation: | ||
| async def test_detect_anomalies_start_after_end(self) -> None: | ||
| optimizer, _ = _make_optimizer() | ||
| with pytest.raises(ValueError, match=r"start .* must be before end"): | ||
| await optimizer.detect_anomalies(start=_END, end=_START) | ||
|
|
||
| async def test_analyze_efficiency_start_after_end(self) -> None: | ||
| optimizer, _ = _make_optimizer() | ||
| with pytest.raises(ValueError, match=r"start .* must be before end"): | ||
| await optimizer.analyze_efficiency(start=_END, end=_START) | ||
|
|
||
| async def test_recommend_downgrades_start_after_end(self) -> None: | ||
| optimizer, _ = _make_optimizer() | ||
| with pytest.raises(ValueError, match=r"start .* must be before end"): | ||
| await optimizer.recommend_downgrades(start=_END, end=_START) | ||
|
|
||
|
|
||
| # ── Edge Case Tests ────────────────────────────────────────────── | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| class TestEdgeCases: | ||
| async def test_find_cheaper_model_picks_cheapest(self) -> None: | ||
| """_find_cheaper_model selects the overall cheapest below current.""" | ||
| resolver = _make_resolver() | ||
| result = await _make_optimizer(model_resolver=resolver)[0].recommend_downgrades( | ||
| start=_START, end=_END | ||
| ) | ||
| # No records → no recommendations, but validates the path | ||
| assert result.recommendations == () | ||
|
|
||
| async def test_budget_pressure_percent_reflects_spending(self) -> None: | ||
| """budget_pressure_percent reflects actual spend vs budget.""" | ||
| from ai_company.budget.billing import billing_period_start | ||
|
|
||
| resolver = _make_resolver() | ||
| bc = BudgetConfig(total_monthly=100.0) | ||
| tracker = CostTracker(budget_config=bc) | ||
| optimizer = CostOptimizer( | ||
| cost_tracker=tracker, | ||
| budget_config=bc, | ||
| model_resolver=resolver, | ||
| ) | ||
| # Record in the current billing period so pressure reflects it | ||
| now = datetime.now(UTC) | ||
| period_start = billing_period_start(bc.reset_day, now=now) | ||
| await tracker.record( | ||
| make_cost_record( | ||
| cost_usd=60.0, | ||
| timestamp=period_start + timedelta(hours=1), | ||
| ), | ||
| ) | ||
| # Use a period that covers the data for the efficiency analysis | ||
| analysis_start = period_start | ||
| analysis_end = now + timedelta(days=1) | ||
| result = await optimizer.recommend_downgrades( | ||
| start=analysis_start, end=analysis_end | ||
| ) | ||
| assert result.budget_pressure_percent == 60.0 | ||
|
|
||
| async def test_downgrade_target_not_resolved(self) -> None: | ||
| """No recommendation when downgrade target doesn't resolve.""" | ||
| from ai_company.budget.config import AutoDowngradeConfig | ||
|
|
||
| resolver = _make_resolver( | ||
| [ | ||
| ResolvedModel( | ||
| provider_name="test-provider", | ||
| model_id="test-large-001", | ||
| alias="large", | ||
| cost_per_1k_input=0.03, | ||
| cost_per_1k_output=0.06, | ||
| ), | ||
| ] | ||
| ) | ||
| bc = BudgetConfig( | ||
| total_monthly=100.0, | ||
| auto_downgrade=AutoDowngradeConfig( | ||
| enabled=True, | ||
| threshold=80, | ||
| downgrade_map=(("large", "nonexistent"),), | ||
| ), | ||
| ) | ||
| tracker = CostTracker(budget_config=bc) | ||
| optimizer = CostOptimizer( | ||
| cost_tracker=tracker, | ||
| budget_config=bc, | ||
| model_resolver=resolver, | ||
| ) | ||
|
|
||
| # Make alice inefficient (only agent, but needs another to set avg) | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="alice", | ||
| model="test-large-001", | ||
| cost_usd=10.0, | ||
| input_tokens=1000, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="bob", | ||
| model="test-large-001", | ||
| cost_usd=0.1, | ||
| input_tokens=1000, | ||
| output_tokens=0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
|
|
||
| result = await optimizer.recommend_downgrades(start=_START, end=_END) | ||
| # Target "nonexistent" can't be resolved → no recommendation | ||
| assert result.recommendations == () | ||
|
|
||
| async def test_negative_estimated_cost_rejected(self) -> None: | ||
| """Negative estimated_cost_usd raises ValueError.""" | ||
| optimizer, _ = _make_optimizer() | ||
| with pytest.raises(ValueError, match="estimated_cost_usd must be >= 0"): | ||
| await optimizer.evaluate_operation( | ||
| agent_id="alice", | ||
| estimated_cost_usd=-1.0, | ||
| ) | ||
|
|
||
| async def test_window_count_upper_bound(self) -> None: | ||
| """window_count > 1000 raises ValueError.""" | ||
| optimizer, _ = _make_optimizer() | ||
| with pytest.raises(ValueError, match="window_count must be <= 1000"): | ||
| await optimizer.detect_anomalies( | ||
| start=_START, | ||
| end=_END, | ||
| window_count=1001, | ||
| ) | ||
|
|
||
| async def test_projected_alert_level_used_for_auto_deny(self) -> None: | ||
| """Auto-deny uses projected alert level, not current.""" | ||
| bc = BudgetConfig( | ||
| total_monthly=100.0, | ||
| alerts=BudgetAlertConfig(warn_at=75, critical_at=90, hard_stop_at=100), | ||
| ) | ||
| config = CostOptimizerConfig( | ||
| approval_auto_deny_alert_level=BudgetAlertLevel.HARD_STOP, | ||
| ) | ||
| optimizer, tracker = _make_optimizer(budget_config=bc, config=config) | ||
|
|
||
| # Spend 95% — current alert is CRITICAL, but requesting 10 | ||
| # would push to 105% → projected HARD_STOP → denied | ||
| await tracker.record( | ||
| make_cost_record(cost_usd=95.0, timestamp=_START + timedelta(hours=1)), | ||
| ) | ||
|
|
||
| decision = await optimizer.evaluate_operation( | ||
| agent_id="alice", | ||
| estimated_cost_usd=10.0, | ||
| now=_START + timedelta(days=15), | ||
| ) | ||
| assert decision.approved is False | ||
| assert "projected" in decision.reason.lower() | ||
|
|
||
|
|
||
| # ── Routing Optimization Tests ────────────────────────────────── | ||
|
|
||
|
|
||
| @pytest.mark.unit | ||
| class TestSuggestRoutingOptimizations: | ||
| async def test_no_resolver_empty_result(self) -> None: | ||
| optimizer, _ = _make_optimizer() | ||
| result = await optimizer.suggest_routing_optimizations( | ||
| start=_START, | ||
| end=_END, | ||
| ) | ||
| assert result.suggestions == () | ||
| assert result.agents_analyzed == 0 | ||
|
|
||
| async def test_no_records_empty_suggestions(self) -> None: | ||
| resolver = _make_resolver() | ||
| optimizer, _ = _make_optimizer(model_resolver=resolver) | ||
| result = await optimizer.suggest_routing_optimizations( | ||
| start=_START, | ||
| end=_END, | ||
| ) | ||
| assert result.suggestions == () | ||
| assert result.agents_analyzed == 0 | ||
|
|
||
| async def test_suggests_cheaper_model(self) -> None: | ||
| resolver = _make_resolver() | ||
| optimizer, tracker = _make_optimizer(model_resolver=resolver) | ||
|
|
||
| # Alice uses the expensive large model | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="alice", | ||
| model="test-large-001", | ||
| cost_usd=5.0, | ||
| input_tokens=1000, | ||
| output_tokens=500, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
|
|
||
| result = await optimizer.suggest_routing_optimizations( | ||
| start=_START, | ||
| end=_END, | ||
| ) | ||
| assert len(result.suggestions) == 1 | ||
| suggestion = result.suggestions[0] | ||
| assert suggestion.agent_id == "alice" | ||
| assert suggestion.current_model == "test-large-001" | ||
| assert suggestion.estimated_savings_per_1k > 0 | ||
| assert result.total_estimated_savings_per_1k > 0 | ||
|
|
||
| async def test_no_suggestion_for_cheapest_model(self) -> None: | ||
| resolver = _make_resolver() | ||
| optimizer, tracker = _make_optimizer(model_resolver=resolver) | ||
|
|
||
| # Alice already uses the cheapest model | ||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="alice", | ||
| model="test-small-001", | ||
| cost_usd=0.1, | ||
| input_tokens=1000, | ||
| output_tokens=500, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
|
|
||
| result = await optimizer.suggest_routing_optimizations( | ||
| start=_START, | ||
| end=_END, | ||
| ) | ||
| assert result.suggestions == () | ||
| assert result.agents_analyzed == 1 | ||
|
|
||
| async def test_start_after_end_rejected(self) -> None: | ||
| optimizer, _ = _make_optimizer() | ||
| with pytest.raises(ValueError, match=r"start .* must be before end"): | ||
| await optimizer.suggest_routing_optimizations(start=_END, end=_START) | ||
|
|
||
| async def test_context_window_respected(self) -> None: | ||
| """Suggestions only include models with sufficient context window.""" | ||
| models = [ | ||
| ResolvedModel( | ||
| provider_name="test-provider", | ||
| model_id="test-large-001", | ||
| alias="large", | ||
| cost_per_1k_input=0.03, | ||
| cost_per_1k_output=0.06, | ||
| max_context=200000, | ||
| ), | ||
| ResolvedModel( | ||
| provider_name="test-provider", | ||
| model_id="test-small-001", | ||
| alias="small", | ||
| cost_per_1k_input=0.001, | ||
| cost_per_1k_output=0.002, | ||
| max_context=50000, # Smaller context than large | ||
| ), | ||
| ] | ||
| resolver = _make_resolver(models) | ||
| optimizer, tracker = _make_optimizer(model_resolver=resolver) | ||
|
|
||
| await tracker.record( | ||
| make_cost_record( | ||
| agent_id="alice", | ||
| model="test-large-001", | ||
| cost_usd=5.0, | ||
| timestamp=_START + timedelta(hours=1), | ||
| ), | ||
| ) | ||
|
|
||
| result = await optimizer.suggest_routing_optimizations( | ||
| start=_START, | ||
| end=_END, | ||
| ) | ||
| # small has insufficient context window → no suggestion | ||
| assert result.suggestions == () |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Split this test module.
This new file is already around 900 lines, which is past the repo's size limit and will only get harder to navigate as optimizer coverage grows. Breaking it into anomaly/efficiency/downgrade/approval/routing modules would keep failures much easier to localize.
As per coding guidelines: "Keep functions under 50 lines and files under 800 lines."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/budget/test_optimizer.py` around lines 1 - 900, The test module is
too large; split it into smaller focused test files by moving the related test
classes into separate modules (e.g., tests/unit/budget/test_anomalies.py,
test_efficiency.py, test_downgrades.py, test_approval.py, test_routing.py).
Extract shared helpers/constants (_START, _END, _make_optimizer, _make_resolver,
make_cost_record import) into a common test helper or conftest (e.g.,
tests/unit/budget/test_helpers.py or reuse tests/unit/budget/conftest.py) and
update imports in each new file; preserve pytest.mark.unit decorators and keep
each test class (TestDetectAnomalies, TestAnalyzeEfficiency,
TestRecommendDowngrades, TestEvaluateOperation, TestSuggestRoutingOptimizations,
TestClassifySeverity, TestInputValidation, TestEdgeCases) intact when moving so
tests and references (CostOptimizer, CostTracker, CostOptimizerConfig,
BudgetConfig, ModelResolver, ResolvedModel, _classify_severity) still resolve.
Ensure no duplicate fixtures/names and run pytest to verify imports and test
discovery.
tests/unit/budget/test_optimizer.py
Outdated
| async def test_find_cheaper_model_picks_cheapest(self) -> None: | ||
| """_find_cheaper_model selects the overall cheapest below current.""" | ||
| resolver = _make_resolver() | ||
| result = await _make_optimizer(model_resolver=resolver)[0].recommend_downgrades( | ||
| start=_START, end=_END | ||
| ) | ||
| # No records → no recommendations, but validates the path | ||
| assert result.recommendations == () |
There was a problem hiding this comment.
This test never reaches cheaper-model selection.
No records are seeded here, so recommend_downgrades() returns on the empty-data path before any _find_cheaper_model logic runs. The test passes even if that branch is broken. Either seed an inefficient record and assert the chosen target, or rename the test to the empty-state behavior it actually covers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/budget/test_optimizer.py` around lines 645 - 652, The test
test_find_cheaper_model_picks_cheapest never exercises _find_cheaper_model
because recommend_downgrades returns early on empty data; either seed an
inefficient usage record before calling recommend_downgrades so the
_find_cheaper_model path runs and assert the chosen cheaper model, or rename the
test to reflect empty-state behavior. Concretely, in the test that calls
_make_resolver() and _make_optimizer(), add a fixture/seeded record (matching
whatever helper you use to insert records in tests) representing an
inefficient/high-cost model so recommend_downgrades evaluates downgrades, then
assert the returned recommendation target; otherwise change the test name and
expected assertion to indicate it verifies the empty-data result from
recommend_downgrades.
- (A) _find_most_used_model accepts pre-filtered agent records
- (B) _find_cheaper_model respects min_context for context window
- (C) recommend_downgrades returns real budget_pressure when no resolver
- (D) evaluate_operation uses projected_alert for conditions
- (E) reports.py logs WARNING before validation ValueErrors
- (F) suggest_routing_optimizations docstring no longer claims latency
- (G) generate_report derives total_cost from records for consistency
- (H) evaluate_operation split into _check_denial/_build_approval_conditions;
recommend_downgrades/suggest_routing_optimizations loops extracted
- (I) recommend_downgrades parallelizes get_records + budget_pressure
- (J) test_optimizer.py split into 3 files (analysis, decisions)
- (K) DESIGN_SPEC §10.3 mentions routing optimization
- (L) _find_cheaper_model tests exercise actual code path + min_context
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| """Generate a spending report for the given period. | ||
|
|
||
| Fetches records and summary concurrently; derives ``total_cost`` | ||
| from the records snapshot for consistent distribution | ||
| percentages. | ||
|
|
There was a problem hiding this comment.
The generate_report() docstring says records and summary are fetched concurrently, but the implementation awaits get_records() and then build_summary() sequentially. Either update the docstring to match the actual behavior or use a TaskGroup/gather to fetch both concurrently (noting CostTracker snapshots under a lock).
| current_value: Spending in the most recent window. | ||
| baseline_value: Mean spending across historical windows. | ||
| deviation_factor: How many standard deviations above baseline. | ||
| Set to 0.0 when the baseline is zero (no historical spending). | ||
| detected_at: Timestamp when the anomaly was detected. |
There was a problem hiding this comment.
SpendingAnomaly.deviation_factor is documented as “standard deviations above baseline”, but when historical stddev is 0 the implementation sets deviation_factor to the spike ratio (a multiplier), not a sigma value. Please update the field/docstring to reflect the actual semantics (e.g., “sigma or spike ratio depending on variance”) so consumers don’t misinterpret it.
| # Same ordering as BudgetEnforcer._ALERT_LEVEL_ORDER | ||
| _ALERT_LEVEL_ORDER: dict[BudgetAlertLevel, int] = { | ||
| BudgetAlertLevel.NORMAL: 0, | ||
| BudgetAlertLevel.WARNING: 1, | ||
| BudgetAlertLevel.CRITICAL: 2, | ||
| BudgetAlertLevel.HARD_STOP: 3, | ||
| } |
There was a problem hiding this comment.
optimizer.py duplicates BudgetEnforcer’s _ALERT_LEVEL_ORDER mapping but omits the runtime sanity checks that enforcer.py has (ensuring keys match BudgetAlertLevel and values are unique). Adding the same validation (or importing a shared constant) would prevent silent drift if BudgetAlertLevel changes.
| if projected_cost >= hard_stop_limit: | ||
| logger.warning( | ||
| CFO_OPERATION_DENIED, | ||
| agent_id=agent_id, | ||
| estimated_cost=estimated_cost_usd, |
There was a problem hiding this comment.
In _check_denial(), the if projected_cost >= hard_stop_limit branch is unreachable with the current logic: whenever that condition is true, projected_pct will be >= hard_stop_at and _compute_alert_level() will return HARD_STOP, which is always >= any configured approval_auto_deny_alert_level, so the earlier auto-deny check already returns. Consider removing this dead branch, or changing the first check if you intend hard-stop to be handled differently.
| if cfg.total_monthly <= 0: | ||
| return ApprovalDecision( | ||
| approved=True, | ||
| reason="Budget enforcement disabled (no monthly budget)", | ||
| budget_remaining_usd=0.0, | ||
| budget_used_percent=0.0, | ||
| alert_level=BudgetAlertLevel.NORMAL, | ||
| conditions=(), | ||
| ) | ||
|
|
There was a problem hiding this comment.
Missing INFO log on budget-enforcement-disabled approval path
The total_monthly <= 0 early-return at line 471 emits no log entry before returning the ApprovalDecision. All other code paths in this method (CFO_OPERATION_DENIED for negative cost, CFO_APPROVAL_EVALUATED for the normal approval, and _check_denial's CFO_OPERATION_DENIED) are instrumented at INFO/WARNING. CLAUDE.md mandates "All state transitions must log at INFO," and this early-exit is a production-relevant state transition that will be completely invisible in logs.
if cfg.total_monthly <= 0:
decision = ApprovalDecision(
approved=True,
reason="Budget enforcement disabled (no monthly budget)",
budget_remaining_usd=0.0,
budget_used_percent=0.0,
alert_level=BudgetAlertLevel.NORMAL,
conditions=(),
)
logger.info(
CFO_APPROVAL_EVALUATED,
agent_id=agent_id,
approved=True,
estimated_cost=estimated_cost_usd,
alert_level=BudgetAlertLevel.NORMAL.value,
conditions_count=0,
reason="enforcement_disabled",
)
return decisionPrompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/budget/optimizer.py
Line: 471-480
Comment:
**Missing INFO log on budget-enforcement-disabled approval path**
The `total_monthly <= 0` early-return at line 471 emits no log entry before returning the `ApprovalDecision`. All other code paths in this method (`CFO_OPERATION_DENIED` for negative cost, `CFO_APPROVAL_EVALUATED` for the normal approval, and `_check_denial`'s `CFO_OPERATION_DENIED`) are instrumented at INFO/WARNING. CLAUDE.md mandates "All state transitions must log at INFO," and this early-exit is a production-relevant state transition that will be completely invisible in logs.
```python
if cfg.total_monthly <= 0:
decision = ApprovalDecision(
approved=True,
reason="Budget enforcement disabled (no monthly budget)",
budget_remaining_usd=0.0,
budget_used_percent=0.0,
alert_level=BudgetAlertLevel.NORMAL,
conditions=(),
)
logger.info(
CFO_APPROVAL_EVALUATED,
agent_id=agent_id,
approved=True,
estimated_cost=estimated_cost_usd,
alert_level=BudgetAlertLevel.NORMAL.value,
conditions_count=0,
reason="enforcement_disabled",
)
return decision
```
How can I resolve this? If you propose a fix, please make it concise.| approval_auto_deny_alert_level: BudgetAlertLevel = Field( | ||
| default=BudgetAlertLevel.HARD_STOP, | ||
| description="Alert level triggering auto-deny", | ||
| ) |
There was a problem hiding this comment.
approval_auto_deny_alert_level = NORMAL silently denies every operation
approval_auto_deny_alert_level accepts any BudgetAlertLevel, including BudgetAlertLevel.NORMAL. In _check_denial, the guard is:
if _ALERT_LEVEL_ORDER[projected_alert] >= _ALERT_LEVEL_ORDER[auto_deny_level]:_ALERT_LEVEL_ORDER[NORMAL] is 0, so this condition is always True for any projected_alert (since all levels map to >= 0). Setting the field to NORMAL therefore auto-denies every operation regardless of actual budget usage — a much harder footgun than the approval_warn_threshold_usd = 0 case already flagged, because it makes the service silently refuse all work.
Consider adding a validator that rejects NORMAL as the deny threshold (or documents this behaviour explicitly):
@field_validator("approval_auto_deny_alert_level")
@classmethod
def _deny_level_not_normal(cls, v: BudgetAlertLevel) -> BudgetAlertLevel:
if v == BudgetAlertLevel.NORMAL:
msg = (
"approval_auto_deny_alert_level=NORMAL would deny every operation; "
"use WARNING, CRITICAL, or HARD_STOP"
)
raise ValueError(msg)
return vPrompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/budget/optimizer_models.py
Line: 383-386
Comment:
**`approval_auto_deny_alert_level = NORMAL` silently denies every operation**
`approval_auto_deny_alert_level` accepts any `BudgetAlertLevel`, including `BudgetAlertLevel.NORMAL`. In `_check_denial`, the guard is:
```python
if _ALERT_LEVEL_ORDER[projected_alert] >= _ALERT_LEVEL_ORDER[auto_deny_level]:
```
`_ALERT_LEVEL_ORDER[NORMAL]` is `0`, so this condition is always `True` for any `projected_alert` (since all levels map to `>= 0`). Setting the field to `NORMAL` therefore auto-denies **every** operation regardless of actual budget usage — a much harder footgun than the `approval_warn_threshold_usd = 0` case already flagged, because it makes the service silently refuse all work.
Consider adding a validator that rejects `NORMAL` as the deny threshold (or documents this behaviour explicitly):
```python
@field_validator("approval_auto_deny_alert_level")
@classmethod
def _deny_level_not_normal(cls, v: BudgetAlertLevel) -> BudgetAlertLevel:
if v == BudgetAlertLevel.NORMAL:
msg = (
"approval_auto_deny_alert_level=NORMAL would deny every operation; "
"use WARNING, CRITICAL, or HARD_STOP"
)
raise ValueError(msg)
return v
```
How can I resolve this? If you propose a fix, please make it concise.| # Re-export _classify_severity for backwards compatibility with tests | ||
| # that import it directly from optimizer. | ||
| __all__ = ["CostOptimizer", "_classify_severity"] |
There was a problem hiding this comment.
Stale re-export of private _classify_severity in __all__
The comment claims _classify_severity is re-exported here for "backwards compatibility with tests that import it directly from optimizer," but test_optimizer.py already imports it from ai_company.budget._optimizer_helpers (line 4 of that file), not from optimizer. The re-export is therefore unused, and exporting a module-private function (double-underscore-prefixed convention) via __all__ is unconventional and misleading — consumers of ai_company.budget.optimizer would see it as part of the public API.
# Re-export _classify_severity for backwards compatibility with tests
# that import it directly from optimizer.
__all__ = ["CostOptimizer", "_classify_severity"]Consider removing _classify_severity from __all__:
| # Re-export _classify_severity for backwards compatibility with tests | |
| # that import it directly from optimizer. | |
| __all__ = ["CostOptimizer", "_classify_severity"] | |
| __all__ = ["CostOptimizer"] |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/budget/optimizer.py
Line: 749-751
Comment:
**Stale re-export of private `_classify_severity` in `__all__`**
The comment claims `_classify_severity` is re-exported here for "backwards compatibility with tests that import it directly from optimizer," but `test_optimizer.py` already imports it from `ai_company.budget._optimizer_helpers` (line 4 of that file), not from `optimizer`. The re-export is therefore unused, and exporting a module-private function (double-underscore-prefixed convention) via `__all__` is unconventional and misleading — consumers of `ai_company.budget.optimizer` would see it as part of the public API.
```python
# Re-export _classify_severity for backwards compatibility with tests
# that import it directly from optimizer.
__all__ = ["CostOptimizer", "_classify_severity"]
```
Consider removing `_classify_severity` from `__all__`:
```suggestion
__all__ = ["CostOptimizer"]
```
How can I resolve this? If you propose a fix, please make it concise.🤖 I have created a release *beep* *boop* --- ## [0.1.1](ai-company-v0.1.0...ai-company-v0.1.1) (2026-03-10) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.0...v0.1.0) (2026-03-11) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add mandatory JWT + API key authentication ([#256](#256)) ([c279cfe](c279cfe)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable output scan response policies ([#263](#263)) ([b9907e8](b9907e8)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement AuditRepository for security audit log persistence ([#279](#279)) ([94bc29f](94bc29f)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * resolve circular imports, bump litellm, fix release tag format ([#286](#286)) ([a6659b5](a6659b5)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * bump anchore/scan-action from 6.5.1 to 7.3.2 ([#271](#271)) ([80a1c15](80a1c15)) * bump docker/build-push-action from 6.19.2 to 7.0.0 ([#273](#273)) ([dd0219e](dd0219e)) * bump docker/login-action from 3.7.0 to 4.0.0 ([#272](#272)) ([33d6238](33d6238)) * bump docker/metadata-action from 5.10.0 to 6.0.0 ([#270](#270)) ([baee04e](baee04e)) * bump docker/setup-buildx-action from 3.12.0 to 4.0.0 ([#274](#274)) ([5fc06f7](5fc06f7)) * bump sigstore/cosign-installer from 3.9.1 to 4.1.0 ([#275](#275)) ([29dd16c](29dd16c)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * **main:** release ai-company 0.1.1 ([#282](#282)) ([2f4703d](2f4703d)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>
Summary
budget/optimizer.py): Spending anomaly detection (spike/zero-baseline), cost efficiency analysis with per-agent ratings, model downgrade recommendations via resolver + downgrade map, and operation approval decisions with configurable auto-deny thresholdsbudget/optimizer_models.py): Frozen Pydantic models for anomalies, efficiency, downgrades, approvals, and config — using@computed_fieldfor derived values,NotBlankStrfor identifiers, cross-field validatorsbudget/reports.py): Multi-dimensional spending reports with task/provider/model breakdowns, period-over-period comparison (computed fields), top-N agent/task rankings with sort-order validatorsevents/cfo.py,events/budget.py): 12 CFO event constants +BUDGET_RECORDS_QUERIEDbudget/tracker.py):get_records()query method for analytical consumers_classify_severitythresholds, computed field verification, validator edge cases, input validation, and downgrade path coverageCloses #46
Pre-PR Review Coverage
Test plan
uv run ruff check src/ tests/— passesuv run mypy src/ tests/— passesuv run pytest tests/ -n auto --cov=ai_company --cov-fail-under=80— 4826 passed, 96.27% coverage🤖 Generated with Claude Code