perf: harden non-inferable principle implementation#195
Conversation
…trics (#188) Remove tools section from default system prompt template per D22 — tool definitions are already passed via the LLM provider API, so duplicating them in the prompt doubles cost with no benefit. Add pluggable MemoryFilterStrategy (D23) with tag-based initial impl that retains only memories tagged "non-inferable" before injection. Add advisory store guard and policy quality validation heuristics. Add prompt_tokens and prompt_cost_ratio to TaskCompletionMetrics for cost-aware context budgeting, with warnings when ratio exceeds 30%. Closes #188
Pre-reviewed by 10 agents, 24 findings addressed: - Rename prompt_cost_ratio → prompt_token_ratio (measures tokens, not cost) - Convert prompt_token_ratio to @computed_field (project convention) - Wire non_inferable_only config to auto-create TagBasedMemoryFilter - Add graceful degradation for memory filter + policy validation - Use word-boundary regex for action verb detection - Add DEBUG logging to filter/guard/validation entry points - Fix import ordering (runtime before TYPE_CHECKING) - Update DESIGN_SPEC.md and CLAUDE.md for new modules - Add comprehensive test coverage for all new behavior
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (10)
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds non-inferable policy validation, removes Tools from the default system prompt, introduces a pluggable memory-filter stage (tag-based / passthrough) with a store guard, and records prompt_tokens and prompt_token_ratio with new observability events and warnings across prompt, memory, metrics, and engine surfaces. Changes
Sequence Diagram(s)sequenceDiagram
participant AgentEngine as AgentEngine
participant PolicyValidator as PolicyValidator
participant PromptBuilder as PromptBuilder
participant Retriever as Retriever
participant MemoryBackend as MemoryBackend
participant MemoryFilter as MemoryFilter
participant LLM as LLM
AgentEngine->>PolicyValidator: validate_policy_quality(org_policies)
Note right of PolicyValidator: advisory issues emitted (PROMPT_POLICY_QUALITY_ISSUE)
AgentEngine->>PromptBuilder: build_system_prompt(...)
PromptBuilder->>Retriever: request_context(task, retrieval_config)
Retriever->>MemoryBackend: fetch_ranked_memories(query)
MemoryBackend-->>Retriever: ranked_memories
Retriever->>MemoryFilter: filter_for_injection(ranked_memories)
alt filter raises MemoryError/RecursionError
MemoryFilter-->>Retriever: propagate error
else filter fails (domain error)
MemoryFilter-->>Retriever: log degraded, return unfiltered
end
Retriever-->>PromptBuilder: selected_memories
PromptBuilder->>LLM: send_prompt(system_prompt + memories)
LLM-->>AgentEngine: completion (tokens)
AgentEngine->>AgentEngine: compute prompt_token_ratio & emit PROMPT_TOKEN_RATIO_HIGH if threshold exceeded
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
Greptile SummaryThis PR hardens the non-inferable principle (D22/D23) across the memory and engine layers: the default system prompt template no longer includes tool definitions, a Key findings:
Confidence Score: 2/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[ContextInjectionStrategy.prepare_messages] --> B[backend.retrieve + shared_store.query]
B --> C[rank_memories]
C --> D{ranked empty?}
D -- yes --> E[return empty — below min_relevance]
D -- no --> F{_memory_filter set?}
F -- no --> I[format_memory_context]
F -- yes --> G[filter.filter_for_injection]
G -- success --> H{filtered empty?}
G -- MemoryError / RecursionError --> RE[re-raise]
G -- Exception --> GD[⚠ MEMORY_RETRIEVAL_DEGRADED\nuse unfiltered ranked]
GD --> I
H -- yes --> E2[return empty — all filtered]
H -- no --> I
I --> J[ChatMessage tuple returned]
subgraph ContextInjectionStrategy.__init__
K{memory_filter param}
K -- None + non_inferable_only=True --> L[auto-create TagBasedMemoryFilter]
K -- provided + non_inferable_only=True --> M[log MEMORY_FILTER_INIT override]
K -- None + non_inferable_only=False --> N[_memory_filter = None\npassthrough]
end
subgraph build_system_prompt
P[_validate_org_policies\nraises PromptBuildError on blank]
P --> Q[validate_policy_quality\nadvisory — never blocks]
Q --> R[render template\ntools omitted by default D22]
R --> S{over max_tokens?}
S -- yes --> T[_trim_sections\ncompany → task → org_policies]
S -- no --> U[_build_prompt_result\nSystemPrompt]
T --> U
end
|
There was a problem hiding this comment.
Pull request overview
This PR hardens the “non-inferable principle” implementation across prompt construction and memory injection, adding advisory validation and observability to reduce prompt overhead and avoid injecting inferable context.
Changes:
- Removes tool definitions from the default system prompt template (tools remain available via API/tooling and custom templates).
- Introduces pluggable memory filtering (tag-based
non-inferable) with config-driven wiring and graceful degradation on filter errors. - Adds org policy quality validation heuristics plus prompt token overhead metrics/events (including a high-ratio warning).
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ai_company/engine/prompt_template.py | Bumps template version and removes default “Available Tools” section. |
| src/ai_company/engine/prompt.py | Documents non-inferable principle; makes policy validation advisory-only. |
| src/ai_company/engine/policy_validation.py | Adds heuristic validator for org policy quality + logging. |
| src/ai_company/engine/metrics.py | Adds prompt_tokens and computed prompt_token_ratio to task metrics. |
| src/ai_company/engine/agent_engine.py | Logs prompt token metrics and emits warning when ratio is high. |
| src/ai_company/memory/filter.py | Adds MemoryFilterStrategy + tag-based and passthrough filters. |
| src/ai_company/memory/retriever.py | Wires optional post-ranking memory filter; config auto-enables tag filter. |
| src/ai_company/memory/retrieval_config.py | Adds non_inferable_only flag to drive filter behavior. |
| src/ai_company/memory/store_guard.py | Adds advisory guard for missing non-inferable tag on store requests. |
| src/ai_company/observability/events/prompt.py | Adds policy-quality and token-ratio event constants. |
| src/ai_company/observability/events/memory.py | Adds memory-filter related event constants. |
| tests/unit/** | Adds/updates unit tests for filter integration, store guard, prompt template changes, policy validation, and metrics. |
| DESIGN_SPEC.md / CLAUDE.md | Updates documentation to reflect new modules/pipeline stages and logging examples. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if NON_INFERABLE_TAG not in request.metadata.tags: | ||
| logger.warning( | ||
| MEMORY_FILTER_STORE_MISSING_TAG, | ||
| category=request.category.value, | ||
| content_preview=request.content[:80], | ||
| tags=request.metadata.tags, | ||
| ) |
There was a problem hiding this comment.
This warning log includes content_preview=request.content[:80]. Memory contents can plausibly contain sensitive/PII data, and the current observability sanitization only redacts based on key names, so this preview will be emitted in cleartext. Safer options: omit content entirely, log only content length / hash, or gate the preview behind a debug-only flag with explicit redaction.
| def validate_memory_tags(request: MemoryStoreRequest) -> None: | ||
| """Log a warning when the non-inferable tag is missing. | ||
|
|
||
| This is advisory only — the store operation is never blocked. | ||
| Wire into ``MemoryBackend.store()`` callers to activate enforcement. | ||
|
|
There was a problem hiding this comment.
validate_memory_tags() is introduced as the store-boundary guard, but a repo-wide search shows no call sites, so the guard currently never runs in production code. To actually enforce/tag-warn at the boundary, wire this into the relevant MemoryBackend.store() implementation(s) or into a wrapper created by the memory factory (keeping it advisory-only).
| def __init__(self, required_tag: str = NON_INFERABLE_TAG) -> None: | ||
| self._required_tag = required_tag | ||
| logger.debug( | ||
| MEMORY_FILTER_APPLIED, | ||
| strategy=self.strategy_name, | ||
| phase="init", | ||
| required_tag=required_tag, | ||
| ) |
There was a problem hiding this comment.
TagBasedMemoryFilter.__init__() logs MEMORY_FILTER_APPLIED with phase="init", but the event name implies an application action and the PR description mentions a separate MEMORY_FILTER_INIT constant. Consider adding a dedicated init event constant (or renaming the current log/event) so init vs application can be distinguished cleanly in telemetry.
| # Advisory only — issues are logged but never block prompt construction. | ||
| if org_policies: | ||
| try: | ||
| validate_policy_quality(org_policies) | ||
| except MemoryError, RecursionError: | ||
| raise | ||
| except Exception: | ||
| logger.warning( | ||
| PROMPT_BUILD_ERROR, | ||
| agent_id=str(agent.id), | ||
| error="Policy quality validation failed (advisory, continuing)", | ||
| exc_info=True, | ||
| ) |
There was a problem hiding this comment.
The failure path for advisory policy validation logs under PROMPT_BUILD_ERROR (prompt.build.error). Since prompt construction continues successfully, this event can look like a real build failure and may trigger error-rate alerts. Consider adding a dedicated event (e.g. PROMPT_POLICY_VALIDATION_FAILED) or logging under PROMPT_POLICY_QUALITY_ISSUE with an explicit phase to keep observability semantics accurate.
| logger.debug( | ||
| PROMPT_POLICY_QUALITY_ISSUE, | ||
| phase="start", | ||
| policy_count=len(policies), | ||
| ) |
There was a problem hiding this comment.
validate_policy_quality() logs a DEBUG event using PROMPT_POLICY_QUALITY_ISSUE with phase="start". This means the same event name will be emitted even when there are zero issues, which can skew event-based analytics (counts of “quality issues”). Consider using a distinct start event constant or a differently named event for the start log.
| _CODE_PATTERNS: Final[tuple[re.Pattern[str], ...]] = ( | ||
| re.compile(r"(?:src|tests|lib|app)/[\w/]+\.py"), # file paths | ||
| re.compile(r"\bfrom\s+\w+\s+import\b"), # Python imports | ||
| re.compile(r"\bimport\s+\w+"), # bare imports | ||
| re.compile(r"\bdef\s+\w+\s*\("), # function definitions | ||
| re.compile(r"\bclass\s+\w+[\s:(]"), # class definitions | ||
| ) |
There was a problem hiding this comment.
The _CODE_PATTERNS regexes are case-sensitive (e.g. \bimport\s+\w+), so policies containing capitalized forms like Import json / From x import y won't be detected. Consider compiling these patterns with re.IGNORECASE (or running them against policy.lower()) to make the heuristic robust to capitalization.
| accumulated = result.execution_result.context.accumulated_cost | ||
| return cls( | ||
| task_id=result.task_id, | ||
| agent_id=result.agent_id, | ||
| turns_per_task=result.total_turns, | ||
| tokens_per_task=accumulated.total_tokens, | ||
| cost_per_task=result.total_cost_usd, | ||
| duration_seconds=result.duration_seconds, | ||
| prompt_tokens=result.system_prompt.estimated_tokens, | ||
| ) |
There was a problem hiding this comment.
prompt_tokens is populated from result.system_prompt.estimated_tokens, but the system prompt message is included in ctx.conversation and is resent on every provider call. Since tokens_per_task aggregates tokens across all turns, prompt_token_ratio will be underestimated for multi-turn runs. Consider either (a) making prompt_tokens represent total prompt tokens across the run (e.g., estimate × result.total_turns), or (b) renaming the field to clarify it's per-call and adjusting the ratio/warning accordingly.
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/ai_company/engine/prompt.py (1)
424-457:⚠️ Potential issue | 🟠 Major
SystemPrompt.sectionsis now wrong for custom templates that render tools.The module docstring and
_build_template_context()still support custom templates that rendertools, but_compute_sections()can no longer ever report a"tools"section. That makes the publicsectionsmetadata and thePROMPT_BUILD_SUCCESSlog inaccurate for a supported rendering path. Either restore tool tracking for custom templates, or narrow the documented contract to "default-template sections only."Also applies to: 647-651
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/prompt.py` around lines 424 - 457, _compute_sections currently never includes the tools section, which breaks SystemPrompt.sections and PROMPT_BUILD_SUCCESS for custom templates that render tools; update _compute_sections to append _SECTION_TOOLS when the template rendering path will include tools (e.g., when available_tools is non-empty or when the template context indicates it renders tools), and mirror the same check where sections are computed in the other related block referenced (around the other function at lines 647-651); ensure the change uses the existing symbols _compute_sections, _SECTION_TOOLS, available_tools, _build_template_context, SystemPrompt.sections, and PROMPT_BUILD_SUCCESS so the public sections metadata accurately reflects templates that include tools.DESIGN_SPEC.md (1)
1606-1608:⚠️ Potential issue | 🟡 MinorReconcile the "enforced" vs "advisory" store-boundary wording.
Section 7.7 says the non-inferable tag convention is enforced at
MemoryBackend.store(), while the project-structure entry describesstore_guard.pyas advisory. Please align those terms so the spec does not promise hard enforcement if the current implementation only warns.Also applies to: 2931-2931
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai_company/engine/policy_validation.py`:
- Around line 124-185: Split the long _check_single_policy function into focused
helpers: implement helpers like _check_policy_length(policy) (using
_MIN_POLICY_LENGTH and _MAX_POLICY_LENGTH and returning
list[PolicyQualityIssue]), _check_policy_code_patterns(policy) (using
_CODE_PATTERNS and preserving the single-match break behavior), and
_check_policy_action_verbs(policy) (using _ACTION_VERB_RE), then have
_check_single_policy simply call and aggregate results from these helpers; keep
all existing messages/severity and the PolicyQualityIssue construction identical
so behavior and tests remain unchanged, add small docstrings for each helper and
update or add unit tests as needed.
- Around line 99-114: The preflight debug call uses PROMPT_POLICY_QUALITY_ISSUE
so every run emits an "issue" event; change the preflight marker to a distinct
constant or remove it so only real findings use PROMPT_POLICY_QUALITY_ISSUE.
Locate the debug call that invokes logger.debug with PROMPT_POLICY_QUALITY_ISSUE
(in the loop surrounding _check_single_policy and the issues emission) and
either replace PROMPT_POLICY_QUALITY_ISSUE with a new
PROMPT_POLICY_QUALITY_START (or similar) constant and add that constant where
events are defined, or simply delete the preflight logger.debug line so only the
subsequent logger.warning calls emit PROMPT_POLICY_QUALITY_ISSUE for real
issues.
In `@src/ai_company/engine/prompt.py`:
- Around line 198-210: The org_policies validation currently swallows
non-string/blank entries into a broad except and allows corrupted policies to
reach template rendering; update prompt construction to perform strict per-item
validation (implement a helper like _validate_org_policies(agent: AgentIdentity,
org_policies: tuple[str, ...])) that iterates org_policies, logs an error via
logger.error(PROMPT_BUILD_ERROR, agent_id=str(agent.id), error=msg) and raises
PromptBuildError for any item that is not a non-empty string, and replace the
broad except Exception around validate_policy_quality with either letting
PromptBuildError propagate or only catching MemoryError/RecursionError so
malformed inputs fail fast before template rendering.
In `@tests/unit/engine/test_agent_engine.py`:
- Around line 861-915: Parametrize the two prompt-ratio cases rather than
duplicating tests: replace the two separate async tests in
TestAgentEnginePromptTokenRatioWarning with a single `@pytest.mark.parametrize`
that yields (input_tokens, output_tokens, estimated_prompt_tokens,
expect_warning). In the test body create the mocked completion response and
provider as before, then make the prompt-size deterministic by injecting a
SystemPrompt (or setting identity.system_prompt) with the specific
estimated_tokens value (or monkeypatch the prompt_template.
SystemPrompt.estimated_tokens) so AgentEngine._log_completion() sees that
explicit estimate when engine.run(...) executes; finally assert based on
expect_warning that PROMPT_TOKEN_RATIO_HIGH appears in
structlog.testing.capture_logs(). Use symbols AgentEngine, engine.run,
_log_completion, SystemPrompt, and PROMPT_TOKEN_RATIO_HIGH to locate the code to
change.
In `@tests/unit/memory/test_retriever.py`:
- Around line 469-485: The test test_filter_skipped_when_none currently relies
on implicit defaults and should explicitly pin non_inferable_only to False: when
constructing ContextInjectionStrategy with memory_filter=None, pass
MemoryRetrievalConfig(min_relevance=0.0, non_inferable_only=False) so the test
exercises the “no-filter” branch regardless of config defaults; update the
instantiation that uses MemoryRetrievalConfig in this test to include
non_inferable_only=False.
---
Outside diff comments:
In `@src/ai_company/engine/prompt.py`:
- Around line 424-457: _compute_sections currently never includes the tools
section, which breaks SystemPrompt.sections and PROMPT_BUILD_SUCCESS for custom
templates that render tools; update _compute_sections to append _SECTION_TOOLS
when the template rendering path will include tools (e.g., when available_tools
is non-empty or when the template context indicates it renders tools), and
mirror the same check where sections are computed in the other related block
referenced (around the other function at lines 647-651); ensure the change uses
the existing symbols _compute_sections, _SECTION_TOOLS, available_tools,
_build_template_context, SystemPrompt.sections, and PROMPT_BUILD_SUCCESS so the
public sections metadata accurately reflects templates that include tools.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7b576f14-6023-4d51-bed1-961c80e683bc
📒 Files selected for processing (21)
CLAUDE.mdDESIGN_SPEC.mdsrc/ai_company/engine/agent_engine.pysrc/ai_company/engine/metrics.pysrc/ai_company/engine/policy_validation.pysrc/ai_company/engine/prompt.pysrc/ai_company/engine/prompt_template.pysrc/ai_company/memory/filter.pysrc/ai_company/memory/retrieval_config.pysrc/ai_company/memory/retriever.pysrc/ai_company/memory/store_guard.pysrc/ai_company/observability/events/memory.pysrc/ai_company/observability/events/prompt.pytests/unit/engine/test_agent_engine.pytests/unit/engine/test_metrics.pytests/unit/engine/test_policy_validation.pytests/unit/engine/test_prompt.pytests/unit/memory/org/test_prompt_integration.pytests/unit/memory/test_filter.pytests/unit/memory/test_retriever.pytests/unit/memory/test_store_guard.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Agent
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
from ai_company.observability import get_loggerand instantiate logger aslogger = get_logger(__name__)in every module with business logic
Files:
src/ai_company/observability/events/memory.pytests/unit/memory/test_store_guard.pysrc/ai_company/observability/events/prompt.pysrc/ai_company/engine/agent_engine.pytests/unit/engine/test_metrics.pysrc/ai_company/engine/policy_validation.pysrc/ai_company/engine/prompt_template.pytests/unit/engine/test_policy_validation.pytests/unit/memory/test_filter.pytests/unit/engine/test_prompt.pysrc/ai_company/memory/retriever.pysrc/ai_company/memory/filter.pysrc/ai_company/memory/retrieval_config.pysrc/ai_company/memory/store_guard.pytests/unit/memory/org/test_prompt_integration.pysrc/ai_company/engine/metrics.pytests/unit/memory/test_retriever.pytests/unit/engine/test_agent_engine.pysrc/ai_company/engine/prompt.py
src/ai_company/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/ai_company/**/*.py: Never useimport logging,logging.getLogger(), orprint()in application code — only use the centralized logger
Use event name constants fromai_company.observability.events.<domain>modules (e.g.,PROVIDER_CALL_STARTfromevents.provider,BUDGET_RECORD_ADDEDfromevents.budget) instead of string literals
Always use structured logging withlogger.info(EVENT, key=value)format — never use format strings likelogger.info('msg %s', val)
Nofrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations
Useexcept A, B:syntax (no parentheses) per PEP 758 — ruff enforces this on Python 3.14
Add type hints to all public functions and classes; mypy strict mode is enforced
Add Google-style docstrings to all public classes and functions — ruff D rules enforce this
Use immutability principles: create new objects instead of mutating existing ones; for non-Pydantic collections usecopy.deepcopy()at construction andMappingProxyTypefor read-only enforcement
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (withmodel_copy(update=...)) for runtime state that evolves
UseNotBlankStrfromcore.typesfor all identifier and name fields instead of manual whitespace validators, including optional (NotBlankStr | None) and tuple variants
Use@computed_fieldfor derived values in Pydantic models instead of storing and validating redundant fields (e.g.,TokenUsage.total_tokens)
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code (e.g., multiple tool invocations, parallel agent calls) over barecreate_task
Keep functions under 50 lines and files under 800 lines
Keep line length at 88 characters (enforced by ruff)
Handle all errors explicitly; never silently swallow exceptions
Validate at system boundaries (user input, external APIs, config files)
Never use real vendor names (Anthropic, OpenAI, Claude, GPT) in project-owned code, docstrings...
Files:
src/ai_company/observability/events/memory.pysrc/ai_company/observability/events/prompt.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/engine/policy_validation.pysrc/ai_company/engine/prompt_template.pysrc/ai_company/memory/retriever.pysrc/ai_company/memory/filter.pysrc/ai_company/memory/retrieval_config.pysrc/ai_company/memory/store_guard.pysrc/ai_company/engine/metrics.pysrc/ai_company/engine/prompt.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Add test markers@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e, or@pytest.mark.slowto all test files
Maintain 80% minimum code coverage — enforced in CI withpytest --cov=ai_company --cov-fail-under=80
Useasyncio_mode = 'auto'in pytest configuration — no manual@pytest.mark.asyncioneeded on async tests
Set test timeout to 30 seconds per test — use@pytest.mark.timeout(30)or configure in pytest.ini
Use@pytest.mark.parametrizefor testing similar cases instead of duplicating test functions
Files:
tests/unit/memory/test_store_guard.pytests/unit/engine/test_metrics.pytests/unit/engine/test_policy_validation.pytests/unit/memory/test_filter.pytests/unit/engine/test_prompt.pytests/unit/memory/org/test_prompt_integration.pytests/unit/memory/test_retriever.pytests/unit/engine/test_agent_engine.py
🧠 Learnings (7)
📚 Learning: 2026-03-10T09:29:47.580Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to src/ai_company/**/*.py : Use event name constants from `ai_company.observability.events.<domain>` modules (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`) instead of string literals
Applied to files:
src/ai_company/observability/events/memory.pysrc/ai_company/observability/events/prompt.pysrc/ai_company/engine/agent_engine.pyCLAUDE.md
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Validate at system boundaries (user input, external APIs, config files)
Applied to files:
src/ai_company/engine/policy_validation.pysrc/ai_company/engine/prompt.pyCLAUDE.md
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Use `computed_field` for derived values in Pydantic models instead of storing and validating redundant fields (e.g., `TokenUsage.total_tokens`)
Applied to files:
src/ai_company/engine/metrics.py
📚 Learning: 2026-03-10T09:29:47.580Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to **/*.py : Use `from ai_company.observability import get_logger` and instantiate logger as `logger = get_logger(__name__)` in every module with business logic
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-10T09:29:47.580Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to src/ai_company/**/*.py : Always use structured logging with `logger.info(EVENT, key=value)` format — never use format strings like `logger.info('msg %s', val)`
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-10T09:29:47.580Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to src/ai_company/**/*.py : Never use `import logging`, `logging.getLogger()`, or `print()` in application code — only use the centralized logger
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Handle all errors explicitly; never silently swallow exceptions
Applied to files:
CLAUDE.md
🧬 Code graph analysis (13)
tests/unit/memory/test_store_guard.py (3)
src/ai_company/core/enums.py (1)
MemoryCategory(101-108)src/ai_company/memory/models.py (2)
MemoryMetadata(20-52)MemoryStoreRequest(55-79)src/ai_company/memory/store_guard.py (1)
validate_memory_tags(21-36)
src/ai_company/engine/agent_engine.py (3)
src/ai_company/engine/metrics.py (1)
prompt_token_ratio(66-70)src/ai_company/engine/parallel_models.py (2)
agent_id(79-81)task_id(87-89)src/ai_company/engine/loop_protocol.py (1)
total_tokens(73-75)
tests/unit/engine/test_metrics.py (1)
src/ai_company/engine/metrics.py (2)
prompt_token_ratio(66-70)TaskCompletionMetrics(17-92)
src/ai_company/engine/policy_validation.py (1)
src/ai_company/observability/_logger.py (1)
get_logger(8-28)
src/ai_company/engine/prompt_template.py (1)
src/ai_company/core/enums.py (1)
SeniorityLevel(6-21)
tests/unit/engine/test_policy_validation.py (1)
src/ai_company/engine/policy_validation.py (2)
PolicyQualityIssue(67-82)validate_policy_quality(85-116)
tests/unit/engine/test_prompt.py (2)
src/ai_company/engine/prompt.py (1)
build_system_prompt(159-258)src/ai_company/core/company.py (1)
Company(400-483)
src/ai_company/memory/retriever.py (1)
src/ai_company/memory/filter.py (5)
TagBasedMemoryFilter(48-101)MemoryFilterStrategy(25-45)filter_for_injection(28-40)filter_for_injection(68-92)filter_for_injection(111-129)
src/ai_company/memory/filter.py (2)
src/ai_company/observability/_logger.py (1)
get_logger(8-28)src/ai_company/memory/ranking.py (1)
ScoredMemory(26-60)
src/ai_company/memory/store_guard.py (2)
src/ai_company/observability/_logger.py (1)
get_logger(8-28)src/ai_company/memory/models.py (1)
MemoryStoreRequest(55-79)
tests/unit/memory/test_retriever.py (2)
src/ai_company/memory/filter.py (2)
PassthroughMemoryFilter(104-138)TagBasedMemoryFilter(48-101)src/ai_company/memory/models.py (1)
MemoryMetadata(20-52)
tests/unit/engine/test_agent_engine.py (1)
src/ai_company/engine/agent_engine.py (1)
run(169-267)
src/ai_company/engine/prompt.py (2)
src/ai_company/engine/policy_validation.py (1)
validate_policy_quality(85-116)src/ai_company/memory/errors.py (1)
MemoryError(13-14)
🪛 LanguageTool
CLAUDE.md
[style] ~87-~87: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...
(EG_NO_COMMA)
🔇 Additional comments (11)
tests/unit/memory/org/test_prompt_integration.py (1)
51-52: LGTM!The version expectation update to
"1.3.0"aligns with the template version bump in this PR.src/ai_company/memory/retrieval_config.py (1)
93-96: LGTM!The new
non_inferable_onlyconfiguration field is well-documented with clear descriptions in both the docstring and Field metadata. DefaultFalsepreserves backward compatibility.src/ai_company/observability/events/memory.py (1)
66-70: LGTM!New event constants follow the established
memory.<entity>.<action>naming convention and are properly typed withFinal[str]. The section header improves organization.src/ai_company/memory/store_guard.py (1)
1-36: LGTM!Well-structured advisory guard with proper observability. The function correctly uses event constants, structured logging with
key=valueformat, and TYPE_CHECKING for import optimization. The docstring clearly documents the advisory-only nature.src/ai_company/memory/filter.py (1)
1-138: LGTM!Well-designed pluggable filter architecture:
- Protocol is
@runtime_checkableenabling isinstance() checks- Both implementations properly log filter statistics (candidates/retained)
- DEBUG logging at init, INFO at filter execution provides good observability
- Consistent structured logging with event constants
src/ai_company/memory/retriever.py (2)
273-295: LGTM!Excellent filter integration with proper error handling:
- Graceful degradation on filter errors (uses unfiltered results)
- Re-raises system-level errors (
MemoryError,RecursionError)- Logs with filter strategy name for debugging
- Properly handles empty filter results with informative skip reason
The filter is correctly positioned in the pipeline after ranking but before formatting.
126-128: LGTM!The auto-instantiation logic correctly defaults to
TagBasedMemoryFilterwhenconfig.non_inferable_onlyis enabled and no explicit filter is provided, maintaining backward compatibility when the flag isFalse.src/ai_company/observability/events/prompt.py (1)
12-13: LGTM!New prompt event constants follow the established naming convention. The
# noqa: S105suppression on line 13 is appropriate—this is an event name, not a credential.CLAUDE.md (2)
52-54: LGTM!Documentation accurately reflects the new modules:
- Engine module now includes "prompt policy validation" (policy_validation.py)
- Memory module now includes "non-inferable filtering" (filter.py)
87-88: LGTM!Event name examples updated to include
PROMPT_BUILD_STARTfromevents.promptandMEMORY_RETRIEVAL_STARTfromevents.memory, providing helpful guidance for developers working with the new observability surface.src/ai_company/engine/metrics.py (1)
58-70: Nice use of@computed_fieldforprompt_token_ratio.Keeping the ratio derived from
prompt_tokensandtokens_per_taskavoids redundant state and makesfrom_run_result()simpler.Based on learnings: Use
computed_fieldfor derived values in Pydantic models instead of storing and validating redundant fields (e.g.,TokenUsage.total_tokens)
| def _check_single_policy(policy: str) -> list[PolicyQualityIssue]: | ||
| """Run all heuristic checks on a single policy string. | ||
|
|
||
| Args: | ||
| policy: The policy text to validate. | ||
|
|
||
| Returns: | ||
| List of quality issues found (empty if the policy passes all checks). | ||
| """ | ||
| found: list[PolicyQualityIssue] = [] | ||
|
|
||
| if len(policy) < _MIN_POLICY_LENGTH: | ||
| found.append( | ||
| PolicyQualityIssue( | ||
| policy=policy, | ||
| issue=( | ||
| f"Too short ({len(policy)} chars) — likely not an actionable policy" | ||
| ), | ||
| severity="warning", | ||
| ), | ||
| ) | ||
|
|
||
| if len(policy) > _MAX_POLICY_LENGTH: | ||
| found.append( | ||
| PolicyQualityIssue( | ||
| policy=policy, | ||
| issue=( | ||
| f"Too long ({len(policy)} chars) — " | ||
| f"may contain inferable context rather than a policy" | ||
| ), | ||
| severity="warning", | ||
| ), | ||
| ) | ||
|
|
||
| for pattern in _CODE_PATTERNS: | ||
| if pattern.search(policy): | ||
| found.append( | ||
| PolicyQualityIssue( | ||
| policy=policy, | ||
| issue=( | ||
| "Contains code patterns (file paths, imports, or " | ||
| "definitions) — likely inferable from the codebase" | ||
| ), | ||
| severity="warning", | ||
| ), | ||
| ) | ||
| break # One code-pattern match is sufficient. | ||
|
|
||
| policy_lower = policy.lower() | ||
| if not _ACTION_VERB_RE.search(policy_lower): | ||
| found.append( | ||
| PolicyQualityIssue( | ||
| policy=policy, | ||
| issue=( | ||
| "Missing action verbs (must, should, always, never, " | ||
| "etc.) — may not be an actionable policy" | ||
| ), | ||
| severity="warning", | ||
| ), | ||
| ) | ||
|
|
||
| return found |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Split _check_single_policy() into focused helpers.
This helper is already over the 50-line limit and now mixes length, code-pattern, and action-verb heuristics. Extract those checks into small helpers so future rule changes stay isolated and easier to test.
Refactor sketch
def _check_single_policy(policy: str) -> list[PolicyQualityIssue]:
- found: list[PolicyQualityIssue] = []
-
- if len(policy) < _MIN_POLICY_LENGTH:
- found.append(...)
-
- if len(policy) > _MAX_POLICY_LENGTH:
- found.append(...)
-
- for pattern in _CODE_PATTERNS:
- if pattern.search(policy):
- found.append(...)
- break
-
- policy_lower = policy.lower()
- if not _ACTION_VERB_RE.search(policy_lower):
- found.append(...)
-
- return found
+ return [
+ *_check_policy_length(policy),
+ *_check_code_patterns(policy),
+ *_check_action_verbs(policy),
+ ]As per coding guidelines "Keep functions under 50 lines and files under 800 lines".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/engine/policy_validation.py` around lines 124 - 185, Split the
long _check_single_policy function into focused helpers: implement helpers like
_check_policy_length(policy) (using _MIN_POLICY_LENGTH and _MAX_POLICY_LENGTH
and returning list[PolicyQualityIssue]), _check_policy_code_patterns(policy)
(using _CODE_PATTERNS and preserving the single-match break behavior), and
_check_policy_action_verbs(policy) (using _ACTION_VERB_RE), then have
_check_single_policy simply call and aggregate results from these helpers; keep
all existing messages/severity and the PolicyQualityIssue construction identical
so behavior and tests remain unchanged, add small docstrings for each helper and
update or add unit tests as needed.
…eptile - Split policy_validation into focused helper functions (<50 lines each) - Add case-insensitive regex matching for code pattern detection - Add fail-fast org_policies validation in prompt builder - Add model_validator for prompt_tokens <= tokens_per_task in metrics - Add allow_inf_nan=False to PolicyQualityIssue and TaskCompletionMetrics - Use distinct event constants for validation start vs quality issues - Upgrade classification failure log from DEBUG to WARNING with exc_info - Replace content_preview with content_length in store_guard to avoid PII - Add non-empty validation for TagBasedMemoryFilter required_tag - Add error_type to filter error handler for consistency - Update DESIGN_SPEC.md, README.md with non-inferable filtering references - Add boundary tests for policy length, frozen enforcement, severity - Add prompt validation integration tests (error isolation, blank rejection) - Add retriever tests for config-driven filter, graceful degradation, MemoryError - Parametrize prompt token ratio warning tests
The model_validator ensuring prompt_tokens <= tokens_per_task was rejecting the metrics in test_metrics_logged_on_completion because the default 150 tokens was less than the system prompt estimate (~306).
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's adherence to the 'non-inferable principle' by refining how information is presented to agents, particularly within system prompts and memory retrieval. It introduces mechanisms to validate policy quality, filters memory to include only non-inferable content, and improves prompt token ratio metrics for better overhead visibility. These changes aim to make agent interactions more efficient and aligned with design principles, ensuring agents receive only necessary, non-discoverable information. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (1)
tests/unit/engine/test_agent_engine.py (1)
865-905: 🧹 Nitpick | 🔵 TrivialMake the ratio-warning test independent of prompt size.
The parametrization helps, but
expect_warning=Truestill depends on whateverbuild_system_prompt()estimates today. An unrelated prompt-template edit can flip this test without changingAgentEngine. Inject a fixedSystemPrompt.estimated_tokensor unit-test_log_completion()directly so only the threshold logic is under test.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/engine/test_agent_engine.py` around lines 865 - 905, The test test_prompt_token_ratio_warning is brittle because it relies on the current build_system_prompt() estimate; make it deterministic by injecting a fixed prompt token estimate or testing the threshold function directly: either (A) set SystemPrompt.estimated_tokens to a known constant (or monkeypatch the SystemPrompt instance used by AgentEngine) before creating the engine so the prompt size is controlled when calling engine.run with the mock provider, or (B) call AgentEngine._log_completion(...) directly with a constructed CompletionResponse from _make_completion_response and an explicit prompt_token_count to exercise only the ratio/threshold logic; reference AgentEngine, _log_completion, SystemPrompt.estimated_tokens and _make_completion_response to locate the relevant code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/ai_company/engine/metrics.py`:
- Around line 67-80: The validator _validate_prompt_tokens should not raise a
ValueError when SystemPrompt.estimated_tokens (prompt_tokens) exceeds
tokens_per_task; instead, compute and store a bounded ratio so the model isn't
rejected: when self.tokens_per_task > 0 calculate ratio = self.prompt_tokens /
self.tokens_per_task, set a new or existing attribute like
self.prompt_token_ratio = min(ratio, 1.0) (and optionally clamp prompt_tokens to
self.tokens_per_task if you must keep it as an upper bound), remove the raise,
and mirror the same change for the similar validator handling completion tokens
(the validator around the 82-92 region) so both use a capped ratio rather than
throwing.
In `@src/ai_company/engine/policy_validation.py`:
- Around line 115-121: The warning currently emits policy text via
logger.warning(PROMPT_POLICY_QUALITY_ISSUE, policy=issue.policy[:80], ...),
which can leak sensitive operator-authored content; update the call in the loop
over issues so it no longer includes any substring of issue.policy and instead
logs only metadata (use the same pattern as store_guard.py) by replacing the
policy argument with a content_length or policy_length field set to
len(issue.policy), keeping issue.issue and issue.severity intact and referencing
PROMPT_POLICY_QUALITY_ISSUE, issues, and logger.warning to locate the change.
In `@src/ai_company/engine/prompt.py`:
- Around line 147-154: The trimming logic removed the tools section from global
tracking (_TRIMMABLE_SECTIONS), which breaks custom templates that render tools
and prevents max_tokens from trimming that block; restore or reintroduce the
tools section identifier (e.g., _SECTION_TOOLS) into _TRIMMABLE_SECTIONS and
ensure SystemPrompt.sections still includes the tools section when a template
opt-in indicates it renders tools (update the code paths around
_TRIMMABLE_SECTIONS, SystemPrompt.sections construction, and the max_tokens
trimming routine so tools remain trimmable for templates that declare/return
tools, while keeping tools excluded only from the default template rendering).
- Around line 197-203: Normalize org_policies at the start of
build_system_prompt by converting the incoming org_policies into a stable
iterable (e.g., org_policies = tuple(org_policies or ())) and use that
normalized variable for all subsequent calls and iterations; call
_validate_org_policies(normalized_org_policies) and
validate_policy_quality(normalized_org_policies) (and any rendering/iteration
later) instead of the original parameter so one-shot iterables or None don't get
consumed or raise TypeError—apply the same normalization and reuse in the later
block around the validate_policy_quality / rendering logic (the section around
lines 279-301).
In `@src/ai_company/memory/filter.py`:
- Around line 62-71: The constructor for required_tag accepts non-string values
and stores unnormalized input, causing AttributeError on non-strings and
mismatches for values with surrounding whitespace; update the __init__ of the
class that defines required_tag to first enforce type (raise TypeError or
ValueError if not instance of str), normalize the value by calling stripped =
required_tag.strip() and then validate stripped is non-empty (raise ValueError
if empty), assign self._required_tag = stripped, and ensure the logger call
(MEMORY_FILTER_INIT, strategy=self.strategy_name) uses the normalized stripped
value for required_tag so stored/logged tag matches how tags are compared
elsewhere.
In `@src/ai_company/memory/store_guard.py`:
- Around line 21-36: The helper validate_memory_tags(request:
MemoryStoreRequest) is never invoked so the NON_INFERABLE_TAG warning
(MEMORY_FILTER_STORE_MISSING_TAG) is inert; fix this by calling
validate_memory_tags(request) at the start of every memory persistence path —
either add the call into each concrete MemoryBackend.store(...) implementation
or place it in the shared façade/wrapper that all stores use (ensure any class
implementing MemoryBackend calls validate_memory_tags before persisting), so the
logger warning will fire when tags are missing.
In `@tests/unit/engine/test_prompt.py`:
- Around line 509-531: Combine the two tests into one parametrized test that
iterates over the invalid org_policies values; replace the separate
test_empty_org_policy_raises and test_whitespace_only_org_policy_raises with a
single `@pytest.mark.parametrize-based` test that calls
build_system_prompt(agent=sample_agent_with_personality, org_policies=<param>)
and asserts pytest.raises(PromptBuildError, match="org_policies"); keep the same
docstring and use the same sample_agent_with_personality fixture and
PromptBuildError type to ensure behavior is unchanged.
---
Duplicate comments:
In `@tests/unit/engine/test_agent_engine.py`:
- Around line 865-905: The test test_prompt_token_ratio_warning is brittle
because it relies on the current build_system_prompt() estimate; make it
deterministic by injecting a fixed prompt token estimate or testing the
threshold function directly: either (A) set SystemPrompt.estimated_tokens to a
known constant (or monkeypatch the SystemPrompt instance used by AgentEngine)
before creating the engine so the prompt size is controlled when calling
engine.run with the mock provider, or (B) call AgentEngine._log_completion(...)
directly with a constructed CompletionResponse from _make_completion_response
and an explicit prompt_token_count to exercise only the ratio/threshold logic;
reference AgentEngine, _log_completion, SystemPrompt.estimated_tokens and
_make_completion_response to locate the relevant code to change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 56e3b91e-e748-4640-ab0f-a33ab5debfc8
📒 Files selected for processing (15)
DESIGN_SPEC.mdREADME.mdsrc/ai_company/engine/agent_engine.pysrc/ai_company/engine/metrics.pysrc/ai_company/engine/policy_validation.pysrc/ai_company/engine/prompt.pysrc/ai_company/memory/filter.pysrc/ai_company/memory/retriever.pysrc/ai_company/memory/store_guard.pysrc/ai_company/observability/events/memory.pysrc/ai_company/observability/events/prompt.pytests/unit/engine/test_agent_engine.pytests/unit/engine/test_policy_validation.pytests/unit/engine/test_prompt.pytests/unit/memory/test_retriever.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
from ai_company.observability import get_loggerand instantiate logger aslogger = get_logger(__name__)in every module with business logic
Files:
src/ai_company/observability/events/prompt.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/memory/retriever.pysrc/ai_company/memory/filter.pysrc/ai_company/memory/store_guard.pytests/unit/engine/test_prompt.pysrc/ai_company/observability/events/memory.pysrc/ai_company/engine/metrics.pysrc/ai_company/engine/policy_validation.pytests/unit/memory/test_retriever.pysrc/ai_company/engine/prompt.pytests/unit/engine/test_policy_validation.pytests/unit/engine/test_agent_engine.py
src/ai_company/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/ai_company/**/*.py: Never useimport logging,logging.getLogger(), orprint()in application code — only use the centralized logger
Use event name constants fromai_company.observability.events.<domain>modules (e.g.,PROVIDER_CALL_STARTfromevents.provider,BUDGET_RECORD_ADDEDfromevents.budget) instead of string literals
Always use structured logging withlogger.info(EVENT, key=value)format — never use format strings likelogger.info('msg %s', val)
Nofrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations
Useexcept A, B:syntax (no parentheses) per PEP 758 — ruff enforces this on Python 3.14
Add type hints to all public functions and classes; mypy strict mode is enforced
Add Google-style docstrings to all public classes and functions — ruff D rules enforce this
Use immutability principles: create new objects instead of mutating existing ones; for non-Pydantic collections usecopy.deepcopy()at construction andMappingProxyTypefor read-only enforcement
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (withmodel_copy(update=...)) for runtime state that evolves
UseNotBlankStrfromcore.typesfor all identifier and name fields instead of manual whitespace validators, including optional (NotBlankStr | None) and tuple variants
Use@computed_fieldfor derived values in Pydantic models instead of storing and validating redundant fields (e.g.,TokenUsage.total_tokens)
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code (e.g., multiple tool invocations, parallel agent calls) over barecreate_task
Keep functions under 50 lines and files under 800 lines
Keep line length at 88 characters (enforced by ruff)
Handle all errors explicitly; never silently swallow exceptions
Validate at system boundaries (user input, external APIs, config files)
Never use real vendor names (Anthropic, OpenAI, Claude, GPT) in project-owned code, docstrings...
Files:
src/ai_company/observability/events/prompt.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/memory/retriever.pysrc/ai_company/memory/filter.pysrc/ai_company/memory/store_guard.pysrc/ai_company/observability/events/memory.pysrc/ai_company/engine/metrics.pysrc/ai_company/engine/policy_validation.pysrc/ai_company/engine/prompt.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Add test markers@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e, or@pytest.mark.slowto all test files
Maintain 80% minimum code coverage — enforced in CI withpytest --cov=ai_company --cov-fail-under=80
Useasyncio_mode = 'auto'in pytest configuration — no manual@pytest.mark.asyncioneeded on async tests
Set test timeout to 30 seconds per test — use@pytest.mark.timeout(30)or configure in pytest.ini
Use@pytest.mark.parametrizefor testing similar cases instead of duplicating test functions
Files:
tests/unit/engine/test_prompt.pytests/unit/memory/test_retriever.pytests/unit/engine/test_policy_validation.pytests/unit/engine/test_agent_engine.py
🧠 Learnings (11)
📚 Learning: 2026-03-10T09:29:47.580Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to src/ai_company/**/*.py : Use event name constants from `ai_company.observability.events.<domain>` modules (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`) instead of string literals
Applied to files:
src/ai_company/observability/events/prompt.pysrc/ai_company/engine/agent_engine.pysrc/ai_company/observability/events/memory.py
📚 Learning: 2026-03-10T09:29:47.580Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to src/ai_company/**/*.py : Use `except A, B:` syntax (no parentheses) per PEP 758 — ruff enforces this on Python 3.14
Applied to files:
src/ai_company/memory/retriever.pysrc/ai_company/engine/prompt.py
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Handle all errors explicitly; never silently swallow exceptions
Applied to files:
src/ai_company/memory/retriever.pysrc/ai_company/engine/prompt.py
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Use `computed_field` for derived values in Pydantic models instead of storing and validating redundant fields (e.g., `TokenUsage.total_tokens`)
Applied to files:
src/ai_company/engine/metrics.py
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Use `NotBlankStr` from `core.types` for all identifier and name fields instead of manual whitespace validators, including optional (`NotBlankStr | None`) and tuple variants
Applied to files:
src/ai_company/engine/metrics.py
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Always read `DESIGN_SPEC.md` before implementing any feature or planning any issue — treat it as the starting point for architecture, data models, and behavior
Applied to files:
DESIGN_SPEC.md
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Validate at system boundaries (user input, external APIs, config files)
Applied to files:
src/ai_company/engine/policy_validation.pysrc/ai_company/engine/prompt.pytests/unit/engine/test_policy_validation.py
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Keep functions under 50 lines and files under 800 lines
Applied to files:
src/ai_company/engine/policy_validation.py
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Keep line length at 88 characters (enforced by ruff)
Applied to files:
src/ai_company/engine/policy_validation.py
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to tests/**/*.py : Use `pytest.mark.parametrize` for testing similar cases instead of duplicating test functions
Applied to files:
tests/unit/engine/test_agent_engine.py
📚 Learning: 2026-03-10T09:29:47.581Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to tests/**/*.py : Maintain 80% minimum code coverage — enforced in CI with `pytest --cov=ai_company --cov-fail-under=80`
Applied to files:
tests/unit/engine/test_agent_engine.py
🧬 Code graph analysis (9)
src/ai_company/engine/agent_engine.py (3)
src/ai_company/engine/parallel_models.py (2)
agent_id(79-81)task_id(87-89)src/ai_company/engine/metrics.py (1)
prompt_token_ratio(84-92)src/ai_company/engine/loop_protocol.py (1)
total_tokens(73-75)
src/ai_company/memory/retriever.py (1)
src/ai_company/memory/filter.py (5)
TagBasedMemoryFilter(51-106)MemoryFilterStrategy(28-48)filter_for_injection(31-43)filter_for_injection(73-97)filter_for_injection(116-134)
src/ai_company/memory/store_guard.py (2)
src/ai_company/observability/_logger.py (1)
get_logger(8-28)src/ai_company/memory/models.py (1)
MemoryStoreRequest(55-79)
tests/unit/engine/test_prompt.py (4)
tests/unit/engine/conftest.py (4)
sample_agent_with_personality(60-87)sample_tool_definitions(128-143)sample_task_with_criteria(103-124)sample_company(175-184)src/ai_company/engine/prompt.py (1)
build_system_prompt(160-260)src/ai_company/core/task.py (1)
Task(45-261)src/ai_company/core/company.py (1)
Company(400-483)
src/ai_company/engine/policy_validation.py (1)
src/ai_company/observability/_logger.py (1)
get_logger(8-28)
tests/unit/memory/test_retriever.py (4)
src/ai_company/memory/filter.py (8)
PassthroughMemoryFilter(109-143)TagBasedMemoryFilter(51-106)filter_for_injection(31-43)filter_for_injection(73-97)filter_for_injection(116-134)strategy_name(46-48)strategy_name(100-106)strategy_name(137-143)src/ai_company/memory/models.py (1)
MemoryMetadata(20-52)src/ai_company/memory/retriever.py (2)
ContextInjectionStrategy(96-399)strategy_name(393-399)src/ai_company/memory/retrieval_config.py (1)
MemoryRetrievalConfig(20-132)
src/ai_company/engine/prompt.py (5)
src/ai_company/engine/policy_validation.py (1)
validate_policy_quality(93-123)src/ai_company/memory/errors.py (1)
MemoryError(13-14)src/ai_company/engine/parallel_models.py (1)
agent_id(79-81)src/ai_company/core/agent.py (1)
AgentIdentity(265-323)src/ai_company/engine/errors.py (1)
PromptBuildError(8-9)
tests/unit/engine/test_policy_validation.py (1)
src/ai_company/engine/policy_validation.py (2)
PolicyQualityIssue(71-90)validate_policy_quality(93-123)
tests/unit/engine/test_agent_engine.py (1)
src/ai_company/engine/agent_engine.py (1)
run(169-267)
🪛 GitHub Actions: CI
src/ai_company/engine/metrics.py
[error] 106-106: TaskCompletionMetrics.from_run_result validation failed. Prompt tokens (306) exceed tokens_per_task (150).
🪛 LanguageTool
README.md
[typographical] ~24-~24: To join two clauses or introduce examples, consider using an em dash.
Context: ...a migrations - Memory Interface (M5) - Pluggable MemoryBackend protocol with ...
(DASH_RULE)
[style] ~24-~24: Using four (or more) nouns in a row may decrease readability.
Context: .../archival with pluggable strategies and retention enforcement - Coordination Error Taxonomy (M5) - Post-execution classi...
(FOUR_NN)
| @model_validator(mode="after") | ||
| def _validate_prompt_tokens(self) -> TaskCompletionMetrics: | ||
| """Ensure prompt_tokens does not exceed tokens_per_task. | ||
|
|
||
| Skipped when ``tokens_per_task`` is 0 (zero-turn runs where the | ||
| system prompt was built but no provider calls were made). | ||
| """ | ||
| if self.tokens_per_task > 0 and self.prompt_tokens > self.tokens_per_task: | ||
| msg = ( | ||
| f"prompt_tokens ({self.prompt_tokens}) cannot exceed " | ||
| f"tokens_per_task ({self.tokens_per_task})" | ||
| ) | ||
| raise ValueError(msg) | ||
| return self |
There was a problem hiding this comment.
Do not validate estimated prompt tokens as an exact upper bound.
SystemPrompt.estimated_tokens is heuristic data. CI is already failing because the estimate can exceed tokens_per_task, so this validator turns otherwise valid runs into runtime errors. Bound the derived ratio instead of rejecting the model.
💡 Minimal fix
- `@model_validator`(mode="after")
- def _validate_prompt_tokens(self) -> TaskCompletionMetrics:
- """Ensure prompt_tokens does not exceed tokens_per_task.
-
- Skipped when ``tokens_per_task`` is 0 (zero-turn runs where the
- system prompt was built but no provider calls were made).
- """
- if self.tokens_per_task > 0 and self.prompt_tokens > self.tokens_per_task:
- msg = (
- f"prompt_tokens ({self.prompt_tokens}) cannot exceed "
- f"tokens_per_task ({self.tokens_per_task})"
- )
- raise ValueError(msg)
- return self
-
`@computed_field` # type: ignore[prop-decorator]
`@property`
def prompt_token_ratio(self) -> float:
@@
- if self.tokens_per_task > 0:
- return self.prompt_tokens / self.tokens_per_task
+ if self.tokens_per_task > 0:
+ capped_prompt_tokens = min(self.prompt_tokens, self.tokens_per_task)
+ return capped_prompt_tokens / self.tokens_per_task
return 0.0Also applies to: 82-92
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/engine/metrics.py` around lines 67 - 80, The validator
_validate_prompt_tokens should not raise a ValueError when
SystemPrompt.estimated_tokens (prompt_tokens) exceeds tokens_per_task; instead,
compute and store a bounded ratio so the model isn't rejected: when
self.tokens_per_task > 0 calculate ratio = self.prompt_tokens /
self.tokens_per_task, set a new or existing attribute like
self.prompt_token_ratio = min(ratio, 1.0) (and optionally clamp prompt_tokens to
self.tokens_per_task if you must keep it as an upper bound), remove the raise,
and mirror the same change for the similar validator handling completion tokens
(the validator around the 82-92 region) so both use a capped ratio rather than
throwing.
| _validate_max_tokens(agent, max_tokens) | ||
| _validate_org_policies(agent, org_policies) | ||
|
|
||
| # Advisory only — issues are logged but never block prompt construction. | ||
| if org_policies: | ||
| try: | ||
| validate_policy_quality(org_policies) |
There was a problem hiding this comment.
Normalize org_policies before walking it.
_validate_org_policies() consumes the incoming object in place and returns nothing. A one-shot iterable gets exhausted before advisory validation/rendering, and None or another non-iterable still leaks a raw TypeError before the main PromptBuildError handling. Convert once to a tuple, validate that normalized value, and reuse it for the rest of build_system_prompt().
As per coding guidelines "Validate at system boundaries (user input, external APIs, config files)".
Also applies to: 279-301
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/engine/prompt.py` around lines 197 - 203, Normalize
org_policies at the start of build_system_prompt by converting the incoming
org_policies into a stable iterable (e.g., org_policies = tuple(org_policies or
())) and use that normalized variable for all subsequent calls and iterations;
call _validate_org_policies(normalized_org_policies) and
validate_policy_quality(normalized_org_policies) (and any rendering/iteration
later) instead of the original parameter so one-shot iterables or None don't get
consumed or raise TypeError—apply the same normalization and reuse in the later
block around the validate_policy_quality / rendering logic (the section around
lines 279-301).
| def __init__(self, required_tag: str = NON_INFERABLE_TAG) -> None: | ||
| if not required_tag.strip(): | ||
| msg = "required_tag must be a non-empty string" | ||
| raise ValueError(msg) | ||
| self._required_tag = required_tag | ||
| logger.debug( | ||
| MEMORY_FILTER_INIT, | ||
| strategy=self.strategy_name, | ||
| required_tag=required_tag, | ||
| ) |
There was a problem hiding this comment.
Normalize and validate required_tag before storing it.
required_tag.strip() will throw AttributeError for non-string config values, and " non-inferable " currently passes validation but never matches any stored tag. A trivial config typo can therefore silently filter out every memory.
🛠️ Suggested fix
def __init__(self, required_tag: str = NON_INFERABLE_TAG) -> None:
- if not required_tag.strip():
+ if not isinstance(required_tag, str):
+ msg = "required_tag must be a non-empty string"
+ raise ValueError(msg)
+ normalized_tag = required_tag.strip()
+ if not normalized_tag:
msg = "required_tag must be a non-empty string"
raise ValueError(msg)
- self._required_tag = required_tag
+ self._required_tag = normalized_tag
logger.debug(
MEMORY_FILTER_INIT,
strategy=self.strategy_name,
- required_tag=required_tag,
+ required_tag=self._required_tag,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/memory/filter.py` around lines 62 - 71, The constructor for
required_tag accepts non-string values and stores unnormalized input, causing
AttributeError on non-strings and mismatches for values with surrounding
whitespace; update the __init__ of the class that defines required_tag to first
enforce type (raise TypeError or ValueError if not instance of str), normalize
the value by calling stripped = required_tag.strip() and then validate stripped
is non-empty (raise ValueError if empty), assign self._required_tag = stripped,
and ensure the logger call (MEMORY_FILTER_INIT, strategy=self.strategy_name)
uses the normalized stripped value for required_tag so stored/logged tag matches
how tags are compared elsewhere.
| def validate_memory_tags(request: MemoryStoreRequest) -> None: | ||
| """Log a warning when the non-inferable tag is missing. | ||
|
|
||
| This is advisory only — the store operation is never blocked. | ||
| Wire into ``MemoryBackend.store()`` callers to activate enforcement. | ||
|
|
||
| Args: | ||
| request: The memory store request to validate. | ||
| """ | ||
| if NON_INFERABLE_TAG not in request.metadata.tags: | ||
| logger.warning( | ||
| MEMORY_FILTER_STORE_MISSING_TAG, | ||
| category=request.category.value, | ||
| content_length=len(request.content), | ||
| tags=request.metadata.tags, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n -C2 '\bvalidate_memory_tags\s*\(' src/ai_company
printf '\n-- store implementations --\n'
rg -n -C3 'async def store\s*\(' src/ai_companyRepository: Aureliolo/ai-company
Length of output: 860
🏁 Script executed:
# Search for all references to validate_memory_tags (including imports)
rg -n 'validate_memory_tags' src/ai_company
# Find concrete implementations of store()
fd -e py -x grep -l 'class.*MemoryBackend\|def store' {} + | head -20 | xargs -I {} sh -c 'echo "=== {} ===" && grep -n "def store\|class.*Memory" {} | head -20'Repository: Aureliolo/ai-company
Length of output: 28055
🏁 Script executed:
# Show the actual store() implementation in the abstract protocol
cat -n src/ai_company/memory/protocol.py | sed -n '77,90p'
# Check a concrete implementation
cat -n tests/unit/memory/test_protocol.py | sed -n '45,60p'Repository: Aureliolo/ai-company
Length of output: 1177
This validation helper is not wired into any store() implementation.
The validate_memory_tags() function is defined but has no call sites in the codebase. Without concrete store implementations invoking it before persistence, the missing-tag warning event never fires and the validation is completely inert. Either add the call to each store() path (or a shared façade) or remove the unused helper.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/memory/store_guard.py` around lines 21 - 36, The helper
validate_memory_tags(request: MemoryStoreRequest) is never invoked so the
NON_INFERABLE_TAG warning (MEMORY_FILTER_STORE_MISSING_TAG) is inert; fix this
by calling validate_memory_tags(request) at the start of every memory
persistence path — either add the call into each concrete
MemoryBackend.store(...) implementation or place it in the shared façade/wrapper
that all stores use (ensure any class implementing MemoryBackend calls
validate_memory_tags before persisting), so the logger warning will fire when
tags are missing.
| @pytest.mark.unit | ||
| def test_empty_org_policy_raises( | ||
| self, | ||
| sample_agent_with_personality: AgentIdentity, | ||
| ) -> None: | ||
| """Empty string policy is rejected with PromptBuildError.""" | ||
| with pytest.raises(PromptBuildError, match="org_policies"): | ||
| build_system_prompt( | ||
| agent=sample_agent_with_personality, | ||
| org_policies=("valid policy must exist", ""), | ||
| ) | ||
|
|
||
| @pytest.mark.unit | ||
| def test_whitespace_only_org_policy_raises( | ||
| self, | ||
| sample_agent_with_personality: AgentIdentity, | ||
| ) -> None: | ||
| """Whitespace-only policy is rejected with PromptBuildError.""" | ||
| with pytest.raises(PromptBuildError, match="org_policies"): | ||
| build_system_prompt( | ||
| agent=sample_agent_with_personality, | ||
| org_policies=(" ",), | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Parametrize the invalid org_policies cases.
These two tests hit the same branch and only vary by the bad value, so collapsing them into one parametrized case keeps the suite tighter.
As per coding guidelines "Use @pytest.mark.parametrize for testing similar cases instead of duplicating test functions".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/engine/test_prompt.py` around lines 509 - 531, Combine the two
tests into one parametrized test that iterates over the invalid org_policies
values; replace the separate test_empty_org_policy_raises and
test_whitespace_only_org_policy_raises with a single
`@pytest.mark.parametrize-based` test that calls
build_system_prompt(agent=sample_agent_with_personality, org_policies=<param>)
and asserts pytest.raises(PromptBuildError, match="org_policies"); keep the same
docstring and use the same sample_agent_with_personality fixture and
PromptBuildError type to ensure behavior is unchanged.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _compute_sections( | ||
| *, | ||
| task: Task | None, | ||
| available_tools: tuple[ToolDefinition, ...], | ||
| company: Company | None, | ||
| org_policies: tuple[str, ...] = (), | ||
| ) -> tuple[str, ...]: | ||
| """Determine which sections are present in the rendered prompt. | ||
|
|
||
| The default template omits the tools section per D22 (non-inferable | ||
| principle), so ``available_tools`` is not considered here. | ||
|
|
||
| Args: | ||
| task: Optional task context. | ||
| available_tools: Tool definitions. | ||
| company: Optional company context. | ||
| org_policies: Company-wide policy texts. | ||
|
|
||
| Returns: | ||
| Tuple of section names that are included. | ||
| """ | ||
| sections: list[str] = [ | ||
| _SECTION_IDENTITY, | ||
| _SECTION_PERSONALITY, | ||
| _SECTION_SKILLS, | ||
| _SECTION_AUTHORITY, | ||
| ] | ||
| if org_policies: | ||
| sections.append(_SECTION_ORG_POLICIES) | ||
| # Autonomy follows org_policies in the template. | ||
| sections.append(_SECTION_AUTONOMY) | ||
| if task is not None: | ||
| sections.append(_SECTION_TASK) | ||
| if available_tools: | ||
| sections.append(_SECTION_TOOLS) | ||
| if company is not None: | ||
| sections.append(_SECTION_COMPANY) | ||
| return tuple(sections) |
There was a problem hiding this comment.
_compute_sections() no longer considers tools at all, so SystemPrompt.sections won’t reflect tool inclusion even when a custom template renders tools (since available_tools is still provided to the template context). If sections is used for downstream diagnostics/trimming analytics, this becomes misleading. Consider restoring tools section tracking when available_tools is non-empty (or making section computation aware of whether a custom template actually renders tools).
| if memory_filter is None and config.non_inferable_only: | ||
| memory_filter = TagBasedMemoryFilter() | ||
| elif memory_filter is not None and config.non_inferable_only: | ||
| logger.debug( | ||
| MEMORY_RETRIEVAL_START, | ||
| note="explicit memory_filter overrides non_inferable_only config", | ||
| filter_strategy=getattr(memory_filter, "strategy_name", "unknown"), | ||
| ) | ||
| self._memory_filter = memory_filter |
There was a problem hiding this comment.
This debug log uses the MEMORY_RETRIEVAL_START event even though it’s reporting an initialization/config override (“explicit memory_filter overrides…”). Reusing the start event here can make retrieval telemetry noisy or ambiguous. Consider logging this with a filter-specific event (e.g., MEMORY_FILTER_INIT) or introducing a dedicated override event constant.
| _ACTION_VERB_RE: re.Pattern[str] = re.compile( | ||
| r"\b(?:" + "|".join(_ACTION_VERBS) + r")\b", | ||
| ) |
There was a problem hiding this comment.
_ACTION_VERB_RE is built from a frozenset, so the alternation order in the generated regex depends on hash iteration order and can vary across processes. Behavior is equivalent, but it makes the compiled pattern non-deterministic for debugging and can cause avoidable diffs if the pattern string is ever surfaced. Consider building it from sorted(_ACTION_VERBS) for deterministic output.
| @model_validator(mode="after") | ||
| def _validate_prompt_tokens(self) -> TaskCompletionMetrics: | ||
| """Ensure prompt_tokens does not exceed tokens_per_task. | ||
|
|
||
| Skipped when ``tokens_per_task`` is 0 (zero-turn runs where the | ||
| system prompt was built but no provider calls were made). | ||
| """ | ||
| if self.tokens_per_task > 0 and self.prompt_tokens > self.tokens_per_task: | ||
| msg = ( | ||
| f"prompt_tokens ({self.prompt_tokens}) cannot exceed " | ||
| f"tokens_per_task ({self.tokens_per_task})" | ||
| ) | ||
| raise ValueError(msg) | ||
| return self |
There was a problem hiding this comment.
prompt_tokens is an estimate (char/4) while tokens_per_task is provider-reported actual usage; the estimate can legitimately exceed the actual total (e.g., short responses, missing/underreported token usage, or estimator overestimation). Raising here can crash AgentEngine._log_completion() because TaskCompletionMetrics.from_run_result() isn’t guarded. Consider removing this validator, downgrading it to a warning log, or clamping the ratio calculation to avoid throwing during metrics logging.
| # Sections trimmed when over token budget, least critical first. | ||
| # Tools section was removed from the default template per D22 | ||
| # (non-inferable principle). | ||
| _TRIMMABLE_SECTIONS = ( | ||
| _SECTION_COMPANY, | ||
| _SECTION_TOOLS, | ||
| _SECTION_TASK, | ||
| _SECTION_ORG_POLICIES, | ||
| ) |
There was a problem hiding this comment.
Tools are still passed in the template context for custom templates (context["tools"]), but tools were removed from _TRIMMABLE_SECTIONS. This means max_tokens trimming can no longer drop tools even if a custom template renders them, and the prompt may remain over budget when tools dominate. Consider keeping a tools-trimming step (e.g., clear available_tools when over budget) even if the default template doesn’t render tools, or explicitly document/guard that max_tokens is best-effort and won’t trim tools for custom templates.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/unit/engine/test_agent_engine_lifecycle.py (2)
350-380: 🧹 Nitpick | 🔵 TrivialUse
@pytest.mark.parametrizeto combine similar test cases.These two tests are nearly identical, differing only in the
timeout_secondsvalue. Per coding guidelines, use@pytest.mark.parametrizefor testing similar cases instead of duplicating test functions.♻️ Proposed refactor
- async def test_zero_timeout_raises( - self, - sample_agent_with_personality: AgentIdentity, - sample_task_with_criteria: Task, - mock_provider_factory: type[MockCompletionProvider], - ) -> None: - provider = mock_provider_factory([]) - engine = AgentEngine(provider=provider) - - with pytest.raises(ValueError, match="timeout_seconds must be > 0"): - await engine.run( - identity=sample_agent_with_personality, - task=sample_task_with_criteria, - timeout_seconds=0, - ) - - async def test_negative_timeout_raises( + `@pytest.mark.parametrize`("timeout_seconds", [0, -1.0]) + async def test_invalid_timeout_raises( self, sample_agent_with_personality: AgentIdentity, sample_task_with_criteria: Task, mock_provider_factory: type[MockCompletionProvider], + timeout_seconds: float, ) -> None: + """Zero or negative timeout_seconds raises ValueError.""" provider = mock_provider_factory([]) engine = AgentEngine(provider=provider) with pytest.raises(ValueError, match="timeout_seconds must be > 0"): await engine.run( identity=sample_agent_with_personality, task=sample_task_with_criteria, - timeout_seconds=-1.0, + timeout_seconds=timeout_seconds, )As per coding guidelines: "Use
@pytest.mark.parametrizefor testing similar cases instead of duplicating test functions".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/engine/test_agent_engine_lifecycle.py` around lines 350 - 380, Combine the two duplicate tests test_zero_timeout_raises and test_negative_timeout_raises into a single parametrized test using pytest.mark.parametrize that iterates over the invalid timeout_seconds values (0 and -1.0); keep the same setup (create provider via mock_provider_factory and instantiate AgentEngine) and the same assertion with pytest.raises(ValueError, match="timeout_seconds must be > 0") around await engine.run(identity=sample_agent_with_personality, task=sample_task_with_criteria, timeout_seconds=timeout), replacing the two separate functions with one function (e.g., test_invalid_timeouts_raise) that accepts the parametrized timeout parameter.
396-416: 🧹 Nitpick | 🔵 TrivialConsider adding assertion for
prompt_token_ratio.The test now provides explicit token counts (
input_tokens=400,output_tokens=200), which is good for validating the new metrics. However, since this PR introducesprompt_token_ratioas a@computed_field, consider adding an assertion to verify it's computed correctly.💡 Suggested assertion
assert metrics.duration_seconds > 0 assert metrics.agent_id == str(sample_agent_with_personality.id) assert metrics.task_id == sample_task_with_criteria.id + # Validate prompt_token_ratio: input_tokens / total_tokens = 400 / 600 + assert metrics.prompt_tokens == 400 + assert 0.66 < metrics.prompt_token_ratio < 0.67🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/engine/test_agent_engine_lifecycle.py` around lines 396 - 416, The test creates a completion with input_tokens=400 and output_tokens=200 but doesn't assert the new computed field prompt_token_ratio; update the test after computing metrics (where TaskCompletionMetrics.from_run_result(result) is called) to assert that metrics.prompt_token_ratio equals 400 / (400 + 200) (i.e., ~0.6667) using a tolerant comparison (pytest.approx or similar). Locate the block that builds response, provider, engine, calls engine.run and computes metrics, and add the assertion referencing metrics.prompt_token_ratio to validate the new `@computed_field`.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tests/unit/engine/test_agent_engine_lifecycle.py`:
- Around line 350-380: Combine the two duplicate tests test_zero_timeout_raises
and test_negative_timeout_raises into a single parametrized test using
pytest.mark.parametrize that iterates over the invalid timeout_seconds values (0
and -1.0); keep the same setup (create provider via mock_provider_factory and
instantiate AgentEngine) and the same assertion with pytest.raises(ValueError,
match="timeout_seconds must be > 0") around await
engine.run(identity=sample_agent_with_personality,
task=sample_task_with_criteria, timeout_seconds=timeout), replacing the two
separate functions with one function (e.g., test_invalid_timeouts_raise) that
accepts the parametrized timeout parameter.
- Around line 396-416: The test creates a completion with input_tokens=400 and
output_tokens=200 but doesn't assert the new computed field prompt_token_ratio;
update the test after computing metrics (where
TaskCompletionMetrics.from_run_result(result) is called) to assert that
metrics.prompt_token_ratio equals 400 / (400 + 200) (i.e., ~0.6667) using a
tolerant comparison (pytest.approx or similar). Locate the block that builds
response, provider, engine, calls engine.run and computes metrics, and add the
assertion referencing metrics.prompt_token_ratio to validate the new
`@computed_field`.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 37f4f8bc-2043-418a-8b8c-c7c2b7cd625f
📒 Files selected for processing (1)
tests/unit/engine/test_agent_engine_lifecycle.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use
from ai_company.observability import get_loggerand instantiate logger aslogger = get_logger(__name__)in every module with business logic
Files:
tests/unit/engine/test_agent_engine_lifecycle.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Add test markers@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e, or@pytest.mark.slowto all test files
Maintain 80% minimum code coverage — enforced in CI withpytest --cov=ai_company --cov-fail-under=80
Useasyncio_mode = 'auto'in pytest configuration — no manual@pytest.mark.asyncioneeded on async tests
Set test timeout to 30 seconds per test — use@pytest.mark.timeout(30)or configure in pytest.ini
Use@pytest.mark.parametrizefor testing similar cases instead of duplicating test functions
Files:
tests/unit/engine/test_agent_engine_lifecycle.py
🔇 Additional comments (1)
tests/unit/engine/test_agent_engine_lifecycle.py (1)
1-24: LGTM!The test file is well-structured with comprehensive coverage of AgentEngine lifecycle scenarios. All test classes have the required
@pytest.mark.unitmarker, and the global timeout is properly configured viapytestmark.
There was a problem hiding this comment.
Code Review
This pull request aims to harden the 'non-inferable principle' implementation across performance metrics, memory filtering, and prompt generation, with well-structured changes and comprehensive tests. However, it introduces several critical security and stability issues. The memory retrieval pipeline's fail-open mechanism on filter errors could lead to sensitive data exposure in agent prompts. Furthermore, the use of invalid Python 3 exception handling syntax will cause runtime SyntaxErrors and potential Denial of Service. Additionally, sensitive memory content is being logged in warning messages, violating secure logging practices. These issues stem partly from adherence to an incorrect project style guide (CLAUDE.md) that mandates Python 2 except syntax, which needs to be updated.
I am having trouble creating individual review comments. Click here to see my feedback.
src/ai_company/memory/retriever.py (273-288)
This section of the memory retrieval pipeline implements a fail-open mechanism. If the memory_filter raises an exception, the system logs a warning and proceeds with unfiltered memories, which could lead to sensitive data injection into agent prompts, bypassing the 'non-inferable' principle. Furthermore, the except builtins_MemoryError, RecursionError: syntax is invalid in Python 3, causing a SyntaxError and potential Denial of Service. This incorrect syntax adheres to an outdated rule in the project's style guide (CLAUDE.md, line 70), which should be updated to recommend the correct Python 3 syntax except (A, B):.
if self._memory_filter is not None:
try:
ranked = self._memory_filter.filter_for_injection(ranked)
except (builtins_MemoryError, RecursionError):
raise
except Exception:
logger.warning(
MEMORY_RETRIEVAL_DEGRADED,
source="memory_filter",
agent_id=agent_id,
filter_strategy=getattr(
self._memory_filter, "strategy_name", "unknown"
),
exc_info=True,
)
# Fail securely: return empty if filter fails
return ()src/ai_company/engine/prompt.py (202)
The except MemoryError, RecursionError: syntax is invalid in Python 3 for catching multiple exceptions. This will cause a SyntaxError at runtime, leading to a denial of service. This issue stems from adherence to an incorrect rule in the project's style guide (CLAUDE.md, line 70), which mandates Python 2 except syntax. The style guide should be updated to recommend the correct Python 3 syntax, which is to group exceptions in a tuple, e.g., except (MemoryError, RecursionError):.
except (MemoryError, RecursionError):src/ai_company/memory/store_guard.py (30-36)
Logging a preview of memory content (content[:80]) when a tag is missing poses a risk of sensitive data exposure. Memories may contain PII, secrets, or internal data that should not be written to application logs.
if NON_INFERABLE_TAG not in request.metadata.tags:
logger.warning(
MEMORY_FILTER_STORE_MISSING_TAG,
category=request.category.value,
tags=request.metadata.tags,
)…reptile - Cap prompt_tokens instead of rejecting when heuristic exceeds actual (#2) - Log policy_length instead of policy content to avoid leaks (#6) - Sort _ACTION_VERBS for deterministic regex alternation (#8) - Use PROMPT_POLICY_VALIDATION_FAILED event for advisory failures (#12) - Add isinstance check and strip whitespace in TagBasedMemoryFilter (#13) - Use MEMORY_FILTER_INIT event for filter-init log paths (#14, #16) - Remove content_preview from store_guard warning log (#18) - Track tools section conditionally for custom templates (#10) - Reconcile enforced vs advisory wording in DESIGN_SPEC (#1) - Inject fixed estimated_tokens in prompt-ratio test (#19) - Parametrize timeout and org_policies tests (#20, #22) - Add prompt_token_ratio assertion in lifecycle test (#21)
| # Sections trimmed when over token budget, least critical first. | ||
| # Tools section was removed from the default template per D22 | ||
| # (non-inferable principle), but custom templates may still render tools. | ||
| _TRIMMABLE_SECTIONS = ( | ||
| _SECTION_COMPANY, | ||
| _SECTION_TOOLS, | ||
| _SECTION_TASK, | ||
| _SECTION_ORG_POLICIES, | ||
| ) |
There was a problem hiding this comment.
Custom-template tools section not trimmable under token budget
_SECTION_TOOLS was intentionally removed from _TRIMMABLE_SECTIONS because the default template no longer renders tools (D22). However, _SECTION_TOOLS is still tracked in sections when available_tools is non-empty and a custom template is active (see _compute_sections), meaning a custom template that references {{ tools }} can land in sections.
The consequence is that if a custom template renders the tools section and the rendered prompt exceeds max_tokens, the trimming loop will cycle through _SECTION_COMPANY → _SECTION_TASK → _SECTION_ORG_POLICIES and then fall through without ever nullifying available_tools. The prompt can end up permanently over budget for custom-template callers who pass a large tool list.
Before this PR the order was company → tools → task → org_policies. If the decision is that custom-template tools should also be trimmable, _SECTION_TOOLS needs to be re-added to _TRIMMABLE_SECTIONS (last or early in priority) along with a corresponding elif section == _SECTION_TOOLS and available_tools: branch in _trim_sections.
If the intended policy is that custom templates own their own trimming responsibility, a doc comment to that effect in build_system_prompt's docstring would prevent future confusion.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/engine/prompt.py
Line: 148-155
Comment:
**Custom-template tools section not trimmable under token budget**
`_SECTION_TOOLS` was intentionally removed from `_TRIMMABLE_SECTIONS` because the default template no longer renders tools (D22). However, `_SECTION_TOOLS` is still tracked in `sections` when `available_tools` is non-empty **and** a custom template is active (see `_compute_sections`), meaning a custom template that references `{{ tools }}` can land in `sections`.
The consequence is that if a custom template renders the tools section and the rendered prompt exceeds `max_tokens`, the trimming loop will cycle through `_SECTION_COMPANY → _SECTION_TASK → _SECTION_ORG_POLICIES` and then fall through without ever nullifying `available_tools`. The prompt can end up permanently over budget for custom-template callers who pass a large tool list.
Before this PR the order was `company → tools → task → org_policies`. If the decision is that custom-template tools should also be trimmable, `_SECTION_TOOLS` needs to be re-added to `_TRIMMABLE_SECTIONS` (last or early in priority) along with a corresponding `elif section == _SECTION_TOOLS and available_tools:` branch in `_trim_sections`.
If the intended policy is that custom templates own their own trimming responsibility, a doc comment to that effect in `build_system_prompt`'s docstring would prevent future confusion.
How can I resolve this? If you propose a fix, please make it concise.🤖 I have created a release *beep* *boop* --- ## [0.1.1](ai-company-v0.1.0...ai-company-v0.1.1) (2026-03-10) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.0...v0.1.0) (2026-03-11) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add mandatory JWT + API key authentication ([#256](#256)) ([c279cfe](c279cfe)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable output scan response policies ([#263](#263)) ([b9907e8](b9907e8)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement AuditRepository for security audit log persistence ([#279](#279)) ([94bc29f](94bc29f)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * resolve circular imports, bump litellm, fix release tag format ([#286](#286)) ([a6659b5](a6659b5)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * bump anchore/scan-action from 6.5.1 to 7.3.2 ([#271](#271)) ([80a1c15](80a1c15)) * bump docker/build-push-action from 6.19.2 to 7.0.0 ([#273](#273)) ([dd0219e](dd0219e)) * bump docker/login-action from 3.7.0 to 4.0.0 ([#272](#272)) ([33d6238](33d6238)) * bump docker/metadata-action from 5.10.0 to 6.0.0 ([#270](#270)) ([baee04e](baee04e)) * bump docker/setup-buildx-action from 3.12.0 to 4.0.0 ([#274](#274)) ([5fc06f7](5fc06f7)) * bump sigstore/cosign-installer from 3.9.1 to 4.1.0 ([#275](#275)) ([29dd16c](29dd16c)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * **main:** release ai-company 0.1.1 ([#282](#282)) ([2f4703d](2f4703d)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>
Summary
prompt_cost_ratio→prompt_token_ratio(it measures token ratio, not cost) and convert to@computed_field(project convention for derived values)non_inferable_onlyconfig flag to auto-createTagBasedMemoryFilterinContextInjectionStrategy, with graceful degradation on filter errorsvalidate_policy_qualityin try/except so it's truly advisory-onlyMEMORY_FILTER_APPLIED/MEMORY_FILTER_INITevent constantsTYPE_CHECKINGblocks) infilter.py,store_guard.pyDESIGN_SPEC.md§15.3 project structure (new files:policy_validation.py,filter.py,store_guard.py,events/security.py) and memory pipeline description; updateCLAUDE.mdengine/memory descriptions and logging examplesTest plan
test_metrics.py—prompt_token_ratioas@computed_field, boundary cases (zero tokens)test_policy_validation.py— word-boundary action verb matching, multiple code patterns produce single issuetest_agent_engine.py— high/low prompt token ratio warning emissiontest_filter.py—TagBasedMemoryFilterandPassthroughMemoryFilterbehaviortest_retriever.py— config-driven filter wiring, graceful degradation on filter errorstest_store_guard.py— advisory tag guard behaviortest_prompt.py— policy validation error isolationReview coverage
Pre-reviewed by 10 agents (code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer, logging-audit, resilience-audit, security-reviewer, docs-consistency). 24 findings addressed across 14 files.
Closes #188