perf: harden non-inferable principle implementation by Aureliolo · Pull Request #195 · Aureliolo/synthorg

Aureliolo · 2026-03-10T10:35:34Z

Summary

Prompt token ratio: Rename prompt_cost_ratio → prompt_token_ratio (it measures token ratio, not cost) and convert to @computed_field (project convention for derived values)
Memory filtering: Wire non_inferable_only config flag to auto-create TagBasedMemoryFilter in ContextInjectionStrategy, with graceful degradation on filter errors
Policy validation: Add word-boundary regex for action verb detection (prevents false negatives from substring matching), wrap validate_policy_quality in try/except so it's truly advisory-only
Prompt templates: Remove tool definitions from default system prompt template (non-inferable — agents discover tools at runtime)
Observability: Add DEBUG logging to filter/guard/validation entry points, add MEMORY_FILTER_APPLIED/MEMORY_FILTER_INIT event constants
Import hygiene: Fix import ordering (runtime imports before TYPE_CHECKING blocks) in filter.py, store_guard.py
Docs: Update DESIGN_SPEC.md §15.3 project structure (new files: policy_validation.py, filter.py, store_guard.py, events/security.py) and memory pipeline description; update CLAUDE.md engine/memory descriptions and logging examples

Test plan

Review coverage

Pre-reviewed by 10 agents (code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer, logging-audit, resilience-audit, security-reviewer, docs-consistency). 24 findings addressed across 14 files.

Closes #188

…trics (#188) Remove tools section from default system prompt template per D22 — tool definitions are already passed via the LLM provider API, so duplicating them in the prompt doubles cost with no benefit. Add pluggable MemoryFilterStrategy (D23) with tag-based initial impl that retains only memories tagged "non-inferable" before injection. Add advisory store guard and policy quality validation heuristics. Add prompt_tokens and prompt_cost_ratio to TaskCompletionMetrics for cost-aware context budgeting, with warnings when ratio exceeds 30%. Closes #188

Pre-reviewed by 10 agents, 24 findings addressed: - Rename prompt_cost_ratio → prompt_token_ratio (measures tokens, not cost) - Convert prompt_token_ratio to @computed_field (project convention) - Wire non_inferable_only config to auto-create TagBasedMemoryFilter - Add graceful degradation for memory filter + policy validation - Use word-boundary regex for action verb detection - Add DEBUG logging to filter/guard/validation entry points - Fix import ordering (runtime before TYPE_CHECKING) - Update DESIGN_SPEC.md and CLAUDE.md for new modules - Add comprehensive test coverage for all new behavior

github-actions · 2026-03-10T10:35:44Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

coderabbitai · 2026-03-10T10:35:49Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1ee4f774-ce63-490d-bc75-38c71c4cc000

📥 Commits

Reviewing files that changed from the base of the PR and between c6801d0 and fd6db3d.

📒 Files selected for processing (10)

DESIGN_SPEC.md
src/ai_company/engine/metrics.py
src/ai_company/engine/policy_validation.py
src/ai_company/engine/prompt.py
src/ai_company/memory/filter.py
src/ai_company/memory/retriever.py
src/ai_company/memory/store_guard.py
tests/unit/engine/test_agent_engine.py
tests/unit/engine/test_agent_engine_lifecycle.py
tests/unit/engine/test_prompt.py

📝 Walkthrough

Summary by CodeRabbit

New Features
- Configurable memory filtering (tag-based and passthrough), advisory store guard, and a config flag to restrict injected memories
Updates
- Advisory org-policy quality validation with non-blocking reporting
- Completion metrics now include prompt token count and prompt-token ratio (with high-ratio warnings)
- System prompt template bumped to v1.3.0 — tools removed from default template
- New observability events for memory filtering, policy validation, and high prompt-token ratio
Tests
- Expanded unit tests for filtering, validation, metrics, prompt behavior, and warnings

Walkthrough

Adds non-inferable policy validation, removes Tools from the default system prompt, introduces a pluggable memory-filter stage (tag-based / passthrough) with a store guard, and records prompt_tokens and prompt_token_ratio with new observability events and warnings across prompt, memory, metrics, and engine surfaces.

Changes

Cohort / File(s)	Summary
Design & Docs `CLAUDE.md`, `DESIGN_SPEC.md`, `README.md`	Document non-inferable principle, memory-filter design, and prompt-policy guidance; reflect retrieval → rank → filter → format pipeline.
Policy Validation `src/ai_company/engine/policy_validation.py`	New module with `PolicyQualityIssue` and `validate_policy_quality()` heuristics (length, code-patterns, action-verb checks) and logging of issues.
Prompt Template & Builder `src/ai_company/engine/prompt_template.py`, `src/ai_company/engine/prompt.py`	Bumped template version 1.2.0→1.3.0; removed Tools section from default template and trimming flow; added `_validate_org_policies` integration and PROMPT_POLICY_* events; Tools still available for custom templates.
Prompt Token Metrics & Engine Logging `src/ai_company/engine/metrics.py`, `src/ai_company/engine/agent_engine.py`	Added `prompt_tokens` and computed `prompt_token_ratio` to `TaskCompletionMetrics`; populate from SystemPrompt.estimated_tokens; engine logs metrics and emits PROMPT_TOKEN_RATIO_HIGH when threshold exceeded.
Memory Filter API & Implementations `src/ai_company/memory/filter.py`, `src/ai_company/memory/store_guard.py`	New `MemoryFilterStrategy` protocol, `TagBasedMemoryFilter` and `PassthroughMemoryFilter`, `NON_INFERABLE_TAG`, and advisory `validate_memory_tags()` store guard that warns when tag missing.
Retriever Integration & Config `src/ai_company/memory/retriever.py`, `src/ai_company/memory/retrieval_config.py`	`ContextInjectionStrategy` gains optional `memory_filter`; auto-creates `TagBasedMemoryFilter` when `non_inferable_only=True`; applies filter after ranking with graceful degradation and logging; new `non_inferable_only` flag on config.
Observability Events `src/ai_company/observability/events/prompt.py`, `src/ai_company/observability/events/memory.py`, `src/ai_company/observability/events/security.py`	Added PROMPT_POLICY_* and PROMPT_TOKEN_RATIO_HIGH; added MEMORY_FILTER_INIT / MEMORY_FILTER_APPLIED / MEMORY_FILTER_STORE_MISSING_TAG; updated events surface for policy and filtering hooks.
Tests `tests/unit/engine/.py`, `tests/unit/memory/.py`	Extensive test additions/updates for policy validation, prompt-template (v1.3.0 and tools removal), prompt-token metrics, memory filter behavior/retriever integration, and store-guard warnings.
Small surface updates `CLAUDE.md`	Minor description updates referencing prompt policy validation and non-inferable retrieval step.

Sequence Diagram(s)

sequenceDiagram
  participant AgentEngine as AgentEngine
  participant PolicyValidator as PolicyValidator
  participant PromptBuilder as PromptBuilder
  participant Retriever as Retriever
  participant MemoryBackend as MemoryBackend
  participant MemoryFilter as MemoryFilter
  participant LLM as LLM

  AgentEngine->>PolicyValidator: validate_policy_quality(org_policies)
  Note right of PolicyValidator: advisory issues emitted (PROMPT_POLICY_QUALITY_ISSUE)
  AgentEngine->>PromptBuilder: build_system_prompt(...)
  PromptBuilder->>Retriever: request_context(task, retrieval_config)
  Retriever->>MemoryBackend: fetch_ranked_memories(query)
  MemoryBackend-->>Retriever: ranked_memories
  Retriever->>MemoryFilter: filter_for_injection(ranked_memories)
  alt filter raises MemoryError/RecursionError
    MemoryFilter-->>Retriever: propagate error
  else filter fails (domain error)
    MemoryFilter-->>Retriever: log degraded, return unfiltered
  end
  Retriever-->>PromptBuilder: selected_memories
  PromptBuilder->>LLM: send_prompt(system_prompt + memories)
  LLM-->>AgentEngine: completion (tokens)
  AgentEngine->>AgentEngine: compute prompt_token_ratio & emit PROMPT_TOKEN_RATIO_HIGH if threshold exceeded

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

perf: harden non-inferable principle implementation #195: Appears to implement the same set of features (policy_validation, memory filter modules, prompt-token metrics, Tools removal) — strong code-level overlap.
feat: add memory retrieval, ranking, and context injection pipeline (#41) #184: Prior memory retrieval pipeline work that this PR extends by adding MemoryFilterStrategy and non_inferable_only wiring.
feat: implement AgentEngine core orchestrator (#11) #143: Modifies the same AgentEngine surface; related to prompt-token-ratio metrics and engine logging changes.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.36% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly refers to the main objective—hardening the non-inferable principle—which is the primary focus across prompt templates, memory filtering, policy validation, and observability improvements.
Description check	✅ Passed	The description comprehensively covers the changeset, explaining metrics renaming, memory filtering wiring, policy validation improvements, template changes, observability additions, and documentation updates—all directly aligned with the code modifications.
Linked Issues check	✅ Passed	The PR fully addresses issue `#188`'s objectives: non-inferable principle documentation [prompt/policy_validation], Tools section removal [prompt_template], MemoryFilterStrategy implementation [filter/retriever], prompt_token_ratio metric [metrics], and policy validation helper [policy_validation].
Out of Scope Changes check	✅ Passed	All code changes are in scope: prompt/memory/metrics modules address non-inferable principle; observability additions support filtering/validation; test expansions validate new features; documentation updates align with implementation—no extraneous modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch perf/non-inferable-principle

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch perf/non-inferable-principle

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-03-10T10:38:36Z

Greptile Summary

This PR hardens the non-inferable principle (D22/D23) across the memory and engine layers: the default system prompt template no longer includes tool definitions, a TagBasedMemoryFilter is now wired into the memory retrieval pipeline, an advisory validate_policy_quality heuristic checks org policies at prompt-build time, and a new prompt_token_ratio @computed_field surfaces prompt-overhead metrics with a warning at >0.3. Supporting pieces include store_guard.py (advisory tag enforcement at the store boundary), new event constants in events/memory.py and events/prompt.py, and updated docs in CLAUDE.md and DESIGN_SPEC.md.

Key findings:

prompt_token_ratio is unreliable for multi-turn runs — prompt_tokens is a single-call estimate (SystemPrompt.estimated_tokens) but tokens_per_task accumulates tokens across all turns. For a 10-turn run the ratio is 10× too small, so the >0.3 warning in agent_engine.py will almost never fire for real multi-turn workloads even when per-turn prompt overhead is high. This affects the operational value of the new metric.
Custom-template tools section not trimmable — _SECTION_TOOLS was removed from _TRIMMABLE_SECTIONS. For callers using a custom Jinja2 template that renders {{ tools }}, the trimming loop can exhaust all trimmable sections while still leaving the tools section in place, silently keeping the prompt over budget.
The store_guard.validate_memory_tags utility is intentionally not yet wired into any MemoryBackend.store() implementation (per docstring), so the store-boundary enforcement described in the DESIGN_SPEC is not yet active.

Confidence Score: 2/5

Not safe to merge as-is — pre-existing Python 2 except syntax SyntaxErrors in the new exception guards would crash the affected modules at import time, and the prompt_token_ratio metric silently underestimates multi-turn overhead.
Two previous-thread issues (Python 2 except A, B: syntax in prompt.py and retriever.py) represent import-time SyntaxErrors that would break the modules entirely. Two newly identified issues — the misleading multi-turn prompt_token_ratio and the silent failure to trim the tools section in custom templates — reduce confidence further. The new filter.py, store_guard.py, policy_validation.py, and event constant files are clean and well-tested.
src/ai_company/engine/prompt.py (Python 2 except syntax + tools trimming gap), src/ai_company/memory/retriever.py (Python 2 except syntax), src/ai_company/engine/metrics.py (multi-turn ratio mismatch), src/ai_company/engine/agent_engine.py (warning threshold relies on underestimated ratio)

Important Files Changed

Filename	Overview
src/ai_company/engine/metrics.py	Adds `prompt_tokens` field and `prompt_token_ratio` `@computed_field`; `_cap_prompt_tokens` validator correctly clamps via `object.__setattr__`, but the ratio mixes a per-call numerator with a cumulative denominator, making it unreliable for multi-turn runs.
src/ai_company/engine/prompt.py	Removes tools from the default template (D22), adds `_validate_org_policies` and advisory `validate_policy_quality` wrapper; `_SECTION_TOOLS` removed from `_TRIMMABLE_SECTIONS` silently breaks token-budget trimming for custom templates that render tools.
src/ai_company/memory/retriever.py	Wires `MemoryFilterStrategy` into the ranking pipeline with correct graceful degradation; the `elif memory_filter is not None and config.non_inferable_only` branch now correctly logs `MEMORY_FILTER_INIT` (previous-thread issue resolved).
src/ai_company/memory/filter.py	New file — clean `MemoryFilterStrategy` protocol, `TagBasedMemoryFilter`, and `PassthroughMemoryFilter` with correct structured logging; import ordering (runtime before TYPE_CHECKING) follows project convention.
src/ai_company/memory/store_guard.py	New advisory-only utility that warns on missing `non-inferable` tag; correctly never blocks; not yet wired into any backend (explicitly deferred per docstring).
src/ai_company/engine/policy_validation.py	New file — heuristic policy quality checks with correct word-boundary regex for action verbs; `_check_code_patterns` stops at first match (single issue per policy); all advisory, never blocking.
src/ai_company/engine/agent_engine.py	Adds `PROMPT_TOKEN_RATIO_HIGH` warning and upgrades classification failure from debug to warning with `exc_info`; the warning threshold logic is correct for single-turn runs but silently understates multi-turn overhead.
src/ai_company/memory/retrieval_config.py	Adds `non_inferable_only: bool = False` field with correct frozen model config and clear docstring; backward-compatible default.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ContextInjectionStrategy.prepare_messages] --> B[backend.retrieve + shared_store.query]
    B --> C[rank_memories]
    C --> D{ranked empty?}
    D -- yes --> E[return empty — below min_relevance]
    D -- no --> F{_memory_filter set?}
    F -- no --> I[format_memory_context]
    F -- yes --> G[filter.filter_for_injection]
    G -- success --> H{filtered empty?}
    G -- MemoryError / RecursionError --> RE[re-raise]
    G -- Exception --> GD[⚠ MEMORY_RETRIEVAL_DEGRADED\nuse unfiltered ranked]
    GD --> I
    H -- yes --> E2[return empty — all filtered]
    H -- no --> I
    I --> J[ChatMessage tuple returned]

    subgraph ContextInjectionStrategy.__init__
        K{memory_filter param}
        K -- None + non_inferable_only=True --> L[auto-create TagBasedMemoryFilter]
        K -- provided + non_inferable_only=True --> M[log MEMORY_FILTER_INIT override]
        K -- None + non_inferable_only=False --> N[_memory_filter = None\npassthrough]
    end

    subgraph build_system_prompt
        P[_validate_org_policies\nraises PromptBuildError on blank]
        P --> Q[validate_policy_quality\nadvisory — never blocks]
        Q --> R[render template\ntools omitted by default D22]
        R --> S{over max_tokens?}
        S -- yes --> T[_trim_sections\ncompany → task → org_policies]
        S -- no --> U[_build_prompt_result\nSystemPrompt]
        T --> U
    end

Comments Outside Diff (1)

src/ai_company/engine/metrics.py, line 233-241 (link)

prompt_token_ratio understates overhead for multi-turn runs

prompt_tokens is sampled from SystemPrompt.estimated_tokens — a per-call (per-turn) estimate — while tokens_per_task in the denominator is the accumulated total across all turns (from AgentContext.accumulated_cost.total_tokens).

For an N-turn run, the real per-call prompt overhead is approximately:

actual_per_call_ratio ≈ (turns_per_task × prompt_tokens) / tokens_per_task

For a 10-turn agent run where each turn consumes ~1 000 tokens and the system prompt is 500 tokens:

Reported ratio: 500 / 10_000 = 0.05 (well below the 0.3 warning threshold)
Actual per-turn overhead: 500 / 1_000 = 0.50 (well above the threshold)

The docstring acknowledges this ("actual overhead is higher for multi-turn runs"), but the warning in agent_engine.py still fires at the underestimated ratio, giving false assurance that prompt overhead is acceptable for any multi-turn run with moderate total token usage.

A simple correction is to compare against the per-turn average instead:

@computed_field  # type: ignore[prop-decorator]
@property
def prompt_token_ratio(self) -> float:
    """Per-call ratio of prompt tokens to per-turn average tokens."""
    if self.tokens_per_task > 0 and self.turns_per_task > 0:
        per_turn_tokens = self.tokens_per_task / self.turns_per_task
        return self.prompt_tokens / per_turn_tokens
    if self.tokens_per_task > 0:
        return self.prompt_tokens / self.tokens_per_task
    return 0.0

Alternatively, rename the field to prompt_token_ratio_per_task to make the semantics explicit and update the threshold accordingly.

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/ai_company/engine/metrics.py
Line: 233-241

Comment:
**`prompt_token_ratio` understates overhead for multi-turn runs**

`prompt_tokens` is sampled from `SystemPrompt.estimated_tokens` — a *per-call* (per-turn) estimate — while `tokens_per_task` in the denominator is the *accumulated total* across **all** turns (from `AgentContext.accumulated_cost.total_tokens`).

For an N-turn run, the real per-call prompt overhead is approximately:

```
actual_per_call_ratio ≈ (turns_per_task × prompt_tokens) / tokens_per_task
```

For a 10-turn agent run where each turn consumes ~1 000 tokens and the system prompt is 500 tokens:
- Reported ratio: `500 / 10_000 = 0.05` (well below the 0.3 warning threshold)
- Actual per-turn overhead: `500 / 1_000 = 0.50` (well above the threshold)

The docstring acknowledges this ("actual overhead is higher for multi-turn runs"), but the warning in `agent_engine.py` still fires at the underestimated ratio, giving false assurance that prompt overhead is acceptable for any multi-turn run with moderate total token usage.

A simple correction is to compare against the per-turn average instead:
```python
@computed_field  # type: ignore[prop-decorator]
@property
def prompt_token_ratio(self) -> float:
    """Per-call ratio of prompt tokens to per-turn average tokens."""
    if self.tokens_per_task > 0 and self.turns_per_task > 0:
        per_turn_tokens = self.tokens_per_task / self.turns_per_task
        return self.prompt_tokens / per_turn_tokens
    if self.tokens_per_task > 0:
        return self.prompt_tokens / self.tokens_per_task
    return 0.0
```

Alternatively, rename the field to `prompt_token_ratio_per_task` to make the semantics explicit and update the threshold accordingly.

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: fd6db3d}

src/ai_company/engine/prompt.py

src/ai_company/memory/retriever.py

Copilot

Pull request overview

This PR hardens the “non-inferable principle” implementation across prompt construction and memory injection, adding advisory validation and observability to reduce prompt overhead and avoid injecting inferable context.

Changes:

Removes tool definitions from the default system prompt template (tools remain available via API/tooling and custom templates).
Introduces pluggable memory filtering (tag-based non-inferable) with config-driven wiring and graceful degradation on filter errors.
Adds org policy quality validation heuristics plus prompt token overhead metrics/events (including a high-ratio warning).

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/ai_company/engine/prompt_template.py	Bumps template version and removes default “Available Tools” section.
src/ai_company/engine/prompt.py	Documents non-inferable principle; makes policy validation advisory-only.
src/ai_company/engine/policy_validation.py	Adds heuristic validator for org policy quality + logging.
src/ai_company/engine/metrics.py	Adds `prompt_tokens` and computed `prompt_token_ratio` to task metrics.
src/ai_company/engine/agent_engine.py	Logs prompt token metrics and emits warning when ratio is high.
src/ai_company/memory/filter.py	Adds `MemoryFilterStrategy` + tag-based and passthrough filters.
src/ai_company/memory/retriever.py	Wires optional post-ranking memory filter; config auto-enables tag filter.
src/ai_company/memory/retrieval_config.py	Adds `non_inferable_only` flag to drive filter behavior.
src/ai_company/memory/store_guard.py	Adds advisory guard for missing `non-inferable` tag on store requests.
src/ai_company/observability/events/prompt.py	Adds policy-quality and token-ratio event constants.
src/ai_company/observability/events/memory.py	Adds memory-filter related event constants.
tests/unit/**	Adds/updates unit tests for filter integration, store guard, prompt template changes, policy validation, and metrics.
DESIGN_SPEC.md / CLAUDE.md	Updates documentation to reflect new modules/pipeline stages and logging examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T10:46:18Z

src/ai_company/memory/store_guard.py

+    if NON_INFERABLE_TAG not in request.metadata.tags:
+        logger.warning(
+            MEMORY_FILTER_STORE_MISSING_TAG,
+            category=request.category.value,
+            content_preview=request.content[:80],
+            tags=request.metadata.tags,
+        )


This warning log includes content_preview=request.content[:80]. Memory contents can plausibly contain sensitive/PII data, and the current observability sanitization only redacts based on key names, so this preview will be emitted in cleartext. Safer options: omit content entirely, log only content length / hash, or gate the preview behind a debug-only flag with explicit redaction.

Copilot · 2026-03-10T10:46:18Z

src/ai_company/memory/store_guard.py

+def validate_memory_tags(request: MemoryStoreRequest) -> None:
+    """Log a warning when the non-inferable tag is missing.
+
+    This is advisory only — the store operation is never blocked.
+    Wire into ``MemoryBackend.store()`` callers to activate enforcement.
+


validate_memory_tags() is introduced as the store-boundary guard, but a repo-wide search shows no call sites, so the guard currently never runs in production code. To actually enforce/tag-warn at the boundary, wire this into the relevant MemoryBackend.store() implementation(s) or into a wrapper created by the memory factory (keeping it advisory-only).

Copilot · 2026-03-10T10:46:19Z

src/ai_company/memory/filter.py

+    def __init__(self, required_tag: str = NON_INFERABLE_TAG) -> None:
+        self._required_tag = required_tag
+        logger.debug(
+            MEMORY_FILTER_APPLIED,
+            strategy=self.strategy_name,
+            phase="init",
+            required_tag=required_tag,
+        )


TagBasedMemoryFilter.__init__() logs MEMORY_FILTER_APPLIED with phase="init", but the event name implies an application action and the PR description mentions a separate MEMORY_FILTER_INIT constant. Consider adding a dedicated init event constant (or renaming the current log/event) so init vs application can be distinguished cleanly in telemetry.

Copilot · 2026-03-10T10:46:19Z

src/ai_company/engine/prompt.py

+    # Advisory only — issues are logged but never block prompt construction.
+    if org_policies:
+        try:
+            validate_policy_quality(org_policies)
+        except MemoryError, RecursionError:
+            raise
+        except Exception:
+            logger.warning(
+                PROMPT_BUILD_ERROR,
+                agent_id=str(agent.id),
+                error="Policy quality validation failed (advisory, continuing)",
+                exc_info=True,
+            )


The failure path for advisory policy validation logs under PROMPT_BUILD_ERROR (prompt.build.error). Since prompt construction continues successfully, this event can look like a real build failure and may trigger error-rate alerts. Consider adding a dedicated event (e.g. PROMPT_POLICY_VALIDATION_FAILED) or logging under PROMPT_POLICY_QUALITY_ISSUE with an explicit phase to keep observability semantics accurate.

Copilot · 2026-03-10T10:46:19Z

src/ai_company/engine/policy_validation.py

+    logger.debug(
+        PROMPT_POLICY_QUALITY_ISSUE,
+        phase="start",
+        policy_count=len(policies),
+    )


validate_policy_quality() logs a DEBUG event using PROMPT_POLICY_QUALITY_ISSUE with phase="start". This means the same event name will be emitted even when there are zero issues, which can skew event-based analytics (counts of “quality issues”). Consider using a distinct start event constant or a differently named event for the start log.

Copilot · 2026-03-10T10:46:19Z

src/ai_company/engine/policy_validation.py

+_CODE_PATTERNS: Final[tuple[re.Pattern[str], ...]] = (
+    re.compile(r"(?:src|tests|lib|app)/[\w/]+\.py"),  # file paths
+    re.compile(r"\bfrom\s+\w+\s+import\b"),  # Python imports
+    re.compile(r"\bimport\s+\w+"),  # bare imports
+    re.compile(r"\bdef\s+\w+\s*\("),  # function definitions
+    re.compile(r"\bclass\s+\w+[\s:(]"),  # class definitions
+)


The _CODE_PATTERNS regexes are case-sensitive (e.g. \bimport\s+\w+), so policies containing capitalized forms like Import json / From x import y won't be detected. Consider compiling these patterns with re.IGNORECASE (or running them against policy.lower()) to make the heuristic robust to capitalization.

Copilot · 2026-03-10T10:46:20Z

src/ai_company/engine/metrics.py

        accumulated = result.execution_result.context.accumulated_cost
        return cls(
            task_id=result.task_id,
            agent_id=result.agent_id,
            turns_per_task=result.total_turns,
            tokens_per_task=accumulated.total_tokens,
            cost_per_task=result.total_cost_usd,
            duration_seconds=result.duration_seconds,
+            prompt_tokens=result.system_prompt.estimated_tokens,
        )


prompt_tokens is populated from result.system_prompt.estimated_tokens, but the system prompt message is included in ctx.conversation and is resent on every provider call. Since tokens_per_task aggregates tokens across all turns, prompt_token_ratio will be underestimated for multi-turn runs. Consider either (a) making prompt_tokens represent total prompt tokens across the run (e.g., estimate × result.total_turns), or (b) renaming the field to clarify it's per-call and adjusting the ratio/warning accordingly.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/ai_company/engine/prompt.py (1)
424-457: ⚠️ Potential issue | 🟠 Major

SystemPrompt.sections is now wrong for custom templates that render tools.

The module docstring and _build_template_context() still support custom templates that render tools, but _compute_sections() can no longer ever report a "tools" section. That makes the public sections metadata and the PROMPT_BUILD_SUCCESS log inaccurate for a supported rendering path. Either restore tool tracking for custom templates, or narrow the documented contract to "default-template sections only."

Also applies to: 647-651
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/engine/prompt.py` around lines 424 - 457, _compute_sections
currently never includes the tools section, which breaks SystemPrompt.sections
and PROMPT_BUILD_SUCCESS for custom templates that render tools; update
_compute_sections to append _SECTION_TOOLS when the template rendering path will
include tools (e.g., when available_tools is non-empty or when the template
context indicates it renders tools), and mirror the same check where sections
are computed in the other related block referenced (around the other function at
lines 647-651); ensure the change uses the existing symbols _compute_sections,
_SECTION_TOOLS, available_tools, _build_template_context, SystemPrompt.sections,
and PROMPT_BUILD_SUCCESS so the public sections metadata accurately reflects
templates that include tools.
DESIGN_SPEC.md (1)

1606-1608: ⚠️ Potential issue | 🟡 Minor

Reconcile the "enforced" vs "advisory" store-boundary wording.

Section 7.7 says the non-inferable tag convention is enforced at MemoryBackend.store(), while the project-structure entry describes store_guard.py as advisory. Please align those terms so the spec does not promise hard enforcement if the current implementation only warns.

Also applies to: 2931-2931

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai_company/engine/policy_validation.py`:
- Around line 124-185: Split the long _check_single_policy function into focused
helpers: implement helpers like _check_policy_length(policy) (using
_MIN_POLICY_LENGTH and _MAX_POLICY_LENGTH and returning
list[PolicyQualityIssue]), _check_policy_code_patterns(policy) (using
_CODE_PATTERNS and preserving the single-match break behavior), and
_check_policy_action_verbs(policy) (using _ACTION_VERB_RE), then have
_check_single_policy simply call and aggregate results from these helpers; keep
all existing messages/severity and the PolicyQualityIssue construction identical
so behavior and tests remain unchanged, add small docstrings for each helper and
update or add unit tests as needed.
- Around line 99-114: The preflight debug call uses PROMPT_POLICY_QUALITY_ISSUE
so every run emits an "issue" event; change the preflight marker to a distinct
constant or remove it so only real findings use PROMPT_POLICY_QUALITY_ISSUE.
Locate the debug call that invokes logger.debug with PROMPT_POLICY_QUALITY_ISSUE
(in the loop surrounding _check_single_policy and the issues emission) and
either replace PROMPT_POLICY_QUALITY_ISSUE with a new
PROMPT_POLICY_QUALITY_START (or similar) constant and add that constant where
events are defined, or simply delete the preflight logger.debug line so only the
subsequent logger.warning calls emit PROMPT_POLICY_QUALITY_ISSUE for real
issues.

In `@src/ai_company/engine/prompt.py`:
- Around line 198-210: The org_policies validation currently swallows
non-string/blank entries into a broad except and allows corrupted policies to
reach template rendering; update prompt construction to perform strict per-item
validation (implement a helper like _validate_org_policies(agent: AgentIdentity,
org_policies: tuple[str, ...])) that iterates org_policies, logs an error via
logger.error(PROMPT_BUILD_ERROR, agent_id=str(agent.id), error=msg) and raises
PromptBuildError for any item that is not a non-empty string, and replace the
broad except Exception around validate_policy_quality with either letting
PromptBuildError propagate or only catching MemoryError/RecursionError so
malformed inputs fail fast before template rendering.

In `@tests/unit/engine/test_agent_engine.py`:
- Around line 861-915: Parametrize the two prompt-ratio cases rather than
duplicating tests: replace the two separate async tests in
TestAgentEnginePromptTokenRatioWarning with a single `@pytest.mark.parametrize`
that yields (input_tokens, output_tokens, estimated_prompt_tokens,
expect_warning). In the test body create the mocked completion response and
provider as before, then make the prompt-size deterministic by injecting a
SystemPrompt (or setting identity.system_prompt) with the specific
estimated_tokens value (or monkeypatch the prompt_template.
SystemPrompt.estimated_tokens) so AgentEngine._log_completion() sees that
explicit estimate when engine.run(...) executes; finally assert based on
expect_warning that PROMPT_TOKEN_RATIO_HIGH appears in
structlog.testing.capture_logs(). Use symbols AgentEngine, engine.run,
_log_completion, SystemPrompt, and PROMPT_TOKEN_RATIO_HIGH to locate the code to
change.

In `@tests/unit/memory/test_retriever.py`:
- Around line 469-485: The test test_filter_skipped_when_none currently relies
on implicit defaults and should explicitly pin non_inferable_only to False: when
constructing ContextInjectionStrategy with memory_filter=None, pass
MemoryRetrievalConfig(min_relevance=0.0, non_inferable_only=False) so the test
exercises the “no-filter” branch regardless of config defaults; update the
instantiation that uses MemoryRetrievalConfig in this test to include
non_inferable_only=False.

---

Outside diff comments:
In `@src/ai_company/engine/prompt.py`:
- Around line 424-457: _compute_sections currently never includes the tools
section, which breaks SystemPrompt.sections and PROMPT_BUILD_SUCCESS for custom
templates that render tools; update _compute_sections to append _SECTION_TOOLS
when the template rendering path will include tools (e.g., when available_tools
is non-empty or when the template context indicates it renders tools), and
mirror the same check where sections are computed in the other related block
referenced (around the other function at lines 647-651); ensure the change uses
the existing symbols _compute_sections, _SECTION_TOOLS, available_tools,
_build_template_context, SystemPrompt.sections, and PROMPT_BUILD_SUCCESS so the
public sections metadata accurately reflects templates that include tools.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7b576f14-6023-4d51-bed1-961c80e683bc

📥 Commits

Reviewing files that changed from the base of the PR and between 2d091ea and 8943bef.

📒 Files selected for processing (21)

CLAUDE.md
DESIGN_SPEC.md
src/ai_company/engine/agent_engine.py
src/ai_company/engine/metrics.py
src/ai_company/engine/policy_validation.py
src/ai_company/engine/prompt.py
src/ai_company/engine/prompt_template.py
src/ai_company/memory/filter.py
src/ai_company/memory/retrieval_config.py
src/ai_company/memory/retriever.py
src/ai_company/memory/store_guard.py
src/ai_company/observability/events/memory.py
src/ai_company/observability/events/prompt.py
tests/unit/engine/test_agent_engine.py
tests/unit/engine/test_metrics.py
tests/unit/engine/test_policy_validation.py
tests/unit/engine/test_prompt.py
tests/unit/memory/org/test_prompt_integration.py
tests/unit/memory/test_filter.py
tests/unit/memory/test_retriever.py
tests/unit/memory/test_store_guard.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Agent
GitHub Check: Greptile Review

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use from ai_company.observability import get_logger and instantiate logger as logger = get_logger(__name__) in every module with business logic

Files:

src/ai_company/observability/events/memory.py
tests/unit/memory/test_store_guard.py
src/ai_company/observability/events/prompt.py
src/ai_company/engine/agent_engine.py
tests/unit/engine/test_metrics.py
src/ai_company/engine/policy_validation.py
src/ai_company/engine/prompt_template.py
tests/unit/engine/test_policy_validation.py
tests/unit/memory/test_filter.py
tests/unit/engine/test_prompt.py
src/ai_company/memory/retriever.py
src/ai_company/memory/filter.py
src/ai_company/memory/retrieval_config.py
src/ai_company/memory/store_guard.py
tests/unit/memory/org/test_prompt_integration.py
src/ai_company/engine/metrics.py
tests/unit/memory/test_retriever.py
tests/unit/engine/test_agent_engine.py
src/ai_company/engine/prompt.py

src/ai_company/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/ai_company/**/*.py: Never use import logging, logging.getLogger(), or print() in application code — only use the centralized logger
Use event name constants from ai_company.observability.events.<domain> modules (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget) instead of string literals
Always use structured logging with logger.info(EVENT, key=value) format — never use format strings like logger.info('msg %s', val)
No from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations
Use except A, B: syntax (no parentheses) per PEP 758 — ruff enforces this on Python 3.14
Add type hints to all public functions and classes; mypy strict mode is enforced
Add Google-style docstrings to all public classes and functions — ruff D rules enforce this
Use immutability principles: create new objects instead of mutating existing ones; for non-Pydantic collections use copy.deepcopy() at construction and MappingProxyType for read-only enforcement
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves
Use NotBlankStr from core.types for all identifier and name fields instead of manual whitespace validators, including optional (NotBlankStr | None) and tuple variants
Use @computed_field for derived values in Pydantic models instead of storing and validating redundant fields (e.g., TokenUsage.total_tokens)
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (e.g., multiple tool invocations, parallel agent calls) over bare create_task
Keep functions under 50 lines and files under 800 lines
Keep line length at 88 characters (enforced by ruff)
Handle all errors explicitly; never silently swallow exceptions
Validate at system boundaries (user input, external APIs, config files)
Never use real vendor names (Anthropic, OpenAI, Claude, GPT) in project-owned code, docstrings...

Files:

src/ai_company/observability/events/memory.py
src/ai_company/observability/events/prompt.py
src/ai_company/engine/agent_engine.py
src/ai_company/engine/policy_validation.py
src/ai_company/engine/prompt_template.py
src/ai_company/memory/retriever.py
src/ai_company/memory/filter.py
src/ai_company/memory/retrieval_config.py
src/ai_company/memory/store_guard.py
src/ai_company/engine/metrics.py
src/ai_company/engine/prompt.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Add test markers @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow to all test files
Maintain 80% minimum code coverage — enforced in CI with pytest --cov=ai_company --cov-fail-under=80
Use asyncio_mode = 'auto' in pytest configuration — no manual @pytest.mark.asyncio needed on async tests
Set test timeout to 30 seconds per test — use @pytest.mark.timeout(30) or configure in pytest.ini
Use @pytest.mark.parametrize for testing similar cases instead of duplicating test functions

Files:

tests/unit/memory/test_store_guard.py
tests/unit/engine/test_metrics.py
tests/unit/engine/test_policy_validation.py
tests/unit/memory/test_filter.py
tests/unit/engine/test_prompt.py
tests/unit/memory/org/test_prompt_integration.py
tests/unit/memory/test_retriever.py
tests/unit/engine/test_agent_engine.py

🧠 Learnings (7)

📚 Learning: 2026-03-10T09:29:47.580Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to src/ai_company/**/*.py : Use event name constants from `ai_company.observability.events.<domain>` modules (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`) instead of string literals

Applied to files:

src/ai_company/observability/events/memory.py
src/ai_company/observability/events/prompt.py
src/ai_company/engine/agent_engine.py
CLAUDE.md

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Validate at system boundaries (user input, external APIs, config files)

Applied to files:

src/ai_company/engine/policy_validation.py
src/ai_company/engine/prompt.py
CLAUDE.md

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Use `computed_field` for derived values in Pydantic models instead of storing and validating redundant fields (e.g., `TokenUsage.total_tokens`)

Applied to files:

src/ai_company/engine/metrics.py

📚 Learning: 2026-03-10T09:29:47.580Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to **/*.py : Use `from ai_company.observability import get_logger` and instantiate logger as `logger = get_logger(__name__)` in every module with business logic

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-10T09:29:47.580Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to src/ai_company/**/*.py : Always use structured logging with `logger.info(EVENT, key=value)` format — never use format strings like `logger.info('msg %s', val)`

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-10T09:29:47.580Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to src/ai_company/**/*.py : Never use `import logging`, `logging.getLogger()`, or `print()` in application code — only use the centralized logger

Applied to files:

CLAUDE.md

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Handle all errors explicitly; never silently swallow exceptions

Applied to files:

CLAUDE.md

🧬 Code graph analysis (13)

tests/unit/memory/test_store_guard.py (3)

src/ai_company/core/enums.py (1)

MemoryCategory (101-108)

src/ai_company/memory/models.py (2)

MemoryMetadata (20-52)

MemoryStoreRequest (55-79)

src/ai_company/memory/store_guard.py (1)

validate_memory_tags (21-36)

src/ai_company/engine/agent_engine.py (3)

src/ai_company/engine/metrics.py (1)

prompt_token_ratio (66-70)

src/ai_company/engine/parallel_models.py (2)

agent_id (79-81)

task_id (87-89)

src/ai_company/engine/loop_protocol.py (1)

total_tokens (73-75)

tests/unit/engine/test_metrics.py (1)

src/ai_company/engine/metrics.py (2)

prompt_token_ratio (66-70)

TaskCompletionMetrics (17-92)

src/ai_company/engine/policy_validation.py (1)

src/ai_company/observability/_logger.py (1)

get_logger (8-28)

src/ai_company/engine/prompt_template.py (1)

src/ai_company/core/enums.py (1)

SeniorityLevel (6-21)

tests/unit/engine/test_policy_validation.py (1)

src/ai_company/engine/policy_validation.py (2)

PolicyQualityIssue (67-82)

validate_policy_quality (85-116)

tests/unit/engine/test_prompt.py (2)

src/ai_company/engine/prompt.py (1)

build_system_prompt (159-258)

src/ai_company/core/company.py (1)

Company (400-483)

src/ai_company/memory/retriever.py (1)

src/ai_company/memory/filter.py (5)

TagBasedMemoryFilter (48-101)

MemoryFilterStrategy (25-45)

filter_for_injection (28-40)

filter_for_injection (68-92)

filter_for_injection (111-129)

src/ai_company/memory/filter.py (2)

src/ai_company/observability/_logger.py (1)

get_logger (8-28)

src/ai_company/memory/ranking.py (1)

ScoredMemory (26-60)

src/ai_company/memory/store_guard.py (2)

src/ai_company/observability/_logger.py (1)

get_logger (8-28)

src/ai_company/memory/models.py (1)

MemoryStoreRequest (55-79)

tests/unit/memory/test_retriever.py (2)

src/ai_company/memory/filter.py (2)

PassthroughMemoryFilter (104-138)

TagBasedMemoryFilter (48-101)

src/ai_company/memory/models.py (1)

MemoryMetadata (20-52)

tests/unit/engine/test_agent_engine.py (1)

src/ai_company/engine/agent_engine.py (1)

run (169-267)

src/ai_company/engine/prompt.py (2)

src/ai_company/engine/policy_validation.py (1)

validate_policy_quality (85-116)

src/ai_company/memory/errors.py (1)

MemoryError (13-14)

🪛 LanguageTool

CLAUDE.md

[style] ~87-~87: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...

(EG_NO_COMMA)

🔇 Additional comments (11)

tests/unit/memory/org/test_prompt_integration.py (1)

51-52: LGTM!

The version expectation update to "1.3.0" aligns with the template version bump in this PR.

src/ai_company/memory/retrieval_config.py (1)

93-96: LGTM!

The new non_inferable_only configuration field is well-documented with clear descriptions in both the docstring and Field metadata. Default False preserves backward compatibility.

src/ai_company/observability/events/memory.py (1)

66-70: LGTM!

New event constants follow the established memory.<entity>.<action> naming convention and are properly typed with Final[str]. The section header improves organization.

src/ai_company/memory/store_guard.py (1)

1-36: LGTM!

Well-structured advisory guard with proper observability. The function correctly uses event constants, structured logging with key=value format, and TYPE_CHECKING for import optimization. The docstring clearly documents the advisory-only nature.

src/ai_company/memory/filter.py (1)

1-138: LGTM!

Well-designed pluggable filter architecture:

Protocol is @runtime_checkable enabling isinstance() checks

Both implementations properly log filter statistics (candidates/retained)

DEBUG logging at init, INFO at filter execution provides good observability

Consistent structured logging with event constants

src/ai_company/memory/retriever.py (2)

273-295: LGTM!

Excellent filter integration with proper error handling:

Graceful degradation on filter errors (uses unfiltered results)

Re-raises system-level errors (MemoryError, RecursionError)

Logs with filter strategy name for debugging

Properly handles empty filter results with informative skip reason

The filter is correctly positioned in the pipeline after ranking but before formatting.

126-128: LGTM!

The auto-instantiation logic correctly defaults to TagBasedMemoryFilter when config.non_inferable_only is enabled and no explicit filter is provided, maintaining backward compatibility when the flag is False.

src/ai_company/observability/events/prompt.py (1)

12-13: LGTM!

New prompt event constants follow the established naming convention. The # noqa: S105 suppression on line 13 is appropriate—this is an event name, not a credential.

CLAUDE.md (2)

52-54: LGTM!

Documentation accurately reflects the new modules:

Engine module now includes "prompt policy validation" (policy_validation.py)

Memory module now includes "non-inferable filtering" (filter.py)

87-88: LGTM!

Event name examples updated to include PROMPT_BUILD_START from events.prompt and MEMORY_RETRIEVAL_START from events.memory, providing helpful guidance for developers working with the new observability surface.

src/ai_company/engine/metrics.py (1)

58-70: Nice use of @computed_field for prompt_token_ratio.

Keeping the ratio derived from prompt_tokens and tokens_per_task avoids redundant state and makes from_run_result() simpler.

Based on learnings: Use computed_field for derived values in Pydantic models instead of storing and validating redundant fields (e.g., TokenUsage.total_tokens)

src/ai_company/engine/policy_validation.py

coderabbitai · 2026-03-10T10:46:32Z

src/ai_company/engine/policy_validation.py

+def _check_single_policy(policy: str) -> list[PolicyQualityIssue]:
+    """Run all heuristic checks on a single policy string.
+
+    Args:
+        policy: The policy text to validate.
+
+    Returns:
+        List of quality issues found (empty if the policy passes all checks).
+    """
+    found: list[PolicyQualityIssue] = []
+
+    if len(policy) < _MIN_POLICY_LENGTH:
+        found.append(
+            PolicyQualityIssue(
+                policy=policy,
+                issue=(
+                    f"Too short ({len(policy)} chars) — likely not an actionable policy"
+                ),
+                severity="warning",
+            ),
+        )
+
+    if len(policy) > _MAX_POLICY_LENGTH:
+        found.append(
+            PolicyQualityIssue(
+                policy=policy,
+                issue=(
+                    f"Too long ({len(policy)} chars) — "
+                    f"may contain inferable context rather than a policy"
+                ),
+                severity="warning",
+            ),
+        )
+
+    for pattern in _CODE_PATTERNS:
+        if pattern.search(policy):
+            found.append(
+                PolicyQualityIssue(
+                    policy=policy,
+                    issue=(
+                        "Contains code patterns (file paths, imports, or "
+                        "definitions) — likely inferable from the codebase"
+                    ),
+                    severity="warning",
+                ),
+            )
+            break  # One code-pattern match is sufficient.
+
+    policy_lower = policy.lower()
+    if not _ACTION_VERB_RE.search(policy_lower):
+        found.append(
+            PolicyQualityIssue(
+                policy=policy,
+                issue=(
+                    "Missing action verbs (must, should, always, never, "
+                    "etc.) — may not be an actionable policy"
+                ),
+                severity="warning",
+            ),
+        )
+
+    return found


🛠️ Refactor suggestion | 🟠 Major

Split _check_single_policy() into focused helpers.

This helper is already over the 50-line limit and now mixes length, code-pattern, and action-verb heuristics. Extract those checks into small helpers so future rule changes stay isolated and easier to test.

Refactor sketch

def _check_single_policy(policy: str) -> list[PolicyQualityIssue]: - found: list[PolicyQualityIssue] = [] - - if len(policy) < _MIN_POLICY_LENGTH: - found.append(...) - - if len(policy) > _MAX_POLICY_LENGTH: - found.append(...) - - for pattern in _CODE_PATTERNS: - if pattern.search(policy): - found.append(...) - break - - policy_lower = policy.lower() - if not _ACTION_VERB_RE.search(policy_lower): - found.append(...) - - return found + return [ + *_check_policy_length(policy), + *_check_code_patterns(policy), + *_check_action_verbs(policy), + ]

As per coding guidelines "Keep functions under 50 lines and files under 800 lines".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/policy_validation.py` around lines 124 - 185, Split the long _check_single_policy function into focused helpers: implement helpers like _check_policy_length(policy) (using _MIN_POLICY_LENGTH and _MAX_POLICY_LENGTH and returning list[PolicyQualityIssue]), _check_policy_code_patterns(policy) (using _CODE_PATTERNS and preserving the single-match break behavior), and _check_policy_action_verbs(policy) (using _ACTION_VERB_RE), then have _check_single_policy simply call and aggregate results from these helpers; keep all existing messages/severity and the PolicyQualityIssue construction identical so behavior and tests remain unchanged, add small docstrings for each helper and update or add unit tests as needed.

src/ai_company/engine/prompt.py

tests/unit/engine/test_agent_engine.py

tests/unit/memory/test_retriever.py

…eptile - Split policy_validation into focused helper functions (<50 lines each) - Add case-insensitive regex matching for code pattern detection - Add fail-fast org_policies validation in prompt builder - Add model_validator for prompt_tokens <= tokens_per_task in metrics - Add allow_inf_nan=False to PolicyQualityIssue and TaskCompletionMetrics - Use distinct event constants for validation start vs quality issues - Upgrade classification failure log from DEBUG to WARNING with exc_info - Replace content_preview with content_length in store_guard to avoid PII - Add non-empty validation for TagBasedMemoryFilter required_tag - Add error_type to filter error handler for consistency - Update DESIGN_SPEC.md, README.md with non-inferable filtering references - Add boundary tests for policy length, frozen enforcement, severity - Add prompt validation integration tests (error isolation, blank rejection) - Add retriever tests for config-driven filter, graceful degradation, MemoryError - Parametrize prompt token ratio warning tests

The model_validator ensuring prompt_tokens <= tokens_per_task was rejecting the metrics in test_metrics_logged_on_completion because the default 150 tokens was less than the system prompt estimate (~306).

src/ai_company/memory/retriever.py

gemini-code-assist · 2026-03-10T11:24:46Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's adherence to the 'non-inferable principle' by refining how information is presented to agents, particularly within system prompts and memory retrieval. It introduces mechanisms to validate policy quality, filters memory to include only non-inferable content, and improves prompt token ratio metrics for better overhead visibility. These changes aim to make agent interactions more efficient and aligned with design principles, ensuring agents receive only necessary, non-discoverable information.

Highlights

Prompt Token Ratio: Renamed prompt_cost_ratio to prompt_token_ratio and converted it to a @computed_field for derived values, aligning with project conventions.
Memory Filtering: Implemented a non_inferable_only configuration flag to automatically create a TagBasedMemoryFilter within ContextInjectionStrategy, including graceful degradation for filter errors.
Policy Validation: Added word-boundary regex for action verb detection in policy validation to prevent false negatives and wrapped validate_policy_quality in a try/except block to ensure it remains advisory-only.
Prompt Templates: Removed tool definitions from the default system prompt template, adhering to the non-inferable principle where agents discover tools at runtime.
Observability: Introduced DEBUG logging for filter, guard, and validation entry points, and added new event constants (MEMORY_FILTER_APPLIED, MEMORY_FILTER_INIT) for enhanced monitoring.
Import Hygiene: Corrected import ordering in filter.py and store_guard.py to place runtime imports before TYPE_CHECKING blocks.
Documentation: Updated DESIGN_SPEC.md and CLAUDE.md to reflect new project structure, memory pipeline descriptions, and logging examples.

Changelog

CLAUDE.md
- Updated descriptions for the engine and memory modules to reflect new functionalities like prompt policy validation and non-inferable filtering.
- Expanded the list of required event constants to include PROMPT_BUILD_START and MEMORY_RETRIEVAL_START.
DESIGN_SPEC.md
- Updated the implementation snapshot to include 'non-inferable filtering' as a completed part of the memory retrieval pipeline.
- Modified the memory retrieval pipeline description to explicitly include the application of MemoryFilterStrategy.
- Added new metrics, prompt_tokens and prompt_token_ratio, to the task completion metrics section.
- Updated the project structure to include new files: policy_validation.py, filter.py, store_guard.py, and events/security.py.
README.md
- Updated the description of the 'Memory Interface' to include 'non-inferable filtering' in the retrieval pipeline.
src/ai_company/engine/agent_engine.py
- Imported PROMPT_TOKEN_RATIO_HIGH event constant.
- Added _PROMPT_TOKEN_RATIO_THRESHOLD constant for warning emission.
- Changed logging level from debug to warning for classification errors and added exc_info=True.
- Added logging for prompt_tokens and prompt_token_ratio in completion logs.
- Implemented a warning mechanism for high prompt-to-total token ratios.
src/ai_company/engine/metrics.py
- Imported computed_field and model_validator from Pydantic.
- Updated model_config to include allow_inf_nan=False.
- Added prompt_tokens field to TaskCompletionMetrics.
- Implemented _validate_prompt_tokens using @model_validator to ensure prompt tokens do not exceed total tokens.
- Added prompt_token_ratio as a @computed_field to calculate the ratio of prompt tokens to total tokens.
- Updated from_run_result to include prompt_tokens from SystemPrompt.estimated_tokens.
src/ai_company/engine/policy_validation.py
- Added new file policy_validation.py to define heuristics for validating the quality of organizational policies.
- Defined PolicyQualityIssue model for reporting policy problems.
- Implemented validate_policy_quality function to check policies for length, code patterns, and action verbs.
- Defined constants for minimum/maximum policy length, common code patterns, and actionable verbs.
src/ai_company/engine/prompt.py
- Added documentation explaining the 'non-inferable principle' and its impact on tool definitions in system prompts.
- Imported validate_policy_quality from the new policy_validation module.
- Imported PROMPT_POLICY_VALIDATION_FAILED event constant.
- Removed _SECTION_TOOLS from _TRIMMABLE_SECTIONS and related logic, aligning with the non-inferable principle.
- Added _validate_org_policies function to enforce non-empty string policies.
- Integrated validate_policy_quality into build_system_prompt with graceful degradation for validation errors.
- Updated _compute_sections and _trim_sections to no longer consider available_tools for default template sections.
src/ai_company/engine/prompt_template.py
- Updated PROMPT_TEMPLATE_VERSION to 1.3.0.
- Added a detailed explanation of the 'non-inferable principle' regarding tool definitions in the default template.
- Removed the Available Tools section from the DEFAULT_TEMPLATE.
src/ai_company/memory/filter.py
- Added new file filter.py to define memory filter strategies.
- Defined MemoryFilterStrategy protocol for filtering scored memories.
- Implemented TagBasedMemoryFilter to retain only memories with a specific tag (defaulting to 'non-inferable').
- Implemented PassthroughMemoryFilter as a no-op filter.
- Defined NON_INFERABLE_TAG constant.
src/ai_company/memory/retrieval_config.py
- Added non_inferable_only boolean field to MemoryRetrievalConfig to control automatic filter creation.
src/ai_company/memory/retriever.py
- Imported TagBasedMemoryFilter and MemoryFilterStrategy.
- Added memory_filter parameter to ContextInjectionStrategy constructor.
- Implemented logic to auto-create TagBasedMemoryFilter if non_inferable_only is true and no explicit filter is provided.
- Integrated memory filtering into the _execute_pipeline after ranking.
- Added graceful degradation for MemoryFilterStrategy errors, falling back to unfiltered memories while logging warnings.
- Ensured MemoryError and RecursionError propagate from the filter.
src/ai_company/memory/store_guard.py
- Added new file store_guard.py for advisory non-inferable tag enforcement.
- Implemented validate_memory_tags function to log a warning if the 'non-inferable' tag is missing from a MemoryStoreRequest.
src/ai_company/observability/events/memory.py
- Added new event constants: MEMORY_FILTER_INIT, MEMORY_FILTER_APPLIED, and MEMORY_FILTER_STORE_MISSING_TAG.
src/ai_company/observability/events/prompt.py
- Added new event constants: PROMPT_POLICY_VALIDATION_START, PROMPT_POLICY_QUALITY_ISSUE, PROMPT_POLICY_VALIDATION_FAILED, and PROMPT_TOKEN_RATIO_HIGH.
tests/unit/engine/test_agent_engine.py
- Imported structlog.testing and PROMPT_TOKEN_RATIO_HIGH.
- Updated assertion for system prompt to confirm 'tools' section is not present in the default template.
- Added TestAgentEnginePromptTokenRatioWarning class with parameterized tests to verify warning emission based on prompt token ratio.
tests/unit/engine/test_agent_engine_lifecycle.py
- Updated _make_completion_response in test_metrics_logged_on_completion to include input_tokens and output_tokens for accurate metric calculation.
tests/unit/engine/test_metrics.py
- Added tests for prompt_tokens and prompt_token_ratio fields, including default values, negative input rejection, and ratio computation.
- Added tests for prompt_token_ratio at boundary conditions (e.g., when prompt tokens equal total tokens).
- Updated test_from_run_result_extracts_values to assert correct prompt_tokens and prompt_token_ratio values.
- Updated test_from_run_result_zero_turns to assert correct prompt_token_ratio and prompt_tokens for zero-turn scenarios.
tests/unit/engine/test_policy_validation.py
- Added new file test_policy_validation.py with comprehensive unit tests for policy quality validation heuristics.
- Tested PolicyQualityIssue model construction and immutability.
- Included tests for good policies, policies that are too short or too long, policies with code patterns, and policies missing action verbs.
- Covered edge cases like empty input, multiple issues, logging events, and word-boundary matching for action verbs.
- Validated length boundaries for policies.
tests/unit/engine/test_prompt.py
- Updated test_tools_not_in_default_template to assert that tools are no longer included in the default system prompt.
- Added test_tools_render_in_custom_template to confirm tools can still be rendered using custom templates.
- Updated test_no_tools_section_in_default_template to reflect that the default template never includes tools.
- Modified test_max_tokens_triggers_trimming to remove available_tools from test parameters and assertions, reflecting the change in trimming priority.
- Added TestPolicyValidationIntegration class to test integration of policy validation, including error handling and validation of empty/whitespace-only policies.
- Updated test_template_version_is_1_3_0 and test_sections_tracked to reflect the new template version and removal of the 'tools' section.
tests/unit/memory/org/test_prompt_integration.py
- Updated test_template_version_updated to assert the new PROMPT_TEMPLATE_VERSION of 1.3.0.
tests/unit/memory/test_filter.py
- Added new file test_filter.py with unit tests for MemoryFilterStrategy, TagBasedMemoryFilter, and PassthroughMemoryFilter.
- Verified protocol compliance for both filter implementations.
- Tested TagBasedMemoryFilter's ability to retain tagged memories, exclude untagged, handle custom tags, and process empty input.
- Tested PassthroughMemoryFilter's behavior of returning all memories unchanged and handling empty input.
- Confirmed the value of NON_INFERABLE_TAG.
tests/unit/memory/test_retriever.py
- Imported new filter-related modules and types.
- Added TestMemoryFilterIntegration class to test the integration of memory filters.
- Verified that TagBasedMemoryFilter correctly excludes untagged memories after ranking.
- Confirmed that filters are skipped when memory_filter is None.
- Tested scenarios where filters reduce output to empty results.
- Verified PassthroughMemoryFilter behavior.
- Confirmed that non_inferable_only config flag correctly auto-creates TagBasedMemoryFilter.
- Added tests for graceful degradation of filters on unexpected errors and proper propagation of MemoryError.
tests/unit/memory/test_store_guard.py
- Added new file test_store_guard.py with unit tests for validate_memory_tags.
- Verified that requests with the 'non-inferable' tag produce no warnings.
- Confirmed that untagged requests or requests with other tags but missing 'non-inferable' log warnings.
- Asserted that validate_memory_tags is advisory-only and never blocks the store operation.

Activity

The pull request was pre-reviewed by 10 agents (code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer, logging-audit, resilience-audit, security-reviewer, docs-consistency).
24 findings identified by the agents were addressed across 14 files.
All test_metrics.py tests passed, including new cases for prompt_token_ratio as @computed_field and boundary cases.
All test_policy_validation.py tests passed, covering word-boundary action verb matching and multiple code patterns.
All test_agent_engine.py tests passed, including high/low prompt token ratio warning emission.
All test_filter.py tests passed, covering TagBasedMemoryFilter and PassthroughMemoryFilter behavior.
All test_retriever.py tests passed, covering config-driven filter wiring and graceful degradation on filter errors.
All test_store_guard.py tests passed, covering advisory tag guard behavior.
All test_prompt.py tests passed, including policy validation error isolation.
The full test suite passed with 6245 tests, 9 skipped, and 95.07% coverage.
Mypy reported no issues across 771 files.
Ruff reported all checks passed.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai

Actionable comments posted: 7

♻️ Duplicate comments (1)

tests/unit/engine/test_agent_engine.py (1)
865-905: 🧹 Nitpick | 🔵 Trivial

Make the ratio-warning test independent of prompt size.

The parametrization helps, but expect_warning=True still depends on whatever build_system_prompt() estimates today. An unrelated prompt-template edit can flip this test without changing AgentEngine. Inject a fixed SystemPrompt.estimated_tokens or unit-test _log_completion() directly so only the threshold logic is under test.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/engine/test_agent_engine.py` around lines 865 - 905, The test
test_prompt_token_ratio_warning is brittle because it relies on the current
build_system_prompt() estimate; make it deterministic by injecting a fixed
prompt token estimate or testing the threshold function directly: either (A) set
SystemPrompt.estimated_tokens to a known constant (or monkeypatch the
SystemPrompt instance used by AgentEngine) before creating the engine so the
prompt size is controlled when calling engine.run with the mock provider, or (B)
call AgentEngine._log_completion(...) directly with a constructed
CompletionResponse from _make_completion_response and an explicit
prompt_token_count to exercise only the ratio/threshold logic; reference
AgentEngine, _log_completion, SystemPrompt.estimated_tokens and
_make_completion_response to locate the relevant code to change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/ai_company/engine/metrics.py`:
- Around line 67-80: The validator _validate_prompt_tokens should not raise a
ValueError when SystemPrompt.estimated_tokens (prompt_tokens) exceeds
tokens_per_task; instead, compute and store a bounded ratio so the model isn't
rejected: when self.tokens_per_task > 0 calculate ratio = self.prompt_tokens /
self.tokens_per_task, set a new or existing attribute like
self.prompt_token_ratio = min(ratio, 1.0) (and optionally clamp prompt_tokens to
self.tokens_per_task if you must keep it as an upper bound), remove the raise,
and mirror the same change for the similar validator handling completion tokens
(the validator around the 82-92 region) so both use a capped ratio rather than
throwing.

In `@src/ai_company/engine/policy_validation.py`:
- Around line 115-121: The warning currently emits policy text via
logger.warning(PROMPT_POLICY_QUALITY_ISSUE, policy=issue.policy[:80], ...),
which can leak sensitive operator-authored content; update the call in the loop
over issues so it no longer includes any substring of issue.policy and instead
logs only metadata (use the same pattern as store_guard.py) by replacing the
policy argument with a content_length or policy_length field set to
len(issue.policy), keeping issue.issue and issue.severity intact and referencing
PROMPT_POLICY_QUALITY_ISSUE, issues, and logger.warning to locate the change.

In `@src/ai_company/engine/prompt.py`:
- Around line 147-154: The trimming logic removed the tools section from global
tracking (_TRIMMABLE_SECTIONS), which breaks custom templates that render tools
and prevents max_tokens from trimming that block; restore or reintroduce the
tools section identifier (e.g., _SECTION_TOOLS) into _TRIMMABLE_SECTIONS and
ensure SystemPrompt.sections still includes the tools section when a template
opt-in indicates it renders tools (update the code paths around
_TRIMMABLE_SECTIONS, SystemPrompt.sections construction, and the max_tokens
trimming routine so tools remain trimmable for templates that declare/return
tools, while keeping tools excluded only from the default template rendering).
- Around line 197-203: Normalize org_policies at the start of
build_system_prompt by converting the incoming org_policies into a stable
iterable (e.g., org_policies = tuple(org_policies or ())) and use that
normalized variable for all subsequent calls and iterations; call
_validate_org_policies(normalized_org_policies) and
validate_policy_quality(normalized_org_policies) (and any rendering/iteration
later) instead of the original parameter so one-shot iterables or None don't get
consumed or raise TypeError—apply the same normalization and reuse in the later
block around the validate_policy_quality / rendering logic (the section around
lines 279-301).

In `@src/ai_company/memory/filter.py`:
- Around line 62-71: The constructor for required_tag accepts non-string values
and stores unnormalized input, causing AttributeError on non-strings and
mismatches for values with surrounding whitespace; update the __init__ of the
class that defines required_tag to first enforce type (raise TypeError or
ValueError if not instance of str), normalize the value by calling stripped =
required_tag.strip() and then validate stripped is non-empty (raise ValueError
if empty), assign self._required_tag = stripped, and ensure the logger call
(MEMORY_FILTER_INIT, strategy=self.strategy_name) uses the normalized stripped
value for required_tag so stored/logged tag matches how tags are compared
elsewhere.

In `@src/ai_company/memory/store_guard.py`:
- Around line 21-36: The helper validate_memory_tags(request:
MemoryStoreRequest) is never invoked so the NON_INFERABLE_TAG warning
(MEMORY_FILTER_STORE_MISSING_TAG) is inert; fix this by calling
validate_memory_tags(request) at the start of every memory persistence path —
either add the call into each concrete MemoryBackend.store(...) implementation
or place it in the shared façade/wrapper that all stores use (ensure any class
implementing MemoryBackend calls validate_memory_tags before persisting), so the
logger warning will fire when tags are missing.

In `@tests/unit/engine/test_prompt.py`:
- Around line 509-531: Combine the two tests into one parametrized test that
iterates over the invalid org_policies values; replace the separate
test_empty_org_policy_raises and test_whitespace_only_org_policy_raises with a
single `@pytest.mark.parametrize-based` test that calls
build_system_prompt(agent=sample_agent_with_personality, org_policies=<param>)
and asserts pytest.raises(PromptBuildError, match="org_policies"); keep the same
docstring and use the same sample_agent_with_personality fixture and
PromptBuildError type to ensure behavior is unchanged.

---

Duplicate comments:
In `@tests/unit/engine/test_agent_engine.py`:
- Around line 865-905: The test test_prompt_token_ratio_warning is brittle
because it relies on the current build_system_prompt() estimate; make it
deterministic by injecting a fixed prompt token estimate or testing the
threshold function directly: either (A) set SystemPrompt.estimated_tokens to a
known constant (or monkeypatch the SystemPrompt instance used by AgentEngine)
before creating the engine so the prompt size is controlled when calling
engine.run with the mock provider, or (B) call AgentEngine._log_completion(...)
directly with a constructed CompletionResponse from _make_completion_response
and an explicit prompt_token_count to exercise only the ratio/threshold logic;
reference AgentEngine, _log_completion, SystemPrompt.estimated_tokens and
_make_completion_response to locate the relevant code to change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 56e3b91e-e748-4640-ab0f-a33ab5debfc8

📥 Commits

Reviewing files that changed from the base of the PR and between 8943bef and 230b8a4.

📒 Files selected for processing (15)

DESIGN_SPEC.md
README.md
src/ai_company/engine/agent_engine.py
src/ai_company/engine/metrics.py
src/ai_company/engine/policy_validation.py
src/ai_company/engine/prompt.py
src/ai_company/memory/filter.py
src/ai_company/memory/retriever.py
src/ai_company/memory/store_guard.py
src/ai_company/observability/events/memory.py
src/ai_company/observability/events/prompt.py
tests/unit/engine/test_agent_engine.py
tests/unit/engine/test_policy_validation.py
tests/unit/engine/test_prompt.py
tests/unit/memory/test_retriever.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Greptile Review

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use from ai_company.observability import get_logger and instantiate logger as logger = get_logger(__name__) in every module with business logic

Files:

src/ai_company/observability/events/prompt.py
src/ai_company/engine/agent_engine.py
src/ai_company/memory/retriever.py
src/ai_company/memory/filter.py
src/ai_company/memory/store_guard.py
tests/unit/engine/test_prompt.py
src/ai_company/observability/events/memory.py
src/ai_company/engine/metrics.py
src/ai_company/engine/policy_validation.py
tests/unit/memory/test_retriever.py
src/ai_company/engine/prompt.py
tests/unit/engine/test_policy_validation.py
tests/unit/engine/test_agent_engine.py

src/ai_company/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/ai_company/**/*.py: Never use import logging, logging.getLogger(), or print() in application code — only use the centralized logger
Use event name constants from ai_company.observability.events.<domain> modules (e.g., PROVIDER_CALL_START from events.provider, BUDGET_RECORD_ADDED from events.budget) instead of string literals
Always use structured logging with logger.info(EVENT, key=value) format — never use format strings like logger.info('msg %s', val)
No from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations
Use except A, B: syntax (no parentheses) per PEP 758 — ruff enforces this on Python 3.14
Add type hints to all public functions and classes; mypy strict mode is enforced
Add Google-style docstrings to all public classes and functions — ruff D rules enforce this
Use immutability principles: create new objects instead of mutating existing ones; for non-Pydantic collections use copy.deepcopy() at construction and MappingProxyType for read-only enforcement
Use frozen Pydantic models for config/identity; use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves
Use NotBlankStr from core.types for all identifier and name fields instead of manual whitespace validators, including optional (NotBlankStr | None) and tuple variants
Use @computed_field for derived values in Pydantic models instead of storing and validating redundant fields (e.g., TokenUsage.total_tokens)
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (e.g., multiple tool invocations, parallel agent calls) over bare create_task
Keep functions under 50 lines and files under 800 lines
Keep line length at 88 characters (enforced by ruff)
Handle all errors explicitly; never silently swallow exceptions
Validate at system boundaries (user input, external APIs, config files)
Never use real vendor names (Anthropic, OpenAI, Claude, GPT) in project-owned code, docstrings...

Files:

src/ai_company/observability/events/prompt.py
src/ai_company/engine/agent_engine.py
src/ai_company/memory/retriever.py
src/ai_company/memory/filter.py
src/ai_company/memory/store_guard.py
src/ai_company/observability/events/memory.py
src/ai_company/engine/metrics.py
src/ai_company/engine/policy_validation.py
src/ai_company/engine/prompt.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Add test markers @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow to all test files
Maintain 80% minimum code coverage — enforced in CI with pytest --cov=ai_company --cov-fail-under=80
Use asyncio_mode = 'auto' in pytest configuration — no manual @pytest.mark.asyncio needed on async tests
Set test timeout to 30 seconds per test — use @pytest.mark.timeout(30) or configure in pytest.ini
Use @pytest.mark.parametrize for testing similar cases instead of duplicating test functions

Files:

tests/unit/engine/test_prompt.py
tests/unit/memory/test_retriever.py
tests/unit/engine/test_policy_validation.py
tests/unit/engine/test_agent_engine.py

🧠 Learnings (11)

📚 Learning: 2026-03-10T09:29:47.580Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to src/ai_company/**/*.py : Use event name constants from `ai_company.observability.events.<domain>` modules (e.g., `PROVIDER_CALL_START` from `events.provider`, `BUDGET_RECORD_ADDED` from `events.budget`) instead of string literals

Applied to files:

src/ai_company/observability/events/prompt.py
src/ai_company/engine/agent_engine.py
src/ai_company/observability/events/memory.py

📚 Learning: 2026-03-10T09:29:47.580Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.580Z
Learning: Applies to src/ai_company/**/*.py : Use `except A, B:` syntax (no parentheses) per PEP 758 — ruff enforces this on Python 3.14

Applied to files:

src/ai_company/memory/retriever.py
src/ai_company/engine/prompt.py

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Handle all errors explicitly; never silently swallow exceptions

Applied to files:

src/ai_company/memory/retriever.py
src/ai_company/engine/prompt.py

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Use `computed_field` for derived values in Pydantic models instead of storing and validating redundant fields (e.g., `TokenUsage.total_tokens`)

Applied to files:

src/ai_company/engine/metrics.py

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Use `NotBlankStr` from `core.types` for all identifier and name fields instead of manual whitespace validators, including optional (`NotBlankStr | None`) and tuple variants

Applied to files:

src/ai_company/engine/metrics.py

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Always read `DESIGN_SPEC.md` before implementing any feature or planning any issue — treat it as the starting point for architecture, data models, and behavior

Applied to files:

DESIGN_SPEC.md

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Validate at system boundaries (user input, external APIs, config files)

Applied to files:

src/ai_company/engine/policy_validation.py
src/ai_company/engine/prompt.py
tests/unit/engine/test_policy_validation.py

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Keep functions under 50 lines and files under 800 lines

Applied to files:

src/ai_company/engine/policy_validation.py

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to src/ai_company/**/*.py : Keep line length at 88 characters (enforced by ruff)

Applied to files:

src/ai_company/engine/policy_validation.py

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to tests/**/*.py : Use `pytest.mark.parametrize` for testing similar cases instead of duplicating test functions

Applied to files:

tests/unit/engine/test_agent_engine.py

📚 Learning: 2026-03-10T09:29:47.581Z

Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-10T09:29:47.581Z
Learning: Applies to tests/**/*.py : Maintain 80% minimum code coverage — enforced in CI with `pytest --cov=ai_company --cov-fail-under=80`

Applied to files:

tests/unit/engine/test_agent_engine.py

🧬 Code graph analysis (9)

src/ai_company/engine/agent_engine.py (3)

src/ai_company/engine/parallel_models.py (2)

agent_id (79-81)

task_id (87-89)

src/ai_company/engine/metrics.py (1)

prompt_token_ratio (84-92)

src/ai_company/engine/loop_protocol.py (1)

total_tokens (73-75)

src/ai_company/memory/retriever.py (1)

src/ai_company/memory/filter.py (5)

TagBasedMemoryFilter (51-106)

MemoryFilterStrategy (28-48)

filter_for_injection (31-43)

filter_for_injection (73-97)

filter_for_injection (116-134)

src/ai_company/memory/store_guard.py (2)

src/ai_company/observability/_logger.py (1)

get_logger (8-28)

src/ai_company/memory/models.py (1)

MemoryStoreRequest (55-79)

tests/unit/engine/test_prompt.py (4)

tests/unit/engine/conftest.py (4)

sample_agent_with_personality (60-87)

sample_tool_definitions (128-143)

sample_task_with_criteria (103-124)

sample_company (175-184)

src/ai_company/engine/prompt.py (1)

build_system_prompt (160-260)

src/ai_company/core/task.py (1)

Task (45-261)

src/ai_company/core/company.py (1)

Company (400-483)

src/ai_company/engine/policy_validation.py (1)

src/ai_company/observability/_logger.py (1)

get_logger (8-28)

tests/unit/memory/test_retriever.py (4)

src/ai_company/memory/filter.py (8)

PassthroughMemoryFilter (109-143)

TagBasedMemoryFilter (51-106)

filter_for_injection (31-43)

filter_for_injection (73-97)

filter_for_injection (116-134)

strategy_name (46-48)

strategy_name (100-106)

strategy_name (137-143)

src/ai_company/memory/models.py (1)

MemoryMetadata (20-52)

src/ai_company/memory/retriever.py (2)

ContextInjectionStrategy (96-399)

strategy_name (393-399)

src/ai_company/memory/retrieval_config.py (1)

MemoryRetrievalConfig (20-132)

src/ai_company/engine/prompt.py (5)

src/ai_company/engine/policy_validation.py (1)

validate_policy_quality (93-123)

src/ai_company/memory/errors.py (1)

MemoryError (13-14)

src/ai_company/engine/parallel_models.py (1)

agent_id (79-81)

src/ai_company/core/agent.py (1)

AgentIdentity (265-323)

src/ai_company/engine/errors.py (1)

PromptBuildError (8-9)

tests/unit/engine/test_policy_validation.py (1)

src/ai_company/engine/policy_validation.py (2)

PolicyQualityIssue (71-90)

validate_policy_quality (93-123)

tests/unit/engine/test_agent_engine.py (1)

src/ai_company/engine/agent_engine.py (1)

run (169-267)

🪛 GitHub Actions: CI

src/ai_company/engine/metrics.py

[error] 106-106: TaskCompletionMetrics.from_run_result validation failed. Prompt tokens (306) exceed tokens_per_task (150).

🪛 LanguageTool

README.md

[typographical] ~24-~24: To join two clauses or introduce examples, consider using an em dash.
Context: ...a migrations - Memory Interface (M5) - Pluggable MemoryBackend protocol with ...

(DASH_RULE)

[style] ~24-~24: Using four (or more) nouns in a row may decrease readability.
Context: .../archival with pluggable strategies and retention enforcement - Coordination Error Taxonomy (M5) - Post-execution classi...

(FOUR_NN)

coderabbitai · 2026-03-10T11:25:59Z

src/ai_company/engine/metrics.py

+    @model_validator(mode="after")
+    def _validate_prompt_tokens(self) -> TaskCompletionMetrics:
+        """Ensure prompt_tokens does not exceed tokens_per_task.
+
+        Skipped when ``tokens_per_task`` is 0 (zero-turn runs where the
+        system prompt was built but no provider calls were made).
+        """
+        if self.tokens_per_task > 0 and self.prompt_tokens > self.tokens_per_task:
+            msg = (
+                f"prompt_tokens ({self.prompt_tokens}) cannot exceed "
+                f"tokens_per_task ({self.tokens_per_task})"
+            )
+            raise ValueError(msg)
+        return self


⚠️ Potential issue | 🔴 Critical

Do not validate estimated prompt tokens as an exact upper bound.

SystemPrompt.estimated_tokens is heuristic data. CI is already failing because the estimate can exceed tokens_per_task, so this validator turns otherwise valid runs into runtime errors. Bound the derived ratio instead of rejecting the model.

💡 Minimal fix

- `@model_validator`(mode="after") - def _validate_prompt_tokens(self) -> TaskCompletionMetrics: - """Ensure prompt_tokens does not exceed tokens_per_task. - - Skipped when ``tokens_per_task`` is 0 (zero-turn runs where the - system prompt was built but no provider calls were made). - """ - if self.tokens_per_task > 0 and self.prompt_tokens > self.tokens_per_task: - msg = ( - f"prompt_tokens ({self.prompt_tokens}) cannot exceed " - f"tokens_per_task ({self.tokens_per_task})" - ) - raise ValueError(msg) - return self - `@computed_field` # type: ignore[prop-decorator] `@property` def prompt_token_ratio(self) -> float: @@ - if self.tokens_per_task > 0: - return self.prompt_tokens / self.tokens_per_task + if self.tokens_per_task > 0: + capped_prompt_tokens = min(self.prompt_tokens, self.tokens_per_task) + return capped_prompt_tokens / self.tokens_per_task return 0.0

Also applies to: 82-92

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/metrics.py` around lines 67 - 80, The validator _validate_prompt_tokens should not raise a ValueError when SystemPrompt.estimated_tokens (prompt_tokens) exceeds tokens_per_task; instead, compute and store a bounded ratio so the model isn't rejected: when self.tokens_per_task > 0 calculate ratio = self.prompt_tokens / self.tokens_per_task, set a new or existing attribute like self.prompt_token_ratio = min(ratio, 1.0) (and optionally clamp prompt_tokens to self.tokens_per_task if you must keep it as an upper bound), remove the raise, and mirror the same change for the similar validator handling completion tokens (the validator around the 82-92 region) so both use a capped ratio rather than throwing.

src/ai_company/engine/policy_validation.py

src/ai_company/engine/prompt.py

coderabbitai · 2026-03-10T11:25:59Z

src/ai_company/engine/prompt.py

    _validate_max_tokens(agent, max_tokens)
+    _validate_org_policies(agent, org_policies)
+
+    # Advisory only — issues are logged but never block prompt construction.
+    if org_policies:
+        try:
+            validate_policy_quality(org_policies)


⚠️ Potential issue | 🟠 Major

Normalize org_policies before walking it.

_validate_org_policies() consumes the incoming object in place and returns nothing. A one-shot iterable gets exhausted before advisory validation/rendering, and None or another non-iterable still leaks a raw TypeError before the main PromptBuildError handling. Convert once to a tuple, validate that normalized value, and reuse it for the rest of build_system_prompt().

As per coding guidelines "Validate at system boundaries (user input, external APIs, config files)".

Also applies to: 279-301

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/ai_company/engine/prompt.py` around lines 197 - 203, Normalize org_policies at the start of build_system_prompt by converting the incoming org_policies into a stable iterable (e.g., org_policies = tuple(org_policies or ())) and use that normalized variable for all subsequent calls and iterations; call _validate_org_policies(normalized_org_policies) and validate_policy_quality(normalized_org_policies) (and any rendering/iteration later) instead of the original parameter so one-shot iterables or None don't get consumed or raise TypeError—apply the same normalization and reuse in the later block around the validate_policy_quality / rendering logic (the section around lines 279-301).

coderabbitai · 2026-03-10T11:25:59Z

src/ai_company/memory/filter.py

+    def __init__(self, required_tag: str = NON_INFERABLE_TAG) -> None:
+        if not required_tag.strip():
+            msg = "required_tag must be a non-empty string"
+            raise ValueError(msg)
+        self._required_tag = required_tag
+        logger.debug(
+            MEMORY_FILTER_INIT,
+            strategy=self.strategy_name,
+            required_tag=required_tag,
+        )


⚠️ Potential issue | 🟠 Major

Normalize and validate required_tag before storing it.

required_tag.strip() will throw AttributeError for non-string config values, and " non-inferable " currently passes validation but never matches any stored tag. A trivial config typo can therefore silently filter out every memory.

🛠️ Suggested fix

def __init__(self, required_tag: str = NON_INFERABLE_TAG) -> None: - if not required_tag.strip(): + if not isinstance(required_tag, str): + msg = "required_tag must be a non-empty string" + raise ValueError(msg) + normalized_tag = required_tag.strip() + if not normalized_tag: msg = "required_tag must be a non-empty string" raise ValueError(msg) - self._required_tag = required_tag + self._required_tag = normalized_tag logger.debug( MEMORY_FILTER_INIT, strategy=self.strategy_name, - required_tag=required_tag, + required_tag=self._required_tag, )

As per coding guidelines "Validate at system boundaries (user input, external APIs, config files)".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/ai_company/memory/filter.py` around lines 62 - 71, The constructor for required_tag accepts non-string values and stores unnormalized input, causing AttributeError on non-strings and mismatches for values with surrounding whitespace; update the __init__ of the class that defines required_tag to first enforce type (raise TypeError or ValueError if not instance of str), normalize the value by calling stripped = required_tag.strip() and then validate stripped is non-empty (raise ValueError if empty), assign self._required_tag = stripped, and ensure the logger call (MEMORY_FILTER_INIT, strategy=self.strategy_name) uses the normalized stripped value for required_tag so stored/logged tag matches how tags are compared elsewhere.

coderabbitai · 2026-03-10T11:25:59Z

src/ai_company/memory/store_guard.py

+def validate_memory_tags(request: MemoryStoreRequest) -> None:
+    """Log a warning when the non-inferable tag is missing.
+
+    This is advisory only — the store operation is never blocked.
+    Wire into ``MemoryBackend.store()`` callers to activate enforcement.
+
+    Args:
+        request: The memory store request to validate.
+    """
+    if NON_INFERABLE_TAG not in request.metadata.tags:
+        logger.warning(
+            MEMORY_FILTER_STORE_MISSING_TAG,
+            category=request.category.value,
+            content_length=len(request.content),
+            tags=request.metadata.tags,
+        )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n -C2 '\bvalidate_memory_tags\s*\(' src/ai_company printf '\n-- store implementations --\n' rg -n -C3 'async def store\s*\(' src/ai_company

Repository: Aureliolo/ai-company

Length of output: 860

🏁 Script executed:

# Search for all references to validate_memory_tags (including imports) rg -n 'validate_memory_tags' src/ai_company # Find concrete implementations of store() fd -e py -x grep -l 'class.*MemoryBackend\|def store' {} + | head -20 | xargs -I {} sh -c 'echo "=== {} ===" && grep -n "def store\|class.*Memory" {} | head -20'

Repository: Aureliolo/ai-company

Length of output: 28055

🏁 Script executed:

# Show the actual store() implementation in the abstract protocol cat -n src/ai_company/memory/protocol.py | sed -n '77,90p' # Check a concrete implementation cat -n tests/unit/memory/test_protocol.py | sed -n '45,60p'

Repository: Aureliolo/ai-company

Length of output: 1177

This validation helper is not wired into any store() implementation.

The validate_memory_tags() function is defined but has no call sites in the codebase. Without concrete store implementations invoking it before persistence, the missing-tag warning event never fires and the validation is completely inert. Either add the call to each store() path (or a shared façade) or remove the unused helper.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/ai_company/memory/store_guard.py` around lines 21 - 36, The helper validate_memory_tags(request: MemoryStoreRequest) is never invoked so the NON_INFERABLE_TAG warning (MEMORY_FILTER_STORE_MISSING_TAG) is inert; fix this by calling validate_memory_tags(request) at the start of every memory persistence path — either add the call into each concrete MemoryBackend.store(...) implementation or place it in the shared façade/wrapper that all stores use (ensure any class implementing MemoryBackend calls validate_memory_tags before persisting), so the logger warning will fire when tags are missing.

coderabbitai · 2026-03-10T11:25:59Z

tests/unit/engine/test_prompt.py

+    @pytest.mark.unit
+    def test_empty_org_policy_raises(
+        self,
+        sample_agent_with_personality: AgentIdentity,
+    ) -> None:
+        """Empty string policy is rejected with PromptBuildError."""
+        with pytest.raises(PromptBuildError, match="org_policies"):
+            build_system_prompt(
+                agent=sample_agent_with_personality,
+                org_policies=("valid policy must exist", ""),
+            )
+
+    @pytest.mark.unit
+    def test_whitespace_only_org_policy_raises(
+        self,
+        sample_agent_with_personality: AgentIdentity,
+    ) -> None:
+        """Whitespace-only policy is rejected with PromptBuildError."""
+        with pytest.raises(PromptBuildError, match="org_policies"):
+            build_system_prompt(
+                agent=sample_agent_with_personality,
+                org_policies=("   ",),
+            )


🧹 Nitpick | 🔵 Trivial

Parametrize the invalid org_policies cases.

These two tests hit the same branch and only vary by the bad value, so collapsing them into one parametrized case keeps the suite tighter.

As per coding guidelines "Use @pytest.mark.parametrize for testing similar cases instead of duplicating test functions".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/engine/test_prompt.py` around lines 509 - 531, Combine the two tests into one parametrized test that iterates over the invalid org_policies values; replace the separate test_empty_org_policy_raises and test_whitespace_only_org_policy_raises with a single `@pytest.mark.parametrize-based` test that calls build_system_prompt(agent=sample_agent_with_personality, org_policies=<param>) and asserts pytest.raises(PromptBuildError, match="org_policies"); keep the same docstring and use the same sample_agent_with_personality fixture and PromptBuildError type to ensure behavior is unchanged.

Copilot

Pull request overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-10T11:27:42Z

src/ai_company/engine/prompt.py

 def _compute_sections(
    *,
    task: Task | None,
-    available_tools: tuple[ToolDefinition, ...],
    company: Company | None,
    org_policies: tuple[str, ...] = (),
 ) -> tuple[str, ...]:
    """Determine which sections are present in the rendered prompt.

+    The default template omits the tools section per D22 (non-inferable
+    principle), so ``available_tools`` is not considered here.
+
    Args:
        task: Optional task context.
-        available_tools: Tool definitions.
        company: Optional company context.
        org_policies: Company-wide policy texts.

    Returns:
        Tuple of section names that are included.
    """
    sections: list[str] = [
        _SECTION_IDENTITY,
        _SECTION_PERSONALITY,
        _SECTION_SKILLS,
        _SECTION_AUTHORITY,
    ]
    if org_policies:
        sections.append(_SECTION_ORG_POLICIES)
    # Autonomy follows org_policies in the template.
    sections.append(_SECTION_AUTONOMY)
    if task is not None:
        sections.append(_SECTION_TASK)
-    if available_tools:
-        sections.append(_SECTION_TOOLS)
    if company is not None:
        sections.append(_SECTION_COMPANY)
    return tuple(sections)


_compute_sections() no longer considers tools at all, so SystemPrompt.sections won’t reflect tool inclusion even when a custom template renders tools (since available_tools is still provided to the template context). If sections is used for downstream diagnostics/trimming analytics, this becomes misleading. Consider restoring tools section tracking when available_tools is non-empty (or making section computation aware of whether a custom template actually renders tools).

Copilot · 2026-03-10T11:27:42Z

src/ai_company/memory/retriever.py

+        if memory_filter is None and config.non_inferable_only:
+            memory_filter = TagBasedMemoryFilter()
+        elif memory_filter is not None and config.non_inferable_only:
+            logger.debug(
+                MEMORY_RETRIEVAL_START,
+                note="explicit memory_filter overrides non_inferable_only config",
+                filter_strategy=getattr(memory_filter, "strategy_name", "unknown"),
+            )
+        self._memory_filter = memory_filter


This debug log uses the MEMORY_RETRIEVAL_START event even though it’s reporting an initialization/config override (“explicit memory_filter overrides…”). Reusing the start event here can make retrieval telemetry noisy or ambiguous. Consider logging this with a filter-specific event (e.g., MEMORY_FILTER_INIT) or introducing a dedicated override event constant.

Copilot · 2026-03-10T11:27:42Z

src/ai_company/engine/policy_validation.py

+_ACTION_VERB_RE: re.Pattern[str] = re.compile(
+    r"\b(?:" + "|".join(_ACTION_VERBS) + r")\b",
+)


_ACTION_VERB_RE is built from a frozenset, so the alternation order in the generated regex depends on hash iteration order and can vary across processes. Behavior is equivalent, but it makes the compiled pattern non-deterministic for debugging and can cause avoidable diffs if the pattern string is ever surfaced. Consider building it from sorted(_ACTION_VERBS) for deterministic output.

Copilot · 2026-03-10T11:27:43Z

src/ai_company/engine/metrics.py

+    @model_validator(mode="after")
+    def _validate_prompt_tokens(self) -> TaskCompletionMetrics:
+        """Ensure prompt_tokens does not exceed tokens_per_task.
+
+        Skipped when ``tokens_per_task`` is 0 (zero-turn runs where the
+        system prompt was built but no provider calls were made).
+        """
+        if self.tokens_per_task > 0 and self.prompt_tokens > self.tokens_per_task:
+            msg = (
+                f"prompt_tokens ({self.prompt_tokens}) cannot exceed "
+                f"tokens_per_task ({self.tokens_per_task})"
+            )
+            raise ValueError(msg)
+        return self


prompt_tokens is an estimate (char/4) while tokens_per_task is provider-reported actual usage; the estimate can legitimately exceed the actual total (e.g., short responses, missing/underreported token usage, or estimator overestimation). Raising here can crash AgentEngine._log_completion() because TaskCompletionMetrics.from_run_result() isn’t guarded. Consider removing this validator, downgrading it to a warning log, or clamping the ratio calculation to avoid throwing during metrics logging.

Copilot · 2026-03-10T11:27:43Z

src/ai_company/engine/prompt.py

 # Sections trimmed when over token budget, least critical first.
+# Tools section was removed from the default template per D22
+# (non-inferable principle).
 _TRIMMABLE_SECTIONS = (
    _SECTION_COMPANY,
-    _SECTION_TOOLS,
    _SECTION_TASK,
    _SECTION_ORG_POLICIES,
 )


Tools are still passed in the template context for custom templates (context["tools"]), but tools were removed from _TRIMMABLE_SECTIONS. This means max_tokens trimming can no longer drop tools even if a custom template renders them, and the prompt may remain over budget when tools dominate. Consider keeping a tools-trimming step (e.g., clear available_tools when over budget) even if the default template doesn’t render tools, or explicitly document/guard that max_tokens is best-effort and won’t trim tools for custom templates.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/unit/engine/test_agent_engine_lifecycle.py (2)

350-380: 🧹 Nitpick | 🔵 Trivial

Use @pytest.mark.parametrize to combine similar test cases.

These two tests are nearly identical, differing only in the timeout_seconds value. Per coding guidelines, use @pytest.mark.parametrize for testing similar cases instead of duplicating test functions.

♻️ Proposed refactor

-    async def test_zero_timeout_raises(
-        self,
-        sample_agent_with_personality: AgentIdentity,
-        sample_task_with_criteria: Task,
-        mock_provider_factory: type[MockCompletionProvider],
-    ) -> None:
-        provider = mock_provider_factory([])
-        engine = AgentEngine(provider=provider)
-
-        with pytest.raises(ValueError, match="timeout_seconds must be > 0"):
-            await engine.run(
-                identity=sample_agent_with_personality,
-                task=sample_task_with_criteria,
-                timeout_seconds=0,
-            )
-
-    async def test_negative_timeout_raises(
+    `@pytest.mark.parametrize`("timeout_seconds", [0, -1.0])
+    async def test_invalid_timeout_raises(
         self,
         sample_agent_with_personality: AgentIdentity,
         sample_task_with_criteria: Task,
         mock_provider_factory: type[MockCompletionProvider],
+        timeout_seconds: float,
     ) -> None:
+        """Zero or negative timeout_seconds raises ValueError."""
         provider = mock_provider_factory([])
         engine = AgentEngine(provider=provider)
 
         with pytest.raises(ValueError, match="timeout_seconds must be > 0"):
             await engine.run(
                 identity=sample_agent_with_personality,
                 task=sample_task_with_criteria,
-                timeout_seconds=-1.0,
+                timeout_seconds=timeout_seconds,
             )

As per coding guidelines: "Use @pytest.mark.parametrize for testing similar cases instead of duplicating test functions".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/engine/test_agent_engine_lifecycle.py` around lines 350 - 380,
Combine the two duplicate tests test_zero_timeout_raises and
test_negative_timeout_raises into a single parametrized test using
pytest.mark.parametrize that iterates over the invalid timeout_seconds values (0
and -1.0); keep the same setup (create provider via mock_provider_factory and
instantiate AgentEngine) and the same assertion with pytest.raises(ValueError,
match="timeout_seconds must be > 0") around await
engine.run(identity=sample_agent_with_personality,
task=sample_task_with_criteria, timeout_seconds=timeout), replacing the two
separate functions with one function (e.g., test_invalid_timeouts_raise) that
accepts the parametrized timeout parameter.

396-416: 🧹 Nitpick | 🔵 Trivial

Consider adding assertion for prompt_token_ratio.

The test now provides explicit token counts (input_tokens=400, output_tokens=200), which is good for validating the new metrics. However, since this PR introduces prompt_token_ratio as a @computed_field, consider adding an assertion to verify it's computed correctly.

💡 Suggested assertion

         assert metrics.duration_seconds > 0
         assert metrics.agent_id == str(sample_agent_with_personality.id)
         assert metrics.task_id == sample_task_with_criteria.id
+        # Validate prompt_token_ratio: input_tokens / total_tokens = 400 / 600
+        assert metrics.prompt_tokens == 400
+        assert 0.66 < metrics.prompt_token_ratio < 0.67

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/engine/test_agent_engine_lifecycle.py` around lines 396 - 416, The
test creates a completion with input_tokens=400 and output_tokens=200 but
doesn't assert the new computed field prompt_token_ratio; update the test after
computing metrics (where TaskCompletionMetrics.from_run_result(result) is
called) to assert that metrics.prompt_token_ratio equals 400 / (400 + 200)
(i.e., ~0.6667) using a tolerant comparison (pytest.approx or similar). Locate
the block that builds response, provider, engine, calls engine.run and computes
metrics, and add the assertion referencing metrics.prompt_token_ratio to
validate the new `@computed_field`.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tests/unit/engine/test_agent_engine_lifecycle.py`:
- Around line 350-380: Combine the two duplicate tests test_zero_timeout_raises
and test_negative_timeout_raises into a single parametrized test using
pytest.mark.parametrize that iterates over the invalid timeout_seconds values (0
and -1.0); keep the same setup (create provider via mock_provider_factory and
instantiate AgentEngine) and the same assertion with pytest.raises(ValueError,
match="timeout_seconds must be > 0") around await
engine.run(identity=sample_agent_with_personality,
task=sample_task_with_criteria, timeout_seconds=timeout), replacing the two
separate functions with one function (e.g., test_invalid_timeouts_raise) that
accepts the parametrized timeout parameter.
- Around line 396-416: The test creates a completion with input_tokens=400 and
output_tokens=200 but doesn't assert the new computed field prompt_token_ratio;
update the test after computing metrics (where
TaskCompletionMetrics.from_run_result(result) is called) to assert that
metrics.prompt_token_ratio equals 400 / (400 + 200) (i.e., ~0.6667) using a
tolerant comparison (pytest.approx or similar). Locate the block that builds
response, provider, engine, calls engine.run and computes metrics, and add the
assertion referencing metrics.prompt_token_ratio to validate the new
`@computed_field`.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 37f4f8bc-2043-418a-8b8c-c7c2b7cd625f

📥 Commits

Reviewing files that changed from the base of the PR and between 230b8a4 and c6801d0.

📒 Files selected for processing (1)

tests/unit/engine/test_agent_engine_lifecycle.py

📜 Review details

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use from ai_company.observability import get_logger and instantiate logger as logger = get_logger(__name__) in every module with business logic

Files:

tests/unit/engine/test_agent_engine_lifecycle.py

tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Add test markers @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, or @pytest.mark.slow to all test files
Maintain 80% minimum code coverage — enforced in CI with pytest --cov=ai_company --cov-fail-under=80
Use asyncio_mode = 'auto' in pytest configuration — no manual @pytest.mark.asyncio needed on async tests
Set test timeout to 30 seconds per test — use @pytest.mark.timeout(30) or configure in pytest.ini
Use @pytest.mark.parametrize for testing similar cases instead of duplicating test functions

Files:

tests/unit/engine/test_agent_engine_lifecycle.py

🔇 Additional comments (1)

tests/unit/engine/test_agent_engine_lifecycle.py (1)

1-24: LGTM!

The test file is well-structured with comprehensive coverage of AgentEngine lifecycle scenarios. All test classes have the required @pytest.mark.unit marker, and the global timeout is properly configured via pytestmark.

gemini-code-assist

Code Review

This pull request aims to harden the 'non-inferable principle' implementation across performance metrics, memory filtering, and prompt generation, with well-structured changes and comprehensive tests. However, it introduces several critical security and stability issues. The memory retrieval pipeline's fail-open mechanism on filter errors could lead to sensitive data exposure in agent prompts. Furthermore, the use of invalid Python 3 exception handling syntax will cause runtime SyntaxErrors and potential Denial of Service. Additionally, sensitive memory content is being logged in warning messages, violating secure logging practices. These issues stem partly from adherence to an incorrect project style guide (CLAUDE.md) that mandates Python 2 except syntax, which needs to be updated.

I am having trouble creating individual review comments. Click here to see my feedback.

src/ai_company/memory/retriever.py (273-288)

This section of the memory retrieval pipeline implements a fail-open mechanism. If the memory_filter raises an exception, the system logs a warning and proceeds with unfiltered memories, which could lead to sensitive data injection into agent prompts, bypassing the 'non-inferable' principle. Furthermore, the except builtins_MemoryError, RecursionError: syntax is invalid in Python 3, causing a SyntaxError and potential Denial of Service. This incorrect syntax adheres to an outdated rule in the project's style guide (CLAUDE.md, line 70), which should be updated to recommend the correct Python 3 syntax except (A, B):.

        if self._memory_filter is not None:
          try:
              ranked = self._memory_filter.filter_for_injection(ranked)
          except (builtins_MemoryError, RecursionError):
              raise
          except Exception:
              logger.warning(
                  MEMORY_RETRIEVAL_DEGRADED,
                  source="memory_filter",
                  agent_id=agent_id,
                  filter_strategy=getattr(
                      self._memory_filter, "strategy_name", "unknown"
                  ),
                  exc_info=True,
              )
              # Fail securely: return empty if filter fails
              return ()

src/ai_company/engine/prompt.py (202)

The except MemoryError, RecursionError: syntax is invalid in Python 3 for catching multiple exceptions. This will cause a SyntaxError at runtime, leading to a denial of service. This issue stems from adherence to an incorrect rule in the project's style guide (CLAUDE.md, line 70), which mandates Python 2 except syntax. The style guide should be updated to recommend the correct Python 3 syntax, which is to group exceptions in a tuple, e.g., except (MemoryError, RecursionError):.

        except (MemoryError, RecursionError):

src/ai_company/memory/store_guard.py (30-36)

Logging a preview of memory content (content[:80]) when a tag is missing poses a risk of sensitive data exposure. Memories may contain PII, secrets, or internal data that should not be written to application logs.

    if NON_INFERABLE_TAG not in request.metadata.tags:
        logger.warning(
            MEMORY_FILTER_STORE_MISSING_TAG,
            category=request.category.value,
            tags=request.metadata.tags,
        )

…reptile - Cap prompt_tokens instead of rejecting when heuristic exceeds actual (#2) - Log policy_length instead of policy content to avoid leaks (#6) - Sort _ACTION_VERBS for deterministic regex alternation (#8) - Use PROMPT_POLICY_VALIDATION_FAILED event for advisory failures (#12) - Add isinstance check and strip whitespace in TagBasedMemoryFilter (#13) - Use MEMORY_FILTER_INIT event for filter-init log paths (#14, #16) - Remove content_preview from store_guard warning log (#18) - Track tools section conditionally for custom templates (#10) - Reconcile enforced vs advisory wording in DESIGN_SPEC (#1) - Inject fixed estimated_tokens in prompt-ratio test (#19) - Parametrize timeout and org_policies tests (#20, #22) - Add prompt_token_ratio assertion in lifecycle test (#21)

greptile-apps · 2026-03-10T11:56:57Z

src/ai_company/engine/prompt.py

 # Sections trimmed when over token budget, least critical first.
+# Tools section was removed from the default template per D22
+# (non-inferable principle), but custom templates may still render tools.
 _TRIMMABLE_SECTIONS = (
    _SECTION_COMPANY,
-    _SECTION_TOOLS,
    _SECTION_TASK,
    _SECTION_ORG_POLICIES,
 )


Custom-template tools section not trimmable under token budget

_SECTION_TOOLS was intentionally removed from _TRIMMABLE_SECTIONS because the default template no longer renders tools (D22). However, _SECTION_TOOLS is still tracked in sections when available_tools is non-empty and a custom template is active (see _compute_sections), meaning a custom template that references {{ tools }} can land in sections.

The consequence is that if a custom template renders the tools section and the rendered prompt exceeds max_tokens, the trimming loop will cycle through _SECTION_COMPANY → _SECTION_TASK → _SECTION_ORG_POLICIES and then fall through without ever nullifying available_tools. The prompt can end up permanently over budget for custom-template callers who pass a large tool list.

Before this PR the order was company → tools → task → org_policies. If the decision is that custom-template tools should also be trimmable, _SECTION_TOOLS needs to be re-added to _TRIMMABLE_SECTIONS (last or early in priority) along with a corresponding elif section == _SECTION_TOOLS and available_tools: branch in _trim_sections.

If the intended policy is that custom templates own their own trimming responsibility, a doc comment to that effect in build_system_prompt's docstring would prevent future confusion.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/ai_company/engine/prompt.py Line: 148-155 Comment: **Custom-template tools section not trimmable under token budget** `_SECTION_TOOLS` was intentionally removed from `_TRIMMABLE_SECTIONS` because the default template no longer renders tools (D22). However, `_SECTION_TOOLS` is still tracked in `sections` when `available_tools` is non-empty **and** a custom template is active (see `_compute_sections`), meaning a custom template that references `{{ tools }}` can land in `sections`. The consequence is that if a custom template renders the tools section and the rendered prompt exceeds `max_tokens`, the trimming loop will cycle through `_SECTION_COMPANY → _SECTION_TASK → _SECTION_ORG_POLICIES` and then fall through without ever nullifying `available_tools`. The prompt can end up permanently over budget for custom-template callers who pass a large tool list. Before this PR the order was `company → tools → task → org_policies`. If the decision is that custom-template tools should also be trimmable, `_SECTION_TOOLS` needs to be re-added to `_TRIMMABLE_SECTIONS` (last or early in priority) along with a corresponding `elif section == _SECTION_TOOLS and available_tools:` branch in `_trim_sections`. If the intended policy is that custom templates own their own trimming responsibility, a doc comment to that effect in `build_system_prompt`'s docstring would prevent future confusion. How can I resolve this? If you propose a fix, please make it concise.

🤖 I have created a release *beep* *boop* --- ## [0.1.1](ai-company-v0.1.0...ai-company-v0.1.1) (2026-03-10) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.0...v0.1.0) (2026-03-11) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add mandatory JWT + API key authentication ([#256](#256)) ([c279cfe](c279cfe)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable output scan response policies ([#263](#263)) ([b9907e8](b9907e8)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement AuditRepository for security audit log persistence ([#279](#279)) ([94bc29f](94bc29f)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * resolve circular imports, bump litellm, fix release tag format ([#286](#286)) ([a6659b5](a6659b5)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * bump anchore/scan-action from 6.5.1 to 7.3.2 ([#271](#271)) ([80a1c15](80a1c15)) * bump docker/build-push-action from 6.19.2 to 7.0.0 ([#273](#273)) ([dd0219e](dd0219e)) * bump docker/login-action from 3.7.0 to 4.0.0 ([#272](#272)) ([33d6238](33d6238)) * bump docker/metadata-action from 5.10.0 to 6.0.0 ([#270](#270)) ([baee04e](baee04e)) * bump docker/setup-buildx-action from 3.12.0 to 4.0.0 ([#274](#274)) ([5fc06f7](5fc06f7)) * bump sigstore/cosign-installer from 3.9.1 to 4.1.0 ([#275](#275)) ([29dd16c](29dd16c)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * **main:** release ai-company 0.1.1 ([#282](#282)) ([2f4703d](2f4703d)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>

Aureliolo added 2 commits March 10, 2026 11:02

Copilot AI review requested due to automatic review settings March 10, 2026 10:35

Copilot started reviewing on behalf of Aureliolo March 10, 2026 10:36 View session

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

src/ai_company/engine/prompt.py Show resolved Hide resolved

src/ai_company/memory/retriever.py Show resolved Hide resolved

Copilot AI reviewed Mar 10, 2026

View reviewed changes

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

Aureliolo added 2 commits March 10, 2026 12:14

fix: increase test token counts to satisfy prompt_tokens validator

c6801d0

The model_validator ensuring prompt_tokens <= tokens_per_task was rejecting the metrics in test_metrics_logged_on_completion because the default 150 tokens was less than the system prompt estimate (~306).

Copilot AI review requested due to automatic review settings March 10, 2026 11:18

Copilot started reviewing on behalf of Aureliolo March 10, 2026 11:19 View session

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

src/ai_company/memory/retriever.py Show resolved Hide resolved

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

Copilot AI reviewed Mar 10, 2026

View reviewed changes

coderabbitai bot reviewed Mar 10, 2026

View reviewed changes

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

Aureliolo merged commit 02b5f4e into main Mar 10, 2026
7 of 8 checks passed

Aureliolo deleted the perf/non-inferable-principle branch March 10, 2026 11:50

greptile-apps bot reviewed Mar 10, 2026

View reviewed changes

Aureliolo mentioned this pull request Mar 10, 2026

chore(main): release ai-company 0.1.1 #282

Merged

Aureliolo mentioned this pull request Mar 10, 2026

chore(main): release 0.1.0 #283

Merged

This was referenced Mar 15, 2026

chore(main): release 0.2.4 #431

Merged

chore(main): release 0.2.0 #442

Closed

This was referenced Mar 15, 2026

chore(main): release 0.2.5 #447

Merged

chore(main): release 0.2.0 #460

Closed

chore(main): release 0.2.0 #471

Closed

Conversation

Aureliolo commented Mar 10, 2026

Summary

Test plan

Review coverage

Uh oh!

github-actions bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

greptile-apps bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 10, 2026 •

edited

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

greptile-apps bot commented Mar 10, 2026 •

edited

Loading