docs: add crash recovery, sandboxing, analytics, and testing decisions by Aureliolo · Pull Request #127 · Aureliolo/synthorg

Aureliolo · 2026-03-06T10:57:02Z

Summary

Addresses open questions raised by three external reviews of the design spec. Adds four new sections and updates existing tables with resolved decisions.

§6.6 Agent Crash Recovery — pluggable RecoveryStrategy protocol: fail-and-reassign (M3 MVP), checkpoint recovery (M4/M5), with environment reconciliation on resume
§10.5 LLM Call Analytics — incremental build: proxy overhead metrics (M3) → call categorization + orchestration ratio (M4) → full analytics with retry/latency/cache tracking (M5+)
§11.1.2 Tool Sandboxing — layered SandboxBackend protocol: SubprocessSandbox for low-risk tools (file, git), DockerSandbox for code execution/terminal, K8sSandbox for future container deployments
§15.3 updated project tree (tools/sandbox/ directory with protocol, subprocess, docker)
§15.4 new design decision row for sandboxing
§15.5 five new convention rows: sandboxing, crash recovery, agent behavior testing, LLM call analytics
§17.1 resolved question Implement retry logic, rate limiting, and provider error handling #9 (sandboxing), added Design and implement basic tool system (registry, invocation, results) #15-17 as resolved (crash recovery, testing strategy, overhead tracking)
§17.3 updated crash risk mitigation, added orchestration overhead risk

Test plan

Verify all new sections have correct markdown formatting and render properly
Verify cross-references (§ links) point to correct sections
Verify §15.3 project tree matches the intended sandbox directory structure
Verify §17.1 open questions table has correct resolved/open status

🤖 Generated with Claude Code

…s to design spec Address open questions raised by three external reviews of the design spec: - §6.6: Agent Crash Recovery — pluggable RecoveryStrategy protocol with fail-and-reassign (M3 MVP) and checkpoint recovery (M4/M5) - §10.5: LLM Call Analytics — incremental tracking: proxy overhead metrics (M3), call categorization with orchestration ratio (M4), full analytics layer with retry/latency/cache tracking (M5+) - §11.1.2: Tool Sandboxing — layered SandboxBackend protocol with SubprocessSandbox (low-risk tools), DockerSandbox (code execution), and K8sSandbox (future container deployments) - §15.3: Updated project tree (tools/sandbox/ directory) - §15.4: Added sandboxing to key design decisions table - §15.5: Five new convention rows (sandboxing, crash recovery, agent behavior testing, LLM call analytics, tool sandboxing) - §17.1: Resolved questions #9 (sandboxing), added #15-17 as resolved - §17.3: Updated crash risk mitigation, added orchestration overhead risk Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-03-06T10:57:30Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the design specification by formalizing critical architectural decisions related to agent resilience, security, and observability. It introduces detailed strategies for handling agent crashes, implementing layered tool sandboxing, and establishing an incremental LLM call analytics framework. These updates address key concerns raised during external reviews, providing clear paths for development across upcoming milestones and improving the overall robustness and manageability of the agent system.

Highlights

Agent Crash Recovery: Introduced a pluggable RecoveryStrategy protocol, detailing an M3 MVP 'fail-and-reassign' strategy and planned M4/M5 'checkpoint recovery' with environment reconciliation.
LLM Call Analytics: Defined an incremental analytics build, starting with M3 proxy overhead metrics, moving to M4 call categorization and orchestration ratio, and M5+ full analytics including retry/latency/cache tracking.
Tool Sandboxing: Established a layered SandboxBackend protocol, detailing SubprocessSandbox for low-risk tools, DockerSandbox for high-risk tools (code execution/terminal), and K8sSandbox for future deployments.
Project Structure Update: Modified the project tree to include a tools/sandbox/ directory with protocol.py, subprocess.py, and docker.py.
Design Decisions & Conventions: Added a new design decision row for sandboxing and five new convention rows covering sandboxing, crash recovery, agent behavior testing, and LLM call analytics.
Resolved Questions & Risks: Updated the open questions table, resolving question Implement retry logic, rate limiting, and provider error handling #9 (sandboxing) and adding Design and implement basic tool system (registry, invocation, results) #15-17 (crash recovery, testing strategy, overhead tracking) as resolved. Also updated crash risk mitigation and added orchestration overhead risk.

Changelog

DESIGN_SPEC.md
- Added a new section 6.6 Agent Crash Recovery detailing pluggable recovery strategies.
- Added a new section 10.5 LLM Call Analytics outlining an incremental analytics system.
- Added a new section 11.1.2 Tool Sandboxing describing a layered sandboxing approach.
- Updated section 15.3 Project Tree to reflect the new tools/sandbox/ directory structure.
- Updated section 15.4 Design Decisions with a new entry for sandboxing.
- Updated section 15.5 Pydantic Model Conventions with new entries for tool sandboxing, crash recovery, agent behavior testing, and LLM call analytics.
- Updated section 17.1 Open Questions to mark questions Implement retry logic, rate limiting, and provider error handling #9, Design and implement basic tool system (registry, invocation, results) #15, Implement tool permission checking based on role and access level #16, and Implement loop prevention (depth limits, dedup, circuit breakers, ancestry) #17 as resolved.
- Updated section 17.2 Technical Risks to include specific mitigation strategies for data loss on crash and orchestration overhead.

Activity

No human activity has occurred on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-03-06T10:57:37Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

gemini-code-assist

Code Review

This pull request significantly enhances the design specification by adding detailed sections on agent crash recovery, LLM call analytics, tool sandboxing, and agent behavior testing. The new content is well-structured, clear, and resolves several previously open questions. The updates to the project structure, design decisions, and risk tables are consistent with these additions. I have one suggestion to improve the security posture of the proposed tool sandboxing design.

_{Note: Security Review has been skipped due to the limited scope of the PR.}

gemini-code-assist · 2026-03-06T10:58:53Z

DESIGN_SPEC.md

+  overrides:                           # per-category backend overrides
+    file_system: "subprocess"          # low risk — fast, no deps
+    git: "subprocess"                  # low risk — workspace-scoped
+    web: "subprocess"                  # medium risk — timeout + allowlist


In the sandboxing.overrides configuration, assigning the web tool category to the subprocess backend might introduce security risks. A subprocess has access to the host's network stack, which could allow it to connect to internal services on localhost or the local network, even if its filesystem access is restricted.

The comment mentions a 'timeout + allowlist', but the subprocess configuration doesn't show how this network allowlist would be implemented or enforced. For better security and isolation, consider defaulting the web category to the docker backend. The docker sandbox is configured with network: "none" by default, providing strong network isolation. If network access is needed for specific web tools, a dedicated Docker network with an egress-only policy could be used.

Suggested change

web: "subprocess" # medium risk — timeout + allowlist

web: "docker" # medium risk — requires network isolation

Copilot

Pull request overview

Updates the design specification to resolve external review questions by documenting crash recovery, tool sandboxing, LLM call analytics, and testing conventions, plus reflecting those decisions in the architecture tables.

Changes:

Add new spec sections: §6.6 Agent Crash Recovery, §10.5 LLM Call Analytics, §11.1.2 Tool Sandboxing.
Update §15 architecture tables/project tree to include sandboxing decisions and planned directory structure.
Mark previously open questions as resolved in §17.1 and add related risk mitigations in §17.3.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T11:01:53Z

DESIGN_SPEC.md

+The engine catches the failure at its outermost boundary, logs the error with the full `AgentContext` snapshot for debugging, transitions the task to `FAILED`, and makes it available for reassignment (manual or automatic via the task router).
+
+```yaml
+crash_recovery:
+  strategy: "fail_reassign"            # fail_reassign, checkpoint
+```
+
+- Simple, no persistence dependency, M3-ready
+- All progress is lost on crash — acceptable for short single-agent tasks in the MVP
+
+On crash:
+1. Catch exception at the engine boundary (outermost `try/except` in the execution loop)
+2. Log at ERROR with full `AgentContext` snapshot (conversation, turn count, accumulated cost)
+3. Transition `TaskExecution` → `FAILED` with the exception as the failure reason


The spec says the engine should log the full AgentContext snapshot including the conversation on crash. In the current codebase, AgentContext.to_snapshot() produces an AgentContextSnapshot that intentionally excludes message contents (only message_count, turn_count, cost, etc.), which is safer and avoids leaking sensitive prompts/tool outputs into logs. Suggest updating this section to align with the existing snapshot model and explicitly call out redaction/truncation if any message content is ever logged.

Copilot · 2026-03-06T11:01:54Z

DESIGN_SPEC.md

+Every call to `BaseCompletionProvider.complete()` already records a `CostRecord` with token counts, cost, provider, model, agent, and task. In M3, the engine additionally logs **proxy overhead metrics** at task completion:
+
+- `turns_per_task` — number of LLM turns to complete the task (from `AgentContext.turn_count`)
+- `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost`)
+- `cost_per_task` — total USD cost (from `TaskExecution.accumulated_cost.cost_usd`)


BaseCompletionProvider.complete() does not currently record a CostRecord (and it also lacks agent/task context needed to populate one). It only logs provider call start/success/error; cost aggregation happens via TokenUsage on responses and (when wired) the budget layer. Please reword this to avoid stating it "already records a CostRecord" and instead describe where/when CostRecord entries are created (e.g., in the engine when agent_id/task_id are known, recorded into CostTracker).

Copilot · 2026-03-06T11:01:54Z

DESIGN_SPEC.md

+Every call to `BaseCompletionProvider.complete()` already records a `CostRecord` with token counts, cost, provider, model, agent, and task. In M3, the engine additionally logs **proxy overhead metrics** at task completion:
+
+- `turns_per_task` — number of LLM turns to complete the task (from `AgentContext.turn_count`)
+- `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost`)


tokens_per_task is described as coming from AgentContext.accumulated_cost, but accumulated_cost is a TokenUsage object. To avoid ambiguity, consider calling out the exact field used for the token total (e.g., accumulated_cost.total_tokens or input_tokens + output_tokens) vs cost (accumulated_cost.cost_usd).

Suggested change

- `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost`)

- `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost.total_tokens`)

greptile-apps · 2026-03-06T11:04:34Z

Greptile Summary

This PR addresses four open questions from external design-spec reviews by adding §6.6 (Agent Crash Recovery), §10.5 (LLM Call Analytics), §11.1.2 (Tool Sandboxing), and corresponding entries in §15–§17. The overall structure is well-thought-out — the pluggable protocol pattern (RecoveryStrategy, SandboxBackend) is consistent with the rest of the spec, and the incremental milestone approach for analytics is practical.

Key issues found:

egress-only is not a valid Docker network mode (DESIGN_SPEC.md §11.1.2, line 1337) — Docker has no built-in "egress-only" driver; this would fail at runtime and requires a concrete implementation plan (sidecar proxy, iptables rules, etc.)
Empty allowed_hosts: [] with database: "bridge" provides no host-level isolation (DESIGN_SPEC.md §11.1.2, lines 1335–1338) — without a populated allowlist and an enforcement mechanism, bridge-networked database containers have unrestricted network access
Checkpoint storage silently persists full AgentContext message contents (DESIGN_SPEC.md §6.6, lines 873–891) — Strategy 1 explicitly redacts message contents from crash logs, but Strategy 2 checkpoints write the full conversation history to SQLite/filesystem at rest with no mention of encryption, access controls, or whether redaction applies

Confidence Score: 3/5

Safe to merge for documentation purposes, but two sections contain implementation-blocking spec errors that will cause confusion or failures when developers implement them in M3.
The structural and organizational changes are clean and address prior review feedback well. However, §11.1.2 specifies a non-existent Docker network mode (egress-only) and an unenforced allowlist mechanism — both would mislead M3 implementers. §6.6 has a security gap (plaintext checkpoint storage of sensitive message contents) that is inconsistent with the redaction discipline established in Strategy 1. These are spec-level errors in sections that will be directly implemented in M3, so they warrant a lower confidence score despite being a docs-only PR.
Pay close attention to DESIGN_SPEC.md §11.1.2 (Docker network configuration) and §6.6 (checkpoint storage security).

Important Files Changed

Filename	Overview
DESIGN_SPEC.md	Adds four new spec sections (§6.6, §10.5, §11.1.2, updates to §15–§17). Contains two concrete implementation blockers in §11.1.2 (`egress-only` is not a valid Docker network mode; empty `allowed_hosts` with `bridge` networking provides no isolation), and a security gap in §6.6 (checkpoint storage persists full message contents without any acknowledgement or controls, inconsistent with Strategy 1's explicit redaction).
CLAUDE.md	Minor package-structure comment update: moves sandboxing annotation from `security/` to `tools/`. Accurately reflects the §15.3 project tree change. No issues.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Agent Execution Start] --> B[Engine Execution Loop]
    B --> C{Exception?}
    C -- No --> D[Turn Completed]
    D --> E{Strategy?}
    E -- fail_reassign --> F[Strategy 1: Skip checkpoint]
    E -- checkpoint --> G[Strategy 2: Persist AgentContext snapshot\nto SQLite / filesystem]
    G --> B

    C -- Yes --> H{Strategy?}
    H -- fail_reassign --> I[Catch at engine boundary\nLog redacted snapshot\nmessage contents excluded]
    H -- checkpoint --> J[Detect via exception\nor heartbeat timeout]

    J --> K[Load last checkpoint\nfull AgentContext incl. messages]
    K --> L{Resume attempts\n< max_resume_attempts?}
    L -- Yes --> M[Environment reconciliation\nsummary of changes since checkpoint]
    M --> B
    L -- No --> N[Fall back to fail_reassign]
    N --> I

    I --> O[TaskExecution → FAILED\nwith failure reason]
    O --> P[Task available for reassignment\nvia task router]

_{Last reviewed commit: 383125d}

DESIGN_SPEC.md

…nd Greptile - Add FAILED terminal state note to §6.6 (needs TaskStatus enum update in M3) - Fix AgentContext snapshot to use redacted form (exclude message contents) - Fix CostRecord attribution (engine layer, not BaseCompletionProvider) - Fix tokens_per_task to reference accumulated_cost.total_tokens - Move web tools from subprocess to Docker (no network controls in subprocess) - Add database network override (needs TCP access to DB host) - Add Docker network_overrides and allowed_hosts config - Change Adopted→Planned for unimplemented M3 conventions (sandboxing, crash recovery, testing) - Rename §15.5 to "Engineering Conventions" (scope expanded beyond Pydantic) - Update CLAUDE.md: move sandboxing from security/ to tools/ description Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-03-06T11:48:34Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Summary by CodeRabbit

Documentation

Updated design specifications with agent crash-recovery system featuring configurable recovery strategies
Added tool sandboxing capabilities supporting multiple backends (subprocess, Docker, Kubernetes planned)
Extended engineering conventions documentation with new development milestones

Walkthrough

Documentation updates to CLAUDE.md with directory taxonomy and capability terminology adjustments. DESIGN_SPEC.md significantly expanded with new crash-recovery design (RecoveryStrategy pattern), tool sandboxing specification (SandboxBackend protocol), updated task failure workflows, and extended engineering conventions section.

Changes

Cohort / File(s)	Summary
Documentation Metadata `CLAUDE.md`	Updated directory structure, adjusted security description from "sandboxing" to "audit", and added "sandboxing" to tools/capabilities list.
Design Specification Expansion `DESIGN_SPEC.md`	Added crash-recovery design section with pluggable RecoveryStrategy implementations (Fail-and-Reassign default, Checkpoint planned), new FAILED task status variant, YAML config examples. Introduced modular Tool Sandboxing (§11.1.2) with SandboxBackend protocol supporting SubprocessSandbox, DockerSandbox, and future K8sSandbox backends. Extended Engineering Conventions (§15.5) with entries for sandboxing, crash recovery, and agent analytics. Updated risk/plan narrative and cross-references to align crash-recovery and analytics as M3–M5 milestones.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

docs: expand design spec with pluggable strategy protocols #121: Introduces pluggable protocol/strategy abstractions in DESIGN_SPEC.md (RecoveryStrategy, SandboxBackend patterns) aligned with this PR's architectural abstractions.
docs: add CLAUDE.md, contributing guide, and dev documentation #65: Modifies CLAUDE.md terminology and structure, directly related to this PR's documentation taxonomy updates.
Add design specification, license, and project setup #2: Added initial DESIGN_SPEC.md; this PR significantly extends it with crash-recovery and sandboxing design sections.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main changes: adding crash recovery, sandboxing, analytics, and testing decisions to the design specification documentation.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, providing detailed summaries of each new section and update made to the design specification.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch docs/design-spec-review-decisions

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-03-06T11:51:12Z

DESIGN_SPEC.md

+    network: "none"                    # no network by default; per-category overrides below
+    network_overrides:                 # category-specific network policies
+      database: "bridge"               # database tools need TCP access to DB host
+      web: "egress-only"               # web tools need outbound HTTP; no inbound


egress-only is not a valid Docker network mode

network: "egress-only" is not a recognized Docker network driver. Docker's built-in network modes are bridge, host, overlay, macvlan, ipvlan, and none. Passing egress-only as the --network argument to docker run would fail with an error like network egress-only not found.

Restricting a container to outbound-only traffic in Docker actually requires either:

A custom network + iptables rules (or nftables) to block inbound connections while permitting outbound ones

Routing all traffic through a local HTTP proxy (e.g., Squid) whose allowlist controls what the container can reach

A userspace sandboxing layer like gVisor (--runtime=runsc) with a custom network policy

The spec should document which of these mechanisms is intended and not use a non-existent Docker primitive. Otherwise implementers will hit a runtime error and have no guidance on what to build.

Suggested change

web: "egress-only" # web tools need outbound HTTP; no inbound

web: "bridge" # web tools need outbound HTTP; enforce allowlist via proxy or iptables — see note

Note: Docker has no native "egress-only" mode. Outbound-only restriction for web tools must be implemented via a sidecar proxy (e.g., Squid + http_access rules) or iptables FORWARD rules on a custom bridge network. Document the chosen mechanism before implementing DockerSandbox for the web category.

Prompt To Fix With AI

This is a comment left during a code review. Path: DESIGN_SPEC.md Line: 1337 Comment: **`egress-only` is not a valid Docker network mode** `network: "egress-only"` is not a recognized Docker network driver. Docker's built-in network modes are `bridge`, `host`, `overlay`, `macvlan`, `ipvlan`, and `none`. Passing `egress-only` as the `--network` argument to `docker run` would fail with an error like `network egress-only not found`. Restricting a container to outbound-only traffic in Docker actually requires either: 1. A custom network + `iptables` rules (or `nftables`) to block inbound connections while permitting outbound ones 2. Routing all traffic through a local HTTP proxy (e.g., [Squid](http://www.squid-cache.org/)) whose allowlist controls what the container can reach 3. A userspace sandboxing layer like `gVisor` (`--runtime=runsc`) with a custom network policy The spec should document which of these mechanisms is intended and not use a non-existent Docker primitive. Otherwise implementers will hit a runtime error and have no guidance on what to build. ```suggestion web: "bridge" # web tools need outbound HTTP; enforce allowlist via proxy or iptables — see note ``` > **Note:** Docker has no native "egress-only" mode. Outbound-only restriction for web tools must be implemented via a sidecar proxy (e.g., Squid + `http_access` rules) or `iptables FORWARD` rules on a custom bridge network. Document the chosen mechanism before implementing `DockerSandbox` for the `web` category. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-06T11:51:13Z

DESIGN_SPEC.md

+    network_overrides:                 # category-specific network policies
+      database: "bridge"               # database tools need TCP access to DB host
+      web: "egress-only"               # web tools need outbound HTTP; no inbound
+    allowed_hosts: []                  # allowlist of host:port pairs (e.g. ["db:5432"])


allowed_hosts is empty by default but database: "bridge" provides no isolation without it

The network_overrides section sets database: "bridge", which correctly allows the container to reach a database over TCP. However, allowed_hosts: [] is empty by default, meaning there is no host-level allowlist enforced. A Docker container on a bridge network can reach any host reachable on that bridge — not just the database. The combination of bridge network + empty allowed_hosts offers essentially no isolation beyond the container boundary itself.

For the allowed_hosts field to be meaningful, DockerSandbox must:

Enforce it via iptables rules, a sidecar proxy, or Docker's --add-host + --network combination

Document that allowed_hosts: [] means "deny all outbound connections on bridge" (deny-by-default), not "allow all"

The spec currently implies the allowlist is opt-in but provides no mechanism for its enforcement, leaving a gap that implementers will silently skip. At minimum, add a clarifying note:

allowed_hosts: [] # REQUIRED for database category: e.g. ["db:5432"]. Empty = deny all outbound on bridge.

Prompt To Fix With AI

This is a comment left during a code review. Path: DESIGN_SPEC.md Line: 1335-1338 Comment: **`allowed_hosts` is empty by default but `database: "bridge"` provides no isolation without it** The `network_overrides` section sets `database: "bridge"`, which correctly allows the container to reach a database over TCP. However, `allowed_hosts: []` is empty by default, meaning there is no host-level allowlist enforced. A Docker container on a `bridge` network can reach *any* host reachable on that bridge — not just the database. The combination of `bridge` network + empty `allowed_hosts` offers essentially no isolation beyond the container boundary itself. For the `allowed_hosts` field to be meaningful, `DockerSandbox` must: 1. Enforce it via `iptables` rules, a sidecar proxy, or Docker's `--add-host` + `--network` combination 2. Document that `allowed_hosts: []` means "deny all outbound connections on bridge" (deny-by-default), not "allow all" The spec currently implies the allowlist is opt-in but provides no mechanism for its enforcement, leaving a gap that implementers will silently skip. At minimum, add a clarifying note: ```yaml allowed_hosts: [] # REQUIRED for database category: e.g. ["db:5432"]. Empty = deny all outbound on bridge. ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-03-06T11:51:14Z

DESIGN_SPEC.md

+#### Strategy 2: Checkpoint Recovery (Planned — M4/M5)
+
+The engine persists an `AgentContext` snapshot after each completed turn. On crash, the framework detects the failure (via heartbeat timeout or exception), loads the last checkpoint, and resumes execution from the exact turn where it left off. The immutable `model_copy(update=...)` pattern makes checkpointing trivial — each `AgentContext` is a complete, self-contained frozen state that serializes cleanly via `model_dump_json()`.
+
+```yaml
+crash_recovery:
+  strategy: "checkpoint"
+  checkpoint:
+    persist_every_n_turns: 1           # checkpoint frequency
+    storage: "sqlite"                  # sqlite, filesystem
+    heartbeat_interval_seconds: 30     # detect unresponsive agents
+    max_resume_attempts: 2             # retry limit before falling back to fail_reassign
+```
+
+- Preserves progress — critical for long tasks (multi-step plans, epic-level work)
+- Requires persistence layer and environment state reconciliation on resume
+- Natural fit with the existing immutable state model
+
+> **Environment reconciliation:** When resuming from a checkpoint, the agent's tools and workspace may have changed (other agents modified files, external state drifted). The checkpoint strategy includes a reconciliation step: the resumed agent receives a summary of changes since the checkpoint timestamp and can adapt its plan accordingly. This is analogous to a developer returning to a branch after colleagues have pushed changes.


Checkpoint storage silently persists full message contents

Strategy 1 (fail-and-reassign) explicitly redacts message contents from its log entry: "excluding message contents to avoid leaking sensitive prompts/tool outputs". But Strategy 2 (checkpoint recovery) persists the full AgentContext snapshot — which includes the entire message history — to SQLite or the filesystem after every turn.

This creates an inconsistency: the same sensitive content that is deliberately excluded from crash logs in Strategy 1 is written in plaintext to a persistent checkpoint storage in Strategy 2. If the SQLite file or filesystem checkpoint directory is accessible to other agents, processes, or backup systems, sensitive prompts and tool outputs (API keys returned by tools, user PII in prompts, etc.) are silently at rest.

The spec should acknowledge this security implication and at least document the intended controls:

Should checkpoint storage be encrypted at rest? (e.g., SQLCipher for SQLite, or filesystem-level encryption)

Should AgentContext checkpoints exclude message contents (storing only tool call history + turn count) and rely on the task description for context on resume?

What is the access model for the checkpoint database — is it per-agent, shared, or controlled by the engine process only?

Without explicit guidance here, implementers will default to unencrypted plaintext storage, which is a meaningful downgrade from the redaction discipline already applied in Strategy 1.

Prompt To Fix With AI

This is a comment left during a code review. Path: DESIGN_SPEC.md Line: 873-891 Comment: **Checkpoint storage silently persists full message contents** Strategy 1 (fail-and-reassign) explicitly redacts message contents from its log entry: *"excluding message contents to avoid leaking sensitive prompts/tool outputs"*. But Strategy 2 (checkpoint recovery) persists the *full* `AgentContext` snapshot — which includes the entire message history — to SQLite or the filesystem after every turn. This creates an inconsistency: the same sensitive content that is deliberately excluded from crash logs in Strategy 1 is written in plaintext to a persistent `checkpoint` storage in Strategy 2. If the SQLite file or filesystem checkpoint directory is accessible to other agents, processes, or backup systems, sensitive prompts and tool outputs (API keys returned by tools, user PII in prompts, etc.) are silently at rest. The spec should acknowledge this security implication and at least document the intended controls: - Should checkpoint storage be encrypted at rest? (e.g., SQLCipher for SQLite, or filesystem-level encryption) - Should `AgentContext` checkpoints exclude message contents (storing only tool call history + turn count) and rely on the task description for context on resume? - What is the access model for the checkpoint database — is it per-agent, shared, or controlled by the engine process only? Without explicit guidance here, implementers will default to unencrypted plaintext storage, which is a meaningful downgrade from the redaction discipline already applied in Strategy 1. How can I resolve this? If you propose a fix, please make it concise.

🤖 I have created a release *beep* *boop* --- ## [0.1.1](ai-company-v0.1.0...ai-company-v0.1.1) (2026-03-10) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).

🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.0...v0.1.0) (2026-03-11) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add mandatory JWT + API key authentication ([#256](#256)) ([c279cfe](c279cfe)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable output scan response policies ([#263](#263)) ([b9907e8](b9907e8)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement AuditRepository for security audit log persistence ([#279](#279)) ([94bc29f](94bc29f)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * resolve circular imports, bump litellm, fix release tag format ([#286](#286)) ([a6659b5](a6659b5)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * bump anchore/scan-action from 6.5.1 to 7.3.2 ([#271](#271)) ([80a1c15](80a1c15)) * bump docker/build-push-action from 6.19.2 to 7.0.0 ([#273](#273)) ([dd0219e](dd0219e)) * bump docker/login-action from 3.7.0 to 4.0.0 ([#272](#272)) ([33d6238](33d6238)) * bump docker/metadata-action from 5.10.0 to 6.0.0 ([#270](#270)) ([baee04e](baee04e)) * bump docker/setup-buildx-action from 3.12.0 to 4.0.0 ([#274](#274)) ([5fc06f7](5fc06f7)) * bump sigstore/cosign-installer from 3.9.1 to 4.1.0 ([#275](#275)) ([29dd16c](29dd16c)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * **main:** release ai-company 0.1.1 ([#282](#282)) ([2f4703d](2f4703d)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>

…ing (#299) ## Summary - **Upgrade `actions/upload-pages-artifact` v3 → v4** — v4.0.0 ([PR #127](actions/upload-pages-artifact#127)) SHA-pins its internal `actions/upload-artifact` dependency, fixing the `sha_pinning_required` conflict where the composite action's tag reference (`@v4`) was rejected by the repo's Actions permissions policy - **Add `zizmor` workflow security analysis** — runs on workflow file changes (push to main + PRs), catches unpinned actions, script injection, excessive permissions, and uploads SARIF to the Security tab - **Add explicit failure on release retry exhaustion** — retry loop now sets a `FOUND` flag so exhaustion surfaces a clear `::error::` instead of falling through to a confusing `gh release edit` failure (Greptile PR #298 finding) ## Context After merging #298, the Pages workflow failed on main because `upload-pages-artifact` v3 internally called `actions/upload-artifact@v4` (tag, not SHA), violating the repo's `sha_pinning_required: true` setting. This is a [known limitation](actions/runner#2195) with composite actions — GitHub enforces SHA pinning transitively but composite action authors don't always pin their internal deps. v4.0.0 fixed this upstream. The zizmor workflow provides CI-level enforcement of SHA pinning and other workflow security checks, complementing the repo-level `sha_pinning_required` setting. ## Test plan - [ ] Pages workflow succeeds on main after merge (v4 upload-pages-artifact) - [ ] zizmor workflow runs and uploads SARIF on this PR's workflow changes - [ ] Verify no breaking change from v4 dotfile exclusion (MkDocs/Astro output has no dotfiles) - [ ] Release retry loop fails clearly after exhaustion (manual verification of logic)

Copilot AI review requested due to automatic review settings March 6, 2026 10:57

Copilot started reviewing on behalf of Aureliolo March 6, 2026 10:57 View session

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

Copilot AI reviewed Mar 6, 2026

View reviewed changes

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

DESIGN_SPEC.md Outdated Show resolved Hide resolved

DESIGN_SPEC.md Outdated Show resolved Hide resolved

DESIGN_SPEC.md Show resolved Hide resolved

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

Aureliolo merged commit 5c11595 into main Mar 6, 2026
10 of 11 checks passed

Aureliolo deleted the docs/design-spec-review-decisions branch March 6, 2026 11:54

Aureliolo mentioned this pull request Mar 10, 2026

chore(main): release ai-company 0.1.1 #282

Merged

Aureliolo mentioned this pull request Mar 10, 2026

chore(main): release 0.1.0 #283

Merged

This was referenced Mar 15, 2026

chore(main): release 0.2.4 #431

Merged

chore(main): release 0.2.0 #442

Closed

chore(main): release 0.2.5 #447

Merged

chore(main): release 0.2.0 #460

Closed

chore(main): release 0.2.0 #471

Closed

	web: "subprocess" # medium risk — timeout + allowlist
	web: "docker" # medium risk — requires network isolation

	- `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost`)
	- `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost.total_tokens`)

	web: "egress-only" # web tools need outbound HTTP; no inbound
	web: "bridge" # web tools need outbound HTTP; enforce allowlist via proxy or iptables — see note

Conversation

Aureliolo commented Mar 6, 2026

Summary

Test plan

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Summary by CodeRabbit

Documentation

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 6, 2026 •

edited

Loading

greptile-apps bot commented Mar 6, 2026 •

edited

Loading

coderabbitai bot commented Mar 6, 2026 •

edited

Loading