docs: add crash recovery, sandboxing, analytics, and testing decisions#127
docs: add crash recovery, sandboxing, analytics, and testing decisions#127
Conversation
…s to design spec Address open questions raised by three external reviews of the design spec: - §6.6: Agent Crash Recovery — pluggable RecoveryStrategy protocol with fail-and-reassign (M3 MVP) and checkpoint recovery (M4/M5) - §10.5: LLM Call Analytics — incremental tracking: proxy overhead metrics (M3), call categorization with orchestration ratio (M4), full analytics layer with retry/latency/cache tracking (M5+) - §11.1.2: Tool Sandboxing — layered SandboxBackend protocol with SubprocessSandbox (low-risk tools), DockerSandbox (code execution), and K8sSandbox (future container deployments) - §15.3: Updated project tree (tools/sandbox/ directory) - §15.4: Added sandboxing to key design decisions table - §15.5: Five new convention rows (sandboxing, crash recovery, agent behavior testing, LLM call analytics, tool sandboxing) - §17.1: Resolved questions #9 (sandboxing), added #15-17 as resolved - §17.3: Updated crash risk mitigation, added orchestration overhead risk Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the design specification by formalizing critical architectural decisions related to agent resilience, security, and observability. It introduces detailed strategies for handling agent crashes, implementing layered tool sandboxing, and establishing an incremental LLM call analytics framework. These updates address key concerns raised during external reviews, providing clear paths for development across upcoming milestones and improving the overall robustness and manageability of the agent system. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the design specification by adding detailed sections on agent crash recovery, LLM call analytics, tool sandboxing, and agent behavior testing. The new content is well-structured, clear, and resolves several previously open questions. The updates to the project structure, design decisions, and risk tables are consistent with these additions. I have one suggestion to improve the security posture of the proposed tool sandboxing design.
Note: Security Review has been skipped due to the limited scope of the PR.
DESIGN_SPEC.md
Outdated
| overrides: # per-category backend overrides | ||
| file_system: "subprocess" # low risk — fast, no deps | ||
| git: "subprocess" # low risk — workspace-scoped | ||
| web: "subprocess" # medium risk — timeout + allowlist |
There was a problem hiding this comment.
In the sandboxing.overrides configuration, assigning the web tool category to the subprocess backend might introduce security risks. A subprocess has access to the host's network stack, which could allow it to connect to internal services on localhost or the local network, even if its filesystem access is restricted.
The comment mentions a 'timeout + allowlist', but the subprocess configuration doesn't show how this network allowlist would be implemented or enforced. For better security and isolation, consider defaulting the web category to the docker backend. The docker sandbox is configured with network: "none" by default, providing strong network isolation. If network access is needed for specific web tools, a dedicated Docker network with an egress-only policy could be used.
| web: "subprocess" # medium risk — timeout + allowlist | |
| web: "docker" # medium risk — requires network isolation |
There was a problem hiding this comment.
Pull request overview
Updates the design specification to resolve external review questions by documenting crash recovery, tool sandboxing, LLM call analytics, and testing conventions, plus reflecting those decisions in the architecture tables.
Changes:
- Add new spec sections: §6.6 Agent Crash Recovery, §10.5 LLM Call Analytics, §11.1.2 Tool Sandboxing.
- Update §15 architecture tables/project tree to include sandboxing decisions and planned directory structure.
- Mark previously open questions as resolved in §17.1 and add related risk mitigations in §17.3.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
DESIGN_SPEC.md
Outdated
| The engine catches the failure at its outermost boundary, logs the error with the full `AgentContext` snapshot for debugging, transitions the task to `FAILED`, and makes it available for reassignment (manual or automatic via the task router). | ||
|
|
||
| ```yaml | ||
| crash_recovery: | ||
| strategy: "fail_reassign" # fail_reassign, checkpoint | ||
| ``` | ||
|
|
||
| - Simple, no persistence dependency, M3-ready | ||
| - All progress is lost on crash — acceptable for short single-agent tasks in the MVP | ||
|
|
||
| On crash: | ||
| 1. Catch exception at the engine boundary (outermost `try/except` in the execution loop) | ||
| 2. Log at ERROR with full `AgentContext` snapshot (conversation, turn count, accumulated cost) | ||
| 3. Transition `TaskExecution` → `FAILED` with the exception as the failure reason |
There was a problem hiding this comment.
The spec says the engine should log the full AgentContext snapshot including the conversation on crash. In the current codebase, AgentContext.to_snapshot() produces an AgentContextSnapshot that intentionally excludes message contents (only message_count, turn_count, cost, etc.), which is safer and avoids leaking sensitive prompts/tool outputs into logs. Suggest updating this section to align with the existing snapshot model and explicitly call out redaction/truncation if any message content is ever logged.
DESIGN_SPEC.md
Outdated
| Every call to `BaseCompletionProvider.complete()` already records a `CostRecord` with token counts, cost, provider, model, agent, and task. In M3, the engine additionally logs **proxy overhead metrics** at task completion: | ||
|
|
||
| - `turns_per_task` — number of LLM turns to complete the task (from `AgentContext.turn_count`) | ||
| - `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost`) | ||
| - `cost_per_task` — total USD cost (from `TaskExecution.accumulated_cost.cost_usd`) |
There was a problem hiding this comment.
BaseCompletionProvider.complete() does not currently record a CostRecord (and it also lacks agent/task context needed to populate one). It only logs provider call start/success/error; cost aggregation happens via TokenUsage on responses and (when wired) the budget layer. Please reword this to avoid stating it "already records a CostRecord" and instead describe where/when CostRecord entries are created (e.g., in the engine when agent_id/task_id are known, recorded into CostTracker).
DESIGN_SPEC.md
Outdated
| Every call to `BaseCompletionProvider.complete()` already records a `CostRecord` with token counts, cost, provider, model, agent, and task. In M3, the engine additionally logs **proxy overhead metrics** at task completion: | ||
|
|
||
| - `turns_per_task` — number of LLM turns to complete the task (from `AgentContext.turn_count`) | ||
| - `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost`) |
There was a problem hiding this comment.
tokens_per_task is described as coming from AgentContext.accumulated_cost, but accumulated_cost is a TokenUsage object. To avoid ambiguity, consider calling out the exact field used for the token total (e.g., accumulated_cost.total_tokens or input_tokens + output_tokens) vs cost (accumulated_cost.cost_usd).
| - `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost`) | |
| - `tokens_per_task` — total tokens consumed (from `AgentContext.accumulated_cost.total_tokens`) |
Greptile SummaryThis PR addresses four open questions from external design-spec reviews by adding §6.6 (Agent Crash Recovery), §10.5 (LLM Call Analytics), §11.1.2 (Tool Sandboxing), and corresponding entries in §15–§17. The overall structure is well-thought-out — the pluggable protocol pattern ( Key issues found:
Confidence Score: 3/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Agent Execution Start] --> B[Engine Execution Loop]
B --> C{Exception?}
C -- No --> D[Turn Completed]
D --> E{Strategy?}
E -- fail_reassign --> F[Strategy 1: Skip checkpoint]
E -- checkpoint --> G[Strategy 2: Persist AgentContext snapshot\nto SQLite / filesystem]
G --> B
C -- Yes --> H{Strategy?}
H -- fail_reassign --> I[Catch at engine boundary\nLog redacted snapshot\nmessage contents excluded]
H -- checkpoint --> J[Detect via exception\nor heartbeat timeout]
J --> K[Load last checkpoint\nfull AgentContext incl. messages]
K --> L{Resume attempts\n< max_resume_attempts?}
L -- Yes --> M[Environment reconciliation\nsummary of changes since checkpoint]
M --> B
L -- No --> N[Fall back to fail_reassign]
N --> I
I --> O[TaskExecution → FAILED\nwith failure reason]
O --> P[Task available for reassignment\nvia task router]
Last reviewed commit: 383125d |
…nd Greptile - Add FAILED terminal state note to §6.6 (needs TaskStatus enum update in M3) - Fix AgentContext snapshot to use redacted form (exclude message contents) - Fix CostRecord attribution (engine layer, not BaseCompletionProvider) - Fix tokens_per_task to reference accumulated_cost.total_tokens - Move web tools from subprocess to Docker (no network controls in subprocess) - Add database network override (needs TCP access to DB host) - Add Docker network_overrides and allowed_hosts config - Change Adopted→Planned for unimplemented M3 conventions (sandboxing, crash recovery, testing) - Rename §15.5 to "Engineering Conventions" (scope expanded beyond Pydantic) - Update CLAUDE.md: move sandboxing from security/ to tools/ description Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughSummary by CodeRabbitDocumentation
WalkthroughDocumentation updates to CLAUDE.md with directory taxonomy and capability terminology adjustments. DESIGN_SPEC.md significantly expanded with new crash-recovery design (RecoveryStrategy pattern), tool sandboxing specification (SandboxBackend protocol), updated task failure workflows, and extended engineering conventions section. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| network: "none" # no network by default; per-category overrides below | ||
| network_overrides: # category-specific network policies | ||
| database: "bridge" # database tools need TCP access to DB host | ||
| web: "egress-only" # web tools need outbound HTTP; no inbound |
There was a problem hiding this comment.
egress-only is not a valid Docker network mode
network: "egress-only" is not a recognized Docker network driver. Docker's built-in network modes are bridge, host, overlay, macvlan, ipvlan, and none. Passing egress-only as the --network argument to docker run would fail with an error like network egress-only not found.
Restricting a container to outbound-only traffic in Docker actually requires either:
- A custom network +
iptablesrules (ornftables) to block inbound connections while permitting outbound ones - Routing all traffic through a local HTTP proxy (e.g., Squid) whose allowlist controls what the container can reach
- A userspace sandboxing layer like
gVisor(--runtime=runsc) with a custom network policy
The spec should document which of these mechanisms is intended and not use a non-existent Docker primitive. Otherwise implementers will hit a runtime error and have no guidance on what to build.
| web: "egress-only" # web tools need outbound HTTP; no inbound | |
| web: "bridge" # web tools need outbound HTTP; enforce allowlist via proxy or iptables — see note |
Note: Docker has no native "egress-only" mode. Outbound-only restriction for web tools must be implemented via a sidecar proxy (e.g., Squid +
http_accessrules) oriptables FORWARDrules on a custom bridge network. Document the chosen mechanism before implementingDockerSandboxfor thewebcategory.
Prompt To Fix With AI
This is a comment left during a code review.
Path: DESIGN_SPEC.md
Line: 1337
Comment:
**`egress-only` is not a valid Docker network mode**
`network: "egress-only"` is not a recognized Docker network driver. Docker's built-in network modes are `bridge`, `host`, `overlay`, `macvlan`, `ipvlan`, and `none`. Passing `egress-only` as the `--network` argument to `docker run` would fail with an error like `network egress-only not found`.
Restricting a container to outbound-only traffic in Docker actually requires either:
1. A custom network + `iptables` rules (or `nftables`) to block inbound connections while permitting outbound ones
2. Routing all traffic through a local HTTP proxy (e.g., [Squid](http://www.squid-cache.org/)) whose allowlist controls what the container can reach
3. A userspace sandboxing layer like `gVisor` (`--runtime=runsc`) with a custom network policy
The spec should document which of these mechanisms is intended and not use a non-existent Docker primitive. Otherwise implementers will hit a runtime error and have no guidance on what to build.
```suggestion
web: "bridge" # web tools need outbound HTTP; enforce allowlist via proxy or iptables — see note
```
> **Note:** Docker has no native "egress-only" mode. Outbound-only restriction for web tools must be implemented via a sidecar proxy (e.g., Squid + `http_access` rules) or `iptables FORWARD` rules on a custom bridge network. Document the chosen mechanism before implementing `DockerSandbox` for the `web` category.
How can I resolve this? If you propose a fix, please make it concise.| network_overrides: # category-specific network policies | ||
| database: "bridge" # database tools need TCP access to DB host | ||
| web: "egress-only" # web tools need outbound HTTP; no inbound | ||
| allowed_hosts: [] # allowlist of host:port pairs (e.g. ["db:5432"]) |
There was a problem hiding this comment.
allowed_hosts is empty by default but database: "bridge" provides no isolation without it
The network_overrides section sets database: "bridge", which correctly allows the container to reach a database over TCP. However, allowed_hosts: [] is empty by default, meaning there is no host-level allowlist enforced. A Docker container on a bridge network can reach any host reachable on that bridge — not just the database. The combination of bridge network + empty allowed_hosts offers essentially no isolation beyond the container boundary itself.
For the allowed_hosts field to be meaningful, DockerSandbox must:
- Enforce it via
iptablesrules, a sidecar proxy, or Docker's--add-host+--networkcombination - Document that
allowed_hosts: []means "deny all outbound connections on bridge" (deny-by-default), not "allow all"
The spec currently implies the allowlist is opt-in but provides no mechanism for its enforcement, leaving a gap that implementers will silently skip. At minimum, add a clarifying note:
allowed_hosts: [] # REQUIRED for database category: e.g. ["db:5432"]. Empty = deny all outbound on bridge.Prompt To Fix With AI
This is a comment left during a code review.
Path: DESIGN_SPEC.md
Line: 1335-1338
Comment:
**`allowed_hosts` is empty by default but `database: "bridge"` provides no isolation without it**
The `network_overrides` section sets `database: "bridge"`, which correctly allows the container to reach a database over TCP. However, `allowed_hosts: []` is empty by default, meaning there is no host-level allowlist enforced. A Docker container on a `bridge` network can reach *any* host reachable on that bridge — not just the database. The combination of `bridge` network + empty `allowed_hosts` offers essentially no isolation beyond the container boundary itself.
For the `allowed_hosts` field to be meaningful, `DockerSandbox` must:
1. Enforce it via `iptables` rules, a sidecar proxy, or Docker's `--add-host` + `--network` combination
2. Document that `allowed_hosts: []` means "deny all outbound connections on bridge" (deny-by-default), not "allow all"
The spec currently implies the allowlist is opt-in but provides no mechanism for its enforcement, leaving a gap that implementers will silently skip. At minimum, add a clarifying note:
```yaml
allowed_hosts: [] # REQUIRED for database category: e.g. ["db:5432"]. Empty = deny all outbound on bridge.
```
How can I resolve this? If you propose a fix, please make it concise.| #### Strategy 2: Checkpoint Recovery (Planned — M4/M5) | ||
|
|
||
| The engine persists an `AgentContext` snapshot after each completed turn. On crash, the framework detects the failure (via heartbeat timeout or exception), loads the last checkpoint, and resumes execution from the exact turn where it left off. The immutable `model_copy(update=...)` pattern makes checkpointing trivial — each `AgentContext` is a complete, self-contained frozen state that serializes cleanly via `model_dump_json()`. | ||
|
|
||
| ```yaml | ||
| crash_recovery: | ||
| strategy: "checkpoint" | ||
| checkpoint: | ||
| persist_every_n_turns: 1 # checkpoint frequency | ||
| storage: "sqlite" # sqlite, filesystem | ||
| heartbeat_interval_seconds: 30 # detect unresponsive agents | ||
| max_resume_attempts: 2 # retry limit before falling back to fail_reassign | ||
| ``` | ||
|
|
||
| - Preserves progress — critical for long tasks (multi-step plans, epic-level work) | ||
| - Requires persistence layer and environment state reconciliation on resume | ||
| - Natural fit with the existing immutable state model | ||
|
|
||
| > **Environment reconciliation:** When resuming from a checkpoint, the agent's tools and workspace may have changed (other agents modified files, external state drifted). The checkpoint strategy includes a reconciliation step: the resumed agent receives a summary of changes since the checkpoint timestamp and can adapt its plan accordingly. This is analogous to a developer returning to a branch after colleagues have pushed changes. |
There was a problem hiding this comment.
Checkpoint storage silently persists full message contents
Strategy 1 (fail-and-reassign) explicitly redacts message contents from its log entry: "excluding message contents to avoid leaking sensitive prompts/tool outputs". But Strategy 2 (checkpoint recovery) persists the full AgentContext snapshot — which includes the entire message history — to SQLite or the filesystem after every turn.
This creates an inconsistency: the same sensitive content that is deliberately excluded from crash logs in Strategy 1 is written in plaintext to a persistent checkpoint storage in Strategy 2. If the SQLite file or filesystem checkpoint directory is accessible to other agents, processes, or backup systems, sensitive prompts and tool outputs (API keys returned by tools, user PII in prompts, etc.) are silently at rest.
The spec should acknowledge this security implication and at least document the intended controls:
- Should checkpoint storage be encrypted at rest? (e.g., SQLCipher for SQLite, or filesystem-level encryption)
- Should
AgentContextcheckpoints exclude message contents (storing only tool call history + turn count) and rely on the task description for context on resume? - What is the access model for the checkpoint database — is it per-agent, shared, or controlled by the engine process only?
Without explicit guidance here, implementers will default to unencrypted plaintext storage, which is a meaningful downgrade from the redaction discipline already applied in Strategy 1.
Prompt To Fix With AI
This is a comment left during a code review.
Path: DESIGN_SPEC.md
Line: 873-891
Comment:
**Checkpoint storage silently persists full message contents**
Strategy 1 (fail-and-reassign) explicitly redacts message contents from its log entry: *"excluding message contents to avoid leaking sensitive prompts/tool outputs"*. But Strategy 2 (checkpoint recovery) persists the *full* `AgentContext` snapshot — which includes the entire message history — to SQLite or the filesystem after every turn.
This creates an inconsistency: the same sensitive content that is deliberately excluded from crash logs in Strategy 1 is written in plaintext to a persistent `checkpoint` storage in Strategy 2. If the SQLite file or filesystem checkpoint directory is accessible to other agents, processes, or backup systems, sensitive prompts and tool outputs (API keys returned by tools, user PII in prompts, etc.) are silently at rest.
The spec should acknowledge this security implication and at least document the intended controls:
- Should checkpoint storage be encrypted at rest? (e.g., SQLCipher for SQLite, or filesystem-level encryption)
- Should `AgentContext` checkpoints exclude message contents (storing only tool call history + turn count) and rely on the task description for context on resume?
- What is the access model for the checkpoint database — is it per-agent, shared, or controlled by the engine process only?
Without explicit guidance here, implementers will default to unencrypted plaintext storage, which is a meaningful downgrade from the redaction discipline already applied in Strategy 1.
How can I resolve this? If you propose a fix, please make it concise.🤖 I have created a release *beep* *boop* --- ## [0.1.1](ai-company-v0.1.0...ai-company-v0.1.1) (2026-03-10) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please).
🤖 I have created a release *beep* *boop* --- ## [0.1.0](v0.0.0...v0.1.0) (2026-03-11) ### Features * add autonomy levels and approval timeout policies ([#42](#42), [#126](#126)) ([#197](#197)) ([eecc25a](eecc25a)) * add CFO cost optimization service with anomaly detection, reports, and approval decisions ([#186](#186)) ([a7fa00b](a7fa00b)) * add code quality toolchain (ruff, mypy, pre-commit, dependabot) ([#63](#63)) ([36681a8](36681a8)) * add configurable cost tiers and subscription/quota-aware tracking ([#67](#67)) ([#185](#185)) ([9baedfa](9baedfa)) * add container packaging, Docker Compose, and CI pipeline ([#269](#269)) ([435bdfe](435bdfe)), closes [#267](#267) * add coordination error taxonomy classification pipeline ([#146](#146)) ([#181](#181)) ([70c7480](70c7480)) * add cost-optimized, hierarchical, and auction assignment strategies ([#175](#175)) ([ce924fa](ce924fa)), closes [#173](#173) * add design specification, license, and project setup ([8669a09](8669a09)) * add env var substitution and config file auto-discovery ([#77](#77)) ([7f53832](7f53832)) * add FastestStrategy routing + vendor-agnostic cleanup ([#140](#140)) ([09619cb](09619cb)), closes [#139](#139) * add HR engine and performance tracking ([#45](#45), [#47](#47)) ([#193](#193)) ([2d091ea](2d091ea)) * add issue auto-search and resolution verification to PR review skill ([#119](#119)) ([deecc39](deecc39)) * add mandatory JWT + API key authentication ([#256](#256)) ([c279cfe](c279cfe)) * add memory retrieval, ranking, and context injection pipeline ([#41](#41)) ([873b0aa](873b0aa)) * add pluggable MemoryBackend protocol with models, config, and events ([#180](#180)) ([46cfdd4](46cfdd4)) * add pluggable MemoryBackend protocol with models, config, and events ([#32](#32)) ([46cfdd4](46cfdd4)) * add pluggable output scan response policies ([#263](#263)) ([b9907e8](b9907e8)) * add pluggable PersistenceBackend protocol with SQLite implementation ([#36](#36)) ([f753779](f753779)) * add progressive trust and promotion/demotion subsystems ([#43](#43), [#49](#49)) ([3a87c08](3a87c08)) * add retry handler, rate limiter, and provider resilience ([#100](#100)) ([b890545](b890545)) * add SecOps security agent with rule engine, audit log, and ToolInvoker integration ([#40](#40)) ([83b7b6c](83b7b6c)) * add shared org memory and memory consolidation/archival ([#125](#125), [#48](#48)) ([4a0832b](4a0832b)) * design unified provider interface ([#86](#86)) ([3e23d64](3e23d64)) * expand template presets, rosters, and add inheritance ([#80](#80), [#81](#81), [#84](#84)) ([15a9134](15a9134)) * implement agent runtime state vs immutable config split ([#115](#115)) ([4cb1ca5](4cb1ca5)) * implement AgentEngine core orchestrator ([#11](#11)) ([#143](#143)) ([f2eb73a](f2eb73a)) * implement AuditRepository for security audit log persistence ([#279](#279)) ([94bc29f](94bc29f)) * implement basic tool system (registry, invocation, results) ([#15](#15)) ([c51068b](c51068b)) * implement built-in file system tools ([#18](#18)) ([325ef98](325ef98)) * implement communication foundation — message bus, dispatcher, and messenger ([#157](#157)) ([8e71bfd](8e71bfd)) * implement company template system with 7 built-in presets ([#85](#85)) ([cbf1496](cbf1496)) * implement conflict resolution protocol ([#122](#122)) ([#166](#166)) ([e03f9f2](e03f9f2)) * implement core entity and role system models ([#69](#69)) ([acf9801](acf9801)) * implement crash recovery with fail-and-reassign strategy ([#149](#149)) ([e6e91ed](e6e91ed)) * implement engine extensions — Plan-and-Execute loop and call categorization ([#134](#134), [#135](#135)) ([#159](#159)) ([9b2699f](9b2699f)) * implement enterprise logging system with structlog ([#73](#73)) ([2f787e5](2f787e5)) * implement graceful shutdown with cooperative timeout strategy ([#130](#130)) ([6592515](6592515)) * implement hierarchical delegation and loop prevention ([#12](#12), [#17](#17)) ([6be60b6](6be60b6)) * implement LiteLLM driver and provider registry ([#88](#88)) ([ae3f18b](ae3f18b)), closes [#4](#4) * implement LLM decomposition strategy and workspace isolation ([#174](#174)) ([aa0eefe](aa0eefe)) * implement meeting protocol system ([#123](#123)) ([ee7caca](ee7caca)) * implement message and communication domain models ([#74](#74)) ([560a5d2](560a5d2)) * implement model routing engine ([#99](#99)) ([d3c250b](d3c250b)) * implement parallel agent execution ([#22](#22)) ([#161](#161)) ([65940b3](65940b3)) * implement per-call cost tracking service ([#7](#7)) ([#102](#102)) ([c4f1f1c](c4f1f1c)) * implement personality injection and system prompt construction ([#105](#105)) ([934dd85](934dd85)) * implement single-task execution lifecycle ([#21](#21)) ([#144](#144)) ([c7e64e4](c7e64e4)) * implement subprocess sandbox for tool execution isolation ([#131](#131)) ([#153](#153)) ([3c8394e](3c8394e)) * implement task assignment subsystem with pluggable strategies ([#172](#172)) ([c7f1b26](c7f1b26)), closes [#26](#26) [#30](#30) * implement task decomposition and routing engine ([#14](#14)) ([9c7fb52](9c7fb52)) * implement Task, Project, Artifact, Budget, and Cost domain models ([#71](#71)) ([81eabf1](81eabf1)) * implement tool permission checking ([#16](#16)) ([833c190](833c190)) * implement YAML config loader with Pydantic validation ([#59](#59)) ([ff3a2ba](ff3a2ba)) * implement YAML config loader with Pydantic validation ([#75](#75)) ([ff3a2ba](ff3a2ba)) * initialize project with uv, hatchling, and src layout ([39005f9](39005f9)) * initialize project with uv, hatchling, and src layout ([#62](#62)) ([39005f9](39005f9)) * Litestar REST API, WebSocket feed, and approval queue (M6) ([#189](#189)) ([29fcd08](29fcd08)) * make TokenUsage.total_tokens a computed field ([#118](#118)) ([c0bab18](c0bab18)), closes [#109](#109) * parallel tool execution in ToolInvoker.invoke_all ([#137](#137)) ([58517ee](58517ee)) * testing framework, CI pipeline, and M0 gap fixes ([#64](#64)) ([f581749](f581749)) * wire all modules into observability system ([#97](#97)) ([f7a0617](f7a0617)) ### Bug Fixes * address Greptile post-merge review findings from PRs [#170](https://github.com/Aureliolo/ai-company/issues/170)-[#175](https://github.com/Aureliolo/ai-company/issues/175) ([#176](#176)) ([c5ca929](c5ca929)) * address post-merge review feedback from PRs [#164](https://github.com/Aureliolo/ai-company/issues/164)-[#167](https://github.com/Aureliolo/ai-company/issues/167) ([#170](#170)) ([3bf897a](3bf897a)), closes [#169](#169) * enforce strict mypy on test files ([#89](#89)) ([aeeff8c](aeeff8c)) * harden Docker sandbox, MCP bridge, and code runner ([#50](#50), [#53](#53)) ([d5e1b6e](d5e1b6e)) * harden git tools security + code quality improvements ([#150](#150)) ([000a325](000a325)) * harden subprocess cleanup, env filtering, and shutdown resilience ([#155](#155)) ([d1fe1fb](d1fe1fb)) * incorporate post-merge feedback + pre-PR review fixes ([#164](#164)) ([c02832a](c02832a)) * pre-PR review fixes for post-merge findings ([#183](#183)) ([26b3108](26b3108)) * resolve circular imports, bump litellm, fix release tag format ([#286](#286)) ([a6659b5](a6659b5)) * strengthen immutability for BaseTool schema and ToolInvoker boundaries ([#117](#117)) ([7e5e861](7e5e861)) ### Performance * harden non-inferable principle implementation ([#195](#195)) ([02b5f4e](02b5f4e)), closes [#188](#188) ### Refactoring * adopt NotBlankStr across all models ([#108](#108)) ([#120](#120)) ([ef89b90](ef89b90)) * extract _SpendingTotals base class from spending summary models ([#111](#111)) ([2f39c1b](2f39c1b)) * harden BudgetEnforcer with error handling, validation extraction, and review fixes ([#182](#182)) ([c107bf9](c107bf9)) * harden personality profiles, department validation, and template rendering ([#158](#158)) ([10b2299](10b2299)) * pre-PR review improvements for ExecutionLoop + ReAct loop ([#124](#124)) ([8dfb3c0](8dfb3c0)) * split events.py into per-domain event modules ([#136](#136)) ([e9cba89](e9cba89)) ### Documentation * add ADR-001 memory layer evaluation and selection ([#178](#178)) ([db3026f](db3026f)), closes [#39](#39) * add agent scaling research findings to DESIGN_SPEC ([#145](#145)) ([57e487b](57e487b)) * add CLAUDE.md, contributing guide, and dev documentation ([#65](#65)) ([55c1025](55c1025)), closes [#54](#54) * add crash recovery, sandboxing, analytics, and testing decisions ([#127](#127)) ([5c11595](5c11595)) * address external review feedback with MVP scope and new protocols ([#128](#128)) ([3b30b9a](3b30b9a)) * expand design spec with pluggable strategy protocols ([#121](#121)) ([6832db6](6832db6)) * finalize 23 design decisions (ADR-002) ([#190](#190)) ([8c39742](8c39742)) * update project docs for M2.5 conventions and add docs-consistency review agent ([#114](#114)) ([99766ee](99766ee)) ### Tests * add e2e single agent integration tests ([#24](#24)) ([#156](#156)) ([f566fb4](f566fb4)) * add provider adapter integration tests ([#90](#90)) ([40a61f4](40a61f4)) ### CI/CD * add Release Please for automated versioning and GitHub Releases ([#278](#278)) ([a488758](a488758)) * bump actions/checkout from 4 to 6 ([#95](#95)) ([1897247](1897247)) * bump actions/upload-artifact from 4 to 7 ([#94](#94)) ([27b1517](27b1517)) * bump anchore/scan-action from 6.5.1 to 7.3.2 ([#271](#271)) ([80a1c15](80a1c15)) * bump docker/build-push-action from 6.19.2 to 7.0.0 ([#273](#273)) ([dd0219e](dd0219e)) * bump docker/login-action from 3.7.0 to 4.0.0 ([#272](#272)) ([33d6238](33d6238)) * bump docker/metadata-action from 5.10.0 to 6.0.0 ([#270](#270)) ([baee04e](baee04e)) * bump docker/setup-buildx-action from 3.12.0 to 4.0.0 ([#274](#274)) ([5fc06f7](5fc06f7)) * bump sigstore/cosign-installer from 3.9.1 to 4.1.0 ([#275](#275)) ([29dd16c](29dd16c)) * harden CI/CD pipeline ([#92](#92)) ([ce4693c](ce4693c)) * split vulnerability scans into critical-fail and high-warn tiers ([#277](#277)) ([aba48af](aba48af)) ### Maintenance * add /worktree skill for parallel worktree management ([#171](#171)) ([951e337](951e337)) * add design spec context loading to research-link skill ([8ef9685](8ef9685)) * add post-merge-cleanup skill ([#70](#70)) ([f913705](f913705)) * add pre-pr-review skill and update CLAUDE.md ([#103](#103)) ([92e9023](92e9023)) * add research-link skill and rename skill files to SKILL.md ([#101](#101)) ([651c577](651c577)) * bump aiosqlite from 0.21.0 to 0.22.1 ([#191](#191)) ([3274a86](3274a86)) * bump pyyaml from 6.0.2 to 6.0.3 in the minor-and-patch group ([#96](#96)) ([0338d0c](0338d0c)) * bump ruff from 0.15.4 to 0.15.5 ([a49ee46](a49ee46)) * fix M0 audit items ([#66](#66)) ([c7724b5](c7724b5)) * **main:** release ai-company 0.1.1 ([#282](#282)) ([2f4703d](2f4703d)) * pin setup-uv action to full SHA ([#281](#281)) ([4448002](4448002)) * post-audit cleanup — PEP 758, loggers, bug fixes, refactoring, tests, hookify rules ([#148](#148)) ([c57a6a9](c57a6a9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Signed-off-by: Aurelio <19254254+Aureliolo@users.noreply.github.com>
…ing (#299) ## Summary - **Upgrade `actions/upload-pages-artifact` v3 → v4** — v4.0.0 ([PR #127](actions/upload-pages-artifact#127)) SHA-pins its internal `actions/upload-artifact` dependency, fixing the `sha_pinning_required` conflict where the composite action's tag reference (`@v4`) was rejected by the repo's Actions permissions policy - **Add `zizmor` workflow security analysis** — runs on workflow file changes (push to main + PRs), catches unpinned actions, script injection, excessive permissions, and uploads SARIF to the Security tab - **Add explicit failure on release retry exhaustion** — retry loop now sets a `FOUND` flag so exhaustion surfaces a clear `::error::` instead of falling through to a confusing `gh release edit` failure (Greptile PR #298 finding) ## Context After merging #298, the Pages workflow failed on main because `upload-pages-artifact` v3 internally called `actions/upload-artifact@v4` (tag, not SHA), violating the repo's `sha_pinning_required: true` setting. This is a [known limitation](actions/runner#2195) with composite actions — GitHub enforces SHA pinning transitively but composite action authors don't always pin their internal deps. v4.0.0 fixed this upstream. The zizmor workflow provides CI-level enforcement of SHA pinning and other workflow security checks, complementing the repo-level `sha_pinning_required` setting. ## Test plan - [ ] Pages workflow succeeds on main after merge (v4 upload-pages-artifact) - [ ] zizmor workflow runs and uploads SARIF on this PR's workflow changes - [ ] Verify no breaking change from v4 dotfile exclusion (MkDocs/Astro output has no dotfiles) - [ ] Release retry loop fails clearly after exhaustion (manual verification of logic)
Summary
Addresses open questions raised by three external reviews of the design spec. Adds four new sections and updates existing tables with resolved decisions.
RecoveryStrategyprotocol: fail-and-reassign (M3 MVP), checkpoint recovery (M4/M5), with environment reconciliation on resumeSandboxBackendprotocol:SubprocessSandboxfor low-risk tools (file, git),DockerSandboxfor code execution/terminal,K8sSandboxfor future container deploymentstools/sandbox/directory with protocol, subprocess, docker)Test plan
🤖 Generated with Claude Code