Skip to content

fix: harden Docker sandbox, MCP bridge, and code runner#194

Merged
Aureliolo merged 6 commits intomainfrom
feat/sandboxing-and-mcp-bridge
Mar 10, 2026
Merged

fix: harden Docker sandbox, MCP bridge, and code runner#194
Aureliolo merged 6 commits intomainfrom
feat/sandboxing-and-mcp-bridge

Conversation

@Aureliolo
Copy link
Copy Markdown
Owner

Summary

  • Docker sandbox: Multi-stage Dockerfile (no curl-pipe-bash), container hardening (PidsLimit=64, ReadonlyRootfs, CapDrop=ALL), resource leak fixes in cleanup/connect paths, memory limit validation, mount_mode default changed to ro
  • MCP client: Connect timeout enforcement via asyncio.wait_for, AsyncExitStack cleanup on all failure paths, proper disconnect error handling with dedicated event constants
  • MCP bridge tool: MCPError caught and returned as error result instead of propagating, cache TypeError resilience, unknown content block handling in result mapper
  • MCP factory: TaskGroup partial failure cleanup (disconnect already-connected clients), extracted _connect_all to reduce complexity, removed unused variable
  • MCP config: connect_timeout_seconds le=120, result_cache_ttl_seconds ge=0, removed dead cache_ttl_seconds field, validation uses MCP_CONFIG_VALIDATION_FAILED event
  • Docker/sandbox config: All fields use Field(description=...), network override validation, NotBlankStr for identifier fields, _BackendName type alias with validator
  • Code runner: Strict arguments["code"]/arguments["language"] access (required params), timeout schema maximum: 600
  • Docs: Updated DESIGN_SPEC.md §15.2 (aiodocker, mcp SDK), §15.3 (mcp/ subpackage, new files), CLAUDE.md (package structure, logging events), README.md (tool system, M7 status)
  • Tests: 15+ new tests covering container hardening, error paths, HTTP transport, protocol conformance, cache disabled path, config bounds

Closes #50, #53

Test plan

  • ruff check — all passed
  • ruff format — all formatted
  • mypy src/ tests/ — no issues in 678 files
  • pytest tests/ -n auto --cov=ai_company --cov-fail-under=80 — 5391 passed, 95% coverage

Review coverage

Pre-reviewed by 10 agents (code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer, logging-audit, resilience-audit, security-reviewer, docs-consistency). 60 findings consolidated, all implemented.

🤖 Generated with Claude Code

Docker sandbox (#50):
- DockerSandboxConfig, SandboxingConfig for backend selection
- DockerSandbox implementing SandboxBackend via aiodocker
- CodeRunnerTool for Python/JS/Bash execution in containers
- Sandbox Dockerfile (Python 3.14 + Node.js 22)

MCP bridge (#53):
- MCPServerConfig/MCPConfig for stdio and streamable_http transports
- MCPClient thin wrapper over mcp SDK with session management
- MCPBridgeTool exposing MCP tools as native BaseTool instances
- MCPResultCache with TTL + LRU eviction
- MCPToolFactory for parallel server connection and tool discovery
- D18 result mapping (text, image, audio, embedded resources)
- MCP error hierarchy extending ToolError

Both integrated into RootConfig with sandboxing and mcp fields.
The test iterated through 30 error cycles, each sleeping 1s via
_POLL_TIMEOUT, causing it to take ~30s and crash pytest-xdist workers
at the timeout boundary. Patching _POLL_TIMEOUT to 0.0 makes the test
complete in <1s.
Pre-reviewed by 10 agents, 60 findings addressed:
- Docker: multi-stage Dockerfile, container hardening (PidsLimit, ReadonlyRootfs, CapDrop), resource leak fixes, memory limit validation
- MCP client: connect timeout enforcement, resource cleanup on failure, proper disconnect error handling
- MCP bridge: error result wrapping, cache TypeError resilience, unknown content block handling
- MCP factory: TaskGroup partial failure cleanup, complexity reduction
- MCP config: field bounds enforcement, dead field removal, validation event constants
- Docker/sandbox config: Field descriptors, network override validation, mount_mode default ro
- Code runner: strict required param access, timeout schema bounds
- Docs: DESIGN_SPEC §15.2/§15.3, CLAUDE.md, README.md updated
- Tests: 15+ new tests for hardening, error paths, protocol conformance, HTTP transport
Copilot AI review requested due to automatic review settings March 10, 2026 08:11
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 10, 2026

Dependency Review

The following issues were found:
  • ✅ 0 vulnerable package(s)
  • ✅ 0 package(s) with incompatible licenses
  • ✅ 0 package(s) with invalid SPDX license definitions
  • ⚠️ 3 package(s) with unknown licenses.
See the Details below.

License Issues

uv.lock

PackageVersionLicenseIssue Type
aiodocker0.26.0NullUnknown License
pycparser3.0NullUnknown License
sse-starlette3.3.2NullUnknown License
Allowed Licenses: MIT, MIT-0, Apache-2.0, BSD-2-Clause, BSD-3-Clause, ISC, MPL-2.0, PSF-2.0, Unlicense, 0BSD, CC0-1.0, Python-2.0, Python-2.0.1, LicenseRef-scancode-free-unknown

OpenSSF Scorecard

Scorecard details
PackageVersionScoreDetails
pip/aiodocker 0.26.0 UnknownUnknown
pip/cffi 2.0.0 UnknownUnknown
pip/cryptography 46.0.5 UnknownUnknown
pip/httpx-sse 0.4.3 UnknownUnknown
pip/mcp 1.26.0 UnknownUnknown
pip/pycparser 3.0 UnknownUnknown
pip/pydantic-settings 2.13.1 UnknownUnknown
pip/pyjwt 2.11.0 UnknownUnknown
pip/python-multipart 0.0.22 UnknownUnknown
pip/pywin32 311 UnknownUnknown
pip/sse-starlette 3.3.2 UnknownUnknown
pip/starlette 0.52.1 UnknownUnknown

Scanned Files

  • uv.lock

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 10, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b5317583-82aa-4559-9267-829fff877ecf

📥 Commits

Reviewing files that changed from the base of the PR and between e777e9b and 106ec1c.

📒 Files selected for processing (10)
  • DESIGN_SPEC.md
  • src/ai_company/observability/events/mcp.py
  • src/ai_company/tools/mcp/client.py
  • src/ai_company/tools/mcp/factory.py
  • src/ai_company/tools/mcp/result_mapper.py
  • src/ai_company/tools/sandbox/docker_config.py
  • src/ai_company/tools/sandbox/docker_sandbox.py
  • src/ai_company/tools/sandbox/sandboxing_config.py
  • tests/unit/tools/sandbox/test_docker_config.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Docker-based sandbox backend, a Code Runner tool, and MCP bridge/bridge-backed tools added.
  • Observability

    • New event keys for Docker, MCP, and Code Runner to improve tracing.
  • Configuration

    • New sandboxing and MCP configuration surfaces; sandbox mounts default updated to read-only.
  • Documentation

    • README and design docs updated to reflect Docker, code-runner, and MCP progress/status.
  • Tests

    • Extensive unit and integration tests added for sandboxing, MCP pipeline, caching, client, factory, and code runner.

Walkthrough

Adds a Docker-backed sandbox and CodeRunner tool, restructures MCP integration into a new ai_company.tools.mcp package (client, factory, bridge, cache, models, errors, result mapper), introduces observability events for code_runner/docker/mcp, extends config schema (sandboxing, mcp), pins runtime deps, and adds extensive tests and exports.

Changes

Cohort / File(s) Summary
Docs & Manifests
CLAUDE.md, DESIGN_SPEC.md, README.md, pyproject.toml, .github/workflows/dependency-review.yml
Documented Docker/MCP updates; pinned mcp==1.26.0, aiodocker==0.26.0; added mypy ignores and Docker to system requirements; updated allowed licenses.
Observability events
src/ai_company/observability/events/code_runner.py, src/ai_company/observability/events/docker.py, src/ai_company/observability/events/mcp.py
Added typed Final[str] constants for code_runner, docker, and MCP event names used by logging/telemetry.
Sandboxing: Docker backend
docker/sandbox/Dockerfile, src/ai_company/tools/sandbox/docker_config.py, src/ai_company/tools/sandbox/docker_sandbox.py, src/ai_company/tools/sandbox/errors.py, src/ai_company/tools/sandbox/protocol.py, src/ai_company/tools/sandbox/__init__.py, src/ai_company/tools/sandbox/sandboxing_config.py
Introduced DockerSandbox implementation, DockerSandboxConfig, SandboxingConfig (per-category routing), Dockerfile, and updated exports/docstrings.
Code runner tool
src/ai_company/tools/code_runner.py, src/ai_company/tools/__init__.py
Added CodeRunnerTool that validates language, logs events, delegates execution to a SandboxBackend, and returns ToolExecutionResult; re-exported in tools package.
MCP integration (new package)
src/ai_company/tools/mcp/__init__.py, .../client.py, .../config.py, .../factory.py, .../bridge_tool.py, .../cache.py, .../models.py, .../errors.py, .../result_mapper.py
New ai_company.tools.mcp package: immutable server/config models, MCPClient (async transport/discovery/invoke), MCPToolFactory, MCPBridgeTool adapter, MCPResultCache (TTL/LRU), domain models, error types, result mapping, lazy exports.
Config & root defaults
src/ai_company/config/defaults.py, src/ai_company/config/schema.py
Added top-level sandboxing and mcp configuration fields and schema entries.
Exports & public API surface
src/ai_company/tools/__init__.py, src/ai_company/tools/mcp/__init__.py, src/ai_company/tools/sandbox/__init__.py
Extended package exports: CodeRunnerTool, DockerSandbox and configs, SandboxingConfig, and MCP package symbols (with lazy loading).
Tests — integration & unit
tests/integration/tools/*, tests/unit/tools/mcp/*, tests/unit/tools/sandbox/*, tests/unit/tools/test_code_runner.py, tests/unit/api/test_bus_bridge.py, tests/unit/config/conftest.py, ...
Comprehensive unit and integration tests for DockerSandbox, MCP client/factory/bridge/cache/mapper/config/errors, CodeRunnerTool; updated fixtures and new test helpers.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(220,240,255,0.5)
    participant Agent
    participant CodeRunnerTool
    participant DockerSandbox
    participant DockerDaemon
    Agent->>CodeRunnerTool: execute(code, language, timeout)
    CodeRunnerTool->>CodeRunnerTool: validate language\nemit CODE_RUNNER_EXECUTE_START
    CodeRunnerTool->>DockerSandbox: execute(command, args, timeout)
    DockerSandbox->>DockerDaemon: create_container(config)
    DockerDaemon-->>DockerSandbox: container_id
    DockerSandbox->>DockerDaemon: start_container(container_id)
    DockerSandbox->>DockerDaemon: wait_for_exit(container_id, timeout)
    DockerDaemon-->>DockerSandbox: stdout/stderr, exit_code
    DockerSandbox->>DockerDaemon: remove_container(container_id)
    DockerSandbox-->>CodeRunnerTool: SandboxResult
    CodeRunnerTool->>CodeRunnerTool: emit CODE_RUNNER_EXECUTE_SUCCESS / FAILED
    CodeRunnerTool-->>Agent: ToolExecutionResult
    end
Loading
sequenceDiagram
    rect rgba(240,255,220,0.5)
    participant Factory as MCPToolFactory
    participant Client as MCPClient
    participant Server as MCP Server
    participant Bridge as MCPBridgeTool
    participant User as Agent
    Factory->>Factory: filter enabled servers
    par connect & discover per server
        Factory->>Client: connect()
        Client->>Server: establish connection
        Server-->>Client: connected
        Client->>Server: list_tools()
        Server-->>Client: [MCPToolInfo...]
        Client-->>Factory: tools
        Factory->>Bridge: create MCPBridgeTool(tool_info, client, cache?)
    and
    end
    Factory-->>User: [MCPBridgeTool...]
    User->>Bridge: execute(arguments)
    Bridge->>Bridge: check cache (MCP_CACHE_HIT / MISS)
    alt cache hit
        Bridge-->>User: cached ToolExecutionResult
    else cache miss
        Bridge->>Client: call_tool(name, arguments)
        Client->>Server: invoke
        Server-->>Client: MCPRawResult
        Client-->>Bridge: raw result
        Bridge->>Bridge: map_call_tool_result -> ToolExecutionResult
        Bridge->>Bridge: store in cache (if eligible)
        Bridge-->>User: ToolExecutionResult
    end
    end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'fix: harden Docker sandbox, MCP bridge, and code runner' directly summarizes the primary changes across three major components and is clear and specific.
Description check ✅ Passed The description comprehensively covers Docker sandbox hardening, MCP improvements, configuration changes, code runner enhancements, documentation updates, and test coverage—all of which are reflected in the changeset.
Linked Issues check ✅ Passed The PR successfully implements the core requirements from #50: Docker sandbox with aiodocker, timeout enforcement, resource limits, stdout/stderr capture, cleanup, isolation, and comprehensive test coverage. MCP bridge and code runner enhancements satisfy security hardening objectives.
Out of Scope Changes check ✅ Passed All code changes are directly aligned with the PR objectives: Docker sandbox hardening (Dockerfile, config, implementation), MCP robustness improvements (client, bridge, factory), configuration validation, code runner error handling, documentation updates, and comprehensive test coverage. No unrelated changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/sandboxing-and-mcp-bridge
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch feat/sandboxing-and-mcp-bridge

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR introduces the full MCP bridge subsystem (mcp/ package), a Docker sandbox backend, and the code runner tool, with hardening applied throughout: container security flags, connect-timeout enforcement, resource-leak fixes, and improved observability event discipline. The vast majority of the 60 pre-review findings have been addressed cleanly — prior concerns about ReadonlyRootfs without a /tmp tmpfs, the nobody UID conflict, wrong disconnect log events, and MCP_RESULT_MAPPED reuse are all resolved.

Two issues remain worth addressing before merge:

  • timeout: 0 always times out (code_runner.py): The JSON schema allows "minimum": 0 for the optional timeout parameter, but asyncio.wait_for treats any timeout ≤ 0 as an immediate cancellation. Any LLM or caller that passes timeout: 0 expecting "no timeout" or "minimum timeout" will always receive a timed_out=True result. The fix is to change to "exclusiveMinimum": 0, consistent with how DockerSandboxConfig.timeout_seconds and MCPServerConfig.timeout_seconds are already constrained.

  • Double mcp.invoke.failed events (bridge_tool.py): MCPClient.call_tool() logs MCP_INVOKE_FAILED at exception level before raising MCPInvocationError. MCPBridgeTool._invoke() then catches the MCPError and emits MCP_INVOKE_FAILED a second time. This doubles alert noise and skews count-based dashboards. For MCPTimeoutError, the pair is mcp.invoke.timeout + mcp.invoke.failed, which causes timeouts to appear in general failure aggregations. The bridge-level log should either be removed (client log is sufficient) or demoted to debug.

Confidence Score: 4/5

  • Safe to merge with minor fixes — no critical security or data-loss risks, but the timeout=0 schema edge case and duplicate logging should be addressed.
  • The PR is well-structured with 95% test coverage and all prior review findings resolved. Two issues lower confidence slightly: a schema validation gap that makes timeout=0 always produce a timed-out result (user-facing correctness bug), and duplicate MCP_INVOKE_FAILED log events that inflate operational metrics. Neither is a security or data-integrity issue, but the schema bug is a real correctness problem for callers.
  • src/ai_company/tools/code_runner.py (timeout schema) and src/ai_company/tools/mcp/bridge_tool.py (duplicate log event in _invoke)

Important Files Changed

Filename Overview
src/ai_company/tools/mcp/client.py New MCP client with connect timeout via asyncio.wait_for, lock-protected session access in all paths, and proper AsyncExitStack cleanup on every failure branch. TOCTOU concern from prior review is resolved — _require_session() is called inside the lock in both call_tool and list_tools.
src/ai_company/tools/mcp/factory.py Parallel server connection via TaskGroup with correct partial-failure cleanup (contextlib.suppress wraps disconnect, original ExceptionGroup is always re-raised). Shutdown correctly uses MCP_CLIENT_DISCONNECT_FAILED. Extracted _connect_all reduces complexity.
src/ai_company/tools/mcp/bridge_tool.py MCPError is caught and returned as an error ToolExecutionResult (no propagation). Cache TypeError resilience implemented with dedicated MCP_CACHE_STORE_FAILED for the store path. However, catching MCPError in _invoke emits MCP_INVOKE_FAILED which duplicates the same event already logged by MCPClient.call_tool().
src/ai_company/tools/sandbox/docker_sandbox.py Container hardening applied: PidsLimit=64, ReadonlyRootfs=True, CapDrop=ALL, and a /tmp tmpfs (size=64m,noexec,nosuid) so there is writable scratch space inside the read-only rootfs. Resource leak fixes in cleanup and connect paths look solid.
src/ai_company/tools/code_runner.py Strict required-param access, correct language enum guard, and timeout schema maximum:600. However, "minimum":0 in the timeout schema allows timeout=0.0, which causes asyncio.wait_for to raise TimeoutError immediately — every invocation with timeout=0 returns timed_out=True. Should use "exclusiveMinimum":0.
src/ai_company/tools/mcp/result_mapper.py Unknown content block types now use dedicated MCP_RESULT_UNKNOWN_BLOCK constant (addressing prior review comment). Mapping rules for Text, Image, Audio, and EmbeddedResource content are clear and correct.
src/ai_company/tools/mcp/config.py connect_timeout_seconds capped at le=120, result_cache_ttl_seconds ge=0, dead cache_ttl_seconds field removed. Transport-field and tool-filter validators use MCP_CONFIG_VALIDATION_FAILED. Unique server name validation in MCPConfig is sound.
src/ai_company/tools/sandbox/docker_config.py All fields use Field(description=...), NotBlankStr for identifier fields, network override validation, and mount_mode default changed to "ro". Memory limit validated at model creation time.
docker/sandbox/Dockerfile Multi-stage build (node:22-slim base for Node.js binary, python:3.14-slim final). UID changed to 10001, resolving the prior nobody/65534 collision concern. No curl-pipe-bash pattern.
src/ai_company/tools/mcp/cache.py TTL+LRU cache using OrderedDict. _make_hashable recursively converts dicts/lists/sets to frozensets/tuples. Safe for single-event-loop use as documented. put() correctly avoids caching error results via the is_error guard in bridge_tool.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant MCPBridgeTool
    participant MCPResultCache
    participant MCPClient
    participant MCPServer

    Caller->>MCPBridgeTool: execute(arguments)
    MCPBridgeTool->>MCPResultCache: get(tool_name, arguments)
    alt Cache HIT
        MCPResultCache-->>MCPBridgeTool: ToolExecutionResult (cached)
        MCPBridgeTool-->>Caller: ToolExecutionResult
    else Cache MISS / TypeError
        MCPResultCache-->>MCPBridgeTool: None
        MCPBridgeTool->>MCPClient: call_tool(tool_name, arguments)
        MCPClient->>MCPClient: acquire _lock
        MCPClient->>MCPClient: _require_session()
        MCPClient->>MCPServer: asyncio.wait_for(session.call_tool, timeout)
        alt Success
            MCPServer-->>MCPClient: MCPRawResult
            MCPClient-->>MCPBridgeTool: MCPRawResult
            MCPBridgeTool->>MCPBridgeTool: map_call_tool_result(raw)
            MCPBridgeTool->>MCPResultCache: put(tool_name, arguments, result)
            MCPBridgeTool-->>Caller: ToolExecutionResult
        else TimeoutError
            MCPClient->>MCPClient: log MCP_INVOKE_TIMEOUT
            MCPClient-->>MCPBridgeTool: raise MCPTimeoutError
            MCPBridgeTool->>MCPBridgeTool: log MCP_INVOKE_FAILED (duplicate)
            MCPBridgeTool-->>Caller: ToolExecutionResult(is_error=True)
        else MCPInvocationError
            MCPClient->>MCPClient: log MCP_INVOKE_FAILED
            MCPClient-->>MCPBridgeTool: raise MCPInvocationError
            MCPBridgeTool->>MCPBridgeTool: log MCP_INVOKE_FAILED (duplicate)
            MCPBridgeTool-->>Caller: ToolExecutionResult(is_error=True)
        end
    end
Loading

Last reviewed commit: 106ec1c

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Docker sandbox, implements the MCP bridge for external tool integration, and adds a code runner tool — completing issues #50 and #53. It introduces security hardening for Docker containers, full MCP client lifecycle management with timeouts and cleanup, and comprehensive test coverage across both new subsystems.

Changes:

  • Docker sandbox: New DockerSandbox backend with container hardening (PidsLimit=64, ReadonlyRootfs, CapDrop=ALL), resource limit validation, and lazy Docker client connection
  • MCP bridge: Full MCP client (MCPClient) with stdio/HTTP transport, MCPBridgeTool wrapping MCP tools as internal tools, TTL+LRU result cache, and MCPToolFactory for parallel server connection with cleanup on partial failure
  • Code runner + config: New CodeRunnerTool, SandboxingConfig/DockerSandboxConfig config models added to RootConfig, 15+ new tests across all new components

Reviewed changes

Copilot reviewed 46 out of 47 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/ai_company/tools/sandbox/docker_sandbox.py New Docker sandbox backend with container hardening and lifecycle management
src/ai_company/tools/sandbox/docker_config.py Pydantic config model for Docker sandbox
src/ai_company/tools/sandbox/sandboxing_config.py Top-level sandboxing config with backend routing
src/ai_company/tools/mcp/client.py MCP client with connect timeout, AsyncExitStack cleanup, and serialized invocation
src/ai_company/tools/mcp/bridge_tool.py MCPBridgeTool wrapping MCP tools with cache support and MCPError handling
src/ai_company/tools/mcp/factory.py MCPToolFactory with TaskGroup parallelism and partial failure cleanup
src/ai_company/tools/mcp/cache.py TTL+LRU result cache
src/ai_company/tools/mcp/config.py MCPServerConfig/MCPConfig with field bounds validation
src/ai_company/tools/mcp/result_mapper.py Maps MCP raw results to internal ToolExecutionResult
src/ai_company/tools/mcp/models.py MCPToolInfo and MCPRawResult value objects
src/ai_company/tools/mcp/errors.py MCP error hierarchy extending ToolError
src/ai_company/tools/code_runner.py CodeRunnerTool delegating to SandboxBackend
src/ai_company/observability/events/mcp.py MCP event constants
src/ai_company/observability/events/docker.py Docker event constants
src/ai_company/observability/events/code_runner.py Code runner event constants
src/ai_company/config/schema.py RootConfig gains sandboxing and mcp fields
docker/sandbox/Dockerfile Multi-stage sandbox image
DESIGN_SPEC.md Updated §15.2/§15.3 with new files and status
CLAUDE.md Updated package structure and logging events
README.md Updated M7 status and tool system description
pyproject.toml / uv.lock Added aiodocker==0.26.0 and mcp==1.26.0
Test files 15+ new tests covering all new components

Issues found:

  1. Resource leak in _ensure_docker (src/ai_company/tools/sandbox/docker_sandbox.py, lines 132–141): When client.version() raises an exception, the newly created aiodocker.Docker() client is not closed before SandboxStartError is raised. The underlying aiohttp session/connector is leaked. The except branch should call await client.close() before re-raising.

  2. Wrong event constant in shutdown (src/ai_company/tools/mcp/factory.py, lines 144–148): MCP_FACTORY_SERVER_SKIPPED is used to log disconnect failures during shutdown(). This event implies a server was skipped during factory startup, not that a disconnect failed. The already-defined MCP_CLIENT_DISCONNECT_FAILED constant should be used instead.

  3. Documentation typo in DESIGN_SPEC.md (lines 3029, 3034): The directory tree uses McpBridgeTool but the actual class name is MCPBridgeTool.

  4. Documentation error in DESIGN_SPEC.md (line 3030): cache.py is described as "Tool schema caching" but it provides MCP invocation result caching (TTL+LRU), not schema caching.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +144 to +148
logger.warning(
MCP_FACTORY_SERVER_SKIPPED,
server=client.server_name,
reason=f"disconnect failed: {exc}",
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In shutdown, the MCP_FACTORY_SERVER_SKIPPED event constant is used to log disconnect failures. This event name implies a server was skipped during factory setup, not that disconnection failed. A dedicated event constant (e.g., MCP_CLIENT_DISCONNECT_FAILED already exists in mcp.py) should be used here instead, or at minimum the existing MCP_CLIENT_DISCONNECT_FAILED constant should be imported and used for this disconnect failure log.

Copilot uses AI. Check for mistakes.
DESIGN_SPEC.md Outdated
Comment on lines +3029 to +3034
│ │ ├── bridge_tool.py # McpBridgeTool (BaseTool integration)
│ │ ├── cache.py # Tool schema caching
│ │ ├── client.py # MCP client wrapper
│ │ ├── config.py # MCP server/bridge config models
│ │ ├── errors.py # MCP error hierarchy
│ │ ├── factory.py # McpBridgeTool factory
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DESIGN_SPEC.md uses McpBridgeTool (Pascal case mixing) in the directory tree comments (lines 3029 and 3034), but the actual class is named MCPBridgeTool. These documentation comments in the spec are incorrect and should be updated to MCPBridgeTool.

Copilot uses AI. Check for mistakes.
DESIGN_SPEC.md Outdated
│ │ └── mcp/ # MCP bridge subpackage
│ │ ├── __init__.py # Package exports
│ │ ├── bridge_tool.py # McpBridgeTool (BaseTool integration)
│ │ ├── cache.py # Tool schema caching
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DESIGN_SPEC.md describes cache.py as "Tool schema caching" but the actual module provides MCP tool result caching (TTL+LRU), not schema caching. The comment should read something like "MCP result cache (TTL + LRU)".

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +141
try:
client = aiodocker.Docker()
await client.version()
except Exception as exc:
logger.exception(
DOCKER_DAEMON_UNAVAILABLE,
error=str(exc),
)
msg = f"Docker daemon unavailable: {exc}"
raise SandboxStartError(msg) from exc
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _ensure_docker, if client.version() throws an exception, the aiodocker.Docker() client instance (client) is created but never closed before raising SandboxStartError. This leaks the underlying aiohttp session/connector. The try/except/else pattern does not close client in the exception path. The fix is to close the client in the except block before re-raising.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 22

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
DESIGN_SPEC.md (1)

2123-2134: ⚠️ Potential issue | 🟡 Minor

Update the sandbox example to the hardened default.

This YAML still shows mount_mode: "rw", which contradicts the PR's read-only-by-default hardening and leaves the spec advertising the old behavior. As per coding guidelines, "When approved deviations occur, update DESIGN_SPEC.md to reflect the new reality."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@DESIGN_SPEC.md` around lines 2123 - 2134, The docker example in
DESIGN_SPEC.md still shows mount_mode: "rw" which contradicts the hardened
default; update the docker example block so mount_mode is the read-only hardened
default (replace "rw" with "ro") and, if the spec wants to document the common
exception for the workspace, add a brief note next to mount_mode or in the
docker section describing that only the workspace is mounted read-write via a
separate workspace mount setting (referencing the docker keys mount_mode and
allowed_hosts to keep context).
tests/unit/observability/test_events.py (1)

175-215: ⚠️ Potential issue | 🟡 Minor

Add exact-value checks for the new event domains.

Adding code_runner, docker, and mcp to discovery is good, but there are still no domain-specific assertions for those public constants. A renamed event would still satisfy the generic discovery, pattern, and duplicate checks. As per coding guidelines, tests/**/*.py: Prefer @pytest.mark.parametrize for testing similar cases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/observability/test_events.py` around lines 175 - 215, Test
test_all_domain_modules_discovered only checks module discovery; add a new
parametrized test (use `@pytest.mark.parametrize`) that iterates over the new
domain modules code_runner, docker, and mcp and asserts each module exposes its
expected public event constant(s). Specifically, import events.code_runner,
events.docker, events.mcp inside the test and for each module assert
hasattr(module, "<EXPECTED_CONSTANT_NAME>") (replace <EXPECTED_CONSTANT_NAME>
with the actual public constant names defined in those modules) so renaming the
module won’t mask a renamed/removed event; keep
test_all_domain_modules_discovered as-is and add this new test next to it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@DESIGN_SPEC.md`:
- Around line 3022-3036: The DESIGN_SPEC.md tree is out of sync with the
implementation: update the spec to reflect the actual module path
ai_company.tools.sandbox.sandboxing_config (instead of
tools/sandboxing_config.py at the root) and reconcile the tooling name mismatch
by using the exact class name used in code (change MCPBridgeTool → McpBridgeTool
in the spec or update the code to MCPBridgeTool consistently); ensure you
mention this divergence to the reviewer/user per the guideline (alert user and
explain the change) so they can decide whether to update implementation or spec.

In `@README.md`:
- Line 18: Update the README to document Docker as an explicit runtime
prerequisite for the "Tool System" sandbox and code execution features: add
Docker (and Docker Compose if applicable) to the prerequisites list, provide
concise install links/commands for common platforms, and include brief
instructions to build or pull the sandbox image used by the sandbox abstraction
(mention the sandbox image name or build command and the command to run it) so
users won’t encounter Docker-unavailable errors when using the sandbox, git, and
code runner functionality.

In `@src/ai_company/tools/code_runner.py`:
- Around line 77-151: The execute method currently lets sandbox startup
exceptions from self._sandbox.execute bubble up and exceeds the desired length;
wrap the call to self._sandbox.execute in a try/except that catches generic
backend/startup errors (e.g., Exception) and on exception log
CODE_RUNNER_EXECUTE_FAILED with context (language) and return a
ToolExecutionResult with is_error=True, an informative content string including
the caught error, and metadata (language and a returncode or marker). Also
refactor the success/failure mapping into small helpers (e.g.,
_map_success_result(result) and _map_failure_result(result) or a single
_map_execution_result(result, language)) and call them from execute so execute
stays under 50 lines while preserving existing logging and metadata keys.

In `@src/ai_company/tools/mcp/bridge_tool.py`:
- Around line 65-133: The execute method is doing cache lookup, remote
invocation, error handling, result mapping, and cache writeback all inline;
split it into small private helpers to meet the <50-line rule. Create helpers
like _get_cached_result(self, arguments) to perform the cache.get and return
cached or None (and emit MCP_CACHE_HIT), _invoke_remote_tool(self, arguments) to
call self._client.call_tool and translate MCPError into a ToolExecutionResult
(or rethrow/return an error sentinel), and _write_cache(self, arguments, result)
to perform self._cache.put with the TypeError handling/logging (MCP_CACHE_MISS).
Refactor execute to orchestrate: call _get_cached_result, return if present, log
MCP_INVOKE_START, call _invoke_remote_tool, map_call_tool_result on the raw
response if needed, call _write_cache, and return the final ToolExecutionResult;
preserve all existing log messages, MCP constants, and behavior of
map_call_tool_result and error handling.
- Around line 81-125: Wrap the cache read around a try/except to treat
unhashable-arguments TypeError from self._cache.get(...) as a cache miss (i.e.,
mirror the existing fallback used on writes), and when storing results with
self._cache.put(...) skip caching if the mapped result (from
map_call_tool_result(raw)) indicates an error (result.is_error is True) to avoid
negative caching of transient MCP failures; locate these changes around the call
flow involving self._cache.get, self._client.call_tool, map_call_tool_result,
and self._cache.put inside the BridgeTool invoke logic so reads gracefully fall
back on TypeError and only successful results are cached.

In `@src/ai_company/tools/mcp/cache.py`:
- Around line 46-81: The cache currently stores and returns the original
ToolExecutionResult instances which can contain mutable metadata; change the
behavior in the Cache class so that when inserting results (where results are
stored via the same cache logic as in the get/set flow) you store a
copy.deepcopy(result) and in the get(self, ...) method return
copy.deepcopy(result) instead of the original object; update references in the
get method (and the corresponding set/put method that uses _make_key and _cache)
to perform deep copies of ToolExecutionResult (import copy.deepcopy) so callers
always receive independent snapshots and the internal cache holds immutable
copies.

In `@src/ai_company/tools/mcp/client.py`:
- Around line 78-99: Multiple methods access and mutate the shared
_session/_exit_stack without synchronization, risking concurrent
connect/disconnect/calls; wrap all accesses and mutations of _session and
_exit_stack with the existing _lock (await self._lock) — specifically modify
connect(), disconnect(), list_tools(), and call_tool() so that session creation
(assigning self._session/self._exit_stack), session teardown (closing and
clearing them), and any use of the session are performed while holding the lock;
ensure connect() acquires the lock before checking/setting _session,
disconnect() acquires it before closing/clearing _session/_exit_stack,
list_tools() acquires it while using the session, and call_tool() continues to
hold the lock for the full lifetime of the session call to prevent interleaving.
- Around line 210-215: The shallow copy using dict(t.inputSchema) leaves nested
structures shared with SDK objects; replace these with deep copies at the system
boundary—use copy.deepcopy(...) when constructing MCPToolInfo (the input_schema
argument) and when assigning structuredContent into MCPRawResult so that nested
dicts/lists are fully copied and immutable once wrapped by the frozen Pydantic
models (MCPToolInfo, MCPRawResult); update the creation sites (the MCPToolInfo
construction and the result.structuredContent assignment locations referenced in
the diff) to call copy.deepcopy on SDK-owned dict/list fields.

In `@src/ai_company/tools/mcp/factory.py`:
- Around line 139-148: In the disconnect loop inside the MCP factory (in
factory.py where the method iterates self._clients and calls await
client.disconnect()), replace the MCP_FACTORY_SERVER_SKIPPED event used in the
exception handler with the appropriate failure event constant (e.g.,
MCP_FACTORY_SERVER_FAILURE or a dedicated MCP_FACTORY_SERVER_DISCONNECT_FAILED)
so the disconnect exception is logged/reported as an operational failure; update
the logger.warning call to use that failure event and keep the same context keys
(server and reason).
- Around line 154-169: In _connect_and_discover, ensure the MCPClient created by
MCPClient(config) is disconnected if discovery fails: after await
client.connect(), wrap the await client.list_tools() call in try/except (or
try/finally) and call await client.disconnect() before re-raising the exception
so the connected client isn’t leaked from _connect_all()'s cleanup; keep
returned value as (client, tools) on success and re-raise the original error
after disconnect on failure.
- Around line 37-39: The create_tools method currently allows reuse of the
factory and appends a new client set to self._clients without tearing down
previous connections; modify create_tools to guard against reuse by either (a)
disconnecting/closing all existing MCPClient instances in self._clients (calling
their disconnect/close method) and clearing self._clients before creating new
clients, or (b) raising an error if create_tools is called when self._clients is
non-empty; additionally refactor create_tools by extracting server filtering
into a helper (e.g., _filter_servers) and bridge construction into a helper
(e.g., _build_bridge_tools) so the method stays under 50 lines while ensuring
previous connections are cleaned up.

In `@src/ai_company/tools/sandbox/docker_sandbox.py`:
- Around line 213-225: The DockerSandbox backend currently ignores
DockerSandboxConfig.allowed_hosts and forces container removal regardless of
DockerSandboxConfig.auto_remove; update the code that builds host_config (the
block that sets "Binds", "Memory", "NanoCpus", "NetworkMode", "AutoRemove",
etc.) to use self._config.auto_remove instead of the hard-coded False and, if
self._config.allowed_hosts is set, populate the container host configuration
appropriately (e.g., map allowed_hosts into Docker's ExtraHosts or the network
settings) so the config is honored; also change the finally cleanup where
container.remove is always called to only remove the container when
self._config.auto_remove is true (and ensure related occurrences around the
other similar blocks noted are updated similarly), referencing _config,
host_config, allowed_hosts, auto_remove, and the container.remove call to locate
the changes.
- Around line 491-500: The code is trimming captured process output with
.strip(), which can corrupt whitespace-sensitive results; change the assignment
of stdout and stderr to preserve exact emitted output by using stdout =
"".join(stdout_logs) and stderr = "".join(stderr_logs) (leave the
container_obj.log calls and variables stdout_logs/stderr_logs as-is), and ensure
any downstream uses/returns of stdout and stderr rely on these unmodified
strings.
- Around line 555-573: The cleanup() path stops tracked containers but never
removes them, leaving stopped containers orphaned; update cleanup() (which
references self._tracked_containers, self._docker and calls
self._stop_container) to also call self._remove_container(self._docker, cid) for
each tracked container after a successful stop (or attempt removal regardless,
handling/logging failures per-container), ensure removals run before clearing
_tracked_containers and before tearing down/closing self._docker, and preserve
the existing try/except/finally semantics so self._docker is set to None and
_tracked_containers is cleared only after removal attempts and proper error
logging.

In `@src/ai_company/tools/sandbox/sandboxing_config.py`:
- Line 27: The overrides dict currently uses dict[str, str] allowing
empty/whitespace keys; change its annotation to dict[NotBlankStr, str] so
pydantic will reject blank category names: update the type on the overrides
Field (overrides: dict[NotBlankStr, str] = Field(...)) and add the NotBlankStr
import where NotBlankStr is defined/available, then run/update any affected
imports/usages of overrides to match the stronger key type.

In `@tests/integration/tools/test_docker_sandbox_integration.py`:
- Around line 51-70: The test uses a non-existent Docker image
"python:3.14-slim" causing CI failures; update the DockerSandboxConfig image to
a real published tag (e.g. "python:3.11-slim" or "python:3.12-slim") in the test
function test_run_python_code and in the other test block referenced (the second
DockerSandboxConfig instance around lines 71-90) so DockerSandbox receives a
valid image name; keep the image string change only, leaving
DockerSandboxConfig, DockerSandbox, test_run_python_code, and the execute(...)
call unchanged.
- Around line 21-38: The _docker_available() helper leaks the aiodocker.Docker
client when exceptions occur after creation; change the inner async _check() to
ensure the client is always closed by creating the client first, then using a
try/finally (or try/except/finally) around await client.version() and any
subsequent operations so client.close() is awaited in the finally block; update
references to client = aiodocker.Docker() and the existing await client.close()
placement accordingly to guarantee cleanup on all paths.

In `@tests/integration/tools/test_mcp_integration.py`:
- Around line 17-31: The helper _make_connected_client mutates private state by
setting client._session directly, bypassing MCPClient.connect() and disconnect()
and thus skipping new timeout and AsyncExitStack cleanup logic; change the test
to drive the public lifecycle: provide a fake transport/session factory or patch
the factory function used by MCPClient to return a mock session (with AsyncMock
list_tools and call_tool behaviors) and then call client.connect() and
client.disconnect() (or use async context manager) so the real
connection/cleanup paths in MCPClient.connect and MCPClient.disconnect are
exercised instead of assigning client._session directly.

In `@tests/unit/tools/mcp/conftest.py`:
- Around line 148-164: The mock_client pytest fixture declares an unused
parameter sample_tool_info; remove sample_tool_info from the mock_client fixture
signature (def mock_client(stdio_server_config: MCPServerConfig) -> MCPClient:)
and update any callers if necessary so the fixture no longer requires
sample_tool_info, keeping the rest of the setup that assigns client._session
using AsyncMock list_tools/call_tool; alternatively, if the fixture must depend
on sample_tool_info for implicit setup, reference it in the body (e.g., a no-op
access) or rename it to _sample_tool_info to silence the unused-argument
warning.

In `@tests/unit/tools/mcp/test_bridge_tool.py`:
- Around line 83-101: Update the test_empty_input_schema_yields_none_parameters
to use the existing mock_client fixture instead of constructing MCPClient and
setting _session manually: change the test signature to accept mock_client, use
that mock_client (optionally ensure mock_client._session is an AsyncMock if
needed), instantiate MCPBridgeTool with tool_info and client=mock_client, and
remove the manual MCPClient creation and _session assignment so the test matches
other tests using the shared mock_client fixture.

In `@tests/unit/tools/mcp/test_result_mapper.py`:
- Around line 18-201: Add a new unit test in TestMixedContent that constructs an
MCPRawResult whose content includes a fake/unexpected block object with a type
value like "mystery", call map_call_tool_result on it, and assert the mapped
content contains the placeholder "[unknown: mystery]" (exact placeholder
produced by the fallback branch) and that result.metadata does not contain any
leaked "attachments" entry (or that "attachments" is absent or empty). Reference
map_call_tool_result and MCPRawResult to locate where to build the input and
check result.metadata["attachments"] to ensure no attachments are produced.

---

Outside diff comments:
In `@DESIGN_SPEC.md`:
- Around line 2123-2134: The docker example in DESIGN_SPEC.md still shows
mount_mode: "rw" which contradicts the hardened default; update the docker
example block so mount_mode is the read-only hardened default (replace "rw" with
"ro") and, if the spec wants to document the common exception for the workspace,
add a brief note next to mount_mode or in the docker section describing that
only the workspace is mounted read-write via a separate workspace mount setting
(referencing the docker keys mount_mode and allowed_hosts to keep context).

In `@tests/unit/observability/test_events.py`:
- Around line 175-215: Test test_all_domain_modules_discovered only checks
module discovery; add a new parametrized test (use `@pytest.mark.parametrize`)
that iterates over the new domain modules code_runner, docker, and mcp and
asserts each module exposes its expected public event constant(s). Specifically,
import events.code_runner, events.docker, events.mcp inside the test and for
each module assert hasattr(module, "<EXPECTED_CONSTANT_NAME>") (replace
<EXPECTED_CONSTANT_NAME> with the actual public constant names defined in those
modules) so renaming the module won’t mask a renamed/removed event; keep
test_all_domain_modules_discovered as-is and add this new test next to it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d0fea601-118a-435b-9f14-c8a7c70d21cb

📥 Commits

Reviewing files that changed from the base of the PR and between 8c39742 and a093d4d.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (46)
  • CLAUDE.md
  • DESIGN_SPEC.md
  • README.md
  • docker/sandbox/Dockerfile
  • pyproject.toml
  • src/ai_company/config/defaults.py
  • src/ai_company/config/schema.py
  • src/ai_company/observability/events/code_runner.py
  • src/ai_company/observability/events/docker.py
  • src/ai_company/observability/events/mcp.py
  • src/ai_company/tools/__init__.py
  • src/ai_company/tools/code_runner.py
  • src/ai_company/tools/mcp/__init__.py
  • src/ai_company/tools/mcp/bridge_tool.py
  • src/ai_company/tools/mcp/cache.py
  • src/ai_company/tools/mcp/client.py
  • src/ai_company/tools/mcp/config.py
  • src/ai_company/tools/mcp/errors.py
  • src/ai_company/tools/mcp/factory.py
  • src/ai_company/tools/mcp/models.py
  • src/ai_company/tools/mcp/result_mapper.py
  • src/ai_company/tools/sandbox/__init__.py
  • src/ai_company/tools/sandbox/docker_config.py
  • src/ai_company/tools/sandbox/docker_sandbox.py
  • src/ai_company/tools/sandbox/errors.py
  • src/ai_company/tools/sandbox/protocol.py
  • src/ai_company/tools/sandbox/sandboxing_config.py
  • tests/integration/tools/test_docker_sandbox_integration.py
  • tests/integration/tools/test_mcp_integration.py
  • tests/unit/api/test_bus_bridge.py
  • tests/unit/config/conftest.py
  • tests/unit/observability/test_events.py
  • tests/unit/tools/mcp/__init__.py
  • tests/unit/tools/mcp/conftest.py
  • tests/unit/tools/mcp/test_bridge_tool.py
  • tests/unit/tools/mcp/test_cache.py
  • tests/unit/tools/mcp/test_client.py
  • tests/unit/tools/mcp/test_config.py
  • tests/unit/tools/mcp/test_errors.py
  • tests/unit/tools/mcp/test_factory.py
  • tests/unit/tools/mcp/test_result_mapper.py
  • tests/unit/tools/sandbox/test_docker_config.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/tools/sandbox/test_sandboxing_config.py
  • tests/unit/tools/test_code_runner.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Agent
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (5)
**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

DESIGN_SPEC.md is MANDATORY reading before implementing any feature. The spec is the starting point for architecture, data models, and behavior. If implementation deviates from the spec, alert the user and explain why — user decides whether to proceed or update the spec. Do not silently diverge. When spec sections are referenced (e.g. 'Section 10.2'), read that section verbatim before coding. When approved deviations occur, update DESIGN_SPEC.md to reflect the new reality

Files:

  • README.md
  • DESIGN_SPEC.md
  • CLAUDE.md
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) — ruff enforces this on Python 3.14
All public functions must have type hints; mypy strict mode is enforced
Public classes and functions must have Google-style docstrings (enforced by ruff D rules)
Every module with business logic must include logging setup: from ai_company.observability import get_logger then logger = get_logger(__name__). Never use import logging or print() in application code. Variable name must always be logger
Use structured logging with event name constants: logger.info(EVENT_CONSTANT, key=value) — never use string interpolation like logger.info('msg %s', val). Import event constants directly from ai_company.observability.events.<domain>
All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, and entry/exit of key functions
Line length maximum is 88 characters (enforced by ruff)
Functions must be < 50 lines; files must be < 800 lines
Handle errors explicitly; never silently swallow exceptions
Validate at system boundaries: user input, external APIs, and config files. Do not validate unnecessarily at internal boundaries

Files:

  • src/ai_company/tools/mcp/errors.py
  • tests/unit/tools/mcp/conftest.py
  • tests/unit/tools/mcp/test_factory.py
  • src/ai_company/tools/sandbox/__init__.py
  • src/ai_company/tools/mcp/models.py
  • tests/integration/tools/test_mcp_integration.py
  • tests/unit/tools/sandbox/test_docker_config.py
  • src/ai_company/tools/__init__.py
  • tests/unit/api/test_bus_bridge.py
  • src/ai_company/tools/code_runner.py
  • tests/unit/config/conftest.py
  • src/ai_company/observability/events/code_runner.py
  • src/ai_company/tools/sandbox/sandboxing_config.py
  • tests/unit/tools/test_code_runner.py
  • src/ai_company/tools/mcp/bridge_tool.py
  • src/ai_company/config/defaults.py
  • tests/unit/observability/test_events.py
  • src/ai_company/tools/mcp/cache.py
  • tests/unit/tools/mcp/test_cache.py
  • tests/unit/tools/mcp/test_result_mapper.py
  • src/ai_company/config/schema.py
  • tests/unit/tools/mcp/test_config.py
  • tests/unit/tools/mcp/test_bridge_tool.py
  • tests/unit/tools/mcp/test_errors.py
  • tests/integration/tools/test_docker_sandbox_integration.py
  • src/ai_company/tools/mcp/result_mapper.py
  • src/ai_company/tools/mcp/factory.py
  • src/ai_company/observability/events/docker.py
  • src/ai_company/tools/mcp/config.py
  • tests/unit/tools/mcp/test_client.py
  • src/ai_company/tools/sandbox/protocol.py
  • tests/unit/tools/sandbox/test_sandboxing_config.py
  • src/ai_company/tools/sandbox/docker_config.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • src/ai_company/tools/mcp/__init__.py
  • src/ai_company/tools/mcp/client.py
  • src/ai_company/observability/events/mcp.py
  • src/ai_company/tools/sandbox/errors.py
  • src/ai_company/tools/sandbox/docker_sandbox.py
  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/tools/mcp/__init__.py
src/ai_company/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/ai_company/**/*.py: Use Pydantic v2 with frozen models for config/identity; use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model
Use @computed_field for derived values instead of storing + validating redundant fields (e.g. TokenUsage.total_tokens); use NotBlankStr from core.types for all identifier/name fields—including optional (NotBlankStr | None) and tuple variants—instead of manual whitespace validators
Create new objects for immutability; never mutate existing ones. For non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType for read-only enforcement. For dict/list fields in frozen Pydantic models, rely on frozen=True and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task

Files:

  • src/ai_company/tools/mcp/errors.py
  • src/ai_company/tools/sandbox/__init__.py
  • src/ai_company/tools/mcp/models.py
  • src/ai_company/tools/__init__.py
  • src/ai_company/tools/code_runner.py
  • src/ai_company/observability/events/code_runner.py
  • src/ai_company/tools/sandbox/sandboxing_config.py
  • src/ai_company/tools/mcp/bridge_tool.py
  • src/ai_company/config/defaults.py
  • src/ai_company/tools/mcp/cache.py
  • src/ai_company/config/schema.py
  • src/ai_company/tools/mcp/result_mapper.py
  • src/ai_company/tools/mcp/factory.py
  • src/ai_company/observability/events/docker.py
  • src/ai_company/tools/mcp/config.py
  • src/ai_company/tools/sandbox/protocol.py
  • src/ai_company/tools/sandbox/docker_config.py
  • src/ai_company/tools/mcp/__init__.py
  • src/ai_company/tools/mcp/client.py
  • src/ai_company/observability/events/mcp.py
  • src/ai_company/tools/sandbox/errors.py
  • src/ai_company/tools/sandbox/docker_sandbox.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow. Minimum coverage 80% (enforced in CI). Async mode: asyncio_mode = 'auto' — no manual @pytest.mark.asyncio needed. Per-test timeout: 30 seconds. Use pytest-xdist via -n auto for parallelism. Prefer @pytest.mark.parametrize for testing similar cases
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases. Tests must use test-provider, test-small-001, etc. Vendor names may only appear in: (1) DESIGN_SPEC.md provider list, (2) .claude/ files, (3) third-party import paths

Files:

  • tests/unit/tools/mcp/conftest.py
  • tests/unit/tools/mcp/test_factory.py
  • tests/integration/tools/test_mcp_integration.py
  • tests/unit/tools/sandbox/test_docker_config.py
  • tests/unit/api/test_bus_bridge.py
  • tests/unit/config/conftest.py
  • tests/unit/tools/test_code_runner.py
  • tests/unit/observability/test_events.py
  • tests/unit/tools/mcp/test_cache.py
  • tests/unit/tools/mcp/test_result_mapper.py
  • tests/unit/tools/mcp/test_config.py
  • tests/unit/tools/mcp/test_bridge_tool.py
  • tests/unit/tools/mcp/test_errors.py
  • tests/integration/tools/test_docker_sandbox_integration.py
  • tests/unit/tools/mcp/test_client.py
  • tests/unit/tools/sandbox/test_sandboxing_config.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/tools/sandbox/test_protocol.py
  • tests/unit/tools/mcp/__init__.py
pyproject.toml

📄 CodeRabbit inference engine (CLAUDE.md)

pyproject.toml: All dependency versions must be pinned with == in pyproject.toml
Organize dependencies into groups: test (pytest + plugins), dev (includes test + ruff, mypy, pre-commit, commitizen). Install with uv sync (installs all groups including dev by default)

Files:

  • pyproject.toml
🧠 Learnings (7)
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.md : DESIGN_SPEC.md is MANDATORY reading before implementing any feature. The spec is the starting point for architecture, data models, and behavior. If implementation deviates from the spec, alert the user and explain why — user decides whether to proceed or update the spec. Do not silently diverge. When spec sections are referenced (e.g. 'Section 10.2'), read that section verbatim before coding. When approved deviations occur, update DESIGN_SPEC.md to reflect the new reality

Applied to files:

  • README.md
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to src/ai_company/**/*.py : Use Pydantic v2 with frozen models for config/identity; use separate mutable-via-copy models (with `model_copy(update=...)`) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model

Applied to files:

  • src/ai_company/tools/mcp/models.py
  • src/ai_company/config/schema.py
  • src/ai_company/tools/mcp/config.py
  • src/ai_company/tools/sandbox/docker_config.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to src/ai_company/**/*.py : Create new objects for immutability; never mutate existing ones. For non-Pydantic internal collections (registries, `BaseTool`), use `copy.deepcopy()` at construction + `MappingProxyType` for read-only enforcement. For `dict`/`list` fields in frozen Pydantic models, rely on `frozen=True` and `copy.deepcopy()` at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)

Applied to files:

  • src/ai_company/tools/mcp/models.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to tests/**/*.py : Use pytest markers: `pytest.mark.unit`, `pytest.mark.integration`, `pytest.mark.e2e`, `pytest.mark.slow`. Minimum coverage 80% (enforced in CI). Async mode: `asyncio_mode = 'auto'` — no manual `pytest.mark.asyncio` needed. Per-test timeout: 30 seconds. Use `pytest-xdist` via `-n auto` for parallelism. Prefer `pytest.mark.parametrize` for testing similar cases

Applied to files:

  • tests/integration/tools/test_mcp_integration.py
  • tests/unit/tools/test_code_runner.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.py : Every module with business logic must include logging setup: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` or `print()` in application code. Variable name must always be `logger`

Applied to files:

  • src/ai_company/config/schema.py
  • CLAUDE.md
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.py : Use structured logging with event name constants: `logger.info(EVENT_CONSTANT, key=value)` — never use string interpolation like `logger.info('msg %s', val)`. Import event constants directly from `ai_company.observability.events.<domain>`

Applied to files:

  • CLAUDE.md
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.py : All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, and entry/exit of key functions

Applied to files:

  • CLAUDE.md
🧬 Code graph analysis (28)
src/ai_company/tools/mcp/errors.py (1)
src/ai_company/tools/errors.py (1)
  • ToolError (12-44)
tests/unit/tools/mcp/conftest.py (4)
src/ai_company/tools/mcp/cache.py (1)
  • MCPResultCache (22-148)
src/ai_company/tools/mcp/client.py (5)
  • MCPClient (46-454)
  • config (64-66)
  • server_name (74-76)
  • list_tools (179-227)
  • call_tool (229-303)
src/ai_company/tools/mcp/config.py (2)
  • MCPConfig (150-176)
  • MCPServerConfig (22-147)
src/ai_company/tools/mcp/models.py (2)
  • MCPRawResult (40-62)
  • MCPToolInfo (14-37)
tests/unit/tools/mcp/test_factory.py (4)
src/ai_company/tools/mcp/bridge_tool.py (1)
  • MCPBridgeTool (30-133)
src/ai_company/tools/mcp/config.py (2)
  • MCPConfig (150-176)
  • MCPServerConfig (22-147)
src/ai_company/tools/mcp/factory.py (4)
  • MCPToolFactory (27-189)
  • create_tools (41-98)
  • shutdown (137-150)
  • _make_cache (172-189)
src/ai_company/tools/mcp/models.py (1)
  • MCPToolInfo (14-37)
src/ai_company/tools/sandbox/__init__.py (3)
src/ai_company/tools/sandbox/docker_sandbox.py (2)
  • config (112-114)
  • DockerSandbox (70-600)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-75)
src/ai_company/tools/sandbox/sandboxing_config.py (1)
  • SandboxingConfig (14-57)
src/ai_company/tools/mcp/models.py (2)
src/ai_company/tools/base.py (1)
  • description (115-117)
src/ai_company/tools/mcp/client.py (1)
  • server_name (74-76)
tests/integration/tools/test_mcp_integration.py (5)
tests/unit/tools/mcp/test_bridge_tool.py (1)
  • bridge_tool (22-30)
src/ai_company/tools/mcp/bridge_tool.py (3)
  • MCPBridgeTool (30-133)
  • tool_info (61-63)
  • execute (65-133)
src/ai_company/tools/mcp/cache.py (1)
  • MCPResultCache (22-148)
src/ai_company/tools/mcp/client.py (4)
  • MCPClient (46-454)
  • config (64-66)
  • list_tools (179-227)
  • call_tool (229-303)
src/ai_company/tools/mcp/config.py (1)
  • MCPServerConfig (22-147)
tests/unit/tools/sandbox/test_docker_config.py (2)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-75)
src/ai_company/tools/sandbox/docker_sandbox.py (1)
  • config (112-114)
tests/unit/api/test_bus_bridge.py (1)
src/ai_company/api/bus_bridge.py (1)
  • _poll_channel (112-159)
src/ai_company/tools/code_runner.py (5)
src/ai_company/core/enums.py (1)
  • ToolCategory (293-307)
src/ai_company/observability/_logger.py (1)
  • get_logger (8-28)
src/ai_company/tools/base.py (2)
  • BaseTool (56-161)
  • ToolExecutionResult (24-53)
src/ai_company/tools/sandbox/protocol.py (1)
  • execute (25-51)
tests/unit/tools/sandbox/test_protocol.py (1)
  • execute (20-33)
src/ai_company/tools/sandbox/sandboxing_config.py (2)
src/ai_company/tools/sandbox/config.py (1)
  • SubprocessSandboxConfig (8-83)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-75)
tests/unit/tools/test_code_runner.py (6)
src/ai_company/core/enums.py (1)
  • ToolCategory (293-307)
src/ai_company/tools/code_runner.py (2)
  • CodeRunnerTool (53-151)
  • execute (77-151)
src/ai_company/tools/sandbox/result.py (1)
  • SandboxResult (6-28)
src/ai_company/tools/sandbox/docker_sandbox.py (1)
  • execute (265-316)
src/ai_company/tools/sandbox/protocol.py (1)
  • execute (25-51)
src/ai_company/tools/base.py (2)
  • category (110-112)
  • parameters_schema (120-128)
src/ai_company/tools/mcp/bridge_tool.py (5)
src/ai_company/core/enums.py (1)
  • ToolCategory (293-307)
src/ai_company/tools/base.py (3)
  • BaseTool (56-161)
  • ToolExecutionResult (24-53)
  • description (115-117)
src/ai_company/tools/mcp/errors.py (1)
  • MCPError (10-11)
src/ai_company/tools/mcp/result_mapper.py (1)
  • map_call_tool_result (29-108)
src/ai_company/tools/mcp/models.py (1)
  • MCPToolInfo (14-37)
src/ai_company/tools/mcp/cache.py (2)
src/ai_company/observability/_logger.py (1)
  • get_logger (8-28)
src/ai_company/tools/base.py (1)
  • ToolExecutionResult (24-53)
tests/unit/tools/mcp/test_result_mapper.py (2)
src/ai_company/tools/mcp/models.py (1)
  • MCPRawResult (40-62)
src/ai_company/tools/mcp/result_mapper.py (1)
  • map_call_tool_result (29-108)
src/ai_company/config/schema.py (2)
src/ai_company/tools/mcp/config.py (1)
  • MCPConfig (150-176)
src/ai_company/tools/sandbox/sandboxing_config.py (1)
  • SandboxingConfig (14-57)
tests/unit/tools/mcp/test_config.py (1)
src/ai_company/tools/mcp/config.py (2)
  • MCPConfig (150-176)
  • MCPServerConfig (22-147)
tests/unit/tools/mcp/test_bridge_tool.py (5)
src/ai_company/core/enums.py (1)
  • ToolCategory (293-307)
src/ai_company/tools/mcp/client.py (4)
  • MCPClient (46-454)
  • config (64-66)
  • server_name (74-76)
  • call_tool (229-303)
src/ai_company/tools/mcp/config.py (1)
  • MCPServerConfig (22-147)
src/ai_company/tools/mcp/models.py (1)
  • MCPToolInfo (14-37)
src/ai_company/tools/mcp/cache.py (1)
  • MCPResultCache (22-148)
tests/integration/tools/test_docker_sandbox_integration.py (2)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-75)
src/ai_company/tools/sandbox/docker_sandbox.py (6)
  • DockerSandbox (70-600)
  • config (112-114)
  • workspace (117-119)
  • execute (265-316)
  • cleanup (555-573)
  • health_check (575-596)
src/ai_company/tools/mcp/result_mapper.py (3)
src/ai_company/observability/_logger.py (1)
  • get_logger (8-28)
src/ai_company/tools/base.py (1)
  • ToolExecutionResult (24-53)
src/ai_company/tools/mcp/models.py (1)
  • MCPRawResult (40-62)
src/ai_company/tools/mcp/factory.py (5)
src/ai_company/tools/mcp/bridge_tool.py (2)
  • MCPBridgeTool (30-133)
  • tool_info (61-63)
src/ai_company/tools/mcp/cache.py (1)
  • MCPResultCache (22-148)
src/ai_company/tools/mcp/client.py (6)
  • MCPClient (46-454)
  • config (64-66)
  • disconnect (159-177)
  • server_name (74-76)
  • connect (78-138)
  • list_tools (179-227)
src/ai_company/tools/mcp/config.py (2)
  • MCPConfig (150-176)
  • MCPServerConfig (22-147)
src/ai_company/tools/mcp/models.py (1)
  • MCPToolInfo (14-37)
src/ai_company/tools/mcp/config.py (2)
src/ai_company/observability/_logger.py (1)
  • get_logger (8-28)
src/ai_company/tools/base.py (1)
  • description (115-117)
tests/unit/tools/mcp/test_client.py (4)
src/ai_company/tools/mcp/client.py (9)
  • MCPClient (46-454)
  • config (64-66)
  • is_connected (69-71)
  • server_name (74-76)
  • connect (78-138)
  • disconnect (159-177)
  • list_tools (179-227)
  • call_tool (229-303)
  • reconnect (305-316)
src/ai_company/tools/mcp/config.py (1)
  • MCPServerConfig (22-147)
src/ai_company/tools/mcp/errors.py (4)
  • MCPConnectionError (14-15)
  • MCPDiscoveryError (22-23)
  • MCPInvocationError (26-27)
  • MCPTimeoutError (18-19)
src/ai_company/tools/mcp/models.py (1)
  • MCPToolInfo (14-37)
tests/unit/tools/sandbox/test_sandboxing_config.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-75)
src/ai_company/tools/sandbox/sandboxing_config.py (2)
  • SandboxingConfig (14-57)
  • backend_for_category (47-57)
src/ai_company/tools/sandbox/docker_sandbox.py (1)
  • config (112-114)
tests/unit/tools/sandbox/test_docker_sandbox.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-75)
src/ai_company/tools/sandbox/docker_sandbox.py (13)
  • DockerSandbox (70-600)
  • _to_posix_bind_path (48-67)
  • workspace (117-119)
  • config (112-114)
  • _validate_cwd (146-165)
  • execute (265-316)
  • _build_container_config (181-234)
  • cleanup (555-573)
  • health_check (575-596)
  • get_backend_type (598-600)
  • _parse_memory_limit (237-263)
  • _stop_container (504-528)
  • _remove_container (531-553)
src/ai_company/tools/sandbox/errors.py (2)
  • SandboxError (10-11)
  • SandboxStartError (23-24)
src/ai_company/tools/mcp/__init__.py (3)
src/ai_company/tools/mcp/client.py (1)
  • config (64-66)
src/ai_company/tools/mcp/config.py (2)
  • MCPConfig (150-176)
  • MCPServerConfig (22-147)
src/ai_company/tools/mcp/errors.py (4)
  • MCPConnectionError (14-15)
  • MCPDiscoveryError (22-23)
  • MCPError (10-11)
  • MCPInvocationError (26-27)
src/ai_company/tools/mcp/client.py (3)
src/ai_company/tools/mcp/errors.py (4)
  • MCPConnectionError (14-15)
  • MCPDiscoveryError (22-23)
  • MCPInvocationError (26-27)
  • MCPTimeoutError (18-19)
src/ai_company/tools/mcp/models.py (2)
  • MCPRawResult (40-62)
  • MCPToolInfo (14-37)
src/ai_company/tools/mcp/config.py (1)
  • MCPServerConfig (22-147)
src/ai_company/tools/sandbox/docker_sandbox.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-75)
src/ai_company/tools/sandbox/errors.py (2)
  • SandboxError (10-11)
  • SandboxStartError (23-24)
src/ai_company/tools/sandbox/result.py (1)
  • SandboxResult (6-28)
tests/unit/tools/sandbox/test_protocol.py (3)
src/ai_company/tools/sandbox/docker_sandbox.py (2)
  • DockerSandbox (70-600)
  • workspace (117-119)
src/ai_company/tools/sandbox/subprocess_sandbox.py (1)
  • workspace (114-116)
src/ai_company/tools/sandbox/protocol.py (1)
  • SandboxBackend (17-75)
🪛 Checkov (3.2.334)
docker/sandbox/Dockerfile

[low] 1-19: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)

🪛 GitHub Actions: CI
tests/integration/tools/test_docker_sandbox_integration.py

[error] 1-1: Sandbox failed to start container: No such image: python:3.14-slim (test_run_python_code)


[error] 1-1: Sandbox failed to start container: No such image: python:3.14-slim (test_run_with_timeout)

src/ai_company/tools/sandbox/docker_sandbox.py

[error] 309-360: Sandbox failed to start: Docker container creation failed with 404 No such image: python:3.14-slim. This caused tests 'test_run_python_code' to fail during container startup.


[error] 309-360: Sandbox failed to start: Docker container creation failed with 404 No such image: python:3.14-slim. This caused tests 'test_run_with_timeout' to fail during container startup.

🪛 Hadolint (2.14.0)
docker/sandbox/Dockerfile

[warning] 9-9: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)

🪛 LanguageTool
README.md

[typographical] ~18-~18: To join two clauses or introduce examples, consider using an em dash.
Context: ...ation analytics models - Tool System - File system tools, git tools, sandbox ab...

(DASH_RULE)

CLAUDE.md

[style] ~86-~86: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...

(EG_NO_COMMA)

🪛 Trivy (0.69.3)
docker/sandbox/Dockerfile

[info] 1-1: No HEALTHCHECK defined

Add HEALTHCHECK instruction in your Dockerfile

Rule: DS-0026

Learn more

(IaC/Dockerfile)

Comment on lines +17 to +31
def _make_connected_client(
config: MCPServerConfig,
tools: list[MagicMock],
call_result: MagicMock,
) -> MCPClient:
"""Create a mock-connected MCPClient."""
client = MCPClient(config)
session = AsyncMock()

list_result = MagicMock()
list_result.tools = tools
session.list_tools = AsyncMock(return_value=list_result)
session.call_tool = AsyncMock(return_value=call_result)
client._session = session
return client
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

This helper skips the connection/cleanup code the PR hardens.

By assigning client._session directly, these tests never execute MCPClient.connect() / disconnect(), so regressions in the new timeout and AsyncExitStack cleanup paths would still pass here. Prefer driving the public lifecycle with a fake transport/session factory instead of mutating private state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/tools/test_mcp_integration.py` around lines 17 - 31, The
helper _make_connected_client mutates private state by setting client._session
directly, bypassing MCPClient.connect() and disconnect() and thus skipping new
timeout and AsyncExitStack cleanup logic; change the test to drive the public
lifecycle: provide a fake transport/session factory or patch the factory
function used by MCPClient to return a mock session (with AsyncMock list_tools
and call_tool behaviors) and then call client.connect() and client.disconnect()
(or use async context manager) so the real connection/cleanup paths in
MCPClient.connect and MCPClient.disconnect are exercised instead of assigning
client._session directly.

Comment on lines +83 to +101
def test_empty_input_schema_yields_none_parameters(self) -> None:
tool_info = MCPToolInfo(
name="no-schema",
description="No schema",
input_schema={},
server_name="srv",
)
config = MCPServerConfig(
name="srv",
transport="stdio",
command="echo",
)
client = MCPClient(config)
client._session = AsyncMock()
bridge = MCPBridgeTool(
tool_info=tool_info,
client=client,
)
assert bridge.parameters_schema is None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider using the existing fixtures for consistency.

This test creates its own MCPClient instance and manually sets _session. While functional, consider extracting this pattern into a fixture (or parameterizing existing fixtures) for consistency with other tests that use mock_client from conftest.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/tools/mcp/test_bridge_tool.py` around lines 83 - 101, Update the
test_empty_input_schema_yields_none_parameters to use the existing mock_client
fixture instead of constructing MCPClient and setting _session manually: change
the test signature to accept mock_client, use that mock_client (optionally
ensure mock_client._session is an AsyncMock if needed), instantiate
MCPBridgeTool with tool_info and client=mock_client, and remove the manual
MCPClient creation and _session assignment so the test matches other tests using
the shared mock_client fixture.

Implements all 42 findings from local review agents and external
reviewers (Greptile, Copilot, CodeRabbit) on PR #194.

Key fixes:
- Docker sandbox: fix UID 65534 conflict, close client on failure,
  remove containers in cleanup, add memory_limit validator
- MCP client: serialize session access with asyncio.Lock, deepcopy
  SDK-owned dicts at boundaries
- MCP bridge: split execute into helpers, skip caching error results,
  catch TypeError on unhashable cache keys
- MCP factory: fix partial-failure cleanup with contextlib.suppress,
  add reuse guard, disconnect on discovery failure
- MCP cache: deepcopy on both store and get paths
- Code runner: catch SandboxError around sandbox.execute()
- SandboxingConfig: type-safe overrides with Literal backend names
- CI: add MIT-0 and scancode-free-unknown to license allow-list
- Docs: fix DESIGN_SPEC.md tree, mount_mode ro, MCP SDK pin
- Tests: add coverage for reuse guard, partial failure, shutdown
  error swallowing, unhashable cache, error-not-cached, connection
  timeout, unknown content block
- Add type: ignore comments for dict[str, str] → Literal overrides
  in test_sandboxing_config.py (mypy arg-type/dict-item errors)
- Make Docker integration test skip when test image isn't pulled
  (CI has Docker but not python:3.12-slim pre-pulled)
- Extract _TEST_IMAGE constant to avoid hardcoded image strings
Copilot AI review requested due to automatic review settings March 10, 2026 08:43
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the system's capabilities by introducing robust sandboxing mechanisms and enabling seamless integration with external tools via the Model Context Protocol. These additions provide a more secure and flexible environment for executing code and interacting with diverse services, laying critical groundwork for future feature development and enhancing overall system resilience and extensibility.

Highlights

  • Docker Sandbox Implementation: Introduced a new Docker-based sandbox backend for isolated code execution, enhancing security and resource control. This includes a multi-stage Dockerfile, container hardening (PidsLimit, ReadonlyRootfs, CapDrop=ALL), resource leak fixes, and memory limit validation.
  • MCP Bridge Integration: Implemented a comprehensive Model Context Protocol (MCP) bridge, allowing the system to connect to external MCP servers and integrate their tools. This includes client connection management, tool discovery, invocation, result caching with TTL/LRU eviction, and a dedicated error hierarchy.
  • Code Runner Tool: Added a new CodeRunnerTool that executes code snippets in Python, JavaScript, or Bash within a configurable sandboxed environment, leveraging either the new Docker sandbox or the existing subprocess sandbox.
  • Enhanced Configuration & Observability: Extended the system's configuration with new Pydantic models for Docker sandboxing and MCP bridge settings, including detailed validation. New structured logging events were added across Docker, MCP, and Code Runner components for improved observability.
  • Comprehensive Testing: Added over 15 new tests, including integration tests for Docker sandbox and MCP bridge, and extensive unit tests for all new components, covering container hardening, error paths, protocol conformance, and cache behavior.
Changelog
  • CLAUDE.md
    • Updated tool system documentation to reflect the addition of the code runner and MCP bridge.
    • Expanded logging event guidelines to include new events for code runner, Docker, and MCP operations.
  • DESIGN_SPEC.md
    • Updated the status of DockerSandbox to 'Implemented' in the design specification.
    • Added aiodocker as a key technology for Docker API integration.
    • Clarified tool integration to specifically mention the mcp SDK.
    • Updated the project's file structure to include new Docker sandbox and MCP bridge modules.
  • README.md
    • Updated the 'Tool System' description to highlight the new Docker sandbox, code runner, and MCP bridge.
    • Changed the status of M7 to 'in progress'.
    • Updated the status of MCP tool integration from 'planned' to 'implemented'.
  • docker/sandbox/Dockerfile
    • Added a new multi-stage Dockerfile for the sandbox environment, including Node.js, Python, Git, and a dedicated sandbox user.
  • pyproject.toml
    • Added aiodocker and mcp as new project dependencies.
    • Configured mypy overrides for aiodocker and mcp modules.
  • src/ai_company/config/defaults.py
    • Added default configuration sections for sandboxing and mcp.
  • src/ai_company/config/schema.py
    • Integrated SandboxingConfig and MCPConfig into the main RootConfig schema.
  • src/ai_company/observability/events/code_runner.py
    • Added new event constants for code runner execution lifecycle and invalid language handling.
  • src/ai_company/observability/events/docker.py
    • Added new event constants for Docker sandbox execution, container lifecycle, cleanup, and health checks.
  • src/ai_company/observability/events/mcp.py
    • Added new event constants for MCP client connection, tool discovery, invocation, caching, and factory operations.
  • src/ai_company/tools/init.py
    • Exported new CodeRunnerTool, DockerSandbox, DockerSandboxConfig, and SandboxingConfig classes.
    • Added a comment explaining lazy imports for MCP types to prevent circular dependencies.
  • src/ai_company/tools/code_runner.py
    • Added the CodeRunnerTool class, enabling execution of Python, JavaScript, and Bash code within a sandboxed environment.
  • src/ai_company/tools/mcp/init.py
    • Created the mcp subpackage with lazy imports for its components to manage dependencies.
  • src/ai_company/tools/mcp/bridge_tool.py
    • Added the MCPBridgeTool class, which wraps discovered MCP server tools as internal BaseTool instances.
  • src/ai_company/tools/mcp/cache.py
    • Implemented MCPResultCache for in-memory caching of MCP tool invocation results with TTL and LRU eviction.
  • src/ai_company/tools/mcp/client.py
    • Developed MCPClient for managing asynchronous connections to MCP servers, including tool discovery and invocation with timeouts.
  • src/ai_company/tools/mcp/config.py
    • Defined Pydantic models MCPServerConfig and MCPConfig for configuring MCP server connections and the overall bridge.
  • src/ai_company/tools/mcp/errors.py
    • Established a custom error hierarchy for MCP bridge operations, inheriting from ToolError.
  • src/ai_company/tools/mcp/factory.py
    • Created MCPToolFactory to connect to multiple MCP servers in parallel, discover tools, and instantiate MCPBridgeTool objects.
  • src/ai_company/tools/mcp/models.py
    • Defined MCPToolInfo and MCPRawResult Pydantic models for representing MCP tool metadata and raw invocation results.
  • src/ai_company/tools/mcp/result_mapper.py
    • Implemented map_call_tool_result to convert raw MCP results into the internal ToolExecutionResult format, handling various content types.
  • src/ai_company/tools/sandbox/init.py
    • Updated the sandbox package to export Docker-related sandbox components.
  • src/ai_company/tools/sandbox/docker_config.py
    • Defined DockerSandboxConfig for configuring Docker sandbox parameters like image, network, resource limits, and mount mode.
  • src/ai_company/tools/sandbox/docker_sandbox.py
    • Implemented DockerSandbox for executing commands in isolated Docker containers, including resource management and Windows path compatibility.
  • src/ai_company/tools/sandbox/errors.py
    • Updated descriptions for SandboxTimeoutError and SandboxStartError to align with new sandbox backends.
  • src/ai_company/tools/sandbox/protocol.py
    • Updated the SandboxBackend protocol documentation to include Docker as an implemented backend.
  • src/ai_company/tools/sandbox/sandboxing_config.py
    • Added SandboxingConfig to allow dynamic selection of sandbox backends (subprocess or Docker) based on tool categories.
  • tests/integration/tools/test_docker_sandbox_integration.py
    • Added integration tests for DockerSandbox to verify Python execution, timeout handling, and Docker daemon availability.
  • tests/integration/tools/test_mcp_integration.py
    • Added integration tests for the MCP bridge, covering tool discovery, execution, caching, and error handling with a mock server.
  • tests/unit/api/test_bus_bridge.py
    • Modified a test to patch _POLL_TIMEOUT for faster circuit breaker testing.
  • tests/unit/config/conftest.py
    • Added sandboxing and mcp configurations to the RootConfigFactory for unit testing.
  • tests/unit/observability/test_events.py
    • Updated event discovery tests to include new code_runner, docker, and mcp event domains.
  • tests/unit/tools/mcp/init.py
    • Added __init__.py for MCP unit tests.
  • tests/unit/tools/mcp/conftest.py
    • Added fixtures for MCP unit tests, including mock configurations, tool info, raw results, and client/session mocks.
  • tests/unit/tools/mcp/test_bridge_tool.py
    • Added unit tests for MCPBridgeTool covering its construction, properties, execution logic, and cache interaction.
  • tests/unit/tools/mcp/test_cache.py
    • Added unit tests for MCPResultCache to validate its hit/miss behavior, TTL expiry, LRU eviction, and invalidation mechanisms.
  • tests/unit/tools/mcp/test_client.py
    • Added unit tests for MCPClient covering connection management, tool discovery, invocation, reconnection, and context manager functionality.
  • tests/unit/tools/mcp/test_config.py
    • Added unit tests for MCP configuration models, verifying transport-specific fields, tool filter logic, and server name uniqueness.
  • tests/unit/tools/mcp/test_errors.py
    • Added unit tests for the MCP error hierarchy, confirming proper inheritance and context handling.
  • tests/unit/tools/mcp/test_factory.py
    • Added unit tests for MCPToolFactory to validate parallel server connection, tool discovery, and client shutdown processes.
  • tests/unit/tools/mcp/test_models.py
    • Added unit tests for MCPToolInfo and MCPRawResult models, ensuring correct data representation.
  • tests/unit/tools/mcp/test_result_mapper.py
    • Added unit tests for map_call_tool_result, verifying the correct mapping of various MCP content blocks and structured data.
  • tests/unit/tools/sandbox/test_docker_config.py
    • Added unit tests for DockerSandboxConfig, covering default values, custom settings, and validation rules for network and resource limits.
  • tests/unit/tools/sandbox/test_docker_sandbox.py
    • Added unit tests for DockerSandbox with mocked Docker daemon, covering execution, error handling, container configuration, and cleanup.
  • tests/unit/tools/sandbox/test_protocol.py
    • Added a test to confirm that DockerSandbox correctly implements the SandboxBackend protocol.
  • tests/unit/tools/sandbox/test_sandboxing_config.py
    • Added unit tests for SandboxingConfig, validating default backends, category overrides, and routing logic.
  • tests/unit/tools/test_code_runner.py
    • Added unit tests for CodeRunnerTool, covering initialization, language mapping, execution outcomes, and parameter handling.
  • uv.lock
    • Updated the uv.lock file to include new dependencies aiodocker and mcp and their transitive dependencies.
Activity
  • The pull request introduces significant new features, including Docker sandboxing and MCP bridge integration.
  • Extensive documentation updates were made in CLAUDE.md, DESIGN_SPEC.md, and README.md to reflect these changes.
  • New configuration models and observability events were added to support the new functionalities.
  • A comprehensive suite of unit and integration tests was developed, covering various aspects of the new components, including error handling and performance considerations.
  • The author, Aureliolo, has addressed 60 findings consolidated from pre-reviews by 10 different agents, indicating a thorough review process prior to this PR.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

♻️ Duplicate comments (5)
src/ai_company/tools/code_runner.py (1)

78-164: 🧹 Nitpick | 🔵 Trivial

Extract result-mapping logic to reduce execute() length.

The execute() method is ~87 lines, exceeding the 50-line guideline. The result-handling block (Lines 134-164) can be extracted to a private helper to improve readability and maintainability.

♻️ Proposed refactor
+    def _map_result(
+        self,
+        result: SandboxResult,
+        language: str,
+    ) -> ToolExecutionResult:
+        """Map sandbox result to tool execution result."""
+        if result.success:
+            logger.debug(CODE_RUNNER_EXECUTE_SUCCESS, language=language)
+            return ToolExecutionResult(
+                content=result.stdout or "(no output)",
+                metadata={"returncode": result.returncode, "language": language},
+            )
+
+        logger.warning(
+            CODE_RUNNER_EXECUTE_FAILED,
+            language=language,
+            returncode=result.returncode,
+            timed_out=result.timed_out,
+        )
+        error_msg = result.stderr or result.stdout or "Execution failed"
+        if result.timed_out:
+            error_msg = f"Execution timed out. {error_msg}"
+        return ToolExecutionResult(
+            content=error_msg,
+            is_error=True,
+            metadata={
+                "returncode": result.returncode,
+                "timed_out": result.timed_out,
+                "language": language,
+            },
+        )
+
     async def execute(
         self,
         *,
         arguments: dict[str, Any],
     ) -> ToolExecutionResult:
         # ... validation and sandbox call unchanged ...
-
-        if result.success:
-            logger.debug(
-                CODE_RUNNER_EXECUTE_SUCCESS,
-                language=language,
-            )
-            return ToolExecutionResult(
-                content=result.stdout or "(no output)",
-                metadata={
-                    "returncode": result.returncode,
-                    "language": language,
-                },
-            )
-
-        logger.warning(
-            CODE_RUNNER_EXECUTE_FAILED,
-            language=language,
-            returncode=result.returncode,
-            timed_out=result.timed_out,
-        )
-        error_msg = result.stderr or result.stdout or "Execution failed"
-        if result.timed_out:
-            error_msg = f"Execution timed out. {error_msg}"
-        return ToolExecutionResult(
-            content=error_msg,
-            is_error=True,
-            metadata={
-                "returncode": result.returncode,
-                "timed_out": result.timed_out,
-                "language": language,
-            },
-        )
+        return self._map_result(result, language)

Based on learnings: "Functions must be < 50 lines; files must be < 800 lines"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/tools/code_runner.py` around lines 78 - 164, The execute()
method is too long; extract the result-handling block (the logic after obtaining
result from self._sandbox.execute and the SandboxError except) into a private
helper on the same class (e.g., a method named
_map_sandbox_result_to_tool_result or _format_execution_result) that accepts the
sandbox result and language and returns a ToolExecutionResult; move the
success/failure logging and construction of ToolExecutionResult (including
handling result.success, result.stderr/stdout, timed_out, returncode, and
metadata) into that helper, then replace the original block in execute() with a
single call to the new helper, keeping existing logger calls or delegating them
into the helper as needed.
src/ai_company/tools/sandbox/sandboxing_config.py (1)

27-27: ⚠️ Potential issue | 🟠 Major

Use NotBlankStr for sandbox category identifiers.

Empty or whitespace-only keys are currently valid in overrides, and backend_for_category("") accepts the same invalid identifier at runtime. Both should use NotBlankStr.

♻️ Suggested fix
 from pydantic import BaseModel, ConfigDict, Field, model_validator
 
+from ai_company.core.types import NotBlankStr
 from ai_company.tools.sandbox.config import SubprocessSandboxConfig
 from ai_company.tools.sandbox.docker_config import DockerSandboxConfig
@@
-    overrides: dict[str, _BackendName] = Field(default_factory=dict)
+    overrides: dict[NotBlankStr, _BackendName] = Field(default_factory=dict)
@@
-    def backend_for_category(self, category: str) -> _BackendName:
+    def backend_for_category(self, category: NotBlankStr) -> _BackendName:

As per coding guidelines, "use NotBlankStr from core.types for all identifier/name fields—including optional (NotBlankStr | None) and tuple variants—instead of manual whitespace validators".

Also applies to: 47-56

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/tools/sandbox/sandboxing_config.py` at line 27, The overrides
dict accepts blank keys; change its type from dict[str, _BackendName] to
dict[NotBlankStr, _BackendName] and import NotBlankStr from core.types, and
update any runtime accessor/signature that accepts a category identifier (e.g.,
backend_for_category) to require NotBlankStr (or NotBlankStr | None where
optional) so blank/whitespace strings are rejected; apply the same replacement
for the other identifier/name fields referenced in the block around
backend_for_category (lines ~47-56) so all category/identifier types use
NotBlankStr instead of plain str.
src/ai_company/tools/sandbox/docker_config.py (1)

1-8: ⚠️ Potential issue | 🟠 Major

Log invalid Docker config values before rejecting them.

Both validators reject config at the boundary, but the module still has no logger and no warning/error log on the failure path. Add structured logging with the offending field/value before raising ValueError.

As per coding guidelines, "Every module with business logic must include logging setup: from ai_company.observability import get_logger then logger = get_logger(__name__).", "All error paths must log at WARNING or ERROR with context before raising", and "Validate at system boundaries: user input, external APIs, and config files. Do not validate unnecessarily at internal boundaries".

Also applies to: 65-94

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/tools/sandbox/docker_config.py` around lines 1 - 8, Add
module-level logging (import get_logger from ai_company.observability and create
logger = get_logger(__name__)) and update the pydantic model validators (the
functions decorated with model_validator on the Docker sandbox configuration
model) to log the offending field name and value at WARNING or ERROR before
raising ValueError; specifically, inside each validator that currently raises
ValueError for invalid Docker config values, call logger.warning or logger.error
with a short message and structured context (e.g., include the field identifier
and the invalid value) and then raise the ValueError as before.
src/ai_company/tools/sandbox/docker_sandbox.py (1)

223-226: ⚠️ Potential issue | 🟠 Major

These config fields are still not wired through.

allowed_hosts is still a TODO, and _run_container() forces auto_remove=False before the finally block always deletes the container anyway. That leaves two public config knobs with no effect.

Also applies to: 344-352, 384-388

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/tools/sandbox/docker_sandbox.py` around lines 223 - 226, The
config fields allowed_hosts and auto_remove are not being applied: update the
container startup to respect self._config.allowed_hosts by wiring it into the
Docker create/run options (e.g., add to host_config["ExtraHosts"] or appropriate
network settings using the allowed_hosts list) and stop forcing
auto_remove=False in _run_container(); instead set the container create/run
auto_remove option from self._config.auto_remove and change the finally block so
the explicit container removal only runs when auto_remove is False (or when
removal is still required), referencing the host_config dict, the
_run_container() method, the allowed_hosts config field, and the auto_remove
behavior so these knobs actually take effect.
src/ai_company/tools/mcp/client.py (1)

254-265: ⚠️ Potential issue | 🟠 Major

Resolve the session after acquiring _lock.

Line 254 reads _session before async with self._lock. A concurrent disconnect() can close that session in between, leaving call_tool() to run against a stale ClientSession. Move _require_session() inside the locked block.

Proposed fix
-        session = self._require_session()
         logger.debug(
             MCP_INVOKE_START,
             server=self._config.name,
             tool=tool_name,
         )
         async with self._lock:
+            session = self._require_session()
             try:
                 result = await asyncio.wait_for(
                     session.call_tool(tool_name, arguments),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/tools/mcp/client.py` around lines 254 - 265, The code calls
_require_session() and assigns session before entering async with self._lock,
which allows a concurrent disconnect() to close the session and leave
call_tool() using a stale ClientSession; move the call to _require_session()
inside the async with self._lock block so the session is obtained while holding
self._lock, then call session.call_tool(...) under the same lock (ensure you
still use asyncio.wait_for with the existing timeout and preserve the
MCP_INVOKE_START logging and exception handling around the await).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@DESIGN_SPEC.md`:
- Around line 3024-3033: The DESIGN_SPEC contains a duplicated listing of the
sandbox package tree; remove the redundant block that repeats the "sandbox/"
subtree and consolidate into a single entry so the spec shows only one set of
children for the sandbox package (keep a single listing that includes
__init__.py, config.py, docker_config.py, docker_sandbox.py, errors.py,
protocol.py, result.py, sandboxing_config.py, subprocess_sandbox.py). Update the
tree where the duplicate appears to match the original subtree entry (ensure
only one "sandbox/" subtree is documented) and verify package names (sandbox /
tools/sandbox) are consistent with the rest of the file.

In `@src/ai_company/tools/mcp/client.py`:
- Around line 91-140: The AsyncExitStack opened with stack = AsyncExitStack();
await stack.__aenter__() can leak if connect() is cancelled (CancelledError)
before self._exit_stack is assigned; ensure the stack is always closed on any
BaseException by adding a finally (or an except BaseException) that awaits
stack.aclose() unless the stack has been stored to self._exit_stack after
successful connection. Update the connect() flow around _connect_with_stack,
session assignment, and self._exit_stack so that on cancellation or any
BaseException the partially-opened stack is closed before re-raising the error.
- Around line 85-105: Inside connect(), guard against opening a second session
by checking self._session (and/or self._exit_stack) while holding self._lock
before creating a new AsyncExitStack; if a session already exists either raise
an error (e.g., RuntimeError) or call the existing disconnect/close routine to
cleanly close the current transport, then proceed to create a new stack via
self._connect_with_stack. Ensure you do not overwrite self._session or
self._exit_stack until the old session/stack has been closed and fully released,
and keep the check/close logic inside the same self._lock context to avoid
races.

In `@src/ai_company/tools/mcp/factory.py`:
- Around line 59-61: The create_tools() reuse check currently raises
RuntimeError when self._created is true without logging; before raising, add a
log call (e.g., process/logger.warning or logger.error) that records the failure
and relevant context (such as self identity, factory configuration or key
attributes) to ensure MCP factory telemetry captures the event, then raise the
RuntimeError as before; locate the check around create_tools() and self._created
in factory.py and insert the logging call immediately prior to the raise.

In `@src/ai_company/tools/sandbox/docker_sandbox.py`:
- Around line 267-319: The execute/_run_container/_start_and_wait functions are
too long and mix responsibilities; refactor by extracting small helpers so each
function is under 50 lines: keep execute minimal (validate cwd via
_validate_cwd, compute effective_timeout, resolve container_cwd, call a single
orchestration helper), move timeout calculation to a helper like
_compute_effective_timeout, move Docker acquisition to _ensure_docker (or wrap
into _get_docker_client), extract container lifecycle steps from
_run_container/_start_and_wait into clear helpers such as
_create_and_start_container, _stream_and_collect_logs,
_wait_with_timeout_and_cleanup, and _build_sandbox_result (for constructing
SandboxResult); ensure validation, logging (DOCKER_EXECUTE_START), container
start/stop, timeout handling, and output aggregation are each in their own small
functions referenced from _run_container/_start_and_wait so the original
functions become simple orchestrators under 50 lines.
- Around line 460-467: The current logger call in docker_sandbox.py logs raw
stderr when command execution fails; instead keep the full stderr text in
SandboxResult but change the DOCKER_EXECUTE_FAILED logger.warning invocation
(the block referencing command, args, returncode, stderr) to avoid emitting raw
stderr: log bounded metadata such as stderr_length, a short stderr_preview
(e.g., first N chars) or a boolean truncated flag, and an opaque fingerprint
(e.g., sha256) if you need correlation; update the logger parameters accordingly
and ensure SandboxResult still stores the original stderr for later access.
- Around line 106-109: Add an asyncio.Lock (e.g., self._client_lock) to
serialize access to the Docker client and container-tracking state; in
particular, protect _ensure_docker(), cleanup(), and the sections of
_run_container() that read/modify self._tracked_containers and assign
self._docker by surrounding them with "async with self._client_lock" so the
TOCTOU creation/assignment of aiodocker.Docker is atomic, cleanup() cannot close
the client mid-initialization, and tracked-container append/remove operations
are atomic and not lost under concurrency.

In `@src/ai_company/tools/sandbox/sandboxing_config.py`:
- Around line 1-8: This module is missing the required logger setup used by
SandboxingConfig's validation/business logic; add the import "from
ai_company.observability import get_logger" and define "logger =
get_logger(__name__)" near the top of the file, then update SandboxingConfig's
model_validator and any validation/error paths (e.g., inside the validator
function(s) referenced by model_validator and any logic around
SubprocessSandboxConfig/DockerSandboxConfig usage around lines ~35-45) to call
logger.warning or logger.error with contextual details before raising exceptions
or returning errors; ensure all newly added log calls include enough context
(config values or identifiers) to aid debugging.

In `@tests/integration/tools/test_docker_sandbox_integration.py`:
- Around line 56-58: The test currently uses DockerSandboxConfig with image
"python:3.12-slim" but `@skip_no_docker` only checks daemon reachability, so add
an image-availability check before running the test and skip if the image isn't
present (or pull it); specifically, update the test in
tests/integration/tools/test_docker_sandbox_integration.py to call the Docker
SDK (e.g., docker.from_env()) and use client.images.get("python:3.12-slim") to
detect presence (or client.images.pull("python:3.12-slim") if you prefer
auto-pull), and if not available raise unittest.SkipTest (or keep the skip
decorator) so the test using DockerSandboxConfig(image="python:3.12-slim") won't
fail with "No such image".

In `@tests/unit/tools/mcp/test_result_mapper.py`:
- Around line 164-172: The test test_unknown_block_produces_placeholder
currently mutates the global MagicMock type by setting type(unknown).__name__ =
"MysteryBlock"; replace MagicMock with a tiny local stub class (e.g., class
MysteryBlock: pass) and instantiate it instead of using MagicMock so you don’t
alter shared state; keep using MCPRawResult(content=(instance,)) and assert the
same expectations against map_call_tool_result to preserve test behavior and
isolation.

In `@tests/unit/tools/sandbox/test_docker_sandbox.py`:
- Around line 424-436: The test currently only asserts
mock_docker.containers.container.call_count >= 2 which can pass without
verifying removals; update test_cleanup_stops_tracked_containers to assert that
for each tracked container the mocked container's stop and delete were awaited:
locate the DockerSandbox.cleanup invocation and the mock container objects
returned by mock_docker.containers.container, then replace or add assertions to
check call_count (or awaited call) for container.stop and container.delete
(e.g., verify stop was awaited twice and delete/removed awaited twice) and keep
the existing assertion that sandbox._tracked_containers == [] to ensure state is
cleared.

In `@tests/unit/tools/sandbox/test_sandboxing_config.py`:
- Around line 40-42: The test assigns plain str values to the fixtures/variables
used with SandboxingConfig (overrides, default, override_backend), causing mypy
failures; update their type annotations to use Literal["subprocess","docker"]
(or define a shared alias like Backend = Literal["subprocess","docker"]) and
apply that alias to the fixtures/variables (e.g., overrides: dict[str, Backend],
default: Backend, override_backend: Backend), then remove the temporary # type:
ignore on the earlier line so the test compiles cleanly under mypy while still
asserting SandboxingConfig(overrides=overrides) behaves as expected.

---

Duplicate comments:
In `@src/ai_company/tools/code_runner.py`:
- Around line 78-164: The execute() method is too long; extract the
result-handling block (the logic after obtaining result from
self._sandbox.execute and the SandboxError except) into a private helper on the
same class (e.g., a method named _map_sandbox_result_to_tool_result or
_format_execution_result) that accepts the sandbox result and language and
returns a ToolExecutionResult; move the success/failure logging and construction
of ToolExecutionResult (including handling result.success, result.stderr/stdout,
timed_out, returncode, and metadata) into that helper, then replace the original
block in execute() with a single call to the new helper, keeping existing logger
calls or delegating them into the helper as needed.

In `@src/ai_company/tools/mcp/client.py`:
- Around line 254-265: The code calls _require_session() and assigns session
before entering async with self._lock, which allows a concurrent disconnect() to
close the session and leave call_tool() using a stale ClientSession; move the
call to _require_session() inside the async with self._lock block so the session
is obtained while holding self._lock, then call session.call_tool(...) under the
same lock (ensure you still use asyncio.wait_for with the existing timeout and
preserve the MCP_INVOKE_START logging and exception handling around the await).

In `@src/ai_company/tools/sandbox/docker_config.py`:
- Around line 1-8: Add module-level logging (import get_logger from
ai_company.observability and create logger = get_logger(__name__)) and update
the pydantic model validators (the functions decorated with model_validator on
the Docker sandbox configuration model) to log the offending field name and
value at WARNING or ERROR before raising ValueError; specifically, inside each
validator that currently raises ValueError for invalid Docker config values,
call logger.warning or logger.error with a short message and structured context
(e.g., include the field identifier and the invalid value) and then raise the
ValueError as before.

In `@src/ai_company/tools/sandbox/docker_sandbox.py`:
- Around line 223-226: The config fields allowed_hosts and auto_remove are not
being applied: update the container startup to respect
self._config.allowed_hosts by wiring it into the Docker create/run options
(e.g., add to host_config["ExtraHosts"] or appropriate network settings using
the allowed_hosts list) and stop forcing auto_remove=False in _run_container();
instead set the container create/run auto_remove option from
self._config.auto_remove and change the finally block so the explicit container
removal only runs when auto_remove is False (or when removal is still required),
referencing the host_config dict, the _run_container() method, the allowed_hosts
config field, and the auto_remove behavior so these knobs actually take effect.

In `@src/ai_company/tools/sandbox/sandboxing_config.py`:
- Line 27: The overrides dict accepts blank keys; change its type from dict[str,
_BackendName] to dict[NotBlankStr, _BackendName] and import NotBlankStr from
core.types, and update any runtime accessor/signature that accepts a category
identifier (e.g., backend_for_category) to require NotBlankStr (or NotBlankStr |
None where optional) so blank/whitespace strings are rejected; apply the same
replacement for the other identifier/name fields referenced in the block around
backend_for_category (lines ~47-56) so all category/identifier types use
NotBlankStr instead of plain str.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 806d9535-22e7-4320-b5bb-441b8f564d10

📥 Commits

Reviewing files that changed from the base of the PR and between a093d4d and bc1cd55.

📒 Files selected for processing (22)
  • .github/workflows/dependency-review.yml
  • DESIGN_SPEC.md
  • README.md
  • docker/sandbox/Dockerfile
  • src/ai_company/observability/events/mcp.py
  • src/ai_company/tools/code_runner.py
  • src/ai_company/tools/mcp/bridge_tool.py
  • src/ai_company/tools/mcp/cache.py
  • src/ai_company/tools/mcp/client.py
  • src/ai_company/tools/mcp/factory.py
  • src/ai_company/tools/sandbox/docker_config.py
  • src/ai_company/tools/sandbox/docker_sandbox.py
  • src/ai_company/tools/sandbox/protocol.py
  • src/ai_company/tools/sandbox/sandboxing_config.py
  • tests/integration/tools/test_docker_sandbox_integration.py
  • tests/unit/tools/mcp/conftest.py
  • tests/unit/tools/mcp/test_bridge_tool.py
  • tests/unit/tools/mcp/test_client.py
  • tests/unit/tools/mcp/test_factory.py
  • tests/unit/tools/mcp/test_result_mapper.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/tools/sandbox/test_sandboxing_config.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (4)
**/*.md

📄 CodeRabbit inference engine (CLAUDE.md)

DESIGN_SPEC.md is MANDATORY reading before implementing any feature. The spec is the starting point for architecture, data models, and behavior. If implementation deviates from the spec, alert the user and explain why — user decides whether to proceed or update the spec. Do not silently diverge. When spec sections are referenced (e.g. 'Section 10.2'), read that section verbatim before coding. When approved deviations occur, update DESIGN_SPEC.md to reflect the new reality

Files:

  • README.md
  • DESIGN_SPEC.md
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) — ruff enforces this on Python 3.14
All public functions must have type hints; mypy strict mode is enforced
Public classes and functions must have Google-style docstrings (enforced by ruff D rules)
Every module with business logic must include logging setup: from ai_company.observability import get_logger then logger = get_logger(__name__). Never use import logging or print() in application code. Variable name must always be logger
Use structured logging with event name constants: logger.info(EVENT_CONSTANT, key=value) — never use string interpolation like logger.info('msg %s', val). Import event constants directly from ai_company.observability.events.<domain>
All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, and entry/exit of key functions
Line length maximum is 88 characters (enforced by ruff)
Functions must be < 50 lines; files must be < 800 lines
Handle errors explicitly; never silently swallow exceptions
Validate at system boundaries: user input, external APIs, and config files. Do not validate unnecessarily at internal boundaries

Files:

  • src/ai_company/tools/sandbox/sandboxing_config.py
  • src/ai_company/tools/sandbox/docker_sandbox.py
  • tests/integration/tools/test_docker_sandbox_integration.py
  • tests/unit/tools/mcp/conftest.py
  • src/ai_company/tools/code_runner.py
  • tests/unit/tools/mcp/test_factory.py
  • src/ai_company/tools/mcp/cache.py
  • tests/unit/tools/mcp/test_bridge_tool.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/tools/sandbox/test_sandboxing_config.py
  • tests/unit/tools/mcp/test_client.py
  • src/ai_company/tools/mcp/client.py
  • src/ai_company/tools/sandbox/docker_config.py
  • src/ai_company/tools/mcp/factory.py
  • src/ai_company/tools/sandbox/protocol.py
  • src/ai_company/tools/mcp/bridge_tool.py
  • tests/unit/tools/mcp/test_result_mapper.py
  • src/ai_company/observability/events/mcp.py
src/ai_company/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

src/ai_company/**/*.py: Use Pydantic v2 with frozen models for config/identity; use separate mutable-via-copy models (with model_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model
Use @computed_field for derived values instead of storing + validating redundant fields (e.g. TokenUsage.total_tokens); use NotBlankStr from core.types for all identifier/name fields—including optional (NotBlankStr | None) and tuple variants—instead of manual whitespace validators
Create new objects for immutability; never mutate existing ones. For non-Pydantic internal collections (registries, BaseTool), use copy.deepcopy() at construction + MappingProxyType for read-only enforcement. For dict/list fields in frozen Pydantic models, rely on frozen=True and copy.deepcopy() at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)
Prefer asyncio.TaskGroup for fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over bare create_task

Files:

  • src/ai_company/tools/sandbox/sandboxing_config.py
  • src/ai_company/tools/sandbox/docker_sandbox.py
  • src/ai_company/tools/code_runner.py
  • src/ai_company/tools/mcp/cache.py
  • src/ai_company/tools/mcp/client.py
  • src/ai_company/tools/sandbox/docker_config.py
  • src/ai_company/tools/mcp/factory.py
  • src/ai_company/tools/sandbox/protocol.py
  • src/ai_company/tools/mcp/bridge_tool.py
  • src/ai_company/observability/events/mcp.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow. Minimum coverage 80% (enforced in CI). Async mode: asyncio_mode = 'auto' — no manual @pytest.mark.asyncio needed. Per-test timeout: 30 seconds. Use pytest-xdist via -n auto for parallelism. Prefer @pytest.mark.parametrize for testing similar cases
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases. Tests must use test-provider, test-small-001, etc. Vendor names may only appear in: (1) DESIGN_SPEC.md provider list, (2) .claude/ files, (3) third-party import paths

Files:

  • tests/integration/tools/test_docker_sandbox_integration.py
  • tests/unit/tools/mcp/conftest.py
  • tests/unit/tools/mcp/test_factory.py
  • tests/unit/tools/mcp/test_bridge_tool.py
  • tests/unit/tools/sandbox/test_docker_sandbox.py
  • tests/unit/tools/sandbox/test_sandboxing_config.py
  • tests/unit/tools/mcp/test_client.py
  • tests/unit/tools/mcp/test_result_mapper.py
🧠 Learnings (7)
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.md : DESIGN_SPEC.md is MANDATORY reading before implementing any feature. The spec is the starting point for architecture, data models, and behavior. If implementation deviates from the spec, alert the user and explain why — user decides whether to proceed or update the spec. Do not silently diverge. When spec sections are referenced (e.g. 'Section 10.2'), read that section verbatim before coding. When approved deviations occur, update DESIGN_SPEC.md to reflect the new reality

Applied to files:

  • README.md
  • DESIGN_SPEC.md
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to pyproject.toml : Organize dependencies into groups: `test` (pytest + plugins), `dev` (includes test + ruff, mypy, pre-commit, commitizen). Install with `uv sync` (installs all groups including dev by default)

Applied to files:

  • README.md
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to src/ai_company/**/*.py : Use `computed_field` for derived values instead of storing + validating redundant fields (e.g. `TokenUsage.total_tokens`); use `NotBlankStr` from `core.types` for all identifier/name fields—including optional (`NotBlankStr | None`) and tuple variants—instead of manual whitespace validators

Applied to files:

  • src/ai_company/tools/sandbox/sandboxing_config.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.py : Functions must be < 50 lines; files must be < 800 lines

Applied to files:

  • src/ai_company/tools/code_runner.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to src/ai_company/**/*.py : Create new objects for immutability; never mutate existing ones. For non-Pydantic internal collections (registries, `BaseTool`), use `copy.deepcopy()` at construction + `MappingProxyType` for read-only enforcement. For `dict`/`list` fields in frozen Pydantic models, rely on `frozen=True` and `copy.deepcopy()` at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)

Applied to files:

  • src/ai_company/tools/mcp/cache.py
  • src/ai_company/tools/mcp/client.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to src/ai_company/**/*.py : Use Pydantic v2 with frozen models for config/identity; use separate mutable-via-copy models (with `model_copy(update=...)`) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model

Applied to files:

  • src/ai_company/tools/mcp/cache.py
  • src/ai_company/tools/mcp/client.py
  • src/ai_company/tools/sandbox/docker_config.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.py : Validate at system boundaries: user input, external APIs, and config files. Do not validate unnecessarily at internal boundaries

Applied to files:

  • src/ai_company/tools/sandbox/docker_config.py
🧬 Code graph analysis (12)
src/ai_company/tools/sandbox/sandboxing_config.py (2)
src/ai_company/tools/sandbox/config.py (1)
  • SubprocessSandboxConfig (8-83)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-94)
src/ai_company/tools/sandbox/docker_sandbox.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-94)
src/ai_company/tools/sandbox/errors.py (2)
  • SandboxError (10-11)
  • SandboxStartError (23-24)
src/ai_company/tools/sandbox/result.py (1)
  • SandboxResult (6-28)
tests/integration/tools/test_docker_sandbox_integration.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-94)
src/ai_company/tools/sandbox/docker_sandbox.py (6)
  • DockerSandbox (70-610)
  • config (112-114)
  • workspace (117-119)
  • execute (267-319)
  • cleanup (558-583)
  • health_check (585-606)
src/ai_company/tools/sandbox/protocol.py (3)
  • execute (25-51)
  • cleanup (53-59)
  • health_check (61-67)
src/ai_company/tools/code_runner.py (6)
src/ai_company/core/enums.py (1)
  • ToolCategory (293-307)
src/ai_company/tools/base.py (5)
  • BaseTool (56-161)
  • ToolExecutionResult (24-53)
  • description (115-117)
  • category (110-112)
  • parameters_schema (120-128)
src/ai_company/tools/sandbox/errors.py (1)
  • SandboxError (10-11)
src/ai_company/tools/sandbox/protocol.py (2)
  • SandboxBackend (17-75)
  • execute (25-51)
src/ai_company/tools/sandbox/docker_sandbox.py (1)
  • execute (267-319)
src/ai_company/tools/mcp/cache.py (1)
  • get (47-82)
src/ai_company/tools/mcp/cache.py (2)
src/ai_company/observability/_logger.py (1)
  • get_logger (8-28)
src/ai_company/tools/base.py (1)
  • ToolExecutionResult (24-53)
tests/unit/tools/mcp/test_bridge_tool.py (7)
src/ai_company/core/enums.py (1)
  • ToolCategory (293-307)
src/ai_company/tools/base.py (4)
  • ToolExecutionResult (24-53)
  • category (110-112)
  • description (115-117)
  • parameters_schema (120-128)
src/ai_company/tools/mcp/bridge_tool.py (3)
  • MCPBridgeTool (31-188)
  • tool_info (62-64)
  • execute (66-88)
src/ai_company/tools/mcp/client.py (4)
  • MCPClient (47-472)
  • config (65-67)
  • server_name (75-77)
  • call_tool (233-311)
src/ai_company/tools/mcp/config.py (1)
  • MCPServerConfig (22-147)
src/ai_company/tools/mcp/models.py (1)
  • MCPToolInfo (14-37)
src/ai_company/tools/mcp/cache.py (2)
  • MCPResultCache (23-149)
  • get (47-82)
tests/unit/tools/sandbox/test_sandboxing_config.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-94)
src/ai_company/tools/sandbox/sandboxing_config.py (2)
  • SandboxingConfig (14-56)
  • backend_for_category (47-56)
tests/unit/tools/sandbox/test_docker_config.py (2)
  • test_defaults (14-25)
  • test_frozen (27-30)
tests/unit/tools/mcp/test_client.py (4)
src/ai_company/tools/mcp/client.py (9)
  • MCPClient (47-472)
  • config (65-67)
  • is_connected (70-72)
  • server_name (75-77)
  • connect (79-140)
  • disconnect (161-180)
  • list_tools (182-231)
  • call_tool (233-311)
  • reconnect (313-324)
src/ai_company/tools/mcp/config.py (1)
  • MCPServerConfig (22-147)
src/ai_company/tools/mcp/errors.py (4)
  • MCPConnectionError (14-15)
  • MCPDiscoveryError (22-23)
  • MCPInvocationError (26-27)
  • MCPTimeoutError (18-19)
src/ai_company/tools/mcp/models.py (1)
  • MCPToolInfo (14-37)
src/ai_company/tools/mcp/client.py (4)
src/ai_company/observability/_logger.py (1)
  • get_logger (8-28)
src/ai_company/tools/mcp/errors.py (4)
  • MCPConnectionError (14-15)
  • MCPDiscoveryError (22-23)
  • MCPInvocationError (26-27)
  • MCPTimeoutError (18-19)
src/ai_company/tools/mcp/models.py (2)
  • MCPRawResult (40-62)
  • MCPToolInfo (14-37)
src/ai_company/tools/mcp/config.py (1)
  • MCPServerConfig (22-147)
src/ai_company/tools/mcp/factory.py (5)
src/ai_company/tools/mcp/bridge_tool.py (2)
  • MCPBridgeTool (31-188)
  • tool_info (62-64)
src/ai_company/tools/mcp/cache.py (1)
  • MCPResultCache (23-149)
src/ai_company/tools/mcp/client.py (6)
  • MCPClient (47-472)
  • config (65-67)
  • disconnect (161-180)
  • server_name (75-77)
  • connect (79-140)
  • list_tools (182-231)
src/ai_company/tools/mcp/config.py (2)
  • MCPConfig (150-176)
  • MCPServerConfig (22-147)
src/ai_company/tools/mcp/models.py (1)
  • MCPToolInfo (14-37)
src/ai_company/tools/mcp/bridge_tool.py (5)
src/ai_company/tools/mcp/errors.py (1)
  • MCPError (10-11)
src/ai_company/tools/mcp/result_mapper.py (1)
  • map_call_tool_result (29-108)
src/ai_company/tools/mcp/cache.py (3)
  • MCPResultCache (23-149)
  • get (47-82)
  • put (84-115)
src/ai_company/tools/mcp/client.py (3)
  • MCPClient (47-472)
  • server_name (75-77)
  • call_tool (233-311)
src/ai_company/tools/mcp/models.py (1)
  • MCPToolInfo (14-37)
tests/unit/tools/mcp/test_result_mapper.py (2)
src/ai_company/tools/mcp/models.py (1)
  • MCPRawResult (40-62)
src/ai_company/tools/mcp/result_mapper.py (1)
  • map_call_tool_result (29-108)
🪛 Checkov (3.2.334)
docker/sandbox/Dockerfile

[low] 1-19: Ensure that HEALTHCHECK instructions have been added to container images

(CKV_DOCKER_2)

🪛 GitHub Actions: CI
src/ai_company/tools/sandbox/docker_sandbox.py

[error] 355-355: SandboxStartError: Failed to create container: [404] No such image: python:3.12-slim

tests/integration/tools/test_docker_sandbox_integration.py

[error] 65-65: SandboxStartError: Failed to create container: [404] No such image: python:3.12-slim


[error] 85-85: SandboxStartError: Failed to create container: [404] No such image: python:3.12-slim

tests/unit/tools/sandbox/test_sandboxing_config.py

[error] 41-41: mypy: Argument "overrides" to "SandboxingConfig" has incompatible type "dict[str, str]"; expected "dict[str, Literal['subprocess', 'docker']]" [arg-type]


[error] 93-93: mypy: Dict entry 0 has incompatible type "str": "str"; expected "str": "Literal['subprocess', 'docker']" [dict-item]

🪛 Hadolint (2.14.0)
docker/sandbox/Dockerfile

[warning] 9-9: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)

🪛 LanguageTool
README.md

[typographical] ~18-~18: To join two clauses or introduce examples, consider using an em dash.
Context: ...ation analytics models - Tool System - File system tools, git tools, sandbox ab...

(DASH_RULE)

🪛 Trivy (0.69.3)
docker/sandbox/Dockerfile

[info] 1-1: No HEALTHCHECK defined

Add HEALTHCHECK instruction in your Dockerfile

Rule: DS-0026

Learn more

(IaC/Dockerfile)

🔇 Additional comments (8)
.github/workflows/dependency-review.yml (1)

27-31: Looks good.

Adding MIT-0 to the allow-list is consistent with the documented scanner exceptions, and the extra comments make future license-review decisions easier to audit.

docker/sandbox/Dockerfile (1)

1-19: LGTM! Well-structured sandbox Dockerfile.

The multi-stage build avoids curl-pipe-bash patterns, UID 10001 prevents conflicts with common system users, and the non-login shell with read-only workspace alignment supports the hardening objectives.

The Hadolint warning about pinning apt package versions (Line 9) is valid but low-priority for ephemeral sandbox images. The HEALTHCHECK warnings from Trivy/Checkov are not applicable here—the parent DockerSandbox orchestrator manages container health externally via its health_check() method.

src/ai_company/tools/sandbox/protocol.py (1)

22-22: LGTM!

Documentation correctly updated to reflect that Subprocess and Docker are now built-in backends, aligning with the implementation delivered in this PR.

Also applies to: 40-40

tests/unit/tools/mcp/conftest.py (1)

1-169: LGTM!

Well-structured test fixtures providing reusable mock MCP sessions, clients, and configurations. The helper functions for creating mock tool results enable clean, isolated unit tests. Generic test identifiers (test-stdio, test-http, test-server) comply with the naming guidelines.

src/ai_company/observability/events/mcp.py (1)

1-29: LGTM!

Well-organized event constants with consistent naming patterns (MCP_<CATEGORY>_<ACTION>"mcp.<category>.<action>"). These enable structured logging across the MCP client, bridge, factory, and cache components.

README.md (1)

18-18: LGTM!

Documentation accurately reflects the implemented features: Docker sandbox backend, code runner, and MCP bridge. Docker is now documented as an optional requirement (Line 59) with installation guidance, addressing the previous review feedback.

Also applies to: 41-41, 50-50, 59-59

tests/unit/tools/mcp/test_bridge_tool.py (1)

1-216: LGTM!

Comprehensive test coverage for MCPBridgeTool: construction properties, execution delegation, cache hit/miss semantics, unhashable argument handling, and error result exclusion from cache. The inline MCPClient creation in test_empty_input_schema_yields_none_parameters (Lines 84-102) is acceptable since it tests a specific edge case requiring an empty input_schema that the shared fixtures don't provide.

tests/integration/tools/test_docker_sandbox_integration.py (1)

21-41: LGTM on resource leak fix.

The _check() helper now properly closes the Docker client in the finally block, addressing the previous review feedback about resource leaks on exception paths.

Comment on lines +106 to +109
self._config = config or _DEFAULT_CONFIG
self._workspace = resolved
self._docker: aiodocker.Docker | None = None
self._tracked_containers: list[str] = []
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n src/ai_company/tools/sandbox/docker_sandbox.py

Repository: Aureliolo/ai-company

Length of output: 23598


Add synchronization to protect concurrent Docker client lifecycle and container tracking.

The code has race conditions in _ensure_docker() (lines 121-144), cleanup() (lines 558-583), and tracked-container mutations (lines 366-369, 386-388):

  1. _ensure_docker() TOCTOU: Multiple concurrent tasks can both observe self._docker is None, create separate aiodocker.Docker clients, and race to assign to self._docker, leaking clients.
  2. cleanup() interference: While a task is initializing the client in _ensure_docker(), another task can call cleanup(), closing the client before it's assigned or used, causing use-after-close errors.
  3. Container tracking mutations: List reads and modifications in _run_container() (lines 366-369, 386-388) are not atomic; concurrent executions can lose track of container IDs.

Protect these sections with an asyncio.Lock to serialize access to _ensure_docker(), cleanup(), and tracked-container state.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/tools/sandbox/docker_sandbox.py` around lines 106 - 109, Add
an asyncio.Lock (e.g., self._client_lock) to serialize access to the Docker
client and container-tracking state; in particular, protect _ensure_docker(),
cleanup(), and the sections of _run_container() that read/modify
self._tracked_containers and assign self._docker by surrounding them with "async
with self._client_lock" so the TOCTOU creation/assignment of aiodocker.Docker is
atomic, cleanup() cannot close the client mid-initialization, and
tracked-container append/remove operations are atomic and not lost under
concurrency.

Comment on lines +1 to +8
"""Top-level sandboxing configuration model."""

from typing import Literal, Self

from pydantic import BaseModel, ConfigDict, Field, model_validator

from ai_company.tools.sandbox.config import SubprocessSandboxConfig
from ai_company.tools.sandbox.docker_config import DockerSandboxConfig
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Add the required logger to this config module.

SandboxingConfig contains business logic/validation, but the module never defines logger, so it cannot meet the repo rule to emit structured warnings before rejecting invalid config.

As per coding guidelines, "Every module with business logic must include logging setup: from ai_company.observability import get_logger then logger = get_logger(__name__)." and "All error paths must log at WARNING or ERROR with context before raising".

Also applies to: 35-45

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/ai_company/tools/sandbox/sandboxing_config.py` around lines 1 - 8, This
module is missing the required logger setup used by SandboxingConfig's
validation/business logic; add the import "from ai_company.observability import
get_logger" and define "logger = get_logger(__name__)" near the top of the file,
then update SandboxingConfig's model_validator and any validation/error paths
(e.g., inside the validator function(s) referenced by model_validator and any
logic around SubprocessSandboxConfig/DockerSandboxConfig usage around lines
~35-45) to call logger.warning or logger.error with contextual details before
raising exceptions or returning errors; ensure all newly added log calls include
enough context (config values or identifiers) to aid debugging.

Comment on lines +164 to +172
def test_unknown_block_produces_placeholder(self) -> None:
from unittest.mock import MagicMock

unknown = MagicMock()
type(unknown).__name__ = "MysteryBlock"
raw = MCPRawResult(content=(unknown,))
result = map_call_tool_result(raw)
assert "[unknown: MysteryBlock]" in result.content
assert "attachments" not in result.metadata
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n tests/unit/tools/mcp/test_result_mapper.py | sed -n '160,180p'

Repository: Aureliolo/ai-company

Length of output: 866


🏁 Script executed:

# Check if there's a pytest fixture or setup/teardown that might clean up
grep -n "setup\|teardown\|fixture" tests/unit/tools/mcp/test_result_mapper.py

Repository: Aureliolo/ai-company

Length of output: 46


🏁 Script executed:

# Check for other tests in the file that might be affected by MagicMock mutation
grep -n "MagicMock\|MysteryBlock" tests/unit/tools/mcp/test_result_mapper.py

Repository: Aureliolo/ai-company

Length of output: 260


🏁 Script executed:

# Verify the test file exists and check its full context
wc -l tests/unit/tools/mcp/test_result_mapper.py

Repository: Aureliolo/ai-company

Length of output: 110


🏁 Script executed:

# Check the imports and test class structure
head -n 30 tests/unit/tools/mcp/test_result_mapper.py

Repository: Aureliolo/ai-company

Length of output: 859


🏁 Script executed:

# Verify if MagicMock is used elsewhere in the codebase in tests
rg "MagicMock" tests/ --type py -B2 -A2

Repository: Aureliolo/ai-company

Length of output: 50376


🏁 Script executed:

# Check the pytest configuration to understand test isolation
cat pytest.ini 2>/dev/null || cat pyproject.toml 2>/dev/null | grep -A 20 "\[tool.pytest"

Repository: Aureliolo/ai-company

Length of output: 777


🏁 Script executed:

# Check how map_call_tool_result uses the __name__ attribute
grep -n "__name__\|type(" src/ai_company/tools/mcp/result_mapper.py -A 2 -B 2

Repository: Aureliolo/ai-company

Length of output: 356


🏁 Script executed:

# Find the implementation of map_call_tool_result to see if it accesses __name__
fd "result_mapper.py" src/ --type f -exec cat {} \;

Repository: Aureliolo/ai-company

Length of output: 210


🏁 Script executed:

# Read the full result_mapper.py to see complete context
wc -l src/ai_company/tools/mcp/result_mapper.py

Repository: Aureliolo/ai-company

Length of output: 109


🏁 Script executed:

# Get the relevant section where __name__ is used
sed -n '70,85p' src/ai_company/tools/mcp/result_mapper.py

Repository: Aureliolo/ai-company

Length of output: 561


🏁 Script executed:

# Check if there are any other tests after test_unknown_block_produces_placeholder that might be affected
sed -n '164,215p' tests/unit/tools/mcp/test_result_mapper.py

Repository: Aureliolo/ai-company

Length of output: 1840


Replace MagicMock with a local stub class to avoid mutating the shared MagicMock class.

type(unknown).__name__ = "MysteryBlock" mutates the shared MagicMock class globally, which can cause test isolation issues if test order changes or parallelization is used. A simple local stub class achieves the same coverage without state leakage.

♻️ Suggested fix
    def test_unknown_block_produces_placeholder(self) -> None:
-        from unittest.mock import MagicMock
-
-        unknown = MagicMock()
-        type(unknown).__name__ = "MysteryBlock"
+        class MysteryBlock:
+            pass
+
+        unknown = MysteryBlock()
        raw = MCPRawResult(content=(unknown,))
        result = map_call_tool_result(raw)
        assert "[unknown: MysteryBlock]" in result.content
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/tools/mcp/test_result_mapper.py` around lines 164 - 172, The test
test_unknown_block_produces_placeholder currently mutates the global MagicMock
type by setting type(unknown).__name__ = "MysteryBlock"; replace MagicMock with
a tiny local stub class (e.g., class MysteryBlock: pass) and instantiate it
instead of using MagicMock so you don’t alter shared state; keep using
MCPRawResult(content=(instance,)) and assert the same expectations against
map_call_tool_result to preserve test behavior and isolation.

Comment on lines +40 to +42
overrides = {"code_execution": "docker", "terminal": "subprocess"}
config = SandboxingConfig(overrides=overrides)
assert config.overrides == overrides
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix the fixture types so mypy stops failing.

overrides, default, and override_backend are inferred as plain str, which is exactly why CI reports dict[str, str] / str here. Type them as Literal["subprocess", "docker"] (or a shared alias) and drop the ignore on Line 92.

♻️ Suggested fix
+from typing import Literal
+
 import pytest
 from pydantic import ValidationError
 
 from ai_company.tools.sandbox.docker_config import DockerSandboxConfig
 from ai_company.tools.sandbox.sandboxing_config import SandboxingConfig
+
+BackendName = Literal["subprocess", "docker"]
@@
     def test_overrides(self) -> None:
-        overrides = {"code_execution": "docker", "terminal": "subprocess"}
+        overrides: dict[str, BackendName] = {
+            "code_execution": "docker",
+            "terminal": "subprocess",
+        }
         config = SandboxingConfig(overrides=overrides)
         assert config.overrides == overrides
@@
     def test_parametrized_routing(
         self,
-        default: str,
-        override_backend: str,
-        expected: str,
+        default: BackendName,
+        override_backend: BackendName,
+        expected: BackendName,
     ) -> None:
         config = SandboxingConfig(
-            default_backend=default,  # type: ignore[arg-type]
+            default_backend=default,
             overrides={"code_execution": override_backend},
         )

Also applies to: 78-94

🧰 Tools
🪛 GitHub Actions: CI

[error] 41-41: mypy: Argument "overrides" to "SandboxingConfig" has incompatible type "dict[str, str]"; expected "dict[str, Literal['subprocess', 'docker']]" [arg-type]

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/tools/sandbox/test_sandboxing_config.py` around lines 40 - 42, The
test assigns plain str values to the fixtures/variables used with
SandboxingConfig (overrides, default, override_backend), causing mypy failures;
update their type annotations to use Literal["subprocess","docker"] (or define a
shared alias like Backend = Literal["subprocess","docker"]) and apply that alias
to the fixtures/variables (e.g., overrides: dict[str, Backend], default:
Backend, override_backend: Backend), then remove the temporary # type: ignore on
the earlier line so the test compiles cleanly under mypy while still asserting
SandboxingConfig(overrides=overrides) behaves as expected.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 47 out of 48 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +9 to +11
RUN apt-get update && apt-get install -y --no-install-recommends git \
&& apt-get clean && rm -rf /var/lib/apt/lists/*

Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dockerfile does not set ReadOnly mode or drop capabilities at the image layer — those are applied at runtime by DockerSandbox._build_container_config. This is correct design. However, the Dockerfile installs git (line 9) and copies the full Node.js installation including npm. Because ReadonlyRootfs: True is applied at runtime, the sandbox container cannot write to the filesystem — but git and npm may try to write to /home, /tmp, or other paths that are blocked by the read-only rootfs. If agents need to run git or npm inside the sandbox, they will fail silently or with confusing errors. Consider documenting this constraint, or mounting a writable /tmp via tmpfs in the container config to allow tools that need temporary write access.

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +47
"timeout": {
"type": "number",
"description": "Optional timeout in seconds",
"minimum": 0,
"maximum": 600,
},
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code_runner.py timeout schema sets "maximum": 600 (line 46), but the existing PR description says "timeout schema maximum: 600". However, DockerSandboxConfig.timeout_seconds has le=600, so a user passing timeout=600 in arguments would be accepted by the schema but then clamped to min(600, config.timeout_seconds). If the config uses the default of 120s, a timeout of 600 passed from the schema would be silently reduced to 120s. While this is intentional clamping behavior documented in the execute docstring, the discrepancy between the advertised max (600) and the effective default max (120) could surprise users. Consider either lowering the schema maximum to match the config default or adding a note in the schema description.

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +144
async def _ensure_docker(self) -> aiodocker.Docker:
"""Lazily connect to the Docker daemon.

Returns:
An ``aiodocker.Docker`` client instance.

Raises:
SandboxStartError: If the Docker daemon is unavailable.
"""
if self._docker is not None:
return self._docker
client = aiodocker.Docker()
try:
await client.version()
except Exception as exc:
await client.close()
logger.exception(
DOCKER_DAEMON_UNAVAILABLE,
error=str(exc),
)
msg = f"Docker daemon unavailable: {exc}"
raise SandboxStartError(msg) from exc
self._docker = client
return client
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _start_and_wait method catches TimeoutError from asyncio.wait_for. In Python 3.11+, asyncio.wait_for raises TimeoutError (the built-in) rather than asyncio.TimeoutError, which is now an alias. However, the test at line 182 patches side_effect=asyncio.TimeoutError. While these are currently the same class so the tests pass, the code and tests are correct as-is. The real concern is: _ensure_docker is called inside execute, but execute holds no lock. If two concurrent calls both see self._docker is None, they will both create a new aiodocker.Docker() client, only one will be stored (the second call will overwrite self._docker), and the first client is leaked. Since _ensure_docker is async and there is no lock around the self._docker is None check and the subsequent assignment, this is a race condition. Consider adding an asyncio lock to guard _ensure_docker initialization.

Copilot uses AI. Check for mistakes.
Comment on lines +254 to +265
session = self._require_session()
logger.debug(
MCP_INVOKE_START,
server=self._config.name,
tool=tool_name,
)
async with self._lock:
try:
result = await asyncio.wait_for(
session.call_tool(tool_name, arguments),
timeout=self._config.timeout_seconds,
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In call_tool, _require_session() is called outside the lock (line 254), and then the lock is acquired on line 260. This creates a TOCTOU (time-of-check-time-of-use) race: after _require_session() confirms the session is not None but before the lock is acquired, a concurrent disconnect() could set self._session = None. The subsequent session.call_tool(...) call at line 263 would then operate on a stale session reference. _require_session() should be called inside the async with self._lock block, consistent with how it is used in list_tools.

Copilot uses AI. Check for mistakes.
Comment on lines +564 to +573
if self._docker is not None:
try:
for cid in self._tracked_containers:
await self._stop_container(self._docker, cid)
await self._remove_container(self._docker, cid)
except Exception as exc:
logger.warning(
DOCKER_CLEANUP,
error=f"Container cleanup failed: {exc}",
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup method has a bug in its exception handling. The for loop that iterates self._tracked_containers is wrapped in a single try/except. Because _stop_container and _remove_container already catch all exceptions internally and just log them, they will never raise. However, if future changes cause them to raise, the except block would break out of the loop early and skip remaining containers. Since the intent is clearly to iterate all containers regardless of errors, the try/except around the loop body should be removed (the inner methods already handle exceptions), or each call should individually be protected.

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +78
except ValueError:
msg = f"Invalid memory_limit format: {self.memory_limit!r}"
raise ValueError(msg) from None
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DockerSandboxConfig validator _validate_memory_limit re-raises a ValueError without preserving the original exception when parsing fails: raise ValueError(msg) from None. This discards the original ValueError from int(), making debugging harder than necessary. The convention in other model validators in this codebase (e.g., sandboxing_config.py) does not suppress the original exception. Consider using raise ValueError(msg) from exc or just allowing the int() ValueError to propagate with an improved message.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +56
super().__init__(
name=f"mcp_{tool_info.server_name}_{tool_info.name}",
description=tool_info.description,
parameters_schema=tool_info.input_schema or None,
category=ToolCategory.MCP,
)
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MCPBridgeTool.__init__, the parameters_schema is set to tool_info.input_schema or None. Since MCPToolInfo.input_schema defaults to an empty dict {}, and an empty dict is falsy, a tool with no schema parameters will correctly pass None. However, a non-empty schema like {"type": "object", "properties": {}} (an object schema with no properties) is truthy and will be used — this is correct. The concern is that the or None idiom is subtly different from if not input_schema: None, but they behave identically for the empty-dict case, which matches the test at line 84-102. This is fine as-is, but an explicit tool_info.input_schema if tool_info.input_schema else None would be more readable.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the project's capabilities by adding a hardened Docker sandbox, a comprehensive MCP bridge, and a new code runner tool. However, several security vulnerabilities were identified: insecure environment inheritance in the MCP stdio client which could leak host secrets to MCP servers, unimplemented network isolation controls in the Docker sandbox despite being present in the configuration, potential DoS via unbounded log collection, and sensitive information leakage in debug logs. Additionally, a minor bug in the Docker sandbox's memory limit parsing could lead to an unhandled exception. Addressing these will further strengthen the security posture and stability of the tool system.

self._config.memory_limit,
)
nano_cpus = int(self._config.cpu_limit * _NANO_CPUS_MULTIPLIER)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The DockerSandboxConfig defines allowed_hosts and network_overrides fields intended for network filtering and per-category isolation. However, the DockerSandbox implementation completely ignores these fields in _build_container_config. This creates a false sense of security; users may believe they have restricted the sandbox's network access to specific hosts, while the implementation actually allows or blocks all network access based solely on the global network mode.

Comment on lines +374 to +378
Connected ``ClientSession`` (not yet initialized).
"""
if self._config.command is None:
msg = f"Server {self._config.name!r}: stdio transport requires 'command'"
logger.warning(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

In _connect_stdio, the environment for the MCP server subprocess is set to None if self._config.env is empty (which is the default). In Python's subprocess management, env=None causes the child process to inherit the full environment of the parent process. This leaks all host environment variables, including sensitive API keys and credentials, to the MCP server. Pass an empty dictionary instead to prevent inheritance.

Suggested change
Connected ``ClientSession`` (not yet initialized).
"""
if self._config.command is None:
msg = f"Server {self._config.name!r}: stdio transport requires 'command'"
logger.warning(
params = StdioServerParameters(
command=self._config.command,
args=list(self._config.args),
env=dict(self._config.env),
)

Comment on lines +491 to +501
Returns:
Tuple of (stdout, stderr) as strings.
"""
stdout_logs = await container_obj.log(
stdout=True,
stderr=False,
)
stderr_logs = await container_obj.log(
stdout=False,
stderr=True,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The _collect_logs method retrieves the entire stdout and stderr log history from the Docker container and joins them into strings in memory. There is no limit on the amount of data collected. A malicious or runaway process in the container could generate a massive amount of output, leading to an Out-Of-Memory (OOM) condition on the host machine when the logs are joined. Consider implementing a maximum size limit or using the tail parameter.

Comment on lines +299 to +306
)
container_cwd = self._resolve_cwd_in_container(cwd)

logger.debug(
DOCKER_EXECUTE_START,
command=command,
args=args,
cwd=container_cwd,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The execute method logs the full command and all arguments at the DEBUG level. When used by the CodeRunnerTool, the args include the entire source code snippet being executed. If this code contains hardcoded secrets or sensitive data injected by the LLM, they will be persisted in the host's logs in plain text. Consider truncating the logged arguments or avoiding logging the full code snippet.

Comment on lines +257 to +260
multipliers = {"k": 1024, "m": 1024**2, "g": 1024**3}
if limit_lower[-1] in multipliers:
result = int(limit_lower[:-1]) * multipliers[limit_lower[-1]]
else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _parse_memory_limit function can crash with a ValueError if a user provides a memory limit that consists only of a suffix, like "m" or "g". The NotBlankStr validator on the config model allows this input, but int(limit_lower[:-1]) will fail when limit_lower[:-1] is an empty string. This unhandled exception could crash the service when creating a Docker container.

I suggest wrapping the parsing logic in a try...except block to catch this and raise a more informative ValueError, making the function more resilient to malformed user input.

Suggested change
multipliers = {"k": 1024, "m": 1024**2, "g": 1024**3}
if limit_lower[-1] in multipliers:
result = int(limit_lower[:-1]) * multipliers[limit_lower[-1]]
else:
try:
if limit_lower[-1] in multipliers:
result = int(limit_lower[:-1]) * multipliers[limit_lower[-1]]
else:
result = int(limit_lower)
except (ValueError, IndexError) as exc:
raise ValueError(f"Invalid memory limit format: {limit!r}") from exc

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
tests/unit/tools/sandbox/test_sandboxing_config.py (2)

39-45: 🛠️ Refactor suggestion | 🟠 Major

Fix the overrides type annotation to match the expected literal type.

The overrides variable is annotated as dict[str, str] but SandboxingConfig expects dict[str, Literal['subprocess', 'docker']]. This forces the # type: ignore[arg-type] on line 44. Use the proper literal type to eliminate the suppression.

♻️ Suggested fix
+from typing import Literal
+
+BackendName = Literal["subprocess", "docker"]
+
     def test_overrides(self) -> None:
-        overrides: dict[str, str] = {
+        overrides: dict[str, BackendName] = {
             "code_execution": "docker",
             "terminal": "subprocess",
         }
-        config = SandboxingConfig(overrides=overrides)  # type: ignore[arg-type]
+        config = SandboxingConfig(overrides=overrides)
         assert config.overrides == overrides
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/tools/sandbox/test_sandboxing_config.py` around lines 39 - 45, The
test's overrides variable is annotated too broadly as dict[str, str]; update its
type to dict[str, Literal['subprocess', 'docker']] to match SandboxingConfig's
expected type (and remove the "# type: ignore[arg-type]" suppression), and add
the necessary Literal import from typing at the top of the test file so the
annotation is valid; reference: the overrides variable in test_overrides and the
SandboxingConfig construction.

81-98: 🛠️ Refactor suggestion | 🟠 Major

Fix parameterized test parameter types to use the literal backend type.

The parameters default, override_backend, and expected are typed as plain str, requiring # type: ignore suppressions on lines 95-96. Use the proper literal type to eliminate these suppressions and ensure type safety.

♻️ Suggested fix
+from typing import Literal
+
+BackendName = Literal["subprocess", "docker"]
+
     `@pytest.mark.parametrize`(
         ("default", "override_backend", "expected"),
         [
             ("subprocess", "docker", "docker"),
             ("docker", "subprocess", "subprocess"),
         ],
     )
     def test_parametrized_routing(
         self,
-        default: str,
-        override_backend: str,
-        expected: str,
+        default: BackendName,
+        override_backend: BackendName,
+        expected: BackendName,
     ) -> None:
         config = SandboxingConfig(
-            default_backend=default,  # type: ignore[arg-type]
-            overrides={"code_execution": override_backend},  # type: ignore[dict-item]
+            default_backend=default,
+            overrides={"code_execution": override_backend},
         )
         assert config.backend_for_category("code_execution") == expected
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/tools/sandbox/test_sandboxing_config.py` around lines 81 - 98, The
test uses plain str types causing type ignores; change the parameter annotations
in test_parametrized_routing to the specific Literal backend type (e.g.,
typing.Literal["docker", "subprocess"]) for default, override_backend, and
expected so the SandboxingConfig construction and overrides={"code_execution":
...} no longer require # type: ignore; keep the same param values and assertions
against SandboxingConfig.backend_for_category("code_execution").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/tools/test_docker_sandbox_integration.py`:
- Line 18: The per-test timeout set on pytestmark in
tests/integration/tools/test_docker_sandbox_integration.py currently uses 60s;
change pytestmark = [pytest.mark.integration, pytest.mark.timeout(60)] to use
30s (pytest.mark.timeout(30)) to conform to the coding guideline, or if 60s is
required for Docker startup, add a short comment above the pytestmark explaining
the exception and why this test needs a 60s timeout so reviewers know the
deviation is intentional.

---

Duplicate comments:
In `@tests/unit/tools/sandbox/test_sandboxing_config.py`:
- Around line 39-45: The test's overrides variable is annotated too broadly as
dict[str, str]; update its type to dict[str, Literal['subprocess', 'docker']] to
match SandboxingConfig's expected type (and remove the "# type:
ignore[arg-type]" suppression), and add the necessary Literal import from typing
at the top of the test file so the annotation is valid; reference: the overrides
variable in test_overrides and the SandboxingConfig construction.
- Around line 81-98: The test uses plain str types causing type ignores; change
the parameter annotations in test_parametrized_routing to the specific Literal
backend type (e.g., typing.Literal["docker", "subprocess"]) for default,
override_backend, and expected so the SandboxingConfig construction and
overrides={"code_execution": ...} no longer require # type: ignore; keep the
same param values and assertions against
SandboxingConfig.backend_for_category("code_execution").

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 82487bb7-aece-458e-a6c6-5d9d8e73bd89

📥 Commits

Reviewing files that changed from the base of the PR and between bc1cd55 and e777e9b.

📒 Files selected for processing (2)
  • tests/integration/tools/test_docker_sandbox_integration.py
  • tests/unit/tools/sandbox/test_sandboxing_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: No from __future__ import annotations — Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: use except A, B: (no parentheses) — ruff enforces this on Python 3.14
All public functions must have type hints; mypy strict mode is enforced
Public classes and functions must have Google-style docstrings (enforced by ruff D rules)
Every module with business logic must include logging setup: from ai_company.observability import get_logger then logger = get_logger(__name__). Never use import logging or print() in application code. Variable name must always be logger
Use structured logging with event name constants: logger.info(EVENT_CONSTANT, key=value) — never use string interpolation like logger.info('msg %s', val). Import event constants directly from ai_company.observability.events.<domain>
All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, and entry/exit of key functions
Line length maximum is 88 characters (enforced by ruff)
Functions must be < 50 lines; files must be < 800 lines
Handle errors explicitly; never silently swallow exceptions
Validate at system boundaries: user input, external APIs, and config files. Do not validate unnecessarily at internal boundaries

Files:

  • tests/integration/tools/test_docker_sandbox_integration.py
  • tests/unit/tools/sandbox/test_sandboxing_config.py
tests/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

tests/**/*.py: Use pytest markers: @pytest.mark.unit, @pytest.mark.integration, @pytest.mark.e2e, @pytest.mark.slow. Minimum coverage 80% (enforced in CI). Async mode: asyncio_mode = 'auto' — no manual @pytest.mark.asyncio needed. Per-test timeout: 30 seconds. Use pytest-xdist via -n auto for parallelism. Prefer @pytest.mark.parametrize for testing similar cases
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names: example-provider, example-large-001, example-medium-001, example-small-001, large/medium/small as aliases. Tests must use test-provider, test-small-001, etc. Vendor names may only appear in: (1) DESIGN_SPEC.md provider list, (2) .claude/ files, (3) third-party import paths

Files:

  • tests/integration/tools/test_docker_sandbox_integration.py
  • tests/unit/tools/sandbox/test_sandboxing_config.py
🧬 Code graph analysis (1)
tests/integration/tools/test_docker_sandbox_integration.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
  • DockerSandboxConfig (12-94)
src/ai_company/tools/sandbox/docker_sandbox.py (6)
  • DockerSandbox (70-610)
  • config (112-114)
  • workspace (117-119)
  • execute (267-319)
  • cleanup (558-583)
  • health_check (585-606)
src/ai_company/tools/sandbox/protocol.py (3)
  • execute (25-51)
  • cleanup (53-59)
  • health_check (61-67)
🔇 Additional comments (10)
tests/unit/tools/sandbox/test_sandboxing_config.py (4)

1-9: LGTM!

Module setup is correct: imports are clean, and the pytestmark applies both pytest.mark.unit and pytest.mark.timeout(30) as required by coding guidelines.


12-25: LGTM!

Default value assertions and frozen behavior test are well-structured. The # type: ignore[misc] on line 25 is appropriate since the test intentionally attempts an invalid mutation to verify immutability.


53-57: LGTM!

Clean test for custom Docker configuration integration.


63-79: LGTM!

Individual routing tests are clear and cover the key scenarios: default when no override, override when present, and default for unconfigured categories.

tests/integration/tools/test_docker_sandbox_integration.py (6)

1-17: LGTM!

Clean imports and appropriate use of TYPE_CHECKING for the Path import.


24-45: Image availability check now implemented — prior concern resolved.

The helper correctly verifies both Docker daemon connectivity (client.version()) and image presence (client.images.inspect(_TEST_IMAGE)) before running tests. The try/finally pattern ensures proper client cleanup on all code paths. Tests will now skip cleanly in CI environments where the image isn't pre-pulled.


48-51: LGTM!

The skip marker properly gates tests on both Docker daemon availability and image presence, with a clear skip reason.


54-77: LGTM!

Test class is properly decorated with the skip marker. test_run_python_code correctly uses try/finally for cleanup and validates both success status and stdout content.


78-97: LGTM!

The timeout test correctly uses a short execution timeout (2.0s) against a long-running command (sleep 60) to verify timeout enforcement. Proper assertions for both timed_out and success flags.


99-105: LGTM!

Health check test appropriately uses default sandbox configuration since health_check() only verifies daemon connectivity without requiring a specific image. Cleanup is properly handled.

if TYPE_CHECKING:
from pathlib import Path

pytestmark = [pytest.mark.integration, pytest.mark.timeout(60)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Per-test timeout exceeds coding guideline.

The guideline specifies "Per-test timeout: 30 seconds" but this module uses 60 seconds. While the extended timeout may be justified for Docker integration tests (container startup overhead), consider documenting this exception or adjusting if 30s is feasible after the skip logic filters out environments without pre-pulled images.

As per coding guidelines: "Per-test timeout: 30 seconds."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/tools/test_docker_sandbox_integration.py` at line 18, The
per-test timeout set on pytestmark in
tests/integration/tools/test_docker_sandbox_integration.py currently uses 60s;
change pytestmark = [pytest.mark.integration, pytest.mark.timeout(60)] to use
30s (pytest.mark.timeout(30)) to conform to the coding guideline, or if 60s is
required for Docker startup, add a short comment above the pytestmark explaining
the exception and why this test needs a 60s timeout so reviewers know the
deviation is intentional.

… Copilot

Key fixes:
- Docker sandbox: add /tmp tmpfs mount for ReadonlyRootfs (#31/#46),
  add asyncio.Lock to _ensure_docker/cleanup (#38/#48), truncate
  stderr in structured logs (#40), refactor _start_and_wait into
  helpers (#39), remove per-loop try/except in cleanup (#50),
  remove dead auto_remove config field (#21)
- MCP client: move _require_session() inside lock in call_tool()
  to fix TOCTOU race (#32/#49), reject double connect() (#35),
  add BaseException handler in connect() for CancelledError (#36)
- MCP factory: log before raising in reuse guard (#37)
- MCP result_mapper: split MCP_RESULT_UNKNOWN_BLOCK from
  MCP_RESULT_MAPPED event (#33)
- sandboxing_config: add required logger (#41)
- docker_config: preserve original ValueError in chain (#51)
- DESIGN_SPEC.md: remove duplicate tools/sandbox/ tree (#34)
- Tests: assert stop+delete counts (#44), add tmpfs test
@Aureliolo Aureliolo merged commit d5e1b6e into main Mar 10, 2026
8 of 9 checks passed
@Aureliolo Aureliolo deleted the feat/sandboxing-and-mcp-bridge branch March 10, 2026 09:03
Comment on lines +45 to +47
"minimum": 0,
"maximum": 600,
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout: 0 always produces an immediate timeout

The schema specifies "minimum": 0, which allows a caller to pass timeout=0.0. This value flows through to DockerSandbox._wait_for_exit as asyncio.wait_for(container_obj.wait(), timeout=0.0). In Python's asyncio, a timeout <= 0 causes an immediate TimeoutError — the container is stopped before it even runs a single instruction, and every invocation unconditionally returns timed_out=True.

This is the same constraint already applied by DockerSandboxConfig.timeout_seconds (gt=0) and MCPServerConfig.timeout_seconds (gt=0, le=600). The JSON Schema equivalent is "exclusiveMinimum":

Suggested change
"minimum": 0,
"maximum": 600,
},
"timeout": {
"type": "number",
"description": "Optional timeout in seconds",
"exclusiveMinimum": 0,
"maximum": 600,
},
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/tools/code_runner.py
Line: 45-47

Comment:
**`timeout: 0` always produces an immediate timeout**

The schema specifies `"minimum": 0`, which allows a caller to pass `timeout=0.0`. This value flows through to `DockerSandbox._wait_for_exit` as `asyncio.wait_for(container_obj.wait(), timeout=0.0)`. In Python's asyncio, a timeout `<= 0` causes an _immediate_ `TimeoutError` — the container is stopped before it even runs a single instruction, and every invocation unconditionally returns `timed_out=True`.

This is the same constraint already applied by `DockerSandboxConfig.timeout_seconds` (`gt=0`) and `MCPServerConfig.timeout_seconds` (`gt=0, le=600`). The JSON Schema equivalent is `"exclusiveMinimum"`:

```suggestion
        "timeout": {
            "type": "number",
            "description": "Optional timeout in seconds",
            "exclusiveMinimum": 0,
            "maximum": 600,
        },
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +147 to +157
except MCPError as exc:
logger.warning(
MCP_INVOKE_FAILED,
tool=self._tool_info.name,
server=self._tool_info.server_name,
error=str(exc),
)
return ToolExecutionResult(
content=str(exc),
is_error=True,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double emission of MCP_INVOKE_FAILED for invocation errors

MCPClient.call_tool() already logs MCP_INVOKE_FAILED (via logger.exception) before raising MCPInvocationError. The bridge then catches that MCPError here and emits MCP_INVOKE_FAILED a second time. Every tool invocation error therefore produces two mcp.invoke.failed log records, doubling alert noise and skewing aggregation counts.

The same double-firing occurs for timeouts: call_tool emits MCP_INVOKE_TIMEOUT, then the bridge emits MCP_INVOKE_FAILED for the resulting MCPTimeoutError, making it appear in timeout dashboards and general failure dashboards simultaneously.

Since the client-side logs already include server and tool context, the bridge-level log is redundant. Either drop the logger.warning here and let the client's log stand, or demote it to logger.debug to signal "boundary crossed" without duplicating the event at the same severity.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/tools/mcp/bridge_tool.py
Line: 147-157

Comment:
**Double emission of `MCP_INVOKE_FAILED` for invocation errors**

`MCPClient.call_tool()` already logs `MCP_INVOKE_FAILED` (via `logger.exception`) _before_ raising `MCPInvocationError`. The bridge then catches that `MCPError` here and emits `MCP_INVOKE_FAILED` a second time. Every tool invocation error therefore produces two `mcp.invoke.failed` log records, doubling alert noise and skewing aggregation counts.

The same double-firing occurs for timeouts: `call_tool` emits `MCP_INVOKE_TIMEOUT`, then the bridge emits `MCP_INVOKE_FAILED` for the resulting `MCPTimeoutError`, making it appear in timeout dashboards _and_ general failure dashboards simultaneously.

Since the client-side logs already include `server` and `tool` context, the bridge-level log is redundant. Either drop the `logger.warning` here and let the client's log stand, or demote it to `logger.debug` to signal "boundary crossed" without duplicating the event at the same severity.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluate and implement code execution sandboxing (Docker/WASM/Firecracker)

2 participants