fix: harden Docker sandbox, MCP bridge, and code runner#194
Conversation
Docker sandbox (#50): - DockerSandboxConfig, SandboxingConfig for backend selection - DockerSandbox implementing SandboxBackend via aiodocker - CodeRunnerTool for Python/JS/Bash execution in containers - Sandbox Dockerfile (Python 3.14 + Node.js 22) MCP bridge (#53): - MCPServerConfig/MCPConfig for stdio and streamable_http transports - MCPClient thin wrapper over mcp SDK with session management - MCPBridgeTool exposing MCP tools as native BaseTool instances - MCPResultCache with TTL + LRU eviction - MCPToolFactory for parallel server connection and tool discovery - D18 result mapping (text, image, audio, embedded resources) - MCP error hierarchy extending ToolError Both integrated into RootConfig with sandboxing and mcp fields.
The test iterated through 30 error cycles, each sleeping 1s via _POLL_TIMEOUT, causing it to take ~30s and crash pytest-xdist workers at the timeout boundary. Patching _POLL_TIMEOUT to 0.0 makes the test complete in <1s.
Pre-reviewed by 10 agents, 60 findings addressed: - Docker: multi-stage Dockerfile, container hardening (PidsLimit, ReadonlyRootfs, CapDrop), resource leak fixes, memory limit validation - MCP client: connect timeout enforcement, resource cleanup on failure, proper disconnect error handling - MCP bridge: error result wrapping, cache TypeError resilience, unknown content block handling - MCP factory: TaskGroup partial failure cleanup, complexity reduction - MCP config: field bounds enforcement, dead field removal, validation event constants - Docker/sandbox config: Field descriptors, network override validation, mount_mode default ro - Code runner: strict required param access, timeout schema bounds - Docs: DESIGN_SPEC §15.2/§15.3, CLAUDE.md, README.md updated - Tests: 15+ new tests for hardening, error paths, protocol conformance, HTTP transport
Dependency ReviewThe following issues were found:
License Issuesuv.lock
OpenSSF ScorecardScorecard details
Scanned Files
|
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (10)
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a Docker-backed sandbox and CodeRunner tool, restructures MCP integration into a new ai_company.tools.mcp package (client, factory, bridge, cache, models, errors, result mapper), introduces observability events for code_runner/docker/mcp, extends config schema (sandboxing, mcp), pins runtime deps, and adds extensive tests and exports. Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(220,240,255,0.5)
participant Agent
participant CodeRunnerTool
participant DockerSandbox
participant DockerDaemon
Agent->>CodeRunnerTool: execute(code, language, timeout)
CodeRunnerTool->>CodeRunnerTool: validate language\nemit CODE_RUNNER_EXECUTE_START
CodeRunnerTool->>DockerSandbox: execute(command, args, timeout)
DockerSandbox->>DockerDaemon: create_container(config)
DockerDaemon-->>DockerSandbox: container_id
DockerSandbox->>DockerDaemon: start_container(container_id)
DockerSandbox->>DockerDaemon: wait_for_exit(container_id, timeout)
DockerDaemon-->>DockerSandbox: stdout/stderr, exit_code
DockerSandbox->>DockerDaemon: remove_container(container_id)
DockerSandbox-->>CodeRunnerTool: SandboxResult
CodeRunnerTool->>CodeRunnerTool: emit CODE_RUNNER_EXECUTE_SUCCESS / FAILED
CodeRunnerTool-->>Agent: ToolExecutionResult
end
sequenceDiagram
rect rgba(240,255,220,0.5)
participant Factory as MCPToolFactory
participant Client as MCPClient
participant Server as MCP Server
participant Bridge as MCPBridgeTool
participant User as Agent
Factory->>Factory: filter enabled servers
par connect & discover per server
Factory->>Client: connect()
Client->>Server: establish connection
Server-->>Client: connected
Client->>Server: list_tools()
Server-->>Client: [MCPToolInfo...]
Client-->>Factory: tools
Factory->>Bridge: create MCPBridgeTool(tool_info, client, cache?)
and
end
Factory-->>User: [MCPBridgeTool...]
User->>Bridge: execute(arguments)
Bridge->>Bridge: check cache (MCP_CACHE_HIT / MISS)
alt cache hit
Bridge-->>User: cached ToolExecutionResult
else cache miss
Bridge->>Client: call_tool(name, arguments)
Client->>Server: invoke
Server-->>Client: MCPRawResult
Client-->>Bridge: raw result
Bridge->>Bridge: map_call_tool_result -> ToolExecutionResult
Bridge->>Bridge: store in cache (if eligible)
Bridge-->>User: ToolExecutionResult
end
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
Greptile SummaryThis PR introduces the full MCP bridge subsystem ( Two issues remain worth addressing before merge:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant MCPBridgeTool
participant MCPResultCache
participant MCPClient
participant MCPServer
Caller->>MCPBridgeTool: execute(arguments)
MCPBridgeTool->>MCPResultCache: get(tool_name, arguments)
alt Cache HIT
MCPResultCache-->>MCPBridgeTool: ToolExecutionResult (cached)
MCPBridgeTool-->>Caller: ToolExecutionResult
else Cache MISS / TypeError
MCPResultCache-->>MCPBridgeTool: None
MCPBridgeTool->>MCPClient: call_tool(tool_name, arguments)
MCPClient->>MCPClient: acquire _lock
MCPClient->>MCPClient: _require_session()
MCPClient->>MCPServer: asyncio.wait_for(session.call_tool, timeout)
alt Success
MCPServer-->>MCPClient: MCPRawResult
MCPClient-->>MCPBridgeTool: MCPRawResult
MCPBridgeTool->>MCPBridgeTool: map_call_tool_result(raw)
MCPBridgeTool->>MCPResultCache: put(tool_name, arguments, result)
MCPBridgeTool-->>Caller: ToolExecutionResult
else TimeoutError
MCPClient->>MCPClient: log MCP_INVOKE_TIMEOUT
MCPClient-->>MCPBridgeTool: raise MCPTimeoutError
MCPBridgeTool->>MCPBridgeTool: log MCP_INVOKE_FAILED (duplicate)
MCPBridgeTool-->>Caller: ToolExecutionResult(is_error=True)
else MCPInvocationError
MCPClient->>MCPClient: log MCP_INVOKE_FAILED
MCPClient-->>MCPBridgeTool: raise MCPInvocationError
MCPBridgeTool->>MCPBridgeTool: log MCP_INVOKE_FAILED (duplicate)
MCPBridgeTool-->>Caller: ToolExecutionResult(is_error=True)
end
end
Last reviewed commit: 106ec1c |
There was a problem hiding this comment.
Pull request overview
This PR hardens the Docker sandbox, implements the MCP bridge for external tool integration, and adds a code runner tool — completing issues #50 and #53. It introduces security hardening for Docker containers, full MCP client lifecycle management with timeouts and cleanup, and comprehensive test coverage across both new subsystems.
Changes:
- Docker sandbox: New
DockerSandboxbackend with container hardening (PidsLimit=64,ReadonlyRootfs,CapDrop=ALL), resource limit validation, and lazy Docker client connection - MCP bridge: Full MCP client (
MCPClient) with stdio/HTTP transport,MCPBridgeToolwrapping MCP tools as internal tools, TTL+LRU result cache, andMCPToolFactoryfor parallel server connection with cleanup on partial failure - Code runner + config: New
CodeRunnerTool,SandboxingConfig/DockerSandboxConfigconfig models added toRootConfig, 15+ new tests across all new components
Reviewed changes
Copilot reviewed 46 out of 47 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/ai_company/tools/sandbox/docker_sandbox.py |
New Docker sandbox backend with container hardening and lifecycle management |
src/ai_company/tools/sandbox/docker_config.py |
Pydantic config model for Docker sandbox |
src/ai_company/tools/sandbox/sandboxing_config.py |
Top-level sandboxing config with backend routing |
src/ai_company/tools/mcp/client.py |
MCP client with connect timeout, AsyncExitStack cleanup, and serialized invocation |
src/ai_company/tools/mcp/bridge_tool.py |
MCPBridgeTool wrapping MCP tools with cache support and MCPError handling |
src/ai_company/tools/mcp/factory.py |
MCPToolFactory with TaskGroup parallelism and partial failure cleanup |
src/ai_company/tools/mcp/cache.py |
TTL+LRU result cache |
src/ai_company/tools/mcp/config.py |
MCPServerConfig/MCPConfig with field bounds validation |
src/ai_company/tools/mcp/result_mapper.py |
Maps MCP raw results to internal ToolExecutionResult |
src/ai_company/tools/mcp/models.py |
MCPToolInfo and MCPRawResult value objects |
src/ai_company/tools/mcp/errors.py |
MCP error hierarchy extending ToolError |
src/ai_company/tools/code_runner.py |
CodeRunnerTool delegating to SandboxBackend |
src/ai_company/observability/events/mcp.py |
MCP event constants |
src/ai_company/observability/events/docker.py |
Docker event constants |
src/ai_company/observability/events/code_runner.py |
Code runner event constants |
src/ai_company/config/schema.py |
RootConfig gains sandboxing and mcp fields |
docker/sandbox/Dockerfile |
Multi-stage sandbox image |
DESIGN_SPEC.md |
Updated §15.2/§15.3 with new files and status |
CLAUDE.md |
Updated package structure and logging events |
README.md |
Updated M7 status and tool system description |
pyproject.toml / uv.lock |
Added aiodocker==0.26.0 and mcp==1.26.0 |
| Test files | 15+ new tests covering all new components |
Issues found:
-
Resource leak in
_ensure_docker(src/ai_company/tools/sandbox/docker_sandbox.py, lines 132–141): Whenclient.version()raises an exception, the newly createdaiodocker.Docker()client is not closed beforeSandboxStartErroris raised. The underlying aiohttp session/connector is leaked. Theexceptbranch should callawait client.close()before re-raising. -
Wrong event constant in
shutdown(src/ai_company/tools/mcp/factory.py, lines 144–148):MCP_FACTORY_SERVER_SKIPPEDis used to log disconnect failures duringshutdown(). This event implies a server was skipped during factory startup, not that a disconnect failed. The already-definedMCP_CLIENT_DISCONNECT_FAILEDconstant should be used instead. -
Documentation typo in DESIGN_SPEC.md (lines 3029, 3034): The directory tree uses
McpBridgeToolbut the actual class name isMCPBridgeTool. -
Documentation error in DESIGN_SPEC.md (line 3030):
cache.pyis described as "Tool schema caching" but it provides MCP invocation result caching (TTL+LRU), not schema caching.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.warning( | ||
| MCP_FACTORY_SERVER_SKIPPED, | ||
| server=client.server_name, | ||
| reason=f"disconnect failed: {exc}", | ||
| ) |
There was a problem hiding this comment.
In shutdown, the MCP_FACTORY_SERVER_SKIPPED event constant is used to log disconnect failures. This event name implies a server was skipped during factory setup, not that disconnection failed. A dedicated event constant (e.g., MCP_CLIENT_DISCONNECT_FAILED already exists in mcp.py) should be used here instead, or at minimum the existing MCP_CLIENT_DISCONNECT_FAILED constant should be imported and used for this disconnect failure log.
DESIGN_SPEC.md
Outdated
| │ │ ├── bridge_tool.py # McpBridgeTool (BaseTool integration) | ||
| │ │ ├── cache.py # Tool schema caching | ||
| │ │ ├── client.py # MCP client wrapper | ||
| │ │ ├── config.py # MCP server/bridge config models | ||
| │ │ ├── errors.py # MCP error hierarchy | ||
| │ │ ├── factory.py # McpBridgeTool factory |
There was a problem hiding this comment.
The DESIGN_SPEC.md uses McpBridgeTool (Pascal case mixing) in the directory tree comments (lines 3029 and 3034), but the actual class is named MCPBridgeTool. These documentation comments in the spec are incorrect and should be updated to MCPBridgeTool.
DESIGN_SPEC.md
Outdated
| │ │ └── mcp/ # MCP bridge subpackage | ||
| │ │ ├── __init__.py # Package exports | ||
| │ │ ├── bridge_tool.py # McpBridgeTool (BaseTool integration) | ||
| │ │ ├── cache.py # Tool schema caching |
There was a problem hiding this comment.
The DESIGN_SPEC.md describes cache.py as "Tool schema caching" but the actual module provides MCP tool result caching (TTL+LRU), not schema caching. The comment should read something like "MCP result cache (TTL + LRU)".
| try: | ||
| client = aiodocker.Docker() | ||
| await client.version() | ||
| except Exception as exc: | ||
| logger.exception( | ||
| DOCKER_DAEMON_UNAVAILABLE, | ||
| error=str(exc), | ||
| ) | ||
| msg = f"Docker daemon unavailable: {exc}" | ||
| raise SandboxStartError(msg) from exc |
There was a problem hiding this comment.
In _ensure_docker, if client.version() throws an exception, the aiodocker.Docker() client instance (client) is created but never closed before raising SandboxStartError. This leaks the underlying aiohttp session/connector. The try/except/else pattern does not close client in the exception path. The fix is to close the client in the except block before re-raising.
There was a problem hiding this comment.
Actionable comments posted: 22
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
DESIGN_SPEC.md (1)
2123-2134:⚠️ Potential issue | 🟡 MinorUpdate the sandbox example to the hardened default.
This YAML still shows
mount_mode: "rw", which contradicts the PR's read-only-by-default hardening and leaves the spec advertising the old behavior. As per coding guidelines, "When approved deviations occur, update DESIGN_SPEC.md to reflect the new reality."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@DESIGN_SPEC.md` around lines 2123 - 2134, The docker example in DESIGN_SPEC.md still shows mount_mode: "rw" which contradicts the hardened default; update the docker example block so mount_mode is the read-only hardened default (replace "rw" with "ro") and, if the spec wants to document the common exception for the workspace, add a brief note next to mount_mode or in the docker section describing that only the workspace is mounted read-write via a separate workspace mount setting (referencing the docker keys mount_mode and allowed_hosts to keep context).tests/unit/observability/test_events.py (1)
175-215:⚠️ Potential issue | 🟡 MinorAdd exact-value checks for the new event domains.
Adding
code_runner,docker, andmcpto discovery is good, but there are still no domain-specific assertions for those public constants. A renamed event would still satisfy the generic discovery, pattern, and duplicate checks. As per coding guidelines,tests/**/*.py: Prefer@pytest.mark.parametrizefor testing similar cases.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/observability/test_events.py` around lines 175 - 215, Test test_all_domain_modules_discovered only checks module discovery; add a new parametrized test (use `@pytest.mark.parametrize`) that iterates over the new domain modules code_runner, docker, and mcp and asserts each module exposes its expected public event constant(s). Specifically, import events.code_runner, events.docker, events.mcp inside the test and for each module assert hasattr(module, "<EXPECTED_CONSTANT_NAME>") (replace <EXPECTED_CONSTANT_NAME> with the actual public constant names defined in those modules) so renaming the module won’t mask a renamed/removed event; keep test_all_domain_modules_discovered as-is and add this new test next to it.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@DESIGN_SPEC.md`:
- Around line 3022-3036: The DESIGN_SPEC.md tree is out of sync with the
implementation: update the spec to reflect the actual module path
ai_company.tools.sandbox.sandboxing_config (instead of
tools/sandboxing_config.py at the root) and reconcile the tooling name mismatch
by using the exact class name used in code (change MCPBridgeTool → McpBridgeTool
in the spec or update the code to MCPBridgeTool consistently); ensure you
mention this divergence to the reviewer/user per the guideline (alert user and
explain the change) so they can decide whether to update implementation or spec.
In `@README.md`:
- Line 18: Update the README to document Docker as an explicit runtime
prerequisite for the "Tool System" sandbox and code execution features: add
Docker (and Docker Compose if applicable) to the prerequisites list, provide
concise install links/commands for common platforms, and include brief
instructions to build or pull the sandbox image used by the sandbox abstraction
(mention the sandbox image name or build command and the command to run it) so
users won’t encounter Docker-unavailable errors when using the sandbox, git, and
code runner functionality.
In `@src/ai_company/tools/code_runner.py`:
- Around line 77-151: The execute method currently lets sandbox startup
exceptions from self._sandbox.execute bubble up and exceeds the desired length;
wrap the call to self._sandbox.execute in a try/except that catches generic
backend/startup errors (e.g., Exception) and on exception log
CODE_RUNNER_EXECUTE_FAILED with context (language) and return a
ToolExecutionResult with is_error=True, an informative content string including
the caught error, and metadata (language and a returncode or marker). Also
refactor the success/failure mapping into small helpers (e.g.,
_map_success_result(result) and _map_failure_result(result) or a single
_map_execution_result(result, language)) and call them from execute so execute
stays under 50 lines while preserving existing logging and metadata keys.
In `@src/ai_company/tools/mcp/bridge_tool.py`:
- Around line 65-133: The execute method is doing cache lookup, remote
invocation, error handling, result mapping, and cache writeback all inline;
split it into small private helpers to meet the <50-line rule. Create helpers
like _get_cached_result(self, arguments) to perform the cache.get and return
cached or None (and emit MCP_CACHE_HIT), _invoke_remote_tool(self, arguments) to
call self._client.call_tool and translate MCPError into a ToolExecutionResult
(or rethrow/return an error sentinel), and _write_cache(self, arguments, result)
to perform self._cache.put with the TypeError handling/logging (MCP_CACHE_MISS).
Refactor execute to orchestrate: call _get_cached_result, return if present, log
MCP_INVOKE_START, call _invoke_remote_tool, map_call_tool_result on the raw
response if needed, call _write_cache, and return the final ToolExecutionResult;
preserve all existing log messages, MCP constants, and behavior of
map_call_tool_result and error handling.
- Around line 81-125: Wrap the cache read around a try/except to treat
unhashable-arguments TypeError from self._cache.get(...) as a cache miss (i.e.,
mirror the existing fallback used on writes), and when storing results with
self._cache.put(...) skip caching if the mapped result (from
map_call_tool_result(raw)) indicates an error (result.is_error is True) to avoid
negative caching of transient MCP failures; locate these changes around the call
flow involving self._cache.get, self._client.call_tool, map_call_tool_result,
and self._cache.put inside the BridgeTool invoke logic so reads gracefully fall
back on TypeError and only successful results are cached.
In `@src/ai_company/tools/mcp/cache.py`:
- Around line 46-81: The cache currently stores and returns the original
ToolExecutionResult instances which can contain mutable metadata; change the
behavior in the Cache class so that when inserting results (where results are
stored via the same cache logic as in the get/set flow) you store a
copy.deepcopy(result) and in the get(self, ...) method return
copy.deepcopy(result) instead of the original object; update references in the
get method (and the corresponding set/put method that uses _make_key and _cache)
to perform deep copies of ToolExecutionResult (import copy.deepcopy) so callers
always receive independent snapshots and the internal cache holds immutable
copies.
In `@src/ai_company/tools/mcp/client.py`:
- Around line 78-99: Multiple methods access and mutate the shared
_session/_exit_stack without synchronization, risking concurrent
connect/disconnect/calls; wrap all accesses and mutations of _session and
_exit_stack with the existing _lock (await self._lock) — specifically modify
connect(), disconnect(), list_tools(), and call_tool() so that session creation
(assigning self._session/self._exit_stack), session teardown (closing and
clearing them), and any use of the session are performed while holding the lock;
ensure connect() acquires the lock before checking/setting _session,
disconnect() acquires it before closing/clearing _session/_exit_stack,
list_tools() acquires it while using the session, and call_tool() continues to
hold the lock for the full lifetime of the session call to prevent interleaving.
- Around line 210-215: The shallow copy using dict(t.inputSchema) leaves nested
structures shared with SDK objects; replace these with deep copies at the system
boundary—use copy.deepcopy(...) when constructing MCPToolInfo (the input_schema
argument) and when assigning structuredContent into MCPRawResult so that nested
dicts/lists are fully copied and immutable once wrapped by the frozen Pydantic
models (MCPToolInfo, MCPRawResult); update the creation sites (the MCPToolInfo
construction and the result.structuredContent assignment locations referenced in
the diff) to call copy.deepcopy on SDK-owned dict/list fields.
In `@src/ai_company/tools/mcp/factory.py`:
- Around line 139-148: In the disconnect loop inside the MCP factory (in
factory.py where the method iterates self._clients and calls await
client.disconnect()), replace the MCP_FACTORY_SERVER_SKIPPED event used in the
exception handler with the appropriate failure event constant (e.g.,
MCP_FACTORY_SERVER_FAILURE or a dedicated MCP_FACTORY_SERVER_DISCONNECT_FAILED)
so the disconnect exception is logged/reported as an operational failure; update
the logger.warning call to use that failure event and keep the same context keys
(server and reason).
- Around line 154-169: In _connect_and_discover, ensure the MCPClient created by
MCPClient(config) is disconnected if discovery fails: after await
client.connect(), wrap the await client.list_tools() call in try/except (or
try/finally) and call await client.disconnect() before re-raising the exception
so the connected client isn’t leaked from _connect_all()'s cleanup; keep
returned value as (client, tools) on success and re-raise the original error
after disconnect on failure.
- Around line 37-39: The create_tools method currently allows reuse of the
factory and appends a new client set to self._clients without tearing down
previous connections; modify create_tools to guard against reuse by either (a)
disconnecting/closing all existing MCPClient instances in self._clients (calling
their disconnect/close method) and clearing self._clients before creating new
clients, or (b) raising an error if create_tools is called when self._clients is
non-empty; additionally refactor create_tools by extracting server filtering
into a helper (e.g., _filter_servers) and bridge construction into a helper
(e.g., _build_bridge_tools) so the method stays under 50 lines while ensuring
previous connections are cleaned up.
In `@src/ai_company/tools/sandbox/docker_sandbox.py`:
- Around line 213-225: The DockerSandbox backend currently ignores
DockerSandboxConfig.allowed_hosts and forces container removal regardless of
DockerSandboxConfig.auto_remove; update the code that builds host_config (the
block that sets "Binds", "Memory", "NanoCpus", "NetworkMode", "AutoRemove",
etc.) to use self._config.auto_remove instead of the hard-coded False and, if
self._config.allowed_hosts is set, populate the container host configuration
appropriately (e.g., map allowed_hosts into Docker's ExtraHosts or the network
settings) so the config is honored; also change the finally cleanup where
container.remove is always called to only remove the container when
self._config.auto_remove is true (and ensure related occurrences around the
other similar blocks noted are updated similarly), referencing _config,
host_config, allowed_hosts, auto_remove, and the container.remove call to locate
the changes.
- Around line 491-500: The code is trimming captured process output with
.strip(), which can corrupt whitespace-sensitive results; change the assignment
of stdout and stderr to preserve exact emitted output by using stdout =
"".join(stdout_logs) and stderr = "".join(stderr_logs) (leave the
container_obj.log calls and variables stdout_logs/stderr_logs as-is), and ensure
any downstream uses/returns of stdout and stderr rely on these unmodified
strings.
- Around line 555-573: The cleanup() path stops tracked containers but never
removes them, leaving stopped containers orphaned; update cleanup() (which
references self._tracked_containers, self._docker and calls
self._stop_container) to also call self._remove_container(self._docker, cid) for
each tracked container after a successful stop (or attempt removal regardless,
handling/logging failures per-container), ensure removals run before clearing
_tracked_containers and before tearing down/closing self._docker, and preserve
the existing try/except/finally semantics so self._docker is set to None and
_tracked_containers is cleared only after removal attempts and proper error
logging.
In `@src/ai_company/tools/sandbox/sandboxing_config.py`:
- Line 27: The overrides dict currently uses dict[str, str] allowing
empty/whitespace keys; change its annotation to dict[NotBlankStr, str] so
pydantic will reject blank category names: update the type on the overrides
Field (overrides: dict[NotBlankStr, str] = Field(...)) and add the NotBlankStr
import where NotBlankStr is defined/available, then run/update any affected
imports/usages of overrides to match the stronger key type.
In `@tests/integration/tools/test_docker_sandbox_integration.py`:
- Around line 51-70: The test uses a non-existent Docker image
"python:3.14-slim" causing CI failures; update the DockerSandboxConfig image to
a real published tag (e.g. "python:3.11-slim" or "python:3.12-slim") in the test
function test_run_python_code and in the other test block referenced (the second
DockerSandboxConfig instance around lines 71-90) so DockerSandbox receives a
valid image name; keep the image string change only, leaving
DockerSandboxConfig, DockerSandbox, test_run_python_code, and the execute(...)
call unchanged.
- Around line 21-38: The _docker_available() helper leaks the aiodocker.Docker
client when exceptions occur after creation; change the inner async _check() to
ensure the client is always closed by creating the client first, then using a
try/finally (or try/except/finally) around await client.version() and any
subsequent operations so client.close() is awaited in the finally block; update
references to client = aiodocker.Docker() and the existing await client.close()
placement accordingly to guarantee cleanup on all paths.
In `@tests/integration/tools/test_mcp_integration.py`:
- Around line 17-31: The helper _make_connected_client mutates private state by
setting client._session directly, bypassing MCPClient.connect() and disconnect()
and thus skipping new timeout and AsyncExitStack cleanup logic; change the test
to drive the public lifecycle: provide a fake transport/session factory or patch
the factory function used by MCPClient to return a mock session (with AsyncMock
list_tools and call_tool behaviors) and then call client.connect() and
client.disconnect() (or use async context manager) so the real
connection/cleanup paths in MCPClient.connect and MCPClient.disconnect are
exercised instead of assigning client._session directly.
In `@tests/unit/tools/mcp/conftest.py`:
- Around line 148-164: The mock_client pytest fixture declares an unused
parameter sample_tool_info; remove sample_tool_info from the mock_client fixture
signature (def mock_client(stdio_server_config: MCPServerConfig) -> MCPClient:)
and update any callers if necessary so the fixture no longer requires
sample_tool_info, keeping the rest of the setup that assigns client._session
using AsyncMock list_tools/call_tool; alternatively, if the fixture must depend
on sample_tool_info for implicit setup, reference it in the body (e.g., a no-op
access) or rename it to _sample_tool_info to silence the unused-argument
warning.
In `@tests/unit/tools/mcp/test_bridge_tool.py`:
- Around line 83-101: Update the test_empty_input_schema_yields_none_parameters
to use the existing mock_client fixture instead of constructing MCPClient and
setting _session manually: change the test signature to accept mock_client, use
that mock_client (optionally ensure mock_client._session is an AsyncMock if
needed), instantiate MCPBridgeTool with tool_info and client=mock_client, and
remove the manual MCPClient creation and _session assignment so the test matches
other tests using the shared mock_client fixture.
In `@tests/unit/tools/mcp/test_result_mapper.py`:
- Around line 18-201: Add a new unit test in TestMixedContent that constructs an
MCPRawResult whose content includes a fake/unexpected block object with a type
value like "mystery", call map_call_tool_result on it, and assert the mapped
content contains the placeholder "[unknown: mystery]" (exact placeholder
produced by the fallback branch) and that result.metadata does not contain any
leaked "attachments" entry (or that "attachments" is absent or empty). Reference
map_call_tool_result and MCPRawResult to locate where to build the input and
check result.metadata["attachments"] to ensure no attachments are produced.
---
Outside diff comments:
In `@DESIGN_SPEC.md`:
- Around line 2123-2134: The docker example in DESIGN_SPEC.md still shows
mount_mode: "rw" which contradicts the hardened default; update the docker
example block so mount_mode is the read-only hardened default (replace "rw" with
"ro") and, if the spec wants to document the common exception for the workspace,
add a brief note next to mount_mode or in the docker section describing that
only the workspace is mounted read-write via a separate workspace mount setting
(referencing the docker keys mount_mode and allowed_hosts to keep context).
In `@tests/unit/observability/test_events.py`:
- Around line 175-215: Test test_all_domain_modules_discovered only checks
module discovery; add a new parametrized test (use `@pytest.mark.parametrize`)
that iterates over the new domain modules code_runner, docker, and mcp and
asserts each module exposes its expected public event constant(s). Specifically,
import events.code_runner, events.docker, events.mcp inside the test and for
each module assert hasattr(module, "<EXPECTED_CONSTANT_NAME>") (replace
<EXPECTED_CONSTANT_NAME> with the actual public constant names defined in those
modules) so renaming the module won’t mask a renamed/removed event; keep
test_all_domain_modules_discovered as-is and add this new test next to it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d0fea601-118a-435b-9f14-c8a7c70d21cb
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (46)
CLAUDE.mdDESIGN_SPEC.mdREADME.mddocker/sandbox/Dockerfilepyproject.tomlsrc/ai_company/config/defaults.pysrc/ai_company/config/schema.pysrc/ai_company/observability/events/code_runner.pysrc/ai_company/observability/events/docker.pysrc/ai_company/observability/events/mcp.pysrc/ai_company/tools/__init__.pysrc/ai_company/tools/code_runner.pysrc/ai_company/tools/mcp/__init__.pysrc/ai_company/tools/mcp/bridge_tool.pysrc/ai_company/tools/mcp/cache.pysrc/ai_company/tools/mcp/client.pysrc/ai_company/tools/mcp/config.pysrc/ai_company/tools/mcp/errors.pysrc/ai_company/tools/mcp/factory.pysrc/ai_company/tools/mcp/models.pysrc/ai_company/tools/mcp/result_mapper.pysrc/ai_company/tools/sandbox/__init__.pysrc/ai_company/tools/sandbox/docker_config.pysrc/ai_company/tools/sandbox/docker_sandbox.pysrc/ai_company/tools/sandbox/errors.pysrc/ai_company/tools/sandbox/protocol.pysrc/ai_company/tools/sandbox/sandboxing_config.pytests/integration/tools/test_docker_sandbox_integration.pytests/integration/tools/test_mcp_integration.pytests/unit/api/test_bus_bridge.pytests/unit/config/conftest.pytests/unit/observability/test_events.pytests/unit/tools/mcp/__init__.pytests/unit/tools/mcp/conftest.pytests/unit/tools/mcp/test_bridge_tool.pytests/unit/tools/mcp/test_cache.pytests/unit/tools/mcp/test_client.pytests/unit/tools/mcp/test_config.pytests/unit/tools/mcp/test_errors.pytests/unit/tools/mcp/test_factory.pytests/unit/tools/mcp/test_result_mapper.pytests/unit/tools/sandbox/test_docker_config.pytests/unit/tools/sandbox/test_docker_sandbox.pytests/unit/tools/sandbox/test_protocol.pytests/unit/tools/sandbox/test_sandboxing_config.pytests/unit/tools/test_code_runner.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Agent
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (5)
**/*.md
📄 CodeRabbit inference engine (CLAUDE.md)
DESIGN_SPEC.md is MANDATORY reading before implementing any feature. The spec is the starting point for architecture, data models, and behavior. If implementation deviates from the spec, alert the user and explain why — user decides whether to proceed or update the spec. Do not silently diverge. When spec sections are referenced (e.g. 'Section 10.2'), read that section verbatim before coding. When approved deviations occur, update DESIGN_SPEC.md to reflect the new reality
Files:
README.mdDESIGN_SPEC.mdCLAUDE.md
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Nofrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: useexcept A, B:(no parentheses) — ruff enforces this on Python 3.14
All public functions must have type hints; mypy strict mode is enforced
Public classes and functions must have Google-style docstrings (enforced by ruff D rules)
Every module with business logic must include logging setup:from ai_company.observability import get_loggerthenlogger = get_logger(__name__). Never useimport loggingorprint()in application code. Variable name must always belogger
Use structured logging with event name constants:logger.info(EVENT_CONSTANT, key=value)— never use string interpolation likelogger.info('msg %s', val). Import event constants directly fromai_company.observability.events.<domain>
All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, and entry/exit of key functions
Line length maximum is 88 characters (enforced by ruff)
Functions must be < 50 lines; files must be < 800 lines
Handle errors explicitly; never silently swallow exceptions
Validate at system boundaries: user input, external APIs, and config files. Do not validate unnecessarily at internal boundaries
Files:
src/ai_company/tools/mcp/errors.pytests/unit/tools/mcp/conftest.pytests/unit/tools/mcp/test_factory.pysrc/ai_company/tools/sandbox/__init__.pysrc/ai_company/tools/mcp/models.pytests/integration/tools/test_mcp_integration.pytests/unit/tools/sandbox/test_docker_config.pysrc/ai_company/tools/__init__.pytests/unit/api/test_bus_bridge.pysrc/ai_company/tools/code_runner.pytests/unit/config/conftest.pysrc/ai_company/observability/events/code_runner.pysrc/ai_company/tools/sandbox/sandboxing_config.pytests/unit/tools/test_code_runner.pysrc/ai_company/tools/mcp/bridge_tool.pysrc/ai_company/config/defaults.pytests/unit/observability/test_events.pysrc/ai_company/tools/mcp/cache.pytests/unit/tools/mcp/test_cache.pytests/unit/tools/mcp/test_result_mapper.pysrc/ai_company/config/schema.pytests/unit/tools/mcp/test_config.pytests/unit/tools/mcp/test_bridge_tool.pytests/unit/tools/mcp/test_errors.pytests/integration/tools/test_docker_sandbox_integration.pysrc/ai_company/tools/mcp/result_mapper.pysrc/ai_company/tools/mcp/factory.pysrc/ai_company/observability/events/docker.pysrc/ai_company/tools/mcp/config.pytests/unit/tools/mcp/test_client.pysrc/ai_company/tools/sandbox/protocol.pytests/unit/tools/sandbox/test_sandboxing_config.pysrc/ai_company/tools/sandbox/docker_config.pytests/unit/tools/sandbox/test_docker_sandbox.pysrc/ai_company/tools/mcp/__init__.pysrc/ai_company/tools/mcp/client.pysrc/ai_company/observability/events/mcp.pysrc/ai_company/tools/sandbox/errors.pysrc/ai_company/tools/sandbox/docker_sandbox.pytests/unit/tools/sandbox/test_protocol.pytests/unit/tools/mcp/__init__.py
src/ai_company/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/ai_company/**/*.py: Use Pydantic v2 with frozen models for config/identity; use separate mutable-via-copy models (withmodel_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model
Use@computed_fieldfor derived values instead of storing + validating redundant fields (e.g.TokenUsage.total_tokens); useNotBlankStrfromcore.typesfor all identifier/name fields—including optional (NotBlankStr | None) and tuple variants—instead of manual whitespace validators
Create new objects for immutability; never mutate existing ones. For non-Pydantic internal collections (registries,BaseTool), usecopy.deepcopy()at construction +MappingProxyTypefor read-only enforcement. Fordict/listfields in frozen Pydantic models, rely onfrozen=Trueandcopy.deepcopy()at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over barecreate_task
Files:
src/ai_company/tools/mcp/errors.pysrc/ai_company/tools/sandbox/__init__.pysrc/ai_company/tools/mcp/models.pysrc/ai_company/tools/__init__.pysrc/ai_company/tools/code_runner.pysrc/ai_company/observability/events/code_runner.pysrc/ai_company/tools/sandbox/sandboxing_config.pysrc/ai_company/tools/mcp/bridge_tool.pysrc/ai_company/config/defaults.pysrc/ai_company/tools/mcp/cache.pysrc/ai_company/config/schema.pysrc/ai_company/tools/mcp/result_mapper.pysrc/ai_company/tools/mcp/factory.pysrc/ai_company/observability/events/docker.pysrc/ai_company/tools/mcp/config.pysrc/ai_company/tools/sandbox/protocol.pysrc/ai_company/tools/sandbox/docker_config.pysrc/ai_company/tools/mcp/__init__.pysrc/ai_company/tools/mcp/client.pysrc/ai_company/observability/events/mcp.pysrc/ai_company/tools/sandbox/errors.pysrc/ai_company/tools/sandbox/docker_sandbox.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Use pytest markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e,@pytest.mark.slow. Minimum coverage 80% (enforced in CI). Async mode:asyncio_mode = 'auto'— no manual@pytest.mark.asyncioneeded. Per-test timeout: 30 seconds. Usepytest-xdistvia-n autofor parallelism. Prefer@pytest.mark.parametrizefor testing similar cases
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names:example-provider,example-large-001,example-medium-001,example-small-001,large/medium/smallas aliases. Tests must usetest-provider,test-small-001, etc. Vendor names may only appear in: (1) DESIGN_SPEC.md provider list, (2).claude/files, (3) third-party import paths
Files:
tests/unit/tools/mcp/conftest.pytests/unit/tools/mcp/test_factory.pytests/integration/tools/test_mcp_integration.pytests/unit/tools/sandbox/test_docker_config.pytests/unit/api/test_bus_bridge.pytests/unit/config/conftest.pytests/unit/tools/test_code_runner.pytests/unit/observability/test_events.pytests/unit/tools/mcp/test_cache.pytests/unit/tools/mcp/test_result_mapper.pytests/unit/tools/mcp/test_config.pytests/unit/tools/mcp/test_bridge_tool.pytests/unit/tools/mcp/test_errors.pytests/integration/tools/test_docker_sandbox_integration.pytests/unit/tools/mcp/test_client.pytests/unit/tools/sandbox/test_sandboxing_config.pytests/unit/tools/sandbox/test_docker_sandbox.pytests/unit/tools/sandbox/test_protocol.pytests/unit/tools/mcp/__init__.py
pyproject.toml
📄 CodeRabbit inference engine (CLAUDE.md)
pyproject.toml: All dependency versions must be pinned with==in pyproject.toml
Organize dependencies into groups:test(pytest + plugins),dev(includes test + ruff, mypy, pre-commit, commitizen). Install withuv sync(installs all groups including dev by default)
Files:
pyproject.toml
🧠 Learnings (7)
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.md : DESIGN_SPEC.md is MANDATORY reading before implementing any feature. The spec is the starting point for architecture, data models, and behavior. If implementation deviates from the spec, alert the user and explain why — user decides whether to proceed or update the spec. Do not silently diverge. When spec sections are referenced (e.g. 'Section 10.2'), read that section verbatim before coding. When approved deviations occur, update DESIGN_SPEC.md to reflect the new reality
Applied to files:
README.md
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to src/ai_company/**/*.py : Use Pydantic v2 with frozen models for config/identity; use separate mutable-via-copy models (with `model_copy(update=...)`) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model
Applied to files:
src/ai_company/tools/mcp/models.pysrc/ai_company/config/schema.pysrc/ai_company/tools/mcp/config.pysrc/ai_company/tools/sandbox/docker_config.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to src/ai_company/**/*.py : Create new objects for immutability; never mutate existing ones. For non-Pydantic internal collections (registries, `BaseTool`), use `copy.deepcopy()` at construction + `MappingProxyType` for read-only enforcement. For `dict`/`list` fields in frozen Pydantic models, rely on `frozen=True` and `copy.deepcopy()` at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)
Applied to files:
src/ai_company/tools/mcp/models.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to tests/**/*.py : Use pytest markers: `pytest.mark.unit`, `pytest.mark.integration`, `pytest.mark.e2e`, `pytest.mark.slow`. Minimum coverage 80% (enforced in CI). Async mode: `asyncio_mode = 'auto'` — no manual `pytest.mark.asyncio` needed. Per-test timeout: 30 seconds. Use `pytest-xdist` via `-n auto` for parallelism. Prefer `pytest.mark.parametrize` for testing similar cases
Applied to files:
tests/integration/tools/test_mcp_integration.pytests/unit/tools/test_code_runner.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.py : Every module with business logic must include logging setup: `from ai_company.observability import get_logger` then `logger = get_logger(__name__)`. Never use `import logging` or `print()` in application code. Variable name must always be `logger`
Applied to files:
src/ai_company/config/schema.pyCLAUDE.md
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.py : Use structured logging with event name constants: `logger.info(EVENT_CONSTANT, key=value)` — never use string interpolation like `logger.info('msg %s', val)`. Import event constants directly from `ai_company.observability.events.<domain>`
Applied to files:
CLAUDE.md
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.py : All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, and entry/exit of key functions
Applied to files:
CLAUDE.md
🧬 Code graph analysis (28)
src/ai_company/tools/mcp/errors.py (1)
src/ai_company/tools/errors.py (1)
ToolError(12-44)
tests/unit/tools/mcp/conftest.py (4)
src/ai_company/tools/mcp/cache.py (1)
MCPResultCache(22-148)src/ai_company/tools/mcp/client.py (5)
MCPClient(46-454)config(64-66)server_name(74-76)list_tools(179-227)call_tool(229-303)src/ai_company/tools/mcp/config.py (2)
MCPConfig(150-176)MCPServerConfig(22-147)src/ai_company/tools/mcp/models.py (2)
MCPRawResult(40-62)MCPToolInfo(14-37)
tests/unit/tools/mcp/test_factory.py (4)
src/ai_company/tools/mcp/bridge_tool.py (1)
MCPBridgeTool(30-133)src/ai_company/tools/mcp/config.py (2)
MCPConfig(150-176)MCPServerConfig(22-147)src/ai_company/tools/mcp/factory.py (4)
MCPToolFactory(27-189)create_tools(41-98)shutdown(137-150)_make_cache(172-189)src/ai_company/tools/mcp/models.py (1)
MCPToolInfo(14-37)
src/ai_company/tools/sandbox/__init__.py (3)
src/ai_company/tools/sandbox/docker_sandbox.py (2)
config(112-114)DockerSandbox(70-600)src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-75)src/ai_company/tools/sandbox/sandboxing_config.py (1)
SandboxingConfig(14-57)
src/ai_company/tools/mcp/models.py (2)
src/ai_company/tools/base.py (1)
description(115-117)src/ai_company/tools/mcp/client.py (1)
server_name(74-76)
tests/integration/tools/test_mcp_integration.py (5)
tests/unit/tools/mcp/test_bridge_tool.py (1)
bridge_tool(22-30)src/ai_company/tools/mcp/bridge_tool.py (3)
MCPBridgeTool(30-133)tool_info(61-63)execute(65-133)src/ai_company/tools/mcp/cache.py (1)
MCPResultCache(22-148)src/ai_company/tools/mcp/client.py (4)
MCPClient(46-454)config(64-66)list_tools(179-227)call_tool(229-303)src/ai_company/tools/mcp/config.py (1)
MCPServerConfig(22-147)
tests/unit/tools/sandbox/test_docker_config.py (2)
src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-75)src/ai_company/tools/sandbox/docker_sandbox.py (1)
config(112-114)
tests/unit/api/test_bus_bridge.py (1)
src/ai_company/api/bus_bridge.py (1)
_poll_channel(112-159)
src/ai_company/tools/code_runner.py (5)
src/ai_company/core/enums.py (1)
ToolCategory(293-307)src/ai_company/observability/_logger.py (1)
get_logger(8-28)src/ai_company/tools/base.py (2)
BaseTool(56-161)ToolExecutionResult(24-53)src/ai_company/tools/sandbox/protocol.py (1)
execute(25-51)tests/unit/tools/sandbox/test_protocol.py (1)
execute(20-33)
src/ai_company/tools/sandbox/sandboxing_config.py (2)
src/ai_company/tools/sandbox/config.py (1)
SubprocessSandboxConfig(8-83)src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-75)
tests/unit/tools/test_code_runner.py (6)
src/ai_company/core/enums.py (1)
ToolCategory(293-307)src/ai_company/tools/code_runner.py (2)
CodeRunnerTool(53-151)execute(77-151)src/ai_company/tools/sandbox/result.py (1)
SandboxResult(6-28)src/ai_company/tools/sandbox/docker_sandbox.py (1)
execute(265-316)src/ai_company/tools/sandbox/protocol.py (1)
execute(25-51)src/ai_company/tools/base.py (2)
category(110-112)parameters_schema(120-128)
src/ai_company/tools/mcp/bridge_tool.py (5)
src/ai_company/core/enums.py (1)
ToolCategory(293-307)src/ai_company/tools/base.py (3)
BaseTool(56-161)ToolExecutionResult(24-53)description(115-117)src/ai_company/tools/mcp/errors.py (1)
MCPError(10-11)src/ai_company/tools/mcp/result_mapper.py (1)
map_call_tool_result(29-108)src/ai_company/tools/mcp/models.py (1)
MCPToolInfo(14-37)
src/ai_company/tools/mcp/cache.py (2)
src/ai_company/observability/_logger.py (1)
get_logger(8-28)src/ai_company/tools/base.py (1)
ToolExecutionResult(24-53)
tests/unit/tools/mcp/test_result_mapper.py (2)
src/ai_company/tools/mcp/models.py (1)
MCPRawResult(40-62)src/ai_company/tools/mcp/result_mapper.py (1)
map_call_tool_result(29-108)
src/ai_company/config/schema.py (2)
src/ai_company/tools/mcp/config.py (1)
MCPConfig(150-176)src/ai_company/tools/sandbox/sandboxing_config.py (1)
SandboxingConfig(14-57)
tests/unit/tools/mcp/test_config.py (1)
src/ai_company/tools/mcp/config.py (2)
MCPConfig(150-176)MCPServerConfig(22-147)
tests/unit/tools/mcp/test_bridge_tool.py (5)
src/ai_company/core/enums.py (1)
ToolCategory(293-307)src/ai_company/tools/mcp/client.py (4)
MCPClient(46-454)config(64-66)server_name(74-76)call_tool(229-303)src/ai_company/tools/mcp/config.py (1)
MCPServerConfig(22-147)src/ai_company/tools/mcp/models.py (1)
MCPToolInfo(14-37)src/ai_company/tools/mcp/cache.py (1)
MCPResultCache(22-148)
tests/integration/tools/test_docker_sandbox_integration.py (2)
src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-75)src/ai_company/tools/sandbox/docker_sandbox.py (6)
DockerSandbox(70-600)config(112-114)workspace(117-119)execute(265-316)cleanup(555-573)health_check(575-596)
src/ai_company/tools/mcp/result_mapper.py (3)
src/ai_company/observability/_logger.py (1)
get_logger(8-28)src/ai_company/tools/base.py (1)
ToolExecutionResult(24-53)src/ai_company/tools/mcp/models.py (1)
MCPRawResult(40-62)
src/ai_company/tools/mcp/factory.py (5)
src/ai_company/tools/mcp/bridge_tool.py (2)
MCPBridgeTool(30-133)tool_info(61-63)src/ai_company/tools/mcp/cache.py (1)
MCPResultCache(22-148)src/ai_company/tools/mcp/client.py (6)
MCPClient(46-454)config(64-66)disconnect(159-177)server_name(74-76)connect(78-138)list_tools(179-227)src/ai_company/tools/mcp/config.py (2)
MCPConfig(150-176)MCPServerConfig(22-147)src/ai_company/tools/mcp/models.py (1)
MCPToolInfo(14-37)
src/ai_company/tools/mcp/config.py (2)
src/ai_company/observability/_logger.py (1)
get_logger(8-28)src/ai_company/tools/base.py (1)
description(115-117)
tests/unit/tools/mcp/test_client.py (4)
src/ai_company/tools/mcp/client.py (9)
MCPClient(46-454)config(64-66)is_connected(69-71)server_name(74-76)connect(78-138)disconnect(159-177)list_tools(179-227)call_tool(229-303)reconnect(305-316)src/ai_company/tools/mcp/config.py (1)
MCPServerConfig(22-147)src/ai_company/tools/mcp/errors.py (4)
MCPConnectionError(14-15)MCPDiscoveryError(22-23)MCPInvocationError(26-27)MCPTimeoutError(18-19)src/ai_company/tools/mcp/models.py (1)
MCPToolInfo(14-37)
tests/unit/tools/sandbox/test_sandboxing_config.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-75)src/ai_company/tools/sandbox/sandboxing_config.py (2)
SandboxingConfig(14-57)backend_for_category(47-57)src/ai_company/tools/sandbox/docker_sandbox.py (1)
config(112-114)
tests/unit/tools/sandbox/test_docker_sandbox.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-75)src/ai_company/tools/sandbox/docker_sandbox.py (13)
DockerSandbox(70-600)_to_posix_bind_path(48-67)workspace(117-119)config(112-114)_validate_cwd(146-165)execute(265-316)_build_container_config(181-234)cleanup(555-573)health_check(575-596)get_backend_type(598-600)_parse_memory_limit(237-263)_stop_container(504-528)_remove_container(531-553)src/ai_company/tools/sandbox/errors.py (2)
SandboxError(10-11)SandboxStartError(23-24)
src/ai_company/tools/mcp/__init__.py (3)
src/ai_company/tools/mcp/client.py (1)
config(64-66)src/ai_company/tools/mcp/config.py (2)
MCPConfig(150-176)MCPServerConfig(22-147)src/ai_company/tools/mcp/errors.py (4)
MCPConnectionError(14-15)MCPDiscoveryError(22-23)MCPError(10-11)MCPInvocationError(26-27)
src/ai_company/tools/mcp/client.py (3)
src/ai_company/tools/mcp/errors.py (4)
MCPConnectionError(14-15)MCPDiscoveryError(22-23)MCPInvocationError(26-27)MCPTimeoutError(18-19)src/ai_company/tools/mcp/models.py (2)
MCPRawResult(40-62)MCPToolInfo(14-37)src/ai_company/tools/mcp/config.py (1)
MCPServerConfig(22-147)
src/ai_company/tools/sandbox/docker_sandbox.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-75)src/ai_company/tools/sandbox/errors.py (2)
SandboxError(10-11)SandboxStartError(23-24)src/ai_company/tools/sandbox/result.py (1)
SandboxResult(6-28)
tests/unit/tools/sandbox/test_protocol.py (3)
src/ai_company/tools/sandbox/docker_sandbox.py (2)
DockerSandbox(70-600)workspace(117-119)src/ai_company/tools/sandbox/subprocess_sandbox.py (1)
workspace(114-116)src/ai_company/tools/sandbox/protocol.py (1)
SandboxBackend(17-75)
🪛 Checkov (3.2.334)
docker/sandbox/Dockerfile
[low] 1-19: Ensure that HEALTHCHECK instructions have been added to container images
(CKV_DOCKER_2)
🪛 GitHub Actions: CI
tests/integration/tools/test_docker_sandbox_integration.py
[error] 1-1: Sandbox failed to start container: No such image: python:3.14-slim (test_run_python_code)
[error] 1-1: Sandbox failed to start container: No such image: python:3.14-slim (test_run_with_timeout)
src/ai_company/tools/sandbox/docker_sandbox.py
[error] 309-360: Sandbox failed to start: Docker container creation failed with 404 No such image: python:3.14-slim. This caused tests 'test_run_python_code' to fail during container startup.
[error] 309-360: Sandbox failed to start: Docker container creation failed with 404 No such image: python:3.14-slim. This caused tests 'test_run_with_timeout' to fail during container startup.
🪛 Hadolint (2.14.0)
docker/sandbox/Dockerfile
[warning] 9-9: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>
(DL3008)
🪛 LanguageTool
README.md
[typographical] ~18-~18: To join two clauses or introduce examples, consider using an em dash.
Context: ...ation analytics models - Tool System - File system tools, git tools, sandbox ab...
(DASH_RULE)
CLAUDE.md
[style] ~86-~86: A comma is missing here.
Context: ...nder ai_company.observability.events (e.g. PROVIDER_CALL_START from `events.prov...
(EG_NO_COMMA)
🪛 Trivy (0.69.3)
docker/sandbox/Dockerfile
[info] 1-1: No HEALTHCHECK defined
Add HEALTHCHECK instruction in your Dockerfile
Rule: DS-0026
(IaC/Dockerfile)
| def _make_connected_client( | ||
| config: MCPServerConfig, | ||
| tools: list[MagicMock], | ||
| call_result: MagicMock, | ||
| ) -> MCPClient: | ||
| """Create a mock-connected MCPClient.""" | ||
| client = MCPClient(config) | ||
| session = AsyncMock() | ||
|
|
||
| list_result = MagicMock() | ||
| list_result.tools = tools | ||
| session.list_tools = AsyncMock(return_value=list_result) | ||
| session.call_tool = AsyncMock(return_value=call_result) | ||
| client._session = session | ||
| return client |
There was a problem hiding this comment.
This helper skips the connection/cleanup code the PR hardens.
By assigning client._session directly, these tests never execute MCPClient.connect() / disconnect(), so regressions in the new timeout and AsyncExitStack cleanup paths would still pass here. Prefer driving the public lifecycle with a fake transport/session factory instead of mutating private state.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/integration/tools/test_mcp_integration.py` around lines 17 - 31, The
helper _make_connected_client mutates private state by setting client._session
directly, bypassing MCPClient.connect() and disconnect() and thus skipping new
timeout and AsyncExitStack cleanup logic; change the test to drive the public
lifecycle: provide a fake transport/session factory or patch the factory
function used by MCPClient to return a mock session (with AsyncMock list_tools
and call_tool behaviors) and then call client.connect() and client.disconnect()
(or use async context manager) so the real connection/cleanup paths in
MCPClient.connect and MCPClient.disconnect are exercised instead of assigning
client._session directly.
| def test_empty_input_schema_yields_none_parameters(self) -> None: | ||
| tool_info = MCPToolInfo( | ||
| name="no-schema", | ||
| description="No schema", | ||
| input_schema={}, | ||
| server_name="srv", | ||
| ) | ||
| config = MCPServerConfig( | ||
| name="srv", | ||
| transport="stdio", | ||
| command="echo", | ||
| ) | ||
| client = MCPClient(config) | ||
| client._session = AsyncMock() | ||
| bridge = MCPBridgeTool( | ||
| tool_info=tool_info, | ||
| client=client, | ||
| ) | ||
| assert bridge.parameters_schema is None |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider using the existing fixtures for consistency.
This test creates its own MCPClient instance and manually sets _session. While functional, consider extracting this pattern into a fixture (or parameterizing existing fixtures) for consistency with other tests that use mock_client from conftest.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/tools/mcp/test_bridge_tool.py` around lines 83 - 101, Update the
test_empty_input_schema_yields_none_parameters to use the existing mock_client
fixture instead of constructing MCPClient and setting _session manually: change
the test signature to accept mock_client, use that mock_client (optionally
ensure mock_client._session is an AsyncMock if needed), instantiate
MCPBridgeTool with tool_info and client=mock_client, and remove the manual
MCPClient creation and _session assignment so the test matches other tests using
the shared mock_client fixture.
Implements all 42 findings from local review agents and external reviewers (Greptile, Copilot, CodeRabbit) on PR #194. Key fixes: - Docker sandbox: fix UID 65534 conflict, close client on failure, remove containers in cleanup, add memory_limit validator - MCP client: serialize session access with asyncio.Lock, deepcopy SDK-owned dicts at boundaries - MCP bridge: split execute into helpers, skip caching error results, catch TypeError on unhashable cache keys - MCP factory: fix partial-failure cleanup with contextlib.suppress, add reuse guard, disconnect on discovery failure - MCP cache: deepcopy on both store and get paths - Code runner: catch SandboxError around sandbox.execute() - SandboxingConfig: type-safe overrides with Literal backend names - CI: add MIT-0 and scancode-free-unknown to license allow-list - Docs: fix DESIGN_SPEC.md tree, mount_mode ro, MCP SDK pin - Tests: add coverage for reuse guard, partial failure, shutdown error swallowing, unhashable cache, error-not-cached, connection timeout, unknown content block
- Add type: ignore comments for dict[str, str] → Literal overrides in test_sandboxing_config.py (mypy arg-type/dict-item errors) - Make Docker integration test skip when test image isn't pulled (CI has Docker but not python:3.12-slim pre-pulled) - Extract _TEST_IMAGE constant to avoid hardcoded image strings
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the system's capabilities by introducing robust sandboxing mechanisms and enabling seamless integration with external tools via the Model Context Protocol. These additions provide a more secure and flexible environment for executing code and interacting with diverse services, laying critical groundwork for future feature development and enhancing overall system resilience and extensibility. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Actionable comments posted: 12
♻️ Duplicate comments (5)
src/ai_company/tools/code_runner.py (1)
78-164: 🧹 Nitpick | 🔵 TrivialExtract result-mapping logic to reduce
execute()length.The
execute()method is ~87 lines, exceeding the 50-line guideline. The result-handling block (Lines 134-164) can be extracted to a private helper to improve readability and maintainability.♻️ Proposed refactor
+ def _map_result( + self, + result: SandboxResult, + language: str, + ) -> ToolExecutionResult: + """Map sandbox result to tool execution result.""" + if result.success: + logger.debug(CODE_RUNNER_EXECUTE_SUCCESS, language=language) + return ToolExecutionResult( + content=result.stdout or "(no output)", + metadata={"returncode": result.returncode, "language": language}, + ) + + logger.warning( + CODE_RUNNER_EXECUTE_FAILED, + language=language, + returncode=result.returncode, + timed_out=result.timed_out, + ) + error_msg = result.stderr or result.stdout or "Execution failed" + if result.timed_out: + error_msg = f"Execution timed out. {error_msg}" + return ToolExecutionResult( + content=error_msg, + is_error=True, + metadata={ + "returncode": result.returncode, + "timed_out": result.timed_out, + "language": language, + }, + ) + async def execute( self, *, arguments: dict[str, Any], ) -> ToolExecutionResult: # ... validation and sandbox call unchanged ... - - if result.success: - logger.debug( - CODE_RUNNER_EXECUTE_SUCCESS, - language=language, - ) - return ToolExecutionResult( - content=result.stdout or "(no output)", - metadata={ - "returncode": result.returncode, - "language": language, - }, - ) - - logger.warning( - CODE_RUNNER_EXECUTE_FAILED, - language=language, - returncode=result.returncode, - timed_out=result.timed_out, - ) - error_msg = result.stderr or result.stdout or "Execution failed" - if result.timed_out: - error_msg = f"Execution timed out. {error_msg}" - return ToolExecutionResult( - content=error_msg, - is_error=True, - metadata={ - "returncode": result.returncode, - "timed_out": result.timed_out, - "language": language, - }, - ) + return self._map_result(result, language)Based on learnings: "Functions must be < 50 lines; files must be < 800 lines"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/tools/code_runner.py` around lines 78 - 164, The execute() method is too long; extract the result-handling block (the logic after obtaining result from self._sandbox.execute and the SandboxError except) into a private helper on the same class (e.g., a method named _map_sandbox_result_to_tool_result or _format_execution_result) that accepts the sandbox result and language and returns a ToolExecutionResult; move the success/failure logging and construction of ToolExecutionResult (including handling result.success, result.stderr/stdout, timed_out, returncode, and metadata) into that helper, then replace the original block in execute() with a single call to the new helper, keeping existing logger calls or delegating them into the helper as needed.src/ai_company/tools/sandbox/sandboxing_config.py (1)
27-27:⚠️ Potential issue | 🟠 MajorUse
NotBlankStrfor sandbox category identifiers.Empty or whitespace-only keys are currently valid in
overrides, andbackend_for_category("")accepts the same invalid identifier at runtime. Both should useNotBlankStr.♻️ Suggested fix
from pydantic import BaseModel, ConfigDict, Field, model_validator +from ai_company.core.types import NotBlankStr from ai_company.tools.sandbox.config import SubprocessSandboxConfig from ai_company.tools.sandbox.docker_config import DockerSandboxConfig @@ - overrides: dict[str, _BackendName] = Field(default_factory=dict) + overrides: dict[NotBlankStr, _BackendName] = Field(default_factory=dict) @@ - def backend_for_category(self, category: str) -> _BackendName: + def backend_for_category(self, category: NotBlankStr) -> _BackendName:As per coding guidelines, "use
NotBlankStrfromcore.typesfor all identifier/name fields—including optional (NotBlankStr | None) and tuple variants—instead of manual whitespace validators".Also applies to: 47-56
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/tools/sandbox/sandboxing_config.py` at line 27, The overrides dict accepts blank keys; change its type from dict[str, _BackendName] to dict[NotBlankStr, _BackendName] and import NotBlankStr from core.types, and update any runtime accessor/signature that accepts a category identifier (e.g., backend_for_category) to require NotBlankStr (or NotBlankStr | None where optional) so blank/whitespace strings are rejected; apply the same replacement for the other identifier/name fields referenced in the block around backend_for_category (lines ~47-56) so all category/identifier types use NotBlankStr instead of plain str.src/ai_company/tools/sandbox/docker_config.py (1)
1-8:⚠️ Potential issue | 🟠 MajorLog invalid Docker config values before rejecting them.
Both validators reject config at the boundary, but the module still has no
loggerand no warning/error log on the failure path. Add structured logging with the offending field/value before raisingValueError.As per coding guidelines, "Every module with business logic must include logging setup:
from ai_company.observability import get_loggerthenlogger = get_logger(__name__).", "All error paths must log at WARNING or ERROR with context before raising", and "Validate at system boundaries: user input, external APIs, and config files. Do not validate unnecessarily at internal boundaries".Also applies to: 65-94
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/tools/sandbox/docker_config.py` around lines 1 - 8, Add module-level logging (import get_logger from ai_company.observability and create logger = get_logger(__name__)) and update the pydantic model validators (the functions decorated with model_validator on the Docker sandbox configuration model) to log the offending field name and value at WARNING or ERROR before raising ValueError; specifically, inside each validator that currently raises ValueError for invalid Docker config values, call logger.warning or logger.error with a short message and structured context (e.g., include the field identifier and the invalid value) and then raise the ValueError as before.src/ai_company/tools/sandbox/docker_sandbox.py (1)
223-226:⚠️ Potential issue | 🟠 MajorThese config fields are still not wired through.
allowed_hostsis still a TODO, and_run_container()forcesauto_remove=Falsebefore thefinallyblock always deletes the container anyway. That leaves two public config knobs with no effect.Also applies to: 344-352, 384-388
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/tools/sandbox/docker_sandbox.py` around lines 223 - 226, The config fields allowed_hosts and auto_remove are not being applied: update the container startup to respect self._config.allowed_hosts by wiring it into the Docker create/run options (e.g., add to host_config["ExtraHosts"] or appropriate network settings using the allowed_hosts list) and stop forcing auto_remove=False in _run_container(); instead set the container create/run auto_remove option from self._config.auto_remove and change the finally block so the explicit container removal only runs when auto_remove is False (or when removal is still required), referencing the host_config dict, the _run_container() method, the allowed_hosts config field, and the auto_remove behavior so these knobs actually take effect.src/ai_company/tools/mcp/client.py (1)
254-265:⚠️ Potential issue | 🟠 MajorResolve the session after acquiring
_lock.Line 254 reads
_sessionbeforeasync with self._lock. A concurrentdisconnect()can close that session in between, leavingcall_tool()to run against a staleClientSession. Move_require_session()inside the locked block.Proposed fix
- session = self._require_session() logger.debug( MCP_INVOKE_START, server=self._config.name, tool=tool_name, ) async with self._lock: + session = self._require_session() try: result = await asyncio.wait_for( session.call_tool(tool_name, arguments),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/ai_company/tools/mcp/client.py` around lines 254 - 265, The code calls _require_session() and assigns session before entering async with self._lock, which allows a concurrent disconnect() to close the session and leave call_tool() using a stale ClientSession; move the call to _require_session() inside the async with self._lock block so the session is obtained while holding self._lock, then call session.call_tool(...) under the same lock (ensure you still use asyncio.wait_for with the existing timeout and preserve the MCP_INVOKE_START logging and exception handling around the await).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@DESIGN_SPEC.md`:
- Around line 3024-3033: The DESIGN_SPEC contains a duplicated listing of the
sandbox package tree; remove the redundant block that repeats the "sandbox/"
subtree and consolidate into a single entry so the spec shows only one set of
children for the sandbox package (keep a single listing that includes
__init__.py, config.py, docker_config.py, docker_sandbox.py, errors.py,
protocol.py, result.py, sandboxing_config.py, subprocess_sandbox.py). Update the
tree where the duplicate appears to match the original subtree entry (ensure
only one "sandbox/" subtree is documented) and verify package names (sandbox /
tools/sandbox) are consistent with the rest of the file.
In `@src/ai_company/tools/mcp/client.py`:
- Around line 91-140: The AsyncExitStack opened with stack = AsyncExitStack();
await stack.__aenter__() can leak if connect() is cancelled (CancelledError)
before self._exit_stack is assigned; ensure the stack is always closed on any
BaseException by adding a finally (or an except BaseException) that awaits
stack.aclose() unless the stack has been stored to self._exit_stack after
successful connection. Update the connect() flow around _connect_with_stack,
session assignment, and self._exit_stack so that on cancellation or any
BaseException the partially-opened stack is closed before re-raising the error.
- Around line 85-105: Inside connect(), guard against opening a second session
by checking self._session (and/or self._exit_stack) while holding self._lock
before creating a new AsyncExitStack; if a session already exists either raise
an error (e.g., RuntimeError) or call the existing disconnect/close routine to
cleanly close the current transport, then proceed to create a new stack via
self._connect_with_stack. Ensure you do not overwrite self._session or
self._exit_stack until the old session/stack has been closed and fully released,
and keep the check/close logic inside the same self._lock context to avoid
races.
In `@src/ai_company/tools/mcp/factory.py`:
- Around line 59-61: The create_tools() reuse check currently raises
RuntimeError when self._created is true without logging; before raising, add a
log call (e.g., process/logger.warning or logger.error) that records the failure
and relevant context (such as self identity, factory configuration or key
attributes) to ensure MCP factory telemetry captures the event, then raise the
RuntimeError as before; locate the check around create_tools() and self._created
in factory.py and insert the logging call immediately prior to the raise.
In `@src/ai_company/tools/sandbox/docker_sandbox.py`:
- Around line 267-319: The execute/_run_container/_start_and_wait functions are
too long and mix responsibilities; refactor by extracting small helpers so each
function is under 50 lines: keep execute minimal (validate cwd via
_validate_cwd, compute effective_timeout, resolve container_cwd, call a single
orchestration helper), move timeout calculation to a helper like
_compute_effective_timeout, move Docker acquisition to _ensure_docker (or wrap
into _get_docker_client), extract container lifecycle steps from
_run_container/_start_and_wait into clear helpers such as
_create_and_start_container, _stream_and_collect_logs,
_wait_with_timeout_and_cleanup, and _build_sandbox_result (for constructing
SandboxResult); ensure validation, logging (DOCKER_EXECUTE_START), container
start/stop, timeout handling, and output aggregation are each in their own small
functions referenced from _run_container/_start_and_wait so the original
functions become simple orchestrators under 50 lines.
- Around line 460-467: The current logger call in docker_sandbox.py logs raw
stderr when command execution fails; instead keep the full stderr text in
SandboxResult but change the DOCKER_EXECUTE_FAILED logger.warning invocation
(the block referencing command, args, returncode, stderr) to avoid emitting raw
stderr: log bounded metadata such as stderr_length, a short stderr_preview
(e.g., first N chars) or a boolean truncated flag, and an opaque fingerprint
(e.g., sha256) if you need correlation; update the logger parameters accordingly
and ensure SandboxResult still stores the original stderr for later access.
- Around line 106-109: Add an asyncio.Lock (e.g., self._client_lock) to
serialize access to the Docker client and container-tracking state; in
particular, protect _ensure_docker(), cleanup(), and the sections of
_run_container() that read/modify self._tracked_containers and assign
self._docker by surrounding them with "async with self._client_lock" so the
TOCTOU creation/assignment of aiodocker.Docker is atomic, cleanup() cannot close
the client mid-initialization, and tracked-container append/remove operations
are atomic and not lost under concurrency.
In `@src/ai_company/tools/sandbox/sandboxing_config.py`:
- Around line 1-8: This module is missing the required logger setup used by
SandboxingConfig's validation/business logic; add the import "from
ai_company.observability import get_logger" and define "logger =
get_logger(__name__)" near the top of the file, then update SandboxingConfig's
model_validator and any validation/error paths (e.g., inside the validator
function(s) referenced by model_validator and any logic around
SubprocessSandboxConfig/DockerSandboxConfig usage around lines ~35-45) to call
logger.warning or logger.error with contextual details before raising exceptions
or returning errors; ensure all newly added log calls include enough context
(config values or identifiers) to aid debugging.
In `@tests/integration/tools/test_docker_sandbox_integration.py`:
- Around line 56-58: The test currently uses DockerSandboxConfig with image
"python:3.12-slim" but `@skip_no_docker` only checks daemon reachability, so add
an image-availability check before running the test and skip if the image isn't
present (or pull it); specifically, update the test in
tests/integration/tools/test_docker_sandbox_integration.py to call the Docker
SDK (e.g., docker.from_env()) and use client.images.get("python:3.12-slim") to
detect presence (or client.images.pull("python:3.12-slim") if you prefer
auto-pull), and if not available raise unittest.SkipTest (or keep the skip
decorator) so the test using DockerSandboxConfig(image="python:3.12-slim") won't
fail with "No such image".
In `@tests/unit/tools/mcp/test_result_mapper.py`:
- Around line 164-172: The test test_unknown_block_produces_placeholder
currently mutates the global MagicMock type by setting type(unknown).__name__ =
"MysteryBlock"; replace MagicMock with a tiny local stub class (e.g., class
MysteryBlock: pass) and instantiate it instead of using MagicMock so you don’t
alter shared state; keep using MCPRawResult(content=(instance,)) and assert the
same expectations against map_call_tool_result to preserve test behavior and
isolation.
In `@tests/unit/tools/sandbox/test_docker_sandbox.py`:
- Around line 424-436: The test currently only asserts
mock_docker.containers.container.call_count >= 2 which can pass without
verifying removals; update test_cleanup_stops_tracked_containers to assert that
for each tracked container the mocked container's stop and delete were awaited:
locate the DockerSandbox.cleanup invocation and the mock container objects
returned by mock_docker.containers.container, then replace or add assertions to
check call_count (or awaited call) for container.stop and container.delete
(e.g., verify stop was awaited twice and delete/removed awaited twice) and keep
the existing assertion that sandbox._tracked_containers == [] to ensure state is
cleared.
In `@tests/unit/tools/sandbox/test_sandboxing_config.py`:
- Around line 40-42: The test assigns plain str values to the fixtures/variables
used with SandboxingConfig (overrides, default, override_backend), causing mypy
failures; update their type annotations to use Literal["subprocess","docker"]
(or define a shared alias like Backend = Literal["subprocess","docker"]) and
apply that alias to the fixtures/variables (e.g., overrides: dict[str, Backend],
default: Backend, override_backend: Backend), then remove the temporary # type:
ignore on the earlier line so the test compiles cleanly under mypy while still
asserting SandboxingConfig(overrides=overrides) behaves as expected.
---
Duplicate comments:
In `@src/ai_company/tools/code_runner.py`:
- Around line 78-164: The execute() method is too long; extract the
result-handling block (the logic after obtaining result from
self._sandbox.execute and the SandboxError except) into a private helper on the
same class (e.g., a method named _map_sandbox_result_to_tool_result or
_format_execution_result) that accepts the sandbox result and language and
returns a ToolExecutionResult; move the success/failure logging and construction
of ToolExecutionResult (including handling result.success, result.stderr/stdout,
timed_out, returncode, and metadata) into that helper, then replace the original
block in execute() with a single call to the new helper, keeping existing logger
calls or delegating them into the helper as needed.
In `@src/ai_company/tools/mcp/client.py`:
- Around line 254-265: The code calls _require_session() and assigns session
before entering async with self._lock, which allows a concurrent disconnect() to
close the session and leave call_tool() using a stale ClientSession; move the
call to _require_session() inside the async with self._lock block so the session
is obtained while holding self._lock, then call session.call_tool(...) under the
same lock (ensure you still use asyncio.wait_for with the existing timeout and
preserve the MCP_INVOKE_START logging and exception handling around the await).
In `@src/ai_company/tools/sandbox/docker_config.py`:
- Around line 1-8: Add module-level logging (import get_logger from
ai_company.observability and create logger = get_logger(__name__)) and update
the pydantic model validators (the functions decorated with model_validator on
the Docker sandbox configuration model) to log the offending field name and
value at WARNING or ERROR before raising ValueError; specifically, inside each
validator that currently raises ValueError for invalid Docker config values,
call logger.warning or logger.error with a short message and structured context
(e.g., include the field identifier and the invalid value) and then raise the
ValueError as before.
In `@src/ai_company/tools/sandbox/docker_sandbox.py`:
- Around line 223-226: The config fields allowed_hosts and auto_remove are not
being applied: update the container startup to respect
self._config.allowed_hosts by wiring it into the Docker create/run options
(e.g., add to host_config["ExtraHosts"] or appropriate network settings using
the allowed_hosts list) and stop forcing auto_remove=False in _run_container();
instead set the container create/run auto_remove option from
self._config.auto_remove and change the finally block so the explicit container
removal only runs when auto_remove is False (or when removal is still required),
referencing the host_config dict, the _run_container() method, the allowed_hosts
config field, and the auto_remove behavior so these knobs actually take effect.
In `@src/ai_company/tools/sandbox/sandboxing_config.py`:
- Line 27: The overrides dict accepts blank keys; change its type from dict[str,
_BackendName] to dict[NotBlankStr, _BackendName] and import NotBlankStr from
core.types, and update any runtime accessor/signature that accepts a category
identifier (e.g., backend_for_category) to require NotBlankStr (or NotBlankStr |
None where optional) so blank/whitespace strings are rejected; apply the same
replacement for the other identifier/name fields referenced in the block around
backend_for_category (lines ~47-56) so all category/identifier types use
NotBlankStr instead of plain str.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 806d9535-22e7-4320-b5bb-441b8f564d10
📒 Files selected for processing (22)
.github/workflows/dependency-review.ymlDESIGN_SPEC.mdREADME.mddocker/sandbox/Dockerfilesrc/ai_company/observability/events/mcp.pysrc/ai_company/tools/code_runner.pysrc/ai_company/tools/mcp/bridge_tool.pysrc/ai_company/tools/mcp/cache.pysrc/ai_company/tools/mcp/client.pysrc/ai_company/tools/mcp/factory.pysrc/ai_company/tools/sandbox/docker_config.pysrc/ai_company/tools/sandbox/docker_sandbox.pysrc/ai_company/tools/sandbox/protocol.pysrc/ai_company/tools/sandbox/sandboxing_config.pytests/integration/tools/test_docker_sandbox_integration.pytests/unit/tools/mcp/conftest.pytests/unit/tools/mcp/test_bridge_tool.pytests/unit/tools/mcp/test_client.pytests/unit/tools/mcp/test_factory.pytests/unit/tools/mcp/test_result_mapper.pytests/unit/tools/sandbox/test_docker_sandbox.pytests/unit/tools/sandbox/test_sandboxing_config.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (4)
**/*.md
📄 CodeRabbit inference engine (CLAUDE.md)
DESIGN_SPEC.md is MANDATORY reading before implementing any feature. The spec is the starting point for architecture, data models, and behavior. If implementation deviates from the spec, alert the user and explain why — user decides whether to proceed or update the spec. Do not silently diverge. When spec sections are referenced (e.g. 'Section 10.2'), read that section verbatim before coding. When approved deviations occur, update DESIGN_SPEC.md to reflect the new reality
Files:
README.mdDESIGN_SPEC.md
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Nofrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: useexcept A, B:(no parentheses) — ruff enforces this on Python 3.14
All public functions must have type hints; mypy strict mode is enforced
Public classes and functions must have Google-style docstrings (enforced by ruff D rules)
Every module with business logic must include logging setup:from ai_company.observability import get_loggerthenlogger = get_logger(__name__). Never useimport loggingorprint()in application code. Variable name must always belogger
Use structured logging with event name constants:logger.info(EVENT_CONSTANT, key=value)— never use string interpolation likelogger.info('msg %s', val). Import event constants directly fromai_company.observability.events.<domain>
All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, and entry/exit of key functions
Line length maximum is 88 characters (enforced by ruff)
Functions must be < 50 lines; files must be < 800 lines
Handle errors explicitly; never silently swallow exceptions
Validate at system boundaries: user input, external APIs, and config files. Do not validate unnecessarily at internal boundaries
Files:
src/ai_company/tools/sandbox/sandboxing_config.pysrc/ai_company/tools/sandbox/docker_sandbox.pytests/integration/tools/test_docker_sandbox_integration.pytests/unit/tools/mcp/conftest.pysrc/ai_company/tools/code_runner.pytests/unit/tools/mcp/test_factory.pysrc/ai_company/tools/mcp/cache.pytests/unit/tools/mcp/test_bridge_tool.pytests/unit/tools/sandbox/test_docker_sandbox.pytests/unit/tools/sandbox/test_sandboxing_config.pytests/unit/tools/mcp/test_client.pysrc/ai_company/tools/mcp/client.pysrc/ai_company/tools/sandbox/docker_config.pysrc/ai_company/tools/mcp/factory.pysrc/ai_company/tools/sandbox/protocol.pysrc/ai_company/tools/mcp/bridge_tool.pytests/unit/tools/mcp/test_result_mapper.pysrc/ai_company/observability/events/mcp.py
src/ai_company/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
src/ai_company/**/*.py: Use Pydantic v2 with frozen models for config/identity; use separate mutable-via-copy models (withmodel_copy(update=...)) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model
Use@computed_fieldfor derived values instead of storing + validating redundant fields (e.g.TokenUsage.total_tokens); useNotBlankStrfromcore.typesfor all identifier/name fields—including optional (NotBlankStr | None) and tuple variants—instead of manual whitespace validators
Create new objects for immutability; never mutate existing ones. For non-Pydantic internal collections (registries,BaseTool), usecopy.deepcopy()at construction +MappingProxyTypefor read-only enforcement. Fordict/listfields in frozen Pydantic models, rely onfrozen=Trueandcopy.deepcopy()at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)
Preferasyncio.TaskGroupfor fan-out/fan-in parallel operations in new code (multiple tool invocations, parallel agent calls). Prefer structured concurrency over barecreate_task
Files:
src/ai_company/tools/sandbox/sandboxing_config.pysrc/ai_company/tools/sandbox/docker_sandbox.pysrc/ai_company/tools/code_runner.pysrc/ai_company/tools/mcp/cache.pysrc/ai_company/tools/mcp/client.pysrc/ai_company/tools/sandbox/docker_config.pysrc/ai_company/tools/mcp/factory.pysrc/ai_company/tools/sandbox/protocol.pysrc/ai_company/tools/mcp/bridge_tool.pysrc/ai_company/observability/events/mcp.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Use pytest markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e,@pytest.mark.slow. Minimum coverage 80% (enforced in CI). Async mode:asyncio_mode = 'auto'— no manual@pytest.mark.asyncioneeded. Per-test timeout: 30 seconds. Usepytest-xdistvia-n autofor parallelism. Prefer@pytest.mark.parametrizefor testing similar cases
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names:example-provider,example-large-001,example-medium-001,example-small-001,large/medium/smallas aliases. Tests must usetest-provider,test-small-001, etc. Vendor names may only appear in: (1) DESIGN_SPEC.md provider list, (2).claude/files, (3) third-party import paths
Files:
tests/integration/tools/test_docker_sandbox_integration.pytests/unit/tools/mcp/conftest.pytests/unit/tools/mcp/test_factory.pytests/unit/tools/mcp/test_bridge_tool.pytests/unit/tools/sandbox/test_docker_sandbox.pytests/unit/tools/sandbox/test_sandboxing_config.pytests/unit/tools/mcp/test_client.pytests/unit/tools/mcp/test_result_mapper.py
🧠 Learnings (7)
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.md : DESIGN_SPEC.md is MANDATORY reading before implementing any feature. The spec is the starting point for architecture, data models, and behavior. If implementation deviates from the spec, alert the user and explain why — user decides whether to proceed or update the spec. Do not silently diverge. When spec sections are referenced (e.g. 'Section 10.2'), read that section verbatim before coding. When approved deviations occur, update DESIGN_SPEC.md to reflect the new reality
Applied to files:
README.mdDESIGN_SPEC.md
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to pyproject.toml : Organize dependencies into groups: `test` (pytest + plugins), `dev` (includes test + ruff, mypy, pre-commit, commitizen). Install with `uv sync` (installs all groups including dev by default)
Applied to files:
README.md
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to src/ai_company/**/*.py : Use `computed_field` for derived values instead of storing + validating redundant fields (e.g. `TokenUsage.total_tokens`); use `NotBlankStr` from `core.types` for all identifier/name fields—including optional (`NotBlankStr | None`) and tuple variants—instead of manual whitespace validators
Applied to files:
src/ai_company/tools/sandbox/sandboxing_config.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.py : Functions must be < 50 lines; files must be < 800 lines
Applied to files:
src/ai_company/tools/code_runner.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to src/ai_company/**/*.py : Create new objects for immutability; never mutate existing ones. For non-Pydantic internal collections (registries, `BaseTool`), use `copy.deepcopy()` at construction + `MappingProxyType` for read-only enforcement. For `dict`/`list` fields in frozen Pydantic models, rely on `frozen=True` and `copy.deepcopy()` at system boundaries (tool execution, LLM provider serialization, inter-agent delegation, persistence serialization)
Applied to files:
src/ai_company/tools/mcp/cache.pysrc/ai_company/tools/mcp/client.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to src/ai_company/**/*.py : Use Pydantic v2 with frozen models for config/identity; use separate mutable-via-copy models (with `model_copy(update=...)`) for runtime state that evolves. Never mix static config fields with mutable runtime fields in one model
Applied to files:
src/ai_company/tools/mcp/cache.pysrc/ai_company/tools/mcp/client.pysrc/ai_company/tools/sandbox/docker_config.py
📚 Learning: 2026-03-09T20:32:32.549Z
Learnt from: CR
Repo: Aureliolo/ai-company PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-03-09T20:32:32.549Z
Learning: Applies to **/*.py : Validate at system boundaries: user input, external APIs, and config files. Do not validate unnecessarily at internal boundaries
Applied to files:
src/ai_company/tools/sandbox/docker_config.py
🧬 Code graph analysis (12)
src/ai_company/tools/sandbox/sandboxing_config.py (2)
src/ai_company/tools/sandbox/config.py (1)
SubprocessSandboxConfig(8-83)src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-94)
src/ai_company/tools/sandbox/docker_sandbox.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-94)src/ai_company/tools/sandbox/errors.py (2)
SandboxError(10-11)SandboxStartError(23-24)src/ai_company/tools/sandbox/result.py (1)
SandboxResult(6-28)
tests/integration/tools/test_docker_sandbox_integration.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-94)src/ai_company/tools/sandbox/docker_sandbox.py (6)
DockerSandbox(70-610)config(112-114)workspace(117-119)execute(267-319)cleanup(558-583)health_check(585-606)src/ai_company/tools/sandbox/protocol.py (3)
execute(25-51)cleanup(53-59)health_check(61-67)
src/ai_company/tools/code_runner.py (6)
src/ai_company/core/enums.py (1)
ToolCategory(293-307)src/ai_company/tools/base.py (5)
BaseTool(56-161)ToolExecutionResult(24-53)description(115-117)category(110-112)parameters_schema(120-128)src/ai_company/tools/sandbox/errors.py (1)
SandboxError(10-11)src/ai_company/tools/sandbox/protocol.py (2)
SandboxBackend(17-75)execute(25-51)src/ai_company/tools/sandbox/docker_sandbox.py (1)
execute(267-319)src/ai_company/tools/mcp/cache.py (1)
get(47-82)
src/ai_company/tools/mcp/cache.py (2)
src/ai_company/observability/_logger.py (1)
get_logger(8-28)src/ai_company/tools/base.py (1)
ToolExecutionResult(24-53)
tests/unit/tools/mcp/test_bridge_tool.py (7)
src/ai_company/core/enums.py (1)
ToolCategory(293-307)src/ai_company/tools/base.py (4)
ToolExecutionResult(24-53)category(110-112)description(115-117)parameters_schema(120-128)src/ai_company/tools/mcp/bridge_tool.py (3)
MCPBridgeTool(31-188)tool_info(62-64)execute(66-88)src/ai_company/tools/mcp/client.py (4)
MCPClient(47-472)config(65-67)server_name(75-77)call_tool(233-311)src/ai_company/tools/mcp/config.py (1)
MCPServerConfig(22-147)src/ai_company/tools/mcp/models.py (1)
MCPToolInfo(14-37)src/ai_company/tools/mcp/cache.py (2)
MCPResultCache(23-149)get(47-82)
tests/unit/tools/sandbox/test_sandboxing_config.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-94)src/ai_company/tools/sandbox/sandboxing_config.py (2)
SandboxingConfig(14-56)backend_for_category(47-56)tests/unit/tools/sandbox/test_docker_config.py (2)
test_defaults(14-25)test_frozen(27-30)
tests/unit/tools/mcp/test_client.py (4)
src/ai_company/tools/mcp/client.py (9)
MCPClient(47-472)config(65-67)is_connected(70-72)server_name(75-77)connect(79-140)disconnect(161-180)list_tools(182-231)call_tool(233-311)reconnect(313-324)src/ai_company/tools/mcp/config.py (1)
MCPServerConfig(22-147)src/ai_company/tools/mcp/errors.py (4)
MCPConnectionError(14-15)MCPDiscoveryError(22-23)MCPInvocationError(26-27)MCPTimeoutError(18-19)src/ai_company/tools/mcp/models.py (1)
MCPToolInfo(14-37)
src/ai_company/tools/mcp/client.py (4)
src/ai_company/observability/_logger.py (1)
get_logger(8-28)src/ai_company/tools/mcp/errors.py (4)
MCPConnectionError(14-15)MCPDiscoveryError(22-23)MCPInvocationError(26-27)MCPTimeoutError(18-19)src/ai_company/tools/mcp/models.py (2)
MCPRawResult(40-62)MCPToolInfo(14-37)src/ai_company/tools/mcp/config.py (1)
MCPServerConfig(22-147)
src/ai_company/tools/mcp/factory.py (5)
src/ai_company/tools/mcp/bridge_tool.py (2)
MCPBridgeTool(31-188)tool_info(62-64)src/ai_company/tools/mcp/cache.py (1)
MCPResultCache(23-149)src/ai_company/tools/mcp/client.py (6)
MCPClient(47-472)config(65-67)disconnect(161-180)server_name(75-77)connect(79-140)list_tools(182-231)src/ai_company/tools/mcp/config.py (2)
MCPConfig(150-176)MCPServerConfig(22-147)src/ai_company/tools/mcp/models.py (1)
MCPToolInfo(14-37)
src/ai_company/tools/mcp/bridge_tool.py (5)
src/ai_company/tools/mcp/errors.py (1)
MCPError(10-11)src/ai_company/tools/mcp/result_mapper.py (1)
map_call_tool_result(29-108)src/ai_company/tools/mcp/cache.py (3)
MCPResultCache(23-149)get(47-82)put(84-115)src/ai_company/tools/mcp/client.py (3)
MCPClient(47-472)server_name(75-77)call_tool(233-311)src/ai_company/tools/mcp/models.py (1)
MCPToolInfo(14-37)
tests/unit/tools/mcp/test_result_mapper.py (2)
src/ai_company/tools/mcp/models.py (1)
MCPRawResult(40-62)src/ai_company/tools/mcp/result_mapper.py (1)
map_call_tool_result(29-108)
🪛 Checkov (3.2.334)
docker/sandbox/Dockerfile
[low] 1-19: Ensure that HEALTHCHECK instructions have been added to container images
(CKV_DOCKER_2)
🪛 GitHub Actions: CI
src/ai_company/tools/sandbox/docker_sandbox.py
[error] 355-355: SandboxStartError: Failed to create container: [404] No such image: python:3.12-slim
tests/integration/tools/test_docker_sandbox_integration.py
[error] 65-65: SandboxStartError: Failed to create container: [404] No such image: python:3.12-slim
[error] 85-85: SandboxStartError: Failed to create container: [404] No such image: python:3.12-slim
tests/unit/tools/sandbox/test_sandboxing_config.py
[error] 41-41: mypy: Argument "overrides" to "SandboxingConfig" has incompatible type "dict[str, str]"; expected "dict[str, Literal['subprocess', 'docker']]" [arg-type]
[error] 93-93: mypy: Dict entry 0 has incompatible type "str": "str"; expected "str": "Literal['subprocess', 'docker']" [dict-item]
🪛 Hadolint (2.14.0)
docker/sandbox/Dockerfile
[warning] 9-9: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>
(DL3008)
🪛 LanguageTool
README.md
[typographical] ~18-~18: To join two clauses or introduce examples, consider using an em dash.
Context: ...ation analytics models - Tool System - File system tools, git tools, sandbox ab...
(DASH_RULE)
🪛 Trivy (0.69.3)
docker/sandbox/Dockerfile
[info] 1-1: No HEALTHCHECK defined
Add HEALTHCHECK instruction in your Dockerfile
Rule: DS-0026
(IaC/Dockerfile)
🔇 Additional comments (8)
.github/workflows/dependency-review.yml (1)
27-31: Looks good.Adding
MIT-0to the allow-list is consistent with the documented scanner exceptions, and the extra comments make future license-review decisions easier to audit.docker/sandbox/Dockerfile (1)
1-19: LGTM! Well-structured sandbox Dockerfile.The multi-stage build avoids curl-pipe-bash patterns, UID 10001 prevents conflicts with common system users, and the non-login shell with read-only workspace alignment supports the hardening objectives.
The Hadolint warning about pinning apt package versions (Line 9) is valid but low-priority for ephemeral sandbox images. The HEALTHCHECK warnings from Trivy/Checkov are not applicable here—the parent
DockerSandboxorchestrator manages container health externally via itshealth_check()method.src/ai_company/tools/sandbox/protocol.py (1)
22-22: LGTM!Documentation correctly updated to reflect that Subprocess and Docker are now built-in backends, aligning with the implementation delivered in this PR.
Also applies to: 40-40
tests/unit/tools/mcp/conftest.py (1)
1-169: LGTM!Well-structured test fixtures providing reusable mock MCP sessions, clients, and configurations. The helper functions for creating mock tool results enable clean, isolated unit tests. Generic test identifiers (
test-stdio,test-http,test-server) comply with the naming guidelines.src/ai_company/observability/events/mcp.py (1)
1-29: LGTM!Well-organized event constants with consistent naming patterns (
MCP_<CATEGORY>_<ACTION>→"mcp.<category>.<action>"). These enable structured logging across the MCP client, bridge, factory, and cache components.README.md (1)
18-18: LGTM!Documentation accurately reflects the implemented features: Docker sandbox backend, code runner, and MCP bridge. Docker is now documented as an optional requirement (Line 59) with installation guidance, addressing the previous review feedback.
Also applies to: 41-41, 50-50, 59-59
tests/unit/tools/mcp/test_bridge_tool.py (1)
1-216: LGTM!Comprehensive test coverage for
MCPBridgeTool: construction properties, execution delegation, cache hit/miss semantics, unhashable argument handling, and error result exclusion from cache. The inlineMCPClientcreation intest_empty_input_schema_yields_none_parameters(Lines 84-102) is acceptable since it tests a specific edge case requiring an emptyinput_schemathat the shared fixtures don't provide.tests/integration/tools/test_docker_sandbox_integration.py (1)
21-41: LGTM on resource leak fix.The
_check()helper now properly closes the Docker client in thefinallyblock, addressing the previous review feedback about resource leaks on exception paths.
| self._config = config or _DEFAULT_CONFIG | ||
| self._workspace = resolved | ||
| self._docker: aiodocker.Docker | None = None | ||
| self._tracked_containers: list[str] = [] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n src/ai_company/tools/sandbox/docker_sandbox.pyRepository: Aureliolo/ai-company
Length of output: 23598
Add synchronization to protect concurrent Docker client lifecycle and container tracking.
The code has race conditions in _ensure_docker() (lines 121-144), cleanup() (lines 558-583), and tracked-container mutations (lines 366-369, 386-388):
_ensure_docker()TOCTOU: Multiple concurrent tasks can both observeself._docker is None, create separateaiodocker.Dockerclients, and race to assign toself._docker, leaking clients.cleanup()interference: While a task is initializing the client in_ensure_docker(), another task can callcleanup(), closing the client before it's assigned or used, causing use-after-close errors.- Container tracking mutations: List reads and modifications in
_run_container()(lines 366-369, 386-388) are not atomic; concurrent executions can lose track of container IDs.
Protect these sections with an asyncio.Lock to serialize access to _ensure_docker(), cleanup(), and tracked-container state.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/tools/sandbox/docker_sandbox.py` around lines 106 - 109, Add
an asyncio.Lock (e.g., self._client_lock) to serialize access to the Docker
client and container-tracking state; in particular, protect _ensure_docker(),
cleanup(), and the sections of _run_container() that read/modify
self._tracked_containers and assign self._docker by surrounding them with "async
with self._client_lock" so the TOCTOU creation/assignment of aiodocker.Docker is
atomic, cleanup() cannot close the client mid-initialization, and
tracked-container append/remove operations are atomic and not lost under
concurrency.
| """Top-level sandboxing configuration model.""" | ||
|
|
||
| from typing import Literal, Self | ||
|
|
||
| from pydantic import BaseModel, ConfigDict, Field, model_validator | ||
|
|
||
| from ai_company.tools.sandbox.config import SubprocessSandboxConfig | ||
| from ai_company.tools.sandbox.docker_config import DockerSandboxConfig |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add the required logger to this config module.
SandboxingConfig contains business logic/validation, but the module never defines logger, so it cannot meet the repo rule to emit structured warnings before rejecting invalid config.
As per coding guidelines, "Every module with business logic must include logging setup: from ai_company.observability import get_logger then logger = get_logger(__name__)." and "All error paths must log at WARNING or ERROR with context before raising".
Also applies to: 35-45
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/ai_company/tools/sandbox/sandboxing_config.py` around lines 1 - 8, This
module is missing the required logger setup used by SandboxingConfig's
validation/business logic; add the import "from ai_company.observability import
get_logger" and define "logger = get_logger(__name__)" near the top of the file,
then update SandboxingConfig's model_validator and any validation/error paths
(e.g., inside the validator function(s) referenced by model_validator and any
logic around SubprocessSandboxConfig/DockerSandboxConfig usage around lines
~35-45) to call logger.warning or logger.error with contextual details before
raising exceptions or returning errors; ensure all newly added log calls include
enough context (config values or identifiers) to aid debugging.
| def test_unknown_block_produces_placeholder(self) -> None: | ||
| from unittest.mock import MagicMock | ||
|
|
||
| unknown = MagicMock() | ||
| type(unknown).__name__ = "MysteryBlock" | ||
| raw = MCPRawResult(content=(unknown,)) | ||
| result = map_call_tool_result(raw) | ||
| assert "[unknown: MysteryBlock]" in result.content | ||
| assert "attachments" not in result.metadata |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n tests/unit/tools/mcp/test_result_mapper.py | sed -n '160,180p'Repository: Aureliolo/ai-company
Length of output: 866
🏁 Script executed:
# Check if there's a pytest fixture or setup/teardown that might clean up
grep -n "setup\|teardown\|fixture" tests/unit/tools/mcp/test_result_mapper.pyRepository: Aureliolo/ai-company
Length of output: 46
🏁 Script executed:
# Check for other tests in the file that might be affected by MagicMock mutation
grep -n "MagicMock\|MysteryBlock" tests/unit/tools/mcp/test_result_mapper.pyRepository: Aureliolo/ai-company
Length of output: 260
🏁 Script executed:
# Verify the test file exists and check its full context
wc -l tests/unit/tools/mcp/test_result_mapper.pyRepository: Aureliolo/ai-company
Length of output: 110
🏁 Script executed:
# Check the imports and test class structure
head -n 30 tests/unit/tools/mcp/test_result_mapper.pyRepository: Aureliolo/ai-company
Length of output: 859
🏁 Script executed:
# Verify if MagicMock is used elsewhere in the codebase in tests
rg "MagicMock" tests/ --type py -B2 -A2Repository: Aureliolo/ai-company
Length of output: 50376
🏁 Script executed:
# Check the pytest configuration to understand test isolation
cat pytest.ini 2>/dev/null || cat pyproject.toml 2>/dev/null | grep -A 20 "\[tool.pytest"Repository: Aureliolo/ai-company
Length of output: 777
🏁 Script executed:
# Check how map_call_tool_result uses the __name__ attribute
grep -n "__name__\|type(" src/ai_company/tools/mcp/result_mapper.py -A 2 -B 2Repository: Aureliolo/ai-company
Length of output: 356
🏁 Script executed:
# Find the implementation of map_call_tool_result to see if it accesses __name__
fd "result_mapper.py" src/ --type f -exec cat {} \;Repository: Aureliolo/ai-company
Length of output: 210
🏁 Script executed:
# Read the full result_mapper.py to see complete context
wc -l src/ai_company/tools/mcp/result_mapper.pyRepository: Aureliolo/ai-company
Length of output: 109
🏁 Script executed:
# Get the relevant section where __name__ is used
sed -n '70,85p' src/ai_company/tools/mcp/result_mapper.pyRepository: Aureliolo/ai-company
Length of output: 561
🏁 Script executed:
# Check if there are any other tests after test_unknown_block_produces_placeholder that might be affected
sed -n '164,215p' tests/unit/tools/mcp/test_result_mapper.pyRepository: Aureliolo/ai-company
Length of output: 1840
Replace MagicMock with a local stub class to avoid mutating the shared MagicMock class.
type(unknown).__name__ = "MysteryBlock" mutates the shared MagicMock class globally, which can cause test isolation issues if test order changes or parallelization is used. A simple local stub class achieves the same coverage without state leakage.
♻️ Suggested fix
def test_unknown_block_produces_placeholder(self) -> None:
- from unittest.mock import MagicMock
-
- unknown = MagicMock()
- type(unknown).__name__ = "MysteryBlock"
+ class MysteryBlock:
+ pass
+
+ unknown = MysteryBlock()
raw = MCPRawResult(content=(unknown,))
result = map_call_tool_result(raw)
assert "[unknown: MysteryBlock]" in result.content🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/tools/mcp/test_result_mapper.py` around lines 164 - 172, The test
test_unknown_block_produces_placeholder currently mutates the global MagicMock
type by setting type(unknown).__name__ = "MysteryBlock"; replace MagicMock with
a tiny local stub class (e.g., class MysteryBlock: pass) and instantiate it
instead of using MagicMock so you don’t alter shared state; keep using
MCPRawResult(content=(instance,)) and assert the same expectations against
map_call_tool_result to preserve test behavior and isolation.
| overrides = {"code_execution": "docker", "terminal": "subprocess"} | ||
| config = SandboxingConfig(overrides=overrides) | ||
| assert config.overrides == overrides |
There was a problem hiding this comment.
Fix the fixture types so mypy stops failing.
overrides, default, and override_backend are inferred as plain str, which is exactly why CI reports dict[str, str] / str here. Type them as Literal["subprocess", "docker"] (or a shared alias) and drop the ignore on Line 92.
♻️ Suggested fix
+from typing import Literal
+
import pytest
from pydantic import ValidationError
from ai_company.tools.sandbox.docker_config import DockerSandboxConfig
from ai_company.tools.sandbox.sandboxing_config import SandboxingConfig
+
+BackendName = Literal["subprocess", "docker"]
@@
def test_overrides(self) -> None:
- overrides = {"code_execution": "docker", "terminal": "subprocess"}
+ overrides: dict[str, BackendName] = {
+ "code_execution": "docker",
+ "terminal": "subprocess",
+ }
config = SandboxingConfig(overrides=overrides)
assert config.overrides == overrides
@@
def test_parametrized_routing(
self,
- default: str,
- override_backend: str,
- expected: str,
+ default: BackendName,
+ override_backend: BackendName,
+ expected: BackendName,
) -> None:
config = SandboxingConfig(
- default_backend=default, # type: ignore[arg-type]
+ default_backend=default,
overrides={"code_execution": override_backend},
)Also applies to: 78-94
🧰 Tools
🪛 GitHub Actions: CI
[error] 41-41: mypy: Argument "overrides" to "SandboxingConfig" has incompatible type "dict[str, str]"; expected "dict[str, Literal['subprocess', 'docker']]" [arg-type]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/tools/sandbox/test_sandboxing_config.py` around lines 40 - 42, The
test assigns plain str values to the fixtures/variables used with
SandboxingConfig (overrides, default, override_backend), causing mypy failures;
update their type annotations to use Literal["subprocess","docker"] (or define a
shared alias like Backend = Literal["subprocess","docker"]) and apply that alias
to the fixtures/variables (e.g., overrides: dict[str, Backend], default:
Backend, override_backend: Backend), then remove the temporary # type: ignore on
the earlier line so the test compiles cleanly under mypy while still asserting
SandboxingConfig(overrides=overrides) behaves as expected.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 47 out of 48 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| RUN apt-get update && apt-get install -y --no-install-recommends git \ | ||
| && apt-get clean && rm -rf /var/lib/apt/lists/* | ||
|
|
There was a problem hiding this comment.
The Dockerfile does not set ReadOnly mode or drop capabilities at the image layer — those are applied at runtime by DockerSandbox._build_container_config. This is correct design. However, the Dockerfile installs git (line 9) and copies the full Node.js installation including npm. Because ReadonlyRootfs: True is applied at runtime, the sandbox container cannot write to the filesystem — but git and npm may try to write to /home, /tmp, or other paths that are blocked by the read-only rootfs. If agents need to run git or npm inside the sandbox, they will fail silently or with confusing errors. Consider documenting this constraint, or mounting a writable /tmp via tmpfs in the container config to allow tools that need temporary write access.
| "timeout": { | ||
| "type": "number", | ||
| "description": "Optional timeout in seconds", | ||
| "minimum": 0, | ||
| "maximum": 600, | ||
| }, |
There was a problem hiding this comment.
The code_runner.py timeout schema sets "maximum": 600 (line 46), but the existing PR description says "timeout schema maximum: 600". However, DockerSandboxConfig.timeout_seconds has le=600, so a user passing timeout=600 in arguments would be accepted by the schema but then clamped to min(600, config.timeout_seconds). If the config uses the default of 120s, a timeout of 600 passed from the schema would be silently reduced to 120s. While this is intentional clamping behavior documented in the execute docstring, the discrepancy between the advertised max (600) and the effective default max (120) could surprise users. Consider either lowering the schema maximum to match the config default or adding a note in the schema description.
| async def _ensure_docker(self) -> aiodocker.Docker: | ||
| """Lazily connect to the Docker daemon. | ||
|
|
||
| Returns: | ||
| An ``aiodocker.Docker`` client instance. | ||
|
|
||
| Raises: | ||
| SandboxStartError: If the Docker daemon is unavailable. | ||
| """ | ||
| if self._docker is not None: | ||
| return self._docker | ||
| client = aiodocker.Docker() | ||
| try: | ||
| await client.version() | ||
| except Exception as exc: | ||
| await client.close() | ||
| logger.exception( | ||
| DOCKER_DAEMON_UNAVAILABLE, | ||
| error=str(exc), | ||
| ) | ||
| msg = f"Docker daemon unavailable: {exc}" | ||
| raise SandboxStartError(msg) from exc | ||
| self._docker = client | ||
| return client |
There was a problem hiding this comment.
The _start_and_wait method catches TimeoutError from asyncio.wait_for. In Python 3.11+, asyncio.wait_for raises TimeoutError (the built-in) rather than asyncio.TimeoutError, which is now an alias. However, the test at line 182 patches side_effect=asyncio.TimeoutError. While these are currently the same class so the tests pass, the code and tests are correct as-is. The real concern is: _ensure_docker is called inside execute, but execute holds no lock. If two concurrent calls both see self._docker is None, they will both create a new aiodocker.Docker() client, only one will be stored (the second call will overwrite self._docker), and the first client is leaked. Since _ensure_docker is async and there is no lock around the self._docker is None check and the subsequent assignment, this is a race condition. Consider adding an asyncio lock to guard _ensure_docker initialization.
src/ai_company/tools/mcp/client.py
Outdated
| session = self._require_session() | ||
| logger.debug( | ||
| MCP_INVOKE_START, | ||
| server=self._config.name, | ||
| tool=tool_name, | ||
| ) | ||
| async with self._lock: | ||
| try: | ||
| result = await asyncio.wait_for( | ||
| session.call_tool(tool_name, arguments), | ||
| timeout=self._config.timeout_seconds, | ||
| ) |
There was a problem hiding this comment.
In call_tool, _require_session() is called outside the lock (line 254), and then the lock is acquired on line 260. This creates a TOCTOU (time-of-check-time-of-use) race: after _require_session() confirms the session is not None but before the lock is acquired, a concurrent disconnect() could set self._session = None. The subsequent session.call_tool(...) call at line 263 would then operate on a stale session reference. _require_session() should be called inside the async with self._lock block, consistent with how it is used in list_tools.
| if self._docker is not None: | ||
| try: | ||
| for cid in self._tracked_containers: | ||
| await self._stop_container(self._docker, cid) | ||
| await self._remove_container(self._docker, cid) | ||
| except Exception as exc: | ||
| logger.warning( | ||
| DOCKER_CLEANUP, | ||
| error=f"Container cleanup failed: {exc}", | ||
| ) |
There was a problem hiding this comment.
The cleanup method has a bug in its exception handling. The for loop that iterates self._tracked_containers is wrapped in a single try/except. Because _stop_container and _remove_container already catch all exceptions internally and just log them, they will never raise. However, if future changes cause them to raise, the except block would break out of the loop early and skip remaining containers. Since the intent is clearly to iterate all containers regardless of errors, the try/except around the loop body should be removed (the inner methods already handle exceptions), or each call should individually be protected.
| except ValueError: | ||
| msg = f"Invalid memory_limit format: {self.memory_limit!r}" | ||
| raise ValueError(msg) from None |
There was a problem hiding this comment.
The DockerSandboxConfig validator _validate_memory_limit re-raises a ValueError without preserving the original exception when parsing fails: raise ValueError(msg) from None. This discards the original ValueError from int(), making debugging harder than necessary. The convention in other model validators in this codebase (e.g., sandboxing_config.py) does not suppress the original exception. Consider using raise ValueError(msg) from exc or just allowing the int() ValueError to propagate with an improved message.
| super().__init__( | ||
| name=f"mcp_{tool_info.server_name}_{tool_info.name}", | ||
| description=tool_info.description, | ||
| parameters_schema=tool_info.input_schema or None, | ||
| category=ToolCategory.MCP, | ||
| ) |
There was a problem hiding this comment.
In MCPBridgeTool.__init__, the parameters_schema is set to tool_info.input_schema or None. Since MCPToolInfo.input_schema defaults to an empty dict {}, and an empty dict is falsy, a tool with no schema parameters will correctly pass None. However, a non-empty schema like {"type": "object", "properties": {}} (an object schema with no properties) is truthy and will be used — this is correct. The concern is that the or None idiom is subtly different from if not input_schema: None, but they behave identically for the empty-dict case, which matches the test at line 84-102. This is fine as-is, but an explicit tool_info.input_schema if tool_info.input_schema else None would be more readable.
There was a problem hiding this comment.
Code Review
This pull request significantly enhances the project's capabilities by adding a hardened Docker sandbox, a comprehensive MCP bridge, and a new code runner tool. However, several security vulnerabilities were identified: insecure environment inheritance in the MCP stdio client which could leak host secrets to MCP servers, unimplemented network isolation controls in the Docker sandbox despite being present in the configuration, potential DoS via unbounded log collection, and sensitive information leakage in debug logs. Additionally, a minor bug in the Docker sandbox's memory limit parsing could lead to an unhandled exception. Addressing these will further strengthen the security posture and stability of the tool system.
| self._config.memory_limit, | ||
| ) | ||
| nano_cpus = int(self._config.cpu_limit * _NANO_CPUS_MULTIPLIER) | ||
|
|
There was a problem hiding this comment.
The DockerSandboxConfig defines allowed_hosts and network_overrides fields intended for network filtering and per-category isolation. However, the DockerSandbox implementation completely ignores these fields in _build_container_config. This creates a false sense of security; users may believe they have restricted the sandbox's network access to specific hosts, while the implementation actually allows or blocks all network access based solely on the global network mode.
| Connected ``ClientSession`` (not yet initialized). | ||
| """ | ||
| if self._config.command is None: | ||
| msg = f"Server {self._config.name!r}: stdio transport requires 'command'" | ||
| logger.warning( |
There was a problem hiding this comment.
In _connect_stdio, the environment for the MCP server subprocess is set to None if self._config.env is empty (which is the default). In Python's subprocess management, env=None causes the child process to inherit the full environment of the parent process. This leaks all host environment variables, including sensitive API keys and credentials, to the MCP server. Pass an empty dictionary instead to prevent inheritance.
| Connected ``ClientSession`` (not yet initialized). | |
| """ | |
| if self._config.command is None: | |
| msg = f"Server {self._config.name!r}: stdio transport requires 'command'" | |
| logger.warning( | |
| params = StdioServerParameters( | |
| command=self._config.command, | |
| args=list(self._config.args), | |
| env=dict(self._config.env), | |
| ) |
| Returns: | ||
| Tuple of (stdout, stderr) as strings. | ||
| """ | ||
| stdout_logs = await container_obj.log( | ||
| stdout=True, | ||
| stderr=False, | ||
| ) | ||
| stderr_logs = await container_obj.log( | ||
| stdout=False, | ||
| stderr=True, | ||
| ) |
There was a problem hiding this comment.
The _collect_logs method retrieves the entire stdout and stderr log history from the Docker container and joins them into strings in memory. There is no limit on the amount of data collected. A malicious or runaway process in the container could generate a massive amount of output, leading to an Out-Of-Memory (OOM) condition on the host machine when the logs are joined. Consider implementing a maximum size limit or using the tail parameter.
| ) | ||
| container_cwd = self._resolve_cwd_in_container(cwd) | ||
|
|
||
| logger.debug( | ||
| DOCKER_EXECUTE_START, | ||
| command=command, | ||
| args=args, | ||
| cwd=container_cwd, |
There was a problem hiding this comment.
The execute method logs the full command and all arguments at the DEBUG level. When used by the CodeRunnerTool, the args include the entire source code snippet being executed. If this code contains hardcoded secrets or sensitive data injected by the LLM, they will be persisted in the host's logs in plain text. Consider truncating the logged arguments or avoiding logging the full code snippet.
| multipliers = {"k": 1024, "m": 1024**2, "g": 1024**3} | ||
| if limit_lower[-1] in multipliers: | ||
| result = int(limit_lower[:-1]) * multipliers[limit_lower[-1]] | ||
| else: |
There was a problem hiding this comment.
The _parse_memory_limit function can crash with a ValueError if a user provides a memory limit that consists only of a suffix, like "m" or "g". The NotBlankStr validator on the config model allows this input, but int(limit_lower[:-1]) will fail when limit_lower[:-1] is an empty string. This unhandled exception could crash the service when creating a Docker container.
I suggest wrapping the parsing logic in a try...except block to catch this and raise a more informative ValueError, making the function more resilient to malformed user input.
| multipliers = {"k": 1024, "m": 1024**2, "g": 1024**3} | |
| if limit_lower[-1] in multipliers: | |
| result = int(limit_lower[:-1]) * multipliers[limit_lower[-1]] | |
| else: | |
| try: | |
| if limit_lower[-1] in multipliers: | |
| result = int(limit_lower[:-1]) * multipliers[limit_lower[-1]] | |
| else: | |
| result = int(limit_lower) | |
| except (ValueError, IndexError) as exc: | |
| raise ValueError(f"Invalid memory limit format: {limit!r}") from exc |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
tests/unit/tools/sandbox/test_sandboxing_config.py (2)
39-45: 🛠️ Refactor suggestion | 🟠 MajorFix the
overridestype annotation to match the expected literal type.The
overridesvariable is annotated asdict[str, str]butSandboxingConfigexpectsdict[str, Literal['subprocess', 'docker']]. This forces the# type: ignore[arg-type]on line 44. Use the proper literal type to eliminate the suppression.♻️ Suggested fix
+from typing import Literal + +BackendName = Literal["subprocess", "docker"] + def test_overrides(self) -> None: - overrides: dict[str, str] = { + overrides: dict[str, BackendName] = { "code_execution": "docker", "terminal": "subprocess", } - config = SandboxingConfig(overrides=overrides) # type: ignore[arg-type] + config = SandboxingConfig(overrides=overrides) assert config.overrides == overrides🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/tools/sandbox/test_sandboxing_config.py` around lines 39 - 45, The test's overrides variable is annotated too broadly as dict[str, str]; update its type to dict[str, Literal['subprocess', 'docker']] to match SandboxingConfig's expected type (and remove the "# type: ignore[arg-type]" suppression), and add the necessary Literal import from typing at the top of the test file so the annotation is valid; reference: the overrides variable in test_overrides and the SandboxingConfig construction.
81-98: 🛠️ Refactor suggestion | 🟠 MajorFix parameterized test parameter types to use the literal backend type.
The parameters
default,override_backend, andexpectedare typed as plainstr, requiring# type: ignoresuppressions on lines 95-96. Use the proper literal type to eliminate these suppressions and ensure type safety.♻️ Suggested fix
+from typing import Literal + +BackendName = Literal["subprocess", "docker"] + `@pytest.mark.parametrize`( ("default", "override_backend", "expected"), [ ("subprocess", "docker", "docker"), ("docker", "subprocess", "subprocess"), ], ) def test_parametrized_routing( self, - default: str, - override_backend: str, - expected: str, + default: BackendName, + override_backend: BackendName, + expected: BackendName, ) -> None: config = SandboxingConfig( - default_backend=default, # type: ignore[arg-type] - overrides={"code_execution": override_backend}, # type: ignore[dict-item] + default_backend=default, + overrides={"code_execution": override_backend}, ) assert config.backend_for_category("code_execution") == expected🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/tools/sandbox/test_sandboxing_config.py` around lines 81 - 98, The test uses plain str types causing type ignores; change the parameter annotations in test_parametrized_routing to the specific Literal backend type (e.g., typing.Literal["docker", "subprocess"]) for default, override_backend, and expected so the SandboxingConfig construction and overrides={"code_execution": ...} no longer require # type: ignore; keep the same param values and assertions against SandboxingConfig.backend_for_category("code_execution").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/integration/tools/test_docker_sandbox_integration.py`:
- Line 18: The per-test timeout set on pytestmark in
tests/integration/tools/test_docker_sandbox_integration.py currently uses 60s;
change pytestmark = [pytest.mark.integration, pytest.mark.timeout(60)] to use
30s (pytest.mark.timeout(30)) to conform to the coding guideline, or if 60s is
required for Docker startup, add a short comment above the pytestmark explaining
the exception and why this test needs a 60s timeout so reviewers know the
deviation is intentional.
---
Duplicate comments:
In `@tests/unit/tools/sandbox/test_sandboxing_config.py`:
- Around line 39-45: The test's overrides variable is annotated too broadly as
dict[str, str]; update its type to dict[str, Literal['subprocess', 'docker']] to
match SandboxingConfig's expected type (and remove the "# type:
ignore[arg-type]" suppression), and add the necessary Literal import from typing
at the top of the test file so the annotation is valid; reference: the overrides
variable in test_overrides and the SandboxingConfig construction.
- Around line 81-98: The test uses plain str types causing type ignores; change
the parameter annotations in test_parametrized_routing to the specific Literal
backend type (e.g., typing.Literal["docker", "subprocess"]) for default,
override_backend, and expected so the SandboxingConfig construction and
overrides={"code_execution": ...} no longer require # type: ignore; keep the
same param values and assertions against
SandboxingConfig.backend_for_category("code_execution").
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 82487bb7-aece-458e-a6c6-5d9d8e73bd89
📒 Files selected for processing (2)
tests/integration/tools/test_docker_sandbox_integration.pytests/unit/tools/sandbox/test_sandboxing_config.py
📜 Review details
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Nofrom __future__ import annotations— Python 3.14 has PEP 649 native lazy annotations
Use PEP 758 except syntax: useexcept A, B:(no parentheses) — ruff enforces this on Python 3.14
All public functions must have type hints; mypy strict mode is enforced
Public classes and functions must have Google-style docstrings (enforced by ruff D rules)
Every module with business logic must include logging setup:from ai_company.observability import get_loggerthenlogger = get_logger(__name__). Never useimport loggingorprint()in application code. Variable name must always belogger
Use structured logging with event name constants:logger.info(EVENT_CONSTANT, key=value)— never use string interpolation likelogger.info('msg %s', val). Import event constants directly fromai_company.observability.events.<domain>
All error paths must log at WARNING or ERROR with context before raising; all state transitions must log at INFO; DEBUG for object creation, internal flow, and entry/exit of key functions
Line length maximum is 88 characters (enforced by ruff)
Functions must be < 50 lines; files must be < 800 lines
Handle errors explicitly; never silently swallow exceptions
Validate at system boundaries: user input, external APIs, and config files. Do not validate unnecessarily at internal boundaries
Files:
tests/integration/tools/test_docker_sandbox_integration.pytests/unit/tools/sandbox/test_sandboxing_config.py
tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/*.py: Use pytest markers:@pytest.mark.unit,@pytest.mark.integration,@pytest.mark.e2e,@pytest.mark.slow. Minimum coverage 80% (enforced in CI). Async mode:asyncio_mode = 'auto'— no manual@pytest.mark.asyncioneeded. Per-test timeout: 30 seconds. Usepytest-xdistvia-n autofor parallelism. Prefer@pytest.mark.parametrizefor testing similar cases
NEVER use real vendor names (Anthropic, OpenAI, Claude, GPT, etc.) in project-owned code, docstrings, comments, tests, or config examples. Use generic names:example-provider,example-large-001,example-medium-001,example-small-001,large/medium/smallas aliases. Tests must usetest-provider,test-small-001, etc. Vendor names may only appear in: (1) DESIGN_SPEC.md provider list, (2).claude/files, (3) third-party import paths
Files:
tests/integration/tools/test_docker_sandbox_integration.pytests/unit/tools/sandbox/test_sandboxing_config.py
🧬 Code graph analysis (1)
tests/integration/tools/test_docker_sandbox_integration.py (3)
src/ai_company/tools/sandbox/docker_config.py (1)
DockerSandboxConfig(12-94)src/ai_company/tools/sandbox/docker_sandbox.py (6)
DockerSandbox(70-610)config(112-114)workspace(117-119)execute(267-319)cleanup(558-583)health_check(585-606)src/ai_company/tools/sandbox/protocol.py (3)
execute(25-51)cleanup(53-59)health_check(61-67)
🔇 Additional comments (10)
tests/unit/tools/sandbox/test_sandboxing_config.py (4)
1-9: LGTM!Module setup is correct: imports are clean, and the
pytestmarkapplies bothpytest.mark.unitandpytest.mark.timeout(30)as required by coding guidelines.
12-25: LGTM!Default value assertions and frozen behavior test are well-structured. The
# type: ignore[misc]on line 25 is appropriate since the test intentionally attempts an invalid mutation to verify immutability.
53-57: LGTM!Clean test for custom Docker configuration integration.
63-79: LGTM!Individual routing tests are clear and cover the key scenarios: default when no override, override when present, and default for unconfigured categories.
tests/integration/tools/test_docker_sandbox_integration.py (6)
1-17: LGTM!Clean imports and appropriate use of
TYPE_CHECKINGfor thePathimport.
24-45: Image availability check now implemented — prior concern resolved.The helper correctly verifies both Docker daemon connectivity (
client.version()) and image presence (client.images.inspect(_TEST_IMAGE)) before running tests. The try/finally pattern ensures proper client cleanup on all code paths. Tests will now skip cleanly in CI environments where the image isn't pre-pulled.
48-51: LGTM!The skip marker properly gates tests on both Docker daemon availability and image presence, with a clear skip reason.
54-77: LGTM!Test class is properly decorated with the skip marker.
test_run_python_codecorrectly uses try/finally for cleanup and validates both success status and stdout content.
78-97: LGTM!The timeout test correctly uses a short execution timeout (2.0s) against a long-running command (sleep 60) to verify timeout enforcement. Proper assertions for both
timed_outandsuccessflags.
99-105: LGTM!Health check test appropriately uses default sandbox configuration since
health_check()only verifies daemon connectivity without requiring a specific image. Cleanup is properly handled.
| if TYPE_CHECKING: | ||
| from pathlib import Path | ||
|
|
||
| pytestmark = [pytest.mark.integration, pytest.mark.timeout(60)] |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Per-test timeout exceeds coding guideline.
The guideline specifies "Per-test timeout: 30 seconds" but this module uses 60 seconds. While the extended timeout may be justified for Docker integration tests (container startup overhead), consider documenting this exception or adjusting if 30s is feasible after the skip logic filters out environments without pre-pulled images.
As per coding guidelines: "Per-test timeout: 30 seconds."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/integration/tools/test_docker_sandbox_integration.py` at line 18, The
per-test timeout set on pytestmark in
tests/integration/tools/test_docker_sandbox_integration.py currently uses 60s;
change pytestmark = [pytest.mark.integration, pytest.mark.timeout(60)] to use
30s (pytest.mark.timeout(30)) to conform to the coding guideline, or if 60s is
required for Docker startup, add a short comment above the pytestmark explaining
the exception and why this test needs a 60s timeout so reviewers know the
deviation is intentional.
… Copilot Key fixes: - Docker sandbox: add /tmp tmpfs mount for ReadonlyRootfs (#31/#46), add asyncio.Lock to _ensure_docker/cleanup (#38/#48), truncate stderr in structured logs (#40), refactor _start_and_wait into helpers (#39), remove per-loop try/except in cleanup (#50), remove dead auto_remove config field (#21) - MCP client: move _require_session() inside lock in call_tool() to fix TOCTOU race (#32/#49), reject double connect() (#35), add BaseException handler in connect() for CancelledError (#36) - MCP factory: log before raising in reuse guard (#37) - MCP result_mapper: split MCP_RESULT_UNKNOWN_BLOCK from MCP_RESULT_MAPPED event (#33) - sandboxing_config: add required logger (#41) - docker_config: preserve original ValueError in chain (#51) - DESIGN_SPEC.md: remove duplicate tools/sandbox/ tree (#34) - Tests: assert stop+delete counts (#44), add tmpfs test
| "minimum": 0, | ||
| "maximum": 600, | ||
| }, |
There was a problem hiding this comment.
timeout: 0 always produces an immediate timeout
The schema specifies "minimum": 0, which allows a caller to pass timeout=0.0. This value flows through to DockerSandbox._wait_for_exit as asyncio.wait_for(container_obj.wait(), timeout=0.0). In Python's asyncio, a timeout <= 0 causes an immediate TimeoutError — the container is stopped before it even runs a single instruction, and every invocation unconditionally returns timed_out=True.
This is the same constraint already applied by DockerSandboxConfig.timeout_seconds (gt=0) and MCPServerConfig.timeout_seconds (gt=0, le=600). The JSON Schema equivalent is "exclusiveMinimum":
| "minimum": 0, | |
| "maximum": 600, | |
| }, | |
| "timeout": { | |
| "type": "number", | |
| "description": "Optional timeout in seconds", | |
| "exclusiveMinimum": 0, | |
| "maximum": 600, | |
| }, |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/tools/code_runner.py
Line: 45-47
Comment:
**`timeout: 0` always produces an immediate timeout**
The schema specifies `"minimum": 0`, which allows a caller to pass `timeout=0.0`. This value flows through to `DockerSandbox._wait_for_exit` as `asyncio.wait_for(container_obj.wait(), timeout=0.0)`. In Python's asyncio, a timeout `<= 0` causes an _immediate_ `TimeoutError` — the container is stopped before it even runs a single instruction, and every invocation unconditionally returns `timed_out=True`.
This is the same constraint already applied by `DockerSandboxConfig.timeout_seconds` (`gt=0`) and `MCPServerConfig.timeout_seconds` (`gt=0, le=600`). The JSON Schema equivalent is `"exclusiveMinimum"`:
```suggestion
"timeout": {
"type": "number",
"description": "Optional timeout in seconds",
"exclusiveMinimum": 0,
"maximum": 600,
},
```
How can I resolve this? If you propose a fix, please make it concise.| except MCPError as exc: | ||
| logger.warning( | ||
| MCP_INVOKE_FAILED, | ||
| tool=self._tool_info.name, | ||
| server=self._tool_info.server_name, | ||
| error=str(exc), | ||
| ) | ||
| return ToolExecutionResult( | ||
| content=str(exc), | ||
| is_error=True, | ||
| ) |
There was a problem hiding this comment.
Double emission of MCP_INVOKE_FAILED for invocation errors
MCPClient.call_tool() already logs MCP_INVOKE_FAILED (via logger.exception) before raising MCPInvocationError. The bridge then catches that MCPError here and emits MCP_INVOKE_FAILED a second time. Every tool invocation error therefore produces two mcp.invoke.failed log records, doubling alert noise and skewing aggregation counts.
The same double-firing occurs for timeouts: call_tool emits MCP_INVOKE_TIMEOUT, then the bridge emits MCP_INVOKE_FAILED for the resulting MCPTimeoutError, making it appear in timeout dashboards and general failure dashboards simultaneously.
Since the client-side logs already include server and tool context, the bridge-level log is redundant. Either drop the logger.warning here and let the client's log stand, or demote it to logger.debug to signal "boundary crossed" without duplicating the event at the same severity.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/ai_company/tools/mcp/bridge_tool.py
Line: 147-157
Comment:
**Double emission of `MCP_INVOKE_FAILED` for invocation errors**
`MCPClient.call_tool()` already logs `MCP_INVOKE_FAILED` (via `logger.exception`) _before_ raising `MCPInvocationError`. The bridge then catches that `MCPError` here and emits `MCP_INVOKE_FAILED` a second time. Every tool invocation error therefore produces two `mcp.invoke.failed` log records, doubling alert noise and skewing aggregation counts.
The same double-firing occurs for timeouts: `call_tool` emits `MCP_INVOKE_TIMEOUT`, then the bridge emits `MCP_INVOKE_FAILED` for the resulting `MCPTimeoutError`, making it appear in timeout dashboards _and_ general failure dashboards simultaneously.
Since the client-side logs already include `server` and `tool` context, the bridge-level log is redundant. Either drop the `logger.warning` here and let the client's log stand, or demote it to `logger.debug` to signal "boundary crossed" without duplicating the event at the same severity.
How can I resolve this? If you propose a fix, please make it concise.
Summary
roasyncio.wait_for, AsyncExitStack cleanup on all failure paths, proper disconnect error handling with dedicated event constants_connect_allto reduce complexity, removed unused variableconnect_timeout_secondsle=120,result_cache_ttl_secondsge=0, removed deadcache_ttl_secondsfield, validation usesMCP_CONFIG_VALIDATION_FAILEDeventField(description=...), network override validation,NotBlankStrfor identifier fields,_BackendNametype alias with validatorarguments["code"]/arguments["language"]access (required params), timeout schemamaximum: 600Closes #50, #53
Test plan
ruff check— all passedruff format— all formattedmypy src/ tests/— no issues in 678 filespytest tests/ -n auto --cov=ai_company --cov-fail-under=80— 5391 passed, 95% coverageReview coverage
Pre-reviewed by 10 agents (code-reviewer, python-reviewer, pr-test-analyzer, silent-failure-hunter, comment-analyzer, type-design-analyzer, logging-audit, resilience-audit, security-reviewer, docs-consistency). 60 findings consolidated, all implemented.
🤖 Generated with Claude Code