-
Notifications
You must be signed in to change notification settings - Fork 615
[FEATURE]: Tool invocation timeouts and circuit breaker #2078
Description
⏱️ Feature: Tool Invocation Timeouts and Circuit Breaker
Goal
Implement enforced timeouts for all tool invocations (MCP and REST) and add a per-tool circuit breaker to prevent cascading failures from repeatedly failing tools. This ensures predictable behavior and system resilience.
Why Now?
- Hanging Requests: Tool calls without timeouts can block indefinitely, exhausting connection pools and degrading UX
- Cascading Failures: A single failing tool can impact gateway performance when retried repeatedly
- Config Exists But Unused:
TOOL_TIMEOUT,timeout_msfield, and retry settings exist but are not enforced in invocation code - Session Pool Has It: MCP session pool already has circuit breaker (
mcp_session_pool_circuit_breaker_threshold); tools need the same protection
📖 User Stories
US-1: Developer - Predictable Tool Timeout Behavior
As a Developer integrating with the gateway
I want tool calls to timeout predictably
So that my application doesn't hang waiting for unresponsive tools
Acceptance Criteria:
Scenario: Tool times out after configured duration
Given a tool "slow_api" with timeout_ms set to 5000
And the upstream service takes 10 seconds to respond
When I invoke the tool via POST /servers/{id}/tools/{tool}/call
Then the request should fail after ~5 seconds
And the response should have status 504 or JSON-RPC error
And the error message should indicate "Tool invocation timed out after 5000ms"
Scenario: Use global timeout when per-tool timeout not set
Given a tool "default_api" with no timeout_ms configured
And TOOL_TIMEOUT is set to 60 seconds
When I invoke the tool and it takes 70 seconds
Then the request should fail after ~60 seconds
And the error should indicate timeout
Scenario: Per-tool timeout overrides global
Given TOOL_TIMEOUT is 60 seconds
And a tool has timeout_ms set to 10000
When I invoke the tool and it takes 15 seconds
Then the request should fail after ~10 secondsTechnical Requirements:
- Wrap tool invocation in
asyncio.wait_for() - Use
tool.timeout_msif set, elsesettings.tool_timeout - Return consistent JSON-RPC error code for timeout (-32000 or -32603)
US-2: Operator - Circuit Breaker Prevents Cascading Failures
As a Platform Operator
I want failing tools to be temporarily disabled
So that repeated failures don't impact overall gateway health
Acceptance Criteria:
Scenario: Circuit opens after threshold failures
Given TOOL_CIRCUIT_BREAKER_THRESHOLD is 5
And TOOL_CIRCUIT_BREAKER_RESET is 60 seconds
When a tool fails 5 consecutive times
Then the circuit should open for that tool
And subsequent calls should fail immediately with "Circuit breaker open"
And the failure should be logged with tool_id and failure count
Scenario: Circuit resets after cooldown period
Given a tool's circuit breaker is open
When 60 seconds have elapsed
Then the next call should attempt the tool (half-open state)
And if successful, the circuit should close
And if failed, the circuit should remain open for another period
Scenario: Successful calls reset failure counter
Given a tool has 3 consecutive failures
When the next call succeeds
Then the failure counter should reset to 0Technical Requirements:
- Track failures per tool_id (in-memory dict or Redis)
- Configurable threshold and reset duration
- Three states: closed (normal), open (fast-fail), half-open (testing)
US-3: Developer - Clear Timeout Error Messages
As a Developer debugging tool failures
I want timeout errors to be clearly distinguished from other errors
So that I can quickly identify and fix timeout issues
Acceptance Criteria:
Scenario: Timeout error has distinct error code
When a tool invocation times out
Then the JSON-RPC response should have:
| Field | Value |
| error.code | -32000 |
| error.message | Tool invocation timed out |
| error.data.timeout_ms | 5000 |
| error.data.tool_id | {tool_id} |
Scenario: Circuit breaker error is distinct
When a tool call is rejected due to open circuit
Then the JSON-RPC response should have:
| Field | Value |
| error.code | -32001 |
| error.message | Circuit breaker open |
| error.data.retry_after_seconds | 45 |🏗 Architecture
Timeout Enforcement Flow
sequenceDiagram
participant Client
participant Gateway
participant TimeoutWrapper
participant Tool as Tool/MCP Server
Client->>Gateway: POST /servers/{id}/tools/{tool}/call
Gateway->>Gateway: Get tool.timeout_ms or TOOL_TIMEOUT
Gateway->>TimeoutWrapper: asyncio.wait_for(invoke(), timeout)
alt Tool responds in time
TimeoutWrapper->>Tool: Execute tool
Tool-->>TimeoutWrapper: Result
TimeoutWrapper-->>Gateway: Result
Gateway-->>Client: Success response
else Timeout exceeded
TimeoutWrapper--xGateway: asyncio.TimeoutError
Gateway-->>Client: JSON-RPC error (-32000)
end
Circuit Breaker State Machine
stateDiagram-v2
[*] --> Closed
Closed --> Open: failures >= threshold
Open --> HalfOpen: reset_time elapsed
HalfOpen --> Closed: success
HalfOpen --> Open: failure
Closed --> Closed: success (reset counter)
Code Structure
# mcpgateway/services/circuit_breaker.py
class ToolCircuitBreaker:
"""Per-tool circuit breaker with configurable threshold and reset."""
def __init__(
self,
threshold: int = 5,
reset_seconds: float = 60.0,
window_seconds: float = 300.0, # sliding window for failures
):
self._failures: Dict[str, List[float]] = {} # tool_id -> [timestamps]
self._open_until: Dict[str, float] = {} # tool_id -> reopen_time
def is_open(self, tool_id: str) -> bool:
"""Check if circuit is open for tool."""
...
def record_success(self, tool_id: str) -> None:
"""Record successful call, reset failure count."""
...
def record_failure(self, tool_id: str) -> bool:
"""Record failure, return True if circuit just opened."""
...📋 Implementation Tasks
Phase 1: Timeout Enforcement
- Add
asyncio.wait_for()wrapper around REST tool invocation intool_service.py - Add
asyncio.wait_for()wrapper around MCP tool invocation - Use
tool.timeout_msif set, elsesettings.tool_timeout - Convert
asyncio.TimeoutErrorto JSON-RPC error response - Log timeout events with tool_id and configured timeout
Phase 2: Circuit Breaker Service
- Create
mcpgateway/services/circuit_breaker.py - Implement
ToolCircuitBreakerclass with closed/open/half-open states - Track failures per tool_id with sliding window
- Add thread-safe access (asyncio.Lock or threading.Lock)
- Implement
is_open(),record_success(),record_failure()methods
Phase 3: Configuration
- Add
TOOL_CIRCUIT_BREAKER_ENABLEDsetting (default: True) - Add
TOOL_CIRCUIT_BREAKER_THRESHOLDsetting (default: 5) - Add
TOOL_CIRCUIT_BREAKER_RESETsetting (default: 60 seconds) - Add
TOOL_CIRCUIT_BREAKER_WINDOWsetting (default: 300 seconds) - Document settings in
.env.example
Phase 4: Integration
- Inject circuit breaker checks before tool invocation
- Record success/failure after invocation completes
- Return immediate error when circuit is open
- Include
retry_after_secondsin error response
Phase 5: Metrics and Logging
- Add metrics:
tool_timeout_total,circuit_breaker_open_total,circuit_breaker_state - Log circuit state transitions (opened, half-open, closed)
- Include tool_id in all circuit breaker log messages
Phase 6: Testing
- Unit tests for timeout wrapper
- Unit tests for circuit breaker state transitions
- Integration test for timeout behavior
- Integration test for circuit breaker opening after failures
- Integration test for circuit recovery
⚙️ Configuration Example
# .env.example additions
# Tool invocation timeout (seconds) - applies when tool.timeout_ms not set
TOOL_TIMEOUT=60
# Circuit breaker settings
TOOL_CIRCUIT_BREAKER_ENABLED=true
TOOL_CIRCUIT_BREAKER_THRESHOLD=5 # Failures before circuit opens
TOOL_CIRCUIT_BREAKER_RESET=60 # Seconds before testing recovery
TOOL_CIRCUIT_BREAKER_WINDOW=300 # Sliding window for counting failures✅ Success Criteria
- All tool calls (REST and MCP) respect timeout settings
- Per-tool
timeout_msoverrides globalTOOL_TIMEOUT - Timeout errors return consistent JSON-RPC error code
- Circuit breaker opens after configured failure threshold
- Circuit breaker resets after cooldown period
- Metrics exposed for monitoring timeout and circuit breaker events
- Settings documented in
.env.example - Unit test coverage > 90% for new code
🏁 Definition of Done
- Timeout enforcement implemented for REST and MCP tools
- Circuit breaker service implemented
- Configuration settings added
- Metrics and logging integrated
- Unit tests written and passing
- Integration tests written and passing
- Code passes
make verify - Settings documented in
.env.example - PR reviewed and approved
📝 Additional Notes
Current State (Gaps to Address)
| Component | Status | Gap |
|---|---|---|
TOOL_TIMEOUT config |
Exists | Not enforced in invocation code |
tool.timeout_ms field |
Exists in DbTool | Not used during invocation |
| MCP session pool circuit breaker | Implemented | Per-pool, not per-tool |
| REST tool timeout | HTTPX global timeout | Not per-tool configurable |
Error Codes
| Code | Meaning |
|---|---|
-32000 |
Tool invocation timed out |
-32001 |
Circuit breaker open |
-32603 |
Internal error (existing) |
🔗 Related Issues
- MCP session pool already has circuit breaker (reference implementation)
settings.tool_timeoutexists at config.py:1286DbTool.timeout_msexists at db.py:2803