Skip to content

[FEATURE]: Tool invocation timeouts and circuit breaker #2078

@crivetimihai

Description

@crivetimihai

⏱️ Feature: Tool Invocation Timeouts and Circuit Breaker

Goal

Implement enforced timeouts for all tool invocations (MCP and REST) and add a per-tool circuit breaker to prevent cascading failures from repeatedly failing tools. This ensures predictable behavior and system resilience.

Why Now?

  1. Hanging Requests: Tool calls without timeouts can block indefinitely, exhausting connection pools and degrading UX
  2. Cascading Failures: A single failing tool can impact gateway performance when retried repeatedly
  3. Config Exists But Unused: TOOL_TIMEOUT, timeout_ms field, and retry settings exist but are not enforced in invocation code
  4. Session Pool Has It: MCP session pool already has circuit breaker (mcp_session_pool_circuit_breaker_threshold); tools need the same protection

📖 User Stories

US-1: Developer - Predictable Tool Timeout Behavior

As a Developer integrating with the gateway
I want tool calls to timeout predictably
So that my application doesn't hang waiting for unresponsive tools

Acceptance Criteria:

Scenario: Tool times out after configured duration
  Given a tool "slow_api" with timeout_ms set to 5000
  And the upstream service takes 10 seconds to respond
  When I invoke the tool via POST /servers/{id}/tools/{tool}/call
  Then the request should fail after ~5 seconds
  And the response should have status 504 or JSON-RPC error
  And the error message should indicate "Tool invocation timed out after 5000ms"

Scenario: Use global timeout when per-tool timeout not set
  Given a tool "default_api" with no timeout_ms configured
  And TOOL_TIMEOUT is set to 60 seconds
  When I invoke the tool and it takes 70 seconds
  Then the request should fail after ~60 seconds
  And the error should indicate timeout

Scenario: Per-tool timeout overrides global
  Given TOOL_TIMEOUT is 60 seconds
  And a tool has timeout_ms set to 10000
  When I invoke the tool and it takes 15 seconds
  Then the request should fail after ~10 seconds

Technical Requirements:

  • Wrap tool invocation in asyncio.wait_for()
  • Use tool.timeout_ms if set, else settings.tool_timeout
  • Return consistent JSON-RPC error code for timeout (-32000 or -32603)
US-2: Operator - Circuit Breaker Prevents Cascading Failures

As a Platform Operator
I want failing tools to be temporarily disabled
So that repeated failures don't impact overall gateway health

Acceptance Criteria:

Scenario: Circuit opens after threshold failures
  Given TOOL_CIRCUIT_BREAKER_THRESHOLD is 5
  And TOOL_CIRCUIT_BREAKER_RESET is 60 seconds
  When a tool fails 5 consecutive times
  Then the circuit should open for that tool
  And subsequent calls should fail immediately with "Circuit breaker open"
  And the failure should be logged with tool_id and failure count

Scenario: Circuit resets after cooldown period
  Given a tool's circuit breaker is open
  When 60 seconds have elapsed
  Then the next call should attempt the tool (half-open state)
  And if successful, the circuit should close
  And if failed, the circuit should remain open for another period

Scenario: Successful calls reset failure counter
  Given a tool has 3 consecutive failures
  When the next call succeeds
  Then the failure counter should reset to 0

Technical Requirements:

  • Track failures per tool_id (in-memory dict or Redis)
  • Configurable threshold and reset duration
  • Three states: closed (normal), open (fast-fail), half-open (testing)
US-3: Developer - Clear Timeout Error Messages

As a Developer debugging tool failures
I want timeout errors to be clearly distinguished from other errors
So that I can quickly identify and fix timeout issues

Acceptance Criteria:

Scenario: Timeout error has distinct error code
  When a tool invocation times out
  Then the JSON-RPC response should have:
    | Field | Value |
    | error.code | -32000 |
    | error.message | Tool invocation timed out |
    | error.data.timeout_ms | 5000 |
    | error.data.tool_id | {tool_id} |

Scenario: Circuit breaker error is distinct
  When a tool call is rejected due to open circuit
  Then the JSON-RPC response should have:
    | Field | Value |
    | error.code | -32001 |
    | error.message | Circuit breaker open |
    | error.data.retry_after_seconds | 45 |

🏗 Architecture

Timeout Enforcement Flow

sequenceDiagram
    participant Client
    participant Gateway
    participant TimeoutWrapper
    participant Tool as Tool/MCP Server

    Client->>Gateway: POST /servers/{id}/tools/{tool}/call
    Gateway->>Gateway: Get tool.timeout_ms or TOOL_TIMEOUT
    Gateway->>TimeoutWrapper: asyncio.wait_for(invoke(), timeout)
    
    alt Tool responds in time
        TimeoutWrapper->>Tool: Execute tool
        Tool-->>TimeoutWrapper: Result
        TimeoutWrapper-->>Gateway: Result
        Gateway-->>Client: Success response
    else Timeout exceeded
        TimeoutWrapper--xGateway: asyncio.TimeoutError
        Gateway-->>Client: JSON-RPC error (-32000)
    end
Loading

Circuit Breaker State Machine

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: failures >= threshold
    Open --> HalfOpen: reset_time elapsed
    HalfOpen --> Closed: success
    HalfOpen --> Open: failure
    Closed --> Closed: success (reset counter)
Loading

Code Structure

# mcpgateway/services/circuit_breaker.py
class ToolCircuitBreaker:
    """Per-tool circuit breaker with configurable threshold and reset."""
    
    def __init__(
        self,
        threshold: int = 5,
        reset_seconds: float = 60.0,
        window_seconds: float = 300.0,  # sliding window for failures
    ):
        self._failures: Dict[str, List[float]] = {}  # tool_id -> [timestamps]
        self._open_until: Dict[str, float] = {}  # tool_id -> reopen_time
    
    def is_open(self, tool_id: str) -> bool:
        """Check if circuit is open for tool."""
        ...
    
    def record_success(self, tool_id: str) -> None:
        """Record successful call, reset failure count."""
        ...
    
    def record_failure(self, tool_id: str) -> bool:
        """Record failure, return True if circuit just opened."""
        ...

📋 Implementation Tasks

Phase 1: Timeout Enforcement

  • Add asyncio.wait_for() wrapper around REST tool invocation in tool_service.py
  • Add asyncio.wait_for() wrapper around MCP tool invocation
  • Use tool.timeout_ms if set, else settings.tool_timeout
  • Convert asyncio.TimeoutError to JSON-RPC error response
  • Log timeout events with tool_id and configured timeout

Phase 2: Circuit Breaker Service

  • Create mcpgateway/services/circuit_breaker.py
  • Implement ToolCircuitBreaker class with closed/open/half-open states
  • Track failures per tool_id with sliding window
  • Add thread-safe access (asyncio.Lock or threading.Lock)
  • Implement is_open(), record_success(), record_failure() methods

Phase 3: Configuration

  • Add TOOL_CIRCUIT_BREAKER_ENABLED setting (default: True)
  • Add TOOL_CIRCUIT_BREAKER_THRESHOLD setting (default: 5)
  • Add TOOL_CIRCUIT_BREAKER_RESET setting (default: 60 seconds)
  • Add TOOL_CIRCUIT_BREAKER_WINDOW setting (default: 300 seconds)
  • Document settings in .env.example

Phase 4: Integration

  • Inject circuit breaker checks before tool invocation
  • Record success/failure after invocation completes
  • Return immediate error when circuit is open
  • Include retry_after_seconds in error response

Phase 5: Metrics and Logging

  • Add metrics: tool_timeout_total, circuit_breaker_open_total, circuit_breaker_state
  • Log circuit state transitions (opened, half-open, closed)
  • Include tool_id in all circuit breaker log messages

Phase 6: Testing

  • Unit tests for timeout wrapper
  • Unit tests for circuit breaker state transitions
  • Integration test for timeout behavior
  • Integration test for circuit breaker opening after failures
  • Integration test for circuit recovery

⚙️ Configuration Example

# .env.example additions

# Tool invocation timeout (seconds) - applies when tool.timeout_ms not set
TOOL_TIMEOUT=60

# Circuit breaker settings
TOOL_CIRCUIT_BREAKER_ENABLED=true
TOOL_CIRCUIT_BREAKER_THRESHOLD=5      # Failures before circuit opens
TOOL_CIRCUIT_BREAKER_RESET=60         # Seconds before testing recovery
TOOL_CIRCUIT_BREAKER_WINDOW=300       # Sliding window for counting failures

✅ Success Criteria

  • All tool calls (REST and MCP) respect timeout settings
  • Per-tool timeout_ms overrides global TOOL_TIMEOUT
  • Timeout errors return consistent JSON-RPC error code
  • Circuit breaker opens after configured failure threshold
  • Circuit breaker resets after cooldown period
  • Metrics exposed for monitoring timeout and circuit breaker events
  • Settings documented in .env.example
  • Unit test coverage > 90% for new code

🏁 Definition of Done

  • Timeout enforcement implemented for REST and MCP tools
  • Circuit breaker service implemented
  • Configuration settings added
  • Metrics and logging integrated
  • Unit tests written and passing
  • Integration tests written and passing
  • Code passes make verify
  • Settings documented in .env.example
  • PR reviewed and approved

📝 Additional Notes

Current State (Gaps to Address)

Component Status Gap
TOOL_TIMEOUT config Exists Not enforced in invocation code
tool.timeout_ms field Exists in DbTool Not used during invocation
MCP session pool circuit breaker Implemented Per-pool, not per-tool
REST tool timeout HTTPX global timeout Not per-tool configurable

Error Codes

Code Meaning
-32000 Tool invocation timed out
-32001 Circuit breaker open
-32603 Internal error (existing)

🔗 Related Issues

  • MCP session pool already has circuit breaker (reference implementation)
  • settings.tool_timeout exists at config.py:1286
  • DbTool.timeout_ms exists at db.py:2803

Metadata

Metadata

Labels

MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafeenhancementNew feature or requesticaICA related issuespythonPython / backend development (FastAPI)

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions