Skip to content

[BUG][PERFORMANCE][DATABASE]: DB sessions held during external HTTP calls cause connection pool exhaustion #2518

@crivetimihai

Description

@crivetimihai

Summary

Database sessions are held open during external HTTP/MCP calls, causing "idle in transaction" connection pool exhaustion. This adds ~1800ms latency per request and triggers PostgreSQL/PgBouncer timeout errors under load.

Impact

Metric Value
Tool invocation latency ~2000ms avg (vs ~100ms direct to MCP servers)
Max response times 23-30 seconds
Idle transaction timeout errors ~100+/minute across gateway instances
RPC vs REST disparity 1636ms vs 4ms for equivalent list operations

Evidence

Load Test Results (3.2M requests, ~2500 RPS)

Endpoint                              Avg (ms)    Max (ms)    Requests
/tools (REST list)                         4.0       5920        266K
/rpc fast-time-get-system-time          1871      23453        214K
/rpc tools/list                         1636      22103        161K
/rpc fast-test-echo                     2160      30002         88K

Error Logs

psycopg.errors.ProtocolViolation: idle transaction timeout
# ~102 errors/minute across 3 gateway instances

Direct vs Gateway Latency

# Direct call to MCP server (bypassing gateway)
curl -X POST http://localhost:9001/rpc ...
# Result: ~100ms

# Same call through gateway
curl -X POST http://localhost:8080/rpc ...
# Result: ~2000ms

Gateway adds ~1800ms overhead due to DB connection contention.

Root Cause

SQLAlchemy sessions injected via FastAPI Depends(get_db) are held open throughout request handling. When making external HTTP calls, the DB connection remains in an open transaction even though no database work is being done.

async def invoke_tool(db: Session, ...):  # Session opened by FastAPI
    tool = db.execute(query).scalar()      # DB query - transaction starts
    
    # !!! HTTP call with DB session still open !!!
    result = await http_client.post(url, json=arguments)  # 100-2000ms
    
    return result  # db.close() happens much later

Bugs Identified

Bug 1: A2A Tool Invocation Early Return

Location: mcpgateway/services/tool_service.py:2589-2591

# Early return BEFORE db.close() at line 2673
if tool_integration_type == "A2A" and tool_annotations and "a2a_agent_id" in tool_annotations:
    tool_stub = tool if tool is not None else SimpleNamespace(...)
    return await self._invoke_a2a_tool(db=db, tool=tool_stub, arguments=arguments)
    # ^^^ Returns here, bypassing:
    #   - db.commit() and db.close() at lines 2672-2673
    #   - Plugin pre-invoke and post-invoke hooks
    #   - Metrics recording in finally block

Bug 2: RPC Handler Missing db.commit()/db.close()

Location: mcpgateway/main.py RPC handler

REST endpoint explicitly commits after list operations, RPC handler does NOT.


Implementation Plan

Fix 1: Restructure A2A Tool Invocation (Preferred)

Remove early return and integrate A2A into the standard invoke_tool flow:

  1. Remove early return at line 2589-2591
  2. Add A2A data extraction in Phase 2 (before db.close()):
    • Query for A2A agent
    • Extract all needed data to local variables: a2a_agent_name, a2a_agent_endpoint_url, a2a_agent_type, a2a_agent_protocol_version, a2a_agent_auth_type, a2a_agent_auth_value, a2a_agent_auth_query_params
  3. Add elif tool_integration_type == "A2A": branch after the MCP branch (around line 3205)
    • Use pre-extracted local variables
    • Go through plugin pre-invoke hook
    • Make HTTP call
    • Set success flag for metrics
    • Go through plugin post-invoke hook

This ensures A2A tools:

  • Release DB before HTTP call ✓
  • Go through plugin hooks ✓
  • Have metrics recorded in finally block ✓

Fix 2: RPC Handler db.commit()/db.close()

Add db.commit() and db.close() after each RPC list operation:

  • tools/list (2 code paths)
  • mcp/tools/list (2 code paths)
  • list_gateways
  • resources/list (2 code paths)
  • prompts/list (2 code paths)

Total: 11 locations


Diff Saved

Implementation diff saved to: todo/diff-db.diff

Files Modified

File Changes
mcpgateway/services/tool_service.py Restructure A2A to use standard invoke flow, add Phase 2 data extraction
mcpgateway/main.py Add db.commit()/db.close() to 11 RPC list operations

Key Changes in tool_service.py

  1. Removed early A2A return at line 2589-2591
  2. Added A2A agent data extraction in Phase 2 (lines 2663-2697):
    if tool_integration_type == "A2A" and "a2a_agent_id" in tool_annotations:
        # Query agent and extract to local variables BEFORE db.close()
        a2a_agent_name = a2a_agent.name
        a2a_agent_endpoint_url = a2a_agent.endpoint_url
        # ... etc
  3. Added A2A branch after MCP branch (lines 3238-3310):
    elif tool_integration_type == "A2A" and a2a_agent_endpoint_url:
        # Plugin pre-invoke hook
        # Build request data based on agent type
        # Add authentication
        # Make HTTP request
        # Convert response to ToolResult

Key Changes in main.py

Added db.commit() and db.close() after 11 RPC list operations to release connections early.


Verification

After applying the fix:

  1. Run load test: make load-test-ui (100+ users, spawn rate 10/s)
  2. Monitor PostgreSQL: SELECT state, count(*) FROM pg_stat_activity WHERE datname = 'mcp' GROUP BY state;
  3. Check for timeout errors: docker logs mcp-context-forge-gateway-1 2>&1 | grep "idle transaction timeout" | wc -l

Expected outcomes:

  • "idle in transaction" count should stay < pool size
  • Tool invocation latency should drop to ~100-200ms
  • No/minimal "idle transaction timeout" errors

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasebugSomething isn't workingdatabaseperformancePerformance related itemspythonPython / backend development (FastAPI)

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions