-
Notifications
You must be signed in to change notification settings - Fork 615
[BUG][PERFORMANCE][DATABASE]: DB sessions held during external HTTP calls cause connection pool exhaustion #2518
Description
Summary
Database sessions are held open during external HTTP/MCP calls, causing "idle in transaction" connection pool exhaustion. This adds ~1800ms latency per request and triggers PostgreSQL/PgBouncer timeout errors under load.
Impact
| Metric | Value |
|---|---|
| Tool invocation latency | ~2000ms avg (vs ~100ms direct to MCP servers) |
| Max response times | 23-30 seconds |
| Idle transaction timeout errors | ~100+/minute across gateway instances |
| RPC vs REST disparity | 1636ms vs 4ms for equivalent list operations |
Evidence
Load Test Results (3.2M requests, ~2500 RPS)
Endpoint Avg (ms) Max (ms) Requests
/tools (REST list) 4.0 5920 266K
/rpc fast-time-get-system-time 1871 23453 214K
/rpc tools/list 1636 22103 161K
/rpc fast-test-echo 2160 30002 88K
Error Logs
psycopg.errors.ProtocolViolation: idle transaction timeout
# ~102 errors/minute across 3 gateway instances
Direct vs Gateway Latency
# Direct call to MCP server (bypassing gateway)
curl -X POST http://localhost:9001/rpc ...
# Result: ~100ms
# Same call through gateway
curl -X POST http://localhost:8080/rpc ...
# Result: ~2000msGateway adds ~1800ms overhead due to DB connection contention.
Root Cause
SQLAlchemy sessions injected via FastAPI Depends(get_db) are held open throughout request handling. When making external HTTP calls, the DB connection remains in an open transaction even though no database work is being done.
async def invoke_tool(db: Session, ...): # Session opened by FastAPI
tool = db.execute(query).scalar() # DB query - transaction starts
# !!! HTTP call with DB session still open !!!
result = await http_client.post(url, json=arguments) # 100-2000ms
return result # db.close() happens much laterBugs Identified
Bug 1: A2A Tool Invocation Early Return
Location: mcpgateway/services/tool_service.py:2589-2591
# Early return BEFORE db.close() at line 2673
if tool_integration_type == "A2A" and tool_annotations and "a2a_agent_id" in tool_annotations:
tool_stub = tool if tool is not None else SimpleNamespace(...)
return await self._invoke_a2a_tool(db=db, tool=tool_stub, arguments=arguments)
# ^^^ Returns here, bypassing:
# - db.commit() and db.close() at lines 2672-2673
# - Plugin pre-invoke and post-invoke hooks
# - Metrics recording in finally blockBug 2: RPC Handler Missing db.commit()/db.close()
Location: mcpgateway/main.py RPC handler
REST endpoint explicitly commits after list operations, RPC handler does NOT.
Implementation Plan
Fix 1: Restructure A2A Tool Invocation (Preferred)
Remove early return and integrate A2A into the standard invoke_tool flow:
- Remove early return at line 2589-2591
- Add A2A data extraction in Phase 2 (before db.close()):
- Query for A2A agent
- Extract all needed data to local variables:
a2a_agent_name,a2a_agent_endpoint_url,a2a_agent_type,a2a_agent_protocol_version,a2a_agent_auth_type,a2a_agent_auth_value,a2a_agent_auth_query_params
- Add
elif tool_integration_type == "A2A":branch after the MCP branch (around line 3205)- Use pre-extracted local variables
- Go through plugin pre-invoke hook
- Make HTTP call
- Set success flag for metrics
- Go through plugin post-invoke hook
This ensures A2A tools:
- Release DB before HTTP call ✓
- Go through plugin hooks ✓
- Have metrics recorded in finally block ✓
Fix 2: RPC Handler db.commit()/db.close()
Add db.commit() and db.close() after each RPC list operation:
tools/list(2 code paths)mcp/tools/list(2 code paths)list_gatewaysresources/list(2 code paths)prompts/list(2 code paths)
Total: 11 locations
Diff Saved
Implementation diff saved to: todo/diff-db.diff
Files Modified
| File | Changes |
|---|---|
mcpgateway/services/tool_service.py |
Restructure A2A to use standard invoke flow, add Phase 2 data extraction |
mcpgateway/main.py |
Add db.commit()/db.close() to 11 RPC list operations |
Key Changes in tool_service.py
- Removed early A2A return at line 2589-2591
- Added A2A agent data extraction in Phase 2 (lines 2663-2697):
if tool_integration_type == "A2A" and "a2a_agent_id" in tool_annotations: # Query agent and extract to local variables BEFORE db.close() a2a_agent_name = a2a_agent.name a2a_agent_endpoint_url = a2a_agent.endpoint_url # ... etc
- Added A2A branch after MCP branch (lines 3238-3310):
elif tool_integration_type == "A2A" and a2a_agent_endpoint_url: # Plugin pre-invoke hook # Build request data based on agent type # Add authentication # Make HTTP request # Convert response to ToolResult
Key Changes in main.py
Added db.commit() and db.close() after 11 RPC list operations to release connections early.
Verification
After applying the fix:
- Run load test:
make load-test-ui(100+ users, spawn rate 10/s) - Monitor PostgreSQL:
SELECT state, count(*) FROM pg_stat_activity WHERE datname = 'mcp' GROUP BY state; - Check for timeout errors:
docker logs mcp-context-forge-gateway-1 2>&1 | grep "idle transaction timeout" | wc -l
Expected outcomes:
- "idle in transaction" count should stay < pool size
- Tool invocation latency should drop to ~100-200ms
- No/minimal "idle transaction timeout" errors