-
Notifications
You must be signed in to change notification settings - Fork 614
[BUG][PERFORMANCE][DB]: Endpoint handlers hold DB sessions during slow MCP backend calls #2323
Description
Summary
Endpoint handlers in mcpgateway/routers/ acquire database sessions via Depends(get_db) at request start, fetch data, then hold the session while making slow MCP backend calls (1-30+ seconds). This causes "idle in transaction" connections to accumulate, eventually hitting PostgreSQL's idle_in_transaction_session_timeout and causing ProtocolViolation errors.
Related
This is a follow-up to #2318 which fixed the same pattern in RBAC middleware. That fix is working - no permission queries appear in long-running transactions anymore. However, endpoint handlers still exhibit the same anti-pattern.
Evidence from Load Testing (4000 users)
After RBAC fix (#2319), remaining long-running transactions are all from endpoint handlers:
source | count | avg_age | max_age
-----------+-------+---------+---------
teams | 20 | 41 | 52s
tools | 10 | 36 | 50s
resources | 8 | 43 | 53s
prompts | 7 | 45 | 52s
servers | 5 | 42 | 52s
Example queries held during slow MCP calls:
SELECT resources.id, resources.uri, resources.name, ... -- held 63s
SELECT email_teams.id, email_teams.name, ... -- held 62s
SELECT prompts.id, prompts.original_name, ... -- held 62s
SELECT tools.id, tools.original_name, ... -- held 60sRoot Cause
Endpoint pattern that holds sessions:
# Current pattern in routers/*.py
@router.get("/tools/{tool_id}/call")
async def call_tool(
tool_id: str,
db: Session = Depends(get_db), # Session acquired here
user: dict = Depends(get_current_user_with_permissions),
):
tool = db.query(Tool).filter(Tool.id == tool_id).first() # Quick query
# Session still held during slow MCP call (1-30+ seconds)
result = await mcp_client.call_tool(tool.gateway_url, tool.name, arguments)
return result # Session finally releasedProposed Solution
Apply same pattern as RBAC fix - use fresh_db_session() for data fetching, close before slow operations:
@router.get("/tools/{tool_id}/call")
async def call_tool(
tool_id: str,
user: dict = Depends(get_current_user_with_permissions),
):
# Fetch data with short-lived session
with fresh_db_session() as db:
tool = db.query(Tool).filter(Tool.id == tool_id).first()
if not tool:
raise HTTPException(404, "Tool not found")
# Extract needed data before session closes
gateway_url = tool.gateway_url
tool_name = tool.name
# Session closed here - BEFORE slow MCP call
# Now make slow MCP call without holding DB connection
result = await mcp_client.call_tool(gateway_url, tool_name, arguments)
return resultFiles to Modify
| File | Endpoints |
|---|---|
mcpgateway/routers/tools.py |
call_tool, list_tools, tool operations |
mcpgateway/routers/prompts.py |
get_prompt, list_prompts, prompt operations |
mcpgateway/routers/resources.py |
read_resource, list_resources, resource operations |
mcpgateway/routers/teams.py |
Team CRUD operations |
mcpgateway/routers/servers.py |
Server operations with MCP calls |
mcpgateway/routers/gateways.py |
Gateway operations with backend calls |
Expected Impact
| Metric | Current (after RBAC fix) | Expected (after this fix) |
|---|---|---|
| Idle-in-txn count | 50 | < 20 |
| Max transaction age | 50-100s | < 5s |
| ProtocolViolation errors | Still occurring | None |
| Stability at 4000 users | Degrades over time | Stable |
Acceptance Criteria
- No endpoint handlers hold sessions during MCP backend calls
- Max transaction age stays < 10s under load
- No ProtocolViolation errors during 15-min load test at 4000 users
- All existing tests pass
References
- [BUG][PERFORMANCE]: RBAC middleware holds DB sessions for entire request lifecycle causing pool exhaustion #2318 - RBAC session lifetime fix (same pattern)
- fix: use short-lived DB sessions in RBAC permission checks #2319 - RBAC fix PR
mcpgateway/db.py-fresh_db_session()implementation