-
Notifications
You must be signed in to change notification settings - Fork 615
[EPIC][PERFORMANCE]: Database Session Management - Eliminate Transaction Leaks Under Load #2660
Description
Goal
Eliminate database transaction leaks that cause connection pool exhaustion and idle-in-transaction connections under high load. Ensure every endpoint that uses Depends(get_db) explicitly releases the transaction before response serialization or streaming.
Current Status (as of 2026-02-03)
A full audit of mcpgateway/ identified 354 handlers using Depends(get_db):
- Missing both
db.commit(); db.close(): 199 - Missing
db.close()(commit present): 21 - Already correct: 108
- Exceptions: 26 (unused
_dbparams or dependency factory:middleware/rbac.py:get_permission_service)
Hotspots:
mcpgateway/admin.py: 112 missing both + 18 missing closemcpgateway/main.py: 39 missing bothmcpgateway/routers/llm_admin_router.py: 11 missing bothmcpgateway/routers/observability.py: 8 missing bothmcpgateway/routers/email_auth.py: 6 missing both + 1 missing closemcpgateway/routers/oauth_router.py: 5 missing both + 1 missing closemcpgateway/routers/log_search.py: 5 missing bothmcpgateway/routers/sso.py: 4 missing both + 1 missing closemcpgateway/routers/toolops_router.py: 3 missing bothmcpgateway/routers/llm_proxy_router.py: 2 missing bothmcpgateway/routers/server_well_known.py: 2 missing bothmcpgateway/routers/auth.py: 1 missing bothmcpgateway/routers/well_known.py: 1 missing both
Full audit report: todo/db-audit-results.md.
Why This Matters
Load tests (4000 concurrent users) previously revealed:
- 150+ connections stuck in
idle in transactionfor up to 28 seconds - RPS drop from 2000+ to ~1414
- P99 latency spiking to ~30s on Teams/Resources endpoints
The root cause is deferring transaction cleanup to FastAPI dependency teardown, which happens after response serialization.
✅ Desired Pattern (Mandatory)
# 1) Perform DB ops
# 2) Materialize response data
# 3) db.commit(); db.close()
# 4) return responseAvoid returning ORM objects directly after closing. Convert to Pydantic models or dicts first.
Manual Nuance Review (All Missing Endpoints)
Streaming / generator responses (must close before stream starts)
routers/llm_proxy_router.py:chat_completions— resolve provider/model, close session, then stream.routers/observability.py:export_traces—ndjsonbranch returnsStreamingResponse; materialize trace data first, then close.
TemplateResponse endpoints (close before template render)
admin.py:get_overview_partial,admin_servers_partial_html,admin_ui,admin_teams_partial_html,admin_list_teams,admin_users_partial_html,admin_team_members_partial_html,admin_team_non_members_partial_html,admin_tools_partial_html,admin_tool_ops_partial,admin_prompts_partial_html,admin_gateways_partial_html,admin_resources_partial_html,admin_a2a_partial_html,admin_metrics_partial_html,catalog_partial,get_system_stats,get_performance_statsrouters/llm_admin_router.py:get_providers_partial,get_models_partial,set_provider_state_html,set_model_state_html,get_api_info_partial
RedirectResponse endpoints (multi-return branches; unify commit+close)
admin.py:admin_set_server_state,admin_delete_server,admin_set_gateway_state,admin_login_handler,change_password_required_handler,admin_update_team,admin_delete_tool,admin_set_tool_state,admin_delete_gateway,admin_delete_resource,admin_set_resource_state,admin_delete_prompt,admin_set_prompt_state,admin_set_a2a_agent_state,admin_delete_a2a_agentrouters/oauth_router.py:initiate_oauth_flowrouters/sso.py:handle_sso_callback
ORM-to-response conversion before close
routers/observability.pyendpoints return ORM objects; must materialize response models/dicts before closing.
Long-running external calls while DB open
routers/toolops_router.py(ToolOps calls),routers/oauth_router.py:initiate_oauth_flow(DCR),routers/sso.py:handle_sso_callback(OAuth exchange).
Concrete Fix Plan (By Category)
Category A: Standard JSON/Pydantic responses
- Build response payloads (lists/dicts/Pydantic objects) immediately after DB queries.
- Insert
db.commit(); db.close()just beforereturn. - Ensure every early-return branch converges through the same commit+close path.
- Examples:
main.pylist/read routes,routers/log_search.py,routers/server_well_known.py.
- Examples:
Category B: TemplateResponse (admin + llm_admin)
- Build template context dicts fully (no ORM objects).
db.commit(); db.close()beforeTemplateResponse(...).- Confirm templates only use plain data.
Category C: RedirectResponse / multi-branch control flow
- Convert to “build response → commit/close → return”.
- Ensure all branches (success/error/early returns) share cleanup.
Category D: Streaming/Generator responses
- Extract DB-dependent data first, close session, then stream.
- If generator must access DB, use
fresh_db_session()inside generator withtry/finally.
Category E: Long-running external calls while DB open
- Extract DB data to locals.
- Close session before external calls.
- Re-open session only for DB updates afterwards.
Category F: Exceptions/Unused DB params
- Rename unused
dbto_dbor removeDepends(get_db)where safe. - Keep
middleware/rbac.py:get_permission_serviceallowlisted.
Implementation Tasks (Updated)
Phase 0: Guardrails
- Add AST-based static check (see “Static Check Proposal” below)
- Add allowlist for intentional exceptions (dependency factories, streaming endpoints)
Phase 1: Core API (mcpgateway/main.py)
- Fix all remaining list/read/create/update/delete routes missing explicit commit/close
- Pay special attention to RPC handlers and multi-return branches
Phase 2: Admin UI/API (mcpgateway/admin.py)
- Fix all TemplateResponse endpoints (close before render)
- Fix RedirectResponse endpoints with unified cleanup
- Fix JSON/ORJSON responses with early returns
Phase 3: Routers
-
routers/observability.py(materialize results before close, streaming NDJSON) -
routers/oauth_router.py(DCR path + unified cleanup) -
routers/sso.py(callback flow) -
routers/log_search.py -
routers/llm_admin_router.py(partials) -
routers/llm_proxy_router.py(streaming path) -
routers/toolops_router.py(external calls) -
routers/server_well_known.py,routers/well_known.py,routers/auth.py,routers/email_auth.py
Phase 4: Tests & Verification
- Update unit tests that call routers directly (ensure
db=mock_dband new cleanup does not break) - Run
pytest tests/unit/mcpgateway/routers/ -v - Load test + verify
pg_stat_activity(idle-in-transaction < 50, max age < 5s)
Static Check Proposal (Prevent Regression)
- Add an AST-based linter/test that scans for
db: Session = Depends(get_db)and asserts that handlers either:- call
db.commit()anddb.close()(or on an alias), OR - are in an allowlist of exceptions (dependency factories, unused
_db, streaming endpoints).
- call
- Integrate into
make verifyor pre-commit hook. - Keep allowlist in a JSON file (e.g.,
scripts/db_session_audit_allowlist.json).
Success Criteria
- All endpoints with
Depends(get_db)explicitly close transactions before serialization/streaming - Load test: idle-in-transaction connections < 50; max age < 5s; avg age < 2s
- RPS > 2000 under 4000 concurrent users
- P99 latency < 5s
- No functional regressions; all existing tests pass
Notes
- FastAPI dependency cleanup is a safety net, not a substitute for explicit cleanup.
expire_on_commit=Falseis already configured, so committing read-only transactions is safe.- Keep the pattern consistent and avoid leaving ORM objects unmaterialized after closing sessions.