Skip to content

[EPIC][PERFORMANCE]: Database Session Management - Eliminate Transaction Leaks Under Load #2660

@crivetimihai

Description

@crivetimihai

Goal

Eliminate database transaction leaks that cause connection pool exhaustion and idle-in-transaction connections under high load. Ensure every endpoint that uses Depends(get_db) explicitly releases the transaction before response serialization or streaming.

Current Status (as of 2026-02-03)

A full audit of mcpgateway/ identified 354 handlers using Depends(get_db):

  • Missing both db.commit(); db.close(): 199
  • Missing db.close() (commit present): 21
  • Already correct: 108
  • Exceptions: 26 (unused _db params or dependency factory: middleware/rbac.py:get_permission_service)

Hotspots:

  • mcpgateway/admin.py: 112 missing both + 18 missing close
  • mcpgateway/main.py: 39 missing both
  • mcpgateway/routers/llm_admin_router.py: 11 missing both
  • mcpgateway/routers/observability.py: 8 missing both
  • mcpgateway/routers/email_auth.py: 6 missing both + 1 missing close
  • mcpgateway/routers/oauth_router.py: 5 missing both + 1 missing close
  • mcpgateway/routers/log_search.py: 5 missing both
  • mcpgateway/routers/sso.py: 4 missing both + 1 missing close
  • mcpgateway/routers/toolops_router.py: 3 missing both
  • mcpgateway/routers/llm_proxy_router.py: 2 missing both
  • mcpgateway/routers/server_well_known.py: 2 missing both
  • mcpgateway/routers/auth.py: 1 missing both
  • mcpgateway/routers/well_known.py: 1 missing both

Full audit report: todo/db-audit-results.md.

Why This Matters

Load tests (4000 concurrent users) previously revealed:

  • 150+ connections stuck in idle in transaction for up to 28 seconds
  • RPS drop from 2000+ to ~1414
  • P99 latency spiking to ~30s on Teams/Resources endpoints

The root cause is deferring transaction cleanup to FastAPI dependency teardown, which happens after response serialization.


✅ Desired Pattern (Mandatory)

# 1) Perform DB ops
# 2) Materialize response data
# 3) db.commit(); db.close()
# 4) return response

Avoid returning ORM objects directly after closing. Convert to Pydantic models or dicts first.


Manual Nuance Review (All Missing Endpoints)

Streaming / generator responses (must close before stream starts)

  • routers/llm_proxy_router.py:chat_completions — resolve provider/model, close session, then stream.
  • routers/observability.py:export_tracesndjson branch returns StreamingResponse; materialize trace data first, then close.

TemplateResponse endpoints (close before template render)

  • admin.py: get_overview_partial, admin_servers_partial_html, admin_ui, admin_teams_partial_html, admin_list_teams, admin_users_partial_html, admin_team_members_partial_html, admin_team_non_members_partial_html, admin_tools_partial_html, admin_tool_ops_partial, admin_prompts_partial_html, admin_gateways_partial_html, admin_resources_partial_html, admin_a2a_partial_html, admin_metrics_partial_html, catalog_partial, get_system_stats, get_performance_stats
  • routers/llm_admin_router.py: get_providers_partial, get_models_partial, set_provider_state_html, set_model_state_html, get_api_info_partial

RedirectResponse endpoints (multi-return branches; unify commit+close)

  • admin.py: admin_set_server_state, admin_delete_server, admin_set_gateway_state, admin_login_handler, change_password_required_handler, admin_update_team, admin_delete_tool, admin_set_tool_state, admin_delete_gateway, admin_delete_resource, admin_set_resource_state, admin_delete_prompt, admin_set_prompt_state, admin_set_a2a_agent_state, admin_delete_a2a_agent
  • routers/oauth_router.py: initiate_oauth_flow
  • routers/sso.py: handle_sso_callback

ORM-to-response conversion before close

  • routers/observability.py endpoints return ORM objects; must materialize response models/dicts before closing.

Long-running external calls while DB open

  • routers/toolops_router.py (ToolOps calls), routers/oauth_router.py:initiate_oauth_flow (DCR), routers/sso.py:handle_sso_callback (OAuth exchange).

Concrete Fix Plan (By Category)

Category A: Standard JSON/Pydantic responses

  1. Build response payloads (lists/dicts/Pydantic objects) immediately after DB queries.
  2. Insert db.commit(); db.close() just before return.
  3. Ensure every early-return branch converges through the same commit+close path.
    • Examples: main.py list/read routes, routers/log_search.py, routers/server_well_known.py.

Category B: TemplateResponse (admin + llm_admin)

  1. Build template context dicts fully (no ORM objects).
  2. db.commit(); db.close() before TemplateResponse(...).
  3. Confirm templates only use plain data.

Category C: RedirectResponse / multi-branch control flow

  1. Convert to “build response → commit/close → return”.
  2. Ensure all branches (success/error/early returns) share cleanup.

Category D: Streaming/Generator responses

  1. Extract DB-dependent data first, close session, then stream.
  2. If generator must access DB, use fresh_db_session() inside generator with try/finally.

Category E: Long-running external calls while DB open

  1. Extract DB data to locals.
  2. Close session before external calls.
  3. Re-open session only for DB updates afterwards.

Category F: Exceptions/Unused DB params

  1. Rename unused db to _db or remove Depends(get_db) where safe.
  2. Keep middleware/rbac.py:get_permission_service allowlisted.

Implementation Tasks (Updated)

Phase 0: Guardrails

  • Add AST-based static check (see “Static Check Proposal” below)
  • Add allowlist for intentional exceptions (dependency factories, streaming endpoints)

Phase 1: Core API (mcpgateway/main.py)

  • Fix all remaining list/read/create/update/delete routes missing explicit commit/close
  • Pay special attention to RPC handlers and multi-return branches

Phase 2: Admin UI/API (mcpgateway/admin.py)

  • Fix all TemplateResponse endpoints (close before render)
  • Fix RedirectResponse endpoints with unified cleanup
  • Fix JSON/ORJSON responses with early returns

Phase 3: Routers

  • routers/observability.py (materialize results before close, streaming NDJSON)
  • routers/oauth_router.py (DCR path + unified cleanup)
  • routers/sso.py (callback flow)
  • routers/log_search.py
  • routers/llm_admin_router.py (partials)
  • routers/llm_proxy_router.py (streaming path)
  • routers/toolops_router.py (external calls)
  • routers/server_well_known.py, routers/well_known.py, routers/auth.py, routers/email_auth.py

Phase 4: Tests & Verification

  • Update unit tests that call routers directly (ensure db=mock_db and new cleanup does not break)
  • Run pytest tests/unit/mcpgateway/routers/ -v
  • Load test + verify pg_stat_activity (idle-in-transaction < 50, max age < 5s)

Static Check Proposal (Prevent Regression)

  • Add an AST-based linter/test that scans for db: Session = Depends(get_db) and asserts that handlers either:
    • call db.commit() and db.close() (or on an alias), OR
    • are in an allowlist of exceptions (dependency factories, unused _db, streaming endpoints).
  • Integrate into make verify or pre-commit hook.
  • Keep allowlist in a JSON file (e.g., scripts/db_session_audit_allowlist.json).

Success Criteria

  • All endpoints with Depends(get_db) explicitly close transactions before serialization/streaming
  • Load test: idle-in-transaction connections < 50; max age < 5s; avg age < 2s
  • RPS > 2000 under 4000 concurrent users
  • P99 latency < 5s
  • No functional regressions; all existing tests pass

Notes

  • FastAPI dependency cleanup is a safety net, not a substitute for explicit cleanup.
  • expire_on_commit=False is already configured, so committing read-only transactions is safe.
  • Keep the pattern consistent and avoid leaving ORM objects unmaterialized after closing sessions.

Metadata

Metadata

Assignees

Labels

MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafebugSomething isn't workingenhancementNew feature or requestepicLarge feature spanning multiple issuesperformancePerformance related itemspythonPython / backend development (FastAPI)readyValidated, ready-to-work-on items

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions