Skip to content

[BUG]: Apply fresh_db_session() to admin.py endpoints (135 usages) #2335

@crivetimihai

Description

@crivetimihai

Summary

The admin UI (mcpgateway/admin.py) holds database sessions for entire request lifecycles via Depends(get_db), causing:

  • 245+ idle-in-transaction connections during load testing
  • OOM kills when memory pressure from held sessions exceeds container limits
  • idle transaction timeout errors after 300 seconds
  • Cascading failures that crash the gateway within ~2 minutes under load

Root Cause

The Problem with Depends(get_db)

FastAPI's dependency injection holds sessions until after the response is fully sent:

Request arrives
  → Depends(get_db) creates session
  → Endpoint runs queries
  → Template rendering begins (100ms - 5s)
  → Response streaming to client
  → ONLY NOW: dependency cleanup runs, session closes

For admin UI endpoints that render HTML templates, sessions are held for:

  • Query execution time
  • Template rendering time (~100ms for simple, 1-5s for complex)
  • Response transmission time (depends on client speed)

The Scale of the Problem

Metric Value
Total Depends(get_db) usages 135
Admin.py file size 18,299 lines
GET endpoints 116
POST endpoints 62
HTMX partial endpoints ~20
Endpoints with db.commit() workaround 47

High-Risk Endpoints (HTMX Partials)

These are called repeatedly for pagination, infinite scroll, and UI updates:

Line Function db.commit() workaround?
836 get_overview_partial ❌ No
1531 admin_servers_partial_html ✅ Yes
2908 admin_ui (main page) ❌ No
4302 admin_teams_partial_html ❌ No
6224 admin_users_partial_html ✅ Yes
7150 admin_tools_partial_html ✅ Yes
7643 admin_prompts_partial_html ✅ Yes
7836 admin_gateways_partial_html ✅ Yes
8308 admin_resources_partial_html ✅ Yes
8887 admin_a2a_partial_html ✅ Yes
12297 admin_metrics_partial_html ❌ No

Why db.commit() Workaround Isn't Enough

Some endpoints call db.commit() before template rendering:

# End the read-only transaction before template rendering
db.commit()
return templates.TemplateResponse(...)

This releases the PostgreSQL transaction but the SQLAlchemy Session object remains in memory until Depends(get_db)'s cleanup runs after the response is sent. Under load:

  • Session objects consume memory
  • Connection pool slots remain checked out
  • PgBouncer can't reassign the backend connection

Evidence from Load Testing

Connection Buildup

state               | count | max_age_s
--------------------+-------+-----------
idle in transaction |   245 |        40
idle                |    56 |         1
active              |     6 |         3

Stuck Queries (All from admin endpoints)

SELECT email_teams.* FROM email_teams WHERE is_active ORDER BY name  -- 87 connections
SELECT servers_1.*, tools.* ...                                       -- 19 connections
SELECT servers_1.*, a2a_agents.* ...                                  -- 18 connections

OOM Kills (from dmesg)

Memory cgroup out of memory: Killed process 1171829 (mcpgateway work)
total-vm:6606356kB, anon-rss:729012kB

Proposed Fix

Pattern: Replace Depends(get_db) with fresh_db_session()

Before:

@admin_router.get("/overview/partial")
async def get_overview_partial(
    request: Request,
    db: Session = Depends(get_db),  # Held until response sent
    user=Depends(get_current_user_with_permissions),
) -> HTMLResponse:
    servers_total = db.query(func.count(DbServer.id)).scalar()
    tools_total = db.query(func.count(DbTool.id)).scalar()
    # ... more queries ...
    return templates.TemplateResponse("overview.html", context)

After:

@admin_router.get("/overview/partial")
async def get_overview_partial(
    request: Request,
    user=Depends(get_current_user_with_permissions),
) -> HTMLResponse:
    with fresh_db_session() as db:  # Released immediately after block
        servers_total = db.query(func.count(DbServer.id)).scalar()
        tools_total = db.query(func.count(DbTool.id)).scalar()
        # ... more queries ...
        context = {"servers_total": servers_total, ...}
    # Session is now CLOSED
    return templates.TemplateResponse("overview.html", context)

Migration Priority

  1. Phase 1 (Critical): HTMX partial endpoints - called most frequently

    • get_overview_partial
    • admin_*_partial_html (12 endpoints)
  2. Phase 2 (High): Main pages and CRUD operations

    • admin_ui (main dashboard)
    • admin_add_*, admin_delete_*, admin_update_*
  3. Phase 3 (Medium): State changes and utility endpoints

    • admin_set_*_state
    • Import/export endpoints
  4. Phase 4 (Low): Remaining endpoints

    • Login, password change
    • Plugin management

Considerations

  1. Services that take db parameter: Most services like TeamManagementService(db) store the session. Pass fresh session each time:

    with fresh_db_session() as db:
        team_service = TeamManagementService(db)
        teams = await team_service.list_teams(...)
  2. Multiple service calls: Group all DB work in one with block when possible:

    with fresh_db_session() as db:
        servers = await server_service.list_servers(db, ...)
        tools = await tool_service.list_tools(db, ...)
        context = {"servers": servers, "tools": tools}
    return templates.TemplateResponse("admin.html", context)
  3. Write operations: Ensure proper commit/rollback handling (already done by fresh_db_session)

Related Issues

Acceptance Criteria

  • All 135 Depends(get_db) usages in admin.py replaced with fresh_db_session()
  • No idle in transaction connections during admin UI usage
  • No OOM kills during load testing with admin UI traffic
  • All existing admin UI tests pass
  • Load test shows stable memory under admin traffic (target: <3GB per gateway)

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasebugSomething isn't workingdatabasepythonPython / backend development (FastAPI)

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions