Skip to content

[BUG]: Apply fresh_db_session() to remaining 271 endpoints using Depends(get_db) #2334

@crivetimihai

Description

@crivetimihai

Summary

Load testing with 4000 Locust users reveals that 271 endpoints still use Depends(get_db) which holds database sessions for the entire request lifecycle. This causes:

  1. OOM kills - Gateway workers exceed 8GB container limit under load
  2. Idle-in-transaction buildup - 120+ connections stuck for 60-290 seconds
  3. Memory pressure - Held sessions + pending responses consume RAM

Evidence

From profiling RCA (todo/claude-rca-2026-01-22-profiling.md):

Memory cgroup out of memory: Killed process 1172024 (mcpgateway work)
total-vm:7470296kB, anon-rss:1102856kB

Stuck queries in pg_stat_activity:

  • `SELECT tools.*` - stuck 64-221 seconds
  • `SELECT email_teams.*` - stuck 69-215 seconds

Endpoints to Fix (Priority Order)

High Priority (main.py - 52 endpoints)

```bash
grep -n "Depends(get_db)" mcpgateway/main.py | wc -l

52 endpoints

```

Key endpoints:

  • `/tools`, `/tools/{id}` - CRUD operations
  • `/servers`, `/servers/{id}` - CRUD operations
  • `/resources`, `/resources/{id}` - CRUD operations
  • `/prompts`, `/prompts/{id}` - CRUD operations
  • `/health`, `/ready` - Health checks

Medium Priority (admin.py)

  • `/admin/` - Heavy HTML template rendering (5-7s response times)
  • `/admin/tools/partial` - HTMX partials
  • `/admin/resources/partial` - HTMX partials
  • `/admin/prompts/partial` - HTMX partials

Lower Priority (routers/)

  • `mcpgateway/routers/tokens.py` - 10 endpoints
  • `mcpgateway/routers/sso.py` - 9 endpoints
  • `mcpgateway/routers/oauth_router.py` - 7 endpoints
  • Other routers

Pattern to Apply

Replace:
```python
@router.get("/tools")
async def list_tools(db: Session = Depends(get_db)):
tools = tool_service.list_tools(db)
return tools
```

With:
```python
@router.get("/tools")
async def list_tools():
with fresh_db_session() as db:
tools = tool_service.list_tools(db)
return [t.model_dump() for t in tools] # Serialize inside context
```

Acceptance Criteria

  • All 271 `Depends(get_db)` usages replaced with `fresh_db_session()`
  • Unit tests pass
  • Load test (4000 users) runs for 10+ minutes without OOM kills
  • No `idle in transaction` connections older than 60 seconds

Related Issues

References

  • `todo/claude-rca-2026-01-22-profiling.md` - Full profiling report
  • `todo/rca-part-2.md` - Initial RCA
  • `docs/docs/development/profiling.md` - Profiling guide

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasebugSomething isn't workingdatabasepythonPython / backend development (FastAPI)

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions