Skip to content

[BUG]: Apply fresh_db_session() to remaining 52 REST endpoints in main.py #2336

@crivetimihai

Description

@crivetimihai

Summary

The load test crashes even without admin traffic because 52 REST endpoints in main.py still use Depends(get_db) which holds database sessions for entire request lifecycles.

Root Cause Analysis

Session Pattern Status

Pattern Count Status
with fresh_db_session() as db: 22 ✅ Fixed
db: Session = Depends(get_db) 52 ❌ Broken

Evidence from Load Test

Without admin traffic (4000 users, ~300 RPS):

Connection States:
  idle in transaction | 227 | max_age: 219s

Stuck Queries (approaching 300s timeout):
  SELECT servers_1.*, a2a_agents.*  | 114 connections
  SELECT email_teams.*              |  38 connections  
  SELECT resources.*                |  36 connections
  SELECT prompts.*                  |  24 connections

These queries come from:

  • /servers/{id}/tools - uses Depends(get_db)
  • /tools/{id} - uses Depends(get_db)
  • Team membership checks in middleware

Broken Endpoints (52 total)

Get Single Item (6 endpoints)

  • get_server(server_id)/servers/{server_id}
  • get_tool(tool_id)/tools/{tool_id}
  • get_resource_info(resource_id)/resources/{resource_id}
  • get_prompt_no_args(prompt_id)/prompts/{prompt_id}
  • get_gateway(gateway_id)/gateways/{gateway_id}
  • get_a2a_agent(agent_id)/a2a/{agent_id}

Server Sub-Resources (3 endpoints) - HIGH PRIORITY

These are called by Locust with high frequency:

  • server_get_tools(server_id)/servers/{server_id}/tools
  • server_get_resources(server_id)/servers/{server_id}/resources
  • server_get_prompts(server_id)/servers/{server_id}/prompts

State Change Operations (12 endpoints)

  • set_server_state, toggle_server_status
  • set_tool_state, toggle_tool_status
  • set_resource_state, toggle_resource_status
  • set_prompt_state, toggle_prompt_status
  • set_gateway_state, toggle_gateway_status
  • set_a2a_agent_state, toggle_a2a_agent_status

CRUD Operations (12+ endpoints)

  • create_server, update_server, delete_server
  • create_tool, update_tool, delete_tool
  • create_resource, delete_resource
  • create_prompt, delete_prompt
  • register_gateway, delete_gateway
  • create_a2a_agent, update_a2a_agent, delete_a2a_agent, invoke_a2a_agent

Protocol Handlers (2 endpoints)

  • handle_completion/completion/complete
  • handle_sampling/sampling/createMessage

Other Endpoints

  • read_resource/resources/{resource_id}/read
  • list_resource_templates
  • get_metrics, reset_metrics
  • Export/import endpoints

Why This Causes Crashes

  1. High-frequency endpoints like /servers/{id}/tools use Depends(get_db)
  2. Sessions held during response serialization and network transmission
  3. 227+ sessions accumulate as idle in transaction
  4. After 219 seconds, sessions approach PostgreSQL's 300s idle_in_transaction_session_timeout
  5. Memory builds up from held sessions → OOM kills

Proposed Fix

Apply the same pattern used for list endpoints:

Before:

@server_router.get("/{server_id}/tools")
async def server_get_tools(
    server_id: str,
    db: Session = Depends(get_db),  # Held until response sent
    user=Depends(get_current_user_with_permissions),
):
    tools = await tool_service.list_server_tools(db, server_id)
    return [tool.model_dump(by_alias=True) for tool in tools]

After:

@server_router.get("/{server_id}/tools")
async def server_get_tools(
    server_id: str,
    user=Depends(get_current_user_with_permissions),
):
    with fresh_db_session() as db:  # Released immediately after block
        tools = await tool_service.list_server_tools(db, server_id)
        result = [tool.model_dump(by_alias=True) for tool in tools]
    return result  # Response sent after session closed

Priority Order

  1. Critical: Server sub-resources (/servers/{id}/tools, etc.) - highest load
  2. High: Get single item endpoints (/tools/{id}, /servers/{id}, etc.)
  3. Medium: Protocol handlers, state changes
  4. Lower: CRUD operations (less frequent)

Related Issues

Acceptance Criteria

  • All 52 Depends(get_db) usages in main.py converted to fresh_db_session()
  • No idle in transaction buildup during load testing without admin traffic
  • Load test runs stable for 10+ minutes with 4000 users
  • All existing tests pass

Metadata

Metadata

Assignees

Labels

SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasebugSomething isn't workingdatabasepythonPython / backend development (FastAPI)

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions