Skip to content

fix: use fresh_db_session() in list endpoints to prevent connection holding#2326

Closed
crivetimihai wants to merge 11 commits intomainfrom
fix/endpoint-handler-session-lifetime
Closed

fix: use fresh_db_session() in list endpoints to prevent connection holding#2326
crivetimihai wants to merge 11 commits intomainfrom
fix/endpoint-handler-session-lifetime

Conversation

@crivetimihai
Copy link
Copy Markdown
Member

Summary

  • Updated 6 high-traffic list endpoints in main.py to use fresh_db_session() instead of Depends(get_db), preventing database connections from being held during response serialization.

Endpoints modified:

  • list_servers
  • list_a2a_agents
  • list_tools
  • list_resources
  • list_prompts
  • list_gateways

Each endpoint now:

  1. Uses fresh_db_session() context manager for short-lived DB access
  2. Converts ORM objects to dicts (model_dump) before session closes
  3. Returns dict representations to avoid lazy loading after session close

This is a follow-up to PR #2319 (RBAC fix) to address the remaining endpoint handlers that were still holding database sessions during slow operations, causing connection pool exhaustion under high load.

Test plan

  • All 277 unit tests pass
  • Load test with 4000+ concurrent users to verify connection pool stability
  • Monitor PostgreSQL for "idle in transaction" connections during load test

Closes #2323

…olding

Updated 6 high-traffic list endpoints in main.py to use fresh_db_session()
instead of Depends(get_db), preventing database connections from being held
during response serialization.

Endpoints modified:
- list_servers
- list_a2a_agents
- list_tools
- list_resources
- list_prompts
- list_gateways

Each endpoint now:
1. Uses fresh_db_session() context manager for short-lived DB access
2. Converts ORM objects to dicts (model_dump) before session closes
3. Returns dict representations to avoid lazy loading after session close

Closes #2323

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai self-assigned this Jan 22, 2026
@crivetimihai crivetimihai added the performance Performance related items label Jan 22, 2026
Added set_fresh_session_factory() and reset_fresh_session_factory() functions
to allow tests to redirect fresh_db_session() to use a test database instead
of the production database.

Updated tests/e2e/test_main_apis.py to use this mechanism, fixing the
"no such table" errors that occurred when fresh_db_session() was introduced
in the endpoint handler fix.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Updated tests/e2e/test_oauth_protected_resource.py to also override the
fresh_db_session factory, ensuring consistency across all e2e tests that
use test databases.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
- Updated tests/conftest.py app and app_with_temp_db fixtures to also
  override the fresh_db_session factory, ensuring all tests that use
  these fixtures work correctly with the new endpoint handler changes.

- Fixed doctest example in set_fresh_session_factory() that was using
  undefined test_engine variable.

- Added pylint disable comments for intentional global statement usage
  in the session factory override functions.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Apply the fresh_db_session() pattern to prevent connection pool exhaustion
under high load (Issue #2323, #2330):

RPC handler (main.py):
- Remove Depends(get_db) from handle_rpc() signature
- Wrap all database operations in fresh_db_session() context managers
- For tools/call, connection is released before downstream HTTP calls
- Affects: tools/list, resources/list, prompts/list, tools/call,
  resources/read, sampling/createMessage, completion/complete, etc.

Token scoping middleware:
- Use fresh_db_session() for team membership checks
- Use fresh_db_session() for resource ownership checks
- Connection released before call_next() processes the request

docker-compose.yml:
- Increase DB_POOL_SIZE from 20 to 80 per worker
- Increase DB_MAX_OVERFLOW from 10 to 20
- Adjust PostgreSQL and pgBouncer timeout settings

This prevents the pattern where connections were held for the entire
request lifecycle (including 500-5000ms downstream HTTP calls), which
caused QueuePool exhaustion and system crashes under load.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai force-pushed the fix/endpoint-handler-session-lifetime branch from 9535bcc to 21ae64d Compare January 22, 2026 18:10
Load testing with 4000 Locust users revealed gateway workers being OOM
killed when container memory exceeded 8GB limit.

Memory analysis:
- Each worker uses ~350-450MB RSS baseline, spiking to 0.7-1.9GB under load
- 16 workers × 450MB = 7.2GB baseline, only 0.8GB headroom → OOM kills
- 8 workers × 450MB = 3.6GB baseline, 4.4GB headroom → safe

Changes:
- GRANIAN_WORKERS: 16 → 8
- GRANIAN_BACKPRESSURE: 128 → 256 (maintains 2048 concurrent capacity)

With 3 replicas, total worker count is 24 (was 48), which is still
sufficient for high concurrency while avoiding memory pressure.

Related: #2334, #2323, #2330
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Add comprehensive memray section covering:
- Installation instructions for local and container environments
- Local profiling with memray run
- Attaching to running processes
- Report generation (flamegraph, table, tree, stats, summary)
- Live mode for real-time monitoring
- Container profiling workflow
- Container profiling limitations and workarounds
- py-spy vs memray comparison table

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ulation

The RBAC middleware was holding database sessions for the entire request
duration via Depends(get_db), causing idle-in-transaction connections to
accumulate under high load.

Changes:
- Remove Depends(get_db) from get_current_user_with_permissions()
- Remove "db" from user context return dicts
- Update require_permission() to use fresh_db_session()
- Update require_admin_permission() to use fresh_db_session()
- Update require_any_permission() to use fresh_db_session()
- Update PermissionChecker class to use fresh_db_session()
- Update admin.py to not rely on user["db"]
- Update tests to mock fresh_db_session()

Closes #2340

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Transforms 267 endpoints from Depends(get_db) to fresh_db_session()
to prevent database session accumulation under high load.

Files modified:
- main.py (52 endpoints)
- admin.py (133 endpoints)
- routers/auth.py (1 endpoint)
- routers/email_auth.py (4 endpoints)
- routers/llm_config_router.py (1 endpoint)
- routers/llm_proxy_router.py (2 endpoints)
- routers/log_search.py (5 endpoints)
- routers/oauth_router.py (7 endpoints)
- routers/observability.py (8 endpoints)
- routers/rbac.py (12 endpoints)
- routers/reverse_proxy.py (1 endpoint)
- routers/server_well_known.py (2 endpoints)
- routers/sso.py (10 endpoints)
- routers/teams.py (15 endpoints)
- routers/tokens.py (10 endpoints)
- routers/toolops_router.py (3 endpoints)
- routers/well_known.py (1 endpoint)

Before: idle_in_tx grew to 100+ connections with 80-120s age, causing
system degradation from 2180 RPS to 300 RPS under load.

After: idle_in_tx stays at 12-34 connections with 0-17s max age,
RPS stable at 1600-2260 with 0 failures.

Closes #2335
Closes #2336

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai
Copy link
Copy Markdown
Member Author

Closing — superseded by other DB session optimization approaches.

bradmcnew pushed a commit to bradmcnew/mcp-context-forge that referenced this pull request Feb 18, 2026
…pool exhaustion

Replace next(get_db()) with fresh_db_session() context manager in all
three database access points within TokenScopingMiddleware. Extract
query logic into dedicated helper methods (_query_team_membership and
_query_resource_ownership) for cleaner session lifecycle management.

Previously each request held 2+ concurrent connections (middleware +
endpoint handler), causing QueuePool exhaustion under load. With this
fix the middleware connection is returned to the pool before the
endpoint handler runs.

Applies the same pattern established in PR IBM#2326 and PR IBM#2319.

Closes IBM#2330

Signed-off-by: Brad McNew <brad@axite.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance related items

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG][PERFORMANCE][DB]: Endpoint handlers hold DB sessions during slow MCP backend calls

1 participant