fix: use fresh_db_session() in list endpoints to prevent connection holding by crivetimihai · Pull Request #2326 · IBM/mcp-context-forge

crivetimihai · 2026-01-22T10:05:58Z

Summary

Updated 6 high-traffic list endpoints in main.py to use fresh_db_session() instead of Depends(get_db), preventing database connections from being held during response serialization.

Endpoints modified:

list_servers
list_a2a_agents
list_tools
list_resources
list_prompts
list_gateways

Each endpoint now:

Uses fresh_db_session() context manager for short-lived DB access
Converts ORM objects to dicts (model_dump) before session closes
Returns dict representations to avoid lazy loading after session close

This is a follow-up to PR #2319 (RBAC fix) to address the remaining endpoint handlers that were still holding database sessions during slow operations, causing connection pool exhaustion under high load.

Test plan

All 277 unit tests pass
Load test with 4000+ concurrent users to verify connection pool stability
Monitor PostgreSQL for "idle in transaction" connections during load test

Closes #2323

…olding Updated 6 high-traffic list endpoints in main.py to use fresh_db_session() instead of Depends(get_db), preventing database connections from being held during response serialization. Endpoints modified: - list_servers - list_a2a_agents - list_tools - list_resources - list_prompts - list_gateways Each endpoint now: 1. Uses fresh_db_session() context manager for short-lived DB access 2. Converts ORM objects to dicts (model_dump) before session closes 3. Returns dict representations to avoid lazy loading after session close Closes #2323 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Added set_fresh_session_factory() and reset_fresh_session_factory() functions to allow tests to redirect fresh_db_session() to use a test database instead of the production database. Updated tests/e2e/test_main_apis.py to use this mechanism, fixing the "no such table" errors that occurred when fresh_db_session() was introduced in the endpoint handler fix. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Updated tests/e2e/test_oauth_protected_resource.py to also override the fresh_db_session factory, ensuring consistency across all e2e tests that use test databases. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

- Updated tests/conftest.py app and app_with_temp_db fixtures to also override the fresh_db_session factory, ensuring all tests that use these fixtures work correctly with the new endpoint handler changes. - Fixed doctest example in set_fresh_session_factory() that was using undefined test_engine variable. - Added pylint disable comments for intentional global statement usage in the session factory override functions. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Apply the fresh_db_session() pattern to prevent connection pool exhaustion under high load (Issue #2323, #2330): RPC handler (main.py): - Remove Depends(get_db) from handle_rpc() signature - Wrap all database operations in fresh_db_session() context managers - For tools/call, connection is released before downstream HTTP calls - Affects: tools/list, resources/list, prompts/list, tools/call, resources/read, sampling/createMessage, completion/complete, etc. Token scoping middleware: - Use fresh_db_session() for team membership checks - Use fresh_db_session() for resource ownership checks - Connection released before call_next() processes the request docker-compose.yml: - Increase DB_POOL_SIZE from 20 to 80 per worker - Increase DB_MAX_OVERFLOW from 10 to 20 - Adjust PostgreSQL and pgBouncer timeout settings This prevents the pattern where connections were held for the entire request lifecycle (including 500-5000ms downstream HTTP calls), which caused QueuePool exhaustion and system crashes under load. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Load testing with 4000 Locust users revealed gateway workers being OOM killed when container memory exceeded 8GB limit. Memory analysis: - Each worker uses ~350-450MB RSS baseline, spiking to 0.7-1.9GB under load - 16 workers × 450MB = 7.2GB baseline, only 0.8GB headroom → OOM kills - 8 workers × 450MB = 3.6GB baseline, 4.4GB headroom → safe Changes: - GRANIAN_WORKERS: 16 → 8 - GRANIAN_BACKPRESSURE: 128 → 256 (maintains 2048 concurrent capacity) With 3 replicas, total worker count is 24 (was 48), which is still sufficient for high concurrency while avoiding memory pressure. Related: #2334, #2323, #2330 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Add comprehensive memray section covering: - Installation instructions for local and container environments - Local profiling with memray run - Attaching to running processes - Report generation (flamegraph, table, tree, stats, summary) - Live mode for real-time monitoring - Container profiling workflow - Container profiling limitations and workarounds - py-spy vs memray comparison table Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

…ulation The RBAC middleware was holding database sessions for the entire request duration via Depends(get_db), causing idle-in-transaction connections to accumulate under high load. Changes: - Remove Depends(get_db) from get_current_user_with_permissions() - Remove "db" from user context return dicts - Update require_permission() to use fresh_db_session() - Update require_admin_permission() to use fresh_db_session() - Update require_any_permission() to use fresh_db_session() - Update PermissionChecker class to use fresh_db_session() - Update admin.py to not rely on user["db"] - Update tests to mock fresh_db_session() Closes #2340 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Transforms 267 endpoints from Depends(get_db) to fresh_db_session() to prevent database session accumulation under high load. Files modified: - main.py (52 endpoints) - admin.py (133 endpoints) - routers/auth.py (1 endpoint) - routers/email_auth.py (4 endpoints) - routers/llm_config_router.py (1 endpoint) - routers/llm_proxy_router.py (2 endpoints) - routers/log_search.py (5 endpoints) - routers/oauth_router.py (7 endpoints) - routers/observability.py (8 endpoints) - routers/rbac.py (12 endpoints) - routers/reverse_proxy.py (1 endpoint) - routers/server_well_known.py (2 endpoints) - routers/sso.py (10 endpoints) - routers/teams.py (15 endpoints) - routers/tokens.py (10 endpoints) - routers/toolops_router.py (3 endpoints) - routers/well_known.py (1 endpoint) Before: idle_in_tx grew to 100+ connections with 80-120s age, causing system degradation from 2180 RPS to 300 RPS under load. After: idle_in_tx stays at 12-34 connections with 0-17s max age, RPS stable at 1600-2260 with 0 failures. Closes #2335 Closes #2336 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai · 2026-02-15T12:39:25Z

Closing — superseded by other DB session optimization approaches.

…pool exhaustion Replace next(get_db()) with fresh_db_session() context manager in all three database access points within TokenScopingMiddleware. Extract query logic into dedicated helper methods (_query_team_membership and _query_resource_ownership) for cleaner session lifecycle management. Previously each request held 2+ concurrent connections (middleware + endpoint handler), causing QueuePool exhaustion under load. With this fix the middleware connection is returned to the pool before the endpoint handler runs. Applies the same pattern established in PR IBM#2326 and PR IBM#2319. Closes IBM#2330 Signed-off-by: Brad McNew <brad@axite.ai>

crivetimihai self-assigned this Jan 22, 2026

crivetimihai added the performance Performance related items label Jan 22, 2026

crivetimihai requested review from kevalmahajan and madhav165 as code owners January 22, 2026 10:35

crivetimihai added 2 commits January 22, 2026 10:43

crivetimihai mentioned this pull request Jan 22, 2026

[BUG][PERFORMANCE]: TokenScopingMiddleware causes connection pool exhaustion under load #2330

Open

crivetimihai force-pushed the fix/endpoint-handler-session-lifetime branch from 9535bcc to 21ae64d Compare January 22, 2026 18:10

crivetimihai added 6 commits January 22, 2026 18:42

Update compose

3032623

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Messy test fixes - not all done

c1e0f31

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai modified the milestones: Release 1.0.0-RC1, Release 1.0.0-GA Jan 24, 2026

crivetimihai modified the milestones: Release 1.0.0-GA, Release 1.1.0 Feb 15, 2026

crivetimihai closed this Feb 15, 2026

crivetimihai modified the milestones: Release 1.1.0, Release 1.0.0-RC1 Feb 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use fresh_db_session() in list endpoints to prevent connection holding#2326

fix: use fresh_db_session() in list endpoints to prevent connection holding#2326
crivetimihai wants to merge 11 commits intomainfrom
fix/endpoint-handler-session-lifetime

crivetimihai commented Jan 22, 2026

Uh oh!

crivetimihai commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crivetimihai commented Jan 22, 2026

Summary

Test plan

Uh oh!

crivetimihai commented Feb 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant