fix: use fresh_db_session() in list endpoints to prevent connection holding#2326
Closed
crivetimihai wants to merge 11 commits intomainfrom
Closed
fix: use fresh_db_session() in list endpoints to prevent connection holding#2326crivetimihai wants to merge 11 commits intomainfrom
crivetimihai wants to merge 11 commits intomainfrom
Conversation
…olding Updated 6 high-traffic list endpoints in main.py to use fresh_db_session() instead of Depends(get_db), preventing database connections from being held during response serialization. Endpoints modified: - list_servers - list_a2a_agents - list_tools - list_resources - list_prompts - list_gateways Each endpoint now: 1. Uses fresh_db_session() context manager for short-lived DB access 2. Converts ORM objects to dicts (model_dump) before session closes 3. Returns dict representations to avoid lazy loading after session close Closes #2323 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Added set_fresh_session_factory() and reset_fresh_session_factory() functions to allow tests to redirect fresh_db_session() to use a test database instead of the production database. Updated tests/e2e/test_main_apis.py to use this mechanism, fixing the "no such table" errors that occurred when fresh_db_session() was introduced in the endpoint handler fix. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Updated tests/e2e/test_oauth_protected_resource.py to also override the fresh_db_session factory, ensuring consistency across all e2e tests that use test databases. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
- Updated tests/conftest.py app and app_with_temp_db fixtures to also override the fresh_db_session factory, ensuring all tests that use these fixtures work correctly with the new endpoint handler changes. - Fixed doctest example in set_fresh_session_factory() that was using undefined test_engine variable. - Added pylint disable comments for intentional global statement usage in the session factory override functions. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Apply the fresh_db_session() pattern to prevent connection pool exhaustion under high load (Issue #2323, #2330): RPC handler (main.py): - Remove Depends(get_db) from handle_rpc() signature - Wrap all database operations in fresh_db_session() context managers - For tools/call, connection is released before downstream HTTP calls - Affects: tools/list, resources/list, prompts/list, tools/call, resources/read, sampling/createMessage, completion/complete, etc. Token scoping middleware: - Use fresh_db_session() for team membership checks - Use fresh_db_session() for resource ownership checks - Connection released before call_next() processes the request docker-compose.yml: - Increase DB_POOL_SIZE from 20 to 80 per worker - Increase DB_MAX_OVERFLOW from 10 to 20 - Adjust PostgreSQL and pgBouncer timeout settings This prevents the pattern where connections were held for the entire request lifecycle (including 500-5000ms downstream HTTP calls), which caused QueuePool exhaustion and system crashes under load. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
9535bcc to
21ae64d
Compare
Load testing with 4000 Locust users revealed gateway workers being OOM killed when container memory exceeded 8GB limit. Memory analysis: - Each worker uses ~350-450MB RSS baseline, spiking to 0.7-1.9GB under load - 16 workers × 450MB = 7.2GB baseline, only 0.8GB headroom → OOM kills - 8 workers × 450MB = 3.6GB baseline, 4.4GB headroom → safe Changes: - GRANIAN_WORKERS: 16 → 8 - GRANIAN_BACKPRESSURE: 128 → 256 (maintains 2048 concurrent capacity) With 3 replicas, total worker count is 24 (was 48), which is still sufficient for high concurrency while avoiding memory pressure. Related: #2334, #2323, #2330 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Add comprehensive memray section covering: - Installation instructions for local and container environments - Local profiling with memray run - Attaching to running processes - Report generation (flamegraph, table, tree, stats, summary) - Live mode for real-time monitoring - Container profiling workflow - Container profiling limitations and workarounds - py-spy vs memray comparison table Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ulation The RBAC middleware was holding database sessions for the entire request duration via Depends(get_db), causing idle-in-transaction connections to accumulate under high load. Changes: - Remove Depends(get_db) from get_current_user_with_permissions() - Remove "db" from user context return dicts - Update require_permission() to use fresh_db_session() - Update require_admin_permission() to use fresh_db_session() - Update require_any_permission() to use fresh_db_session() - Update PermissionChecker class to use fresh_db_session() - Update admin.py to not rely on user["db"] - Update tests to mock fresh_db_session() Closes #2340 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Transforms 267 endpoints from Depends(get_db) to fresh_db_session() to prevent database session accumulation under high load. Files modified: - main.py (52 endpoints) - admin.py (133 endpoints) - routers/auth.py (1 endpoint) - routers/email_auth.py (4 endpoints) - routers/llm_config_router.py (1 endpoint) - routers/llm_proxy_router.py (2 endpoints) - routers/log_search.py (5 endpoints) - routers/oauth_router.py (7 endpoints) - routers/observability.py (8 endpoints) - routers/rbac.py (12 endpoints) - routers/reverse_proxy.py (1 endpoint) - routers/server_well_known.py (2 endpoints) - routers/sso.py (10 endpoints) - routers/teams.py (15 endpoints) - routers/tokens.py (10 endpoints) - routers/toolops_router.py (3 endpoints) - routers/well_known.py (1 endpoint) Before: idle_in_tx grew to 100+ connections with 80-120s age, causing system degradation from 2180 RPS to 300 RPS under load. After: idle_in_tx stays at 12-34 connections with 0-17s max age, RPS stable at 1600-2260 with 0 failures. Closes #2335 Closes #2336 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Member
Author
|
Closing — superseded by other DB session optimization approaches. |
bradmcnew
pushed a commit
to bradmcnew/mcp-context-forge
that referenced
this pull request
Feb 18, 2026
…pool exhaustion Replace next(get_db()) with fresh_db_session() context manager in all three database access points within TokenScopingMiddleware. Extract query logic into dedicated helper methods (_query_team_membership and _query_resource_ownership) for cleaner session lifecycle management. Previously each request held 2+ concurrent connections (middleware + endpoint handler), causing QueuePool exhaustion under load. With this fix the middleware connection is returned to the pool before the endpoint handler runs. Applies the same pattern established in PR IBM#2326 and PR IBM#2319. Closes IBM#2330 Signed-off-by: Brad McNew <brad@axite.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
fresh_db_session()instead ofDepends(get_db), preventing database connections from being held during response serialization.Endpoints modified:
list_serverslist_a2a_agentslist_toolslist_resourceslist_promptslist_gatewaysEach endpoint now:
fresh_db_session()context manager for short-lived DB accessmodel_dump) before session closesThis is a follow-up to PR #2319 (RBAC fix) to address the remaining endpoint handlers that were still holding database sessions during slow operations, causing connection pool exhaustion under high load.
Test plan
Closes #2323