fix(db): release DB sessions before external HTTP calls (#2518)#2529
Merged
crivetimihai merged 4 commits intomainfrom Jan 27, 2026
Merged
fix(db): release DB sessions before external HTTP calls (#2518)#2529crivetimihai merged 4 commits intomainfrom
crivetimihai merged 4 commits intomainfrom
Conversation
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ol exhaustion This commit addresses issue #2518 where DB connection pool exhaustion occurred during A2A and RPC tool calls due to sessions being held during slow upstream HTTP requests. Changes: - tool_service.py: Extract A2A agent data to local variables before calling db.commit(), allowing HTTP calls to proceed without holding the DB session. The A2A tool invocation logic now uses pre-extracted data instead of querying during the HTTP call phase. - rbac.py: Add db.commit() and db.close() calls before returning user context in all authentication paths (proxy, anonymous, disabled auth). This ensures DB sessions are released early and not held during subsequent request processing. - test_rbac.py: Update test to provide mock db parameter and verify that db.commit() and db.close() are called for proper session cleanup. The fix follows the pattern established in other services: extract all needed data from ORM objects, call db.commit() to release the transaction, then proceed with external HTTP calls. This prevents "idle in transaction" states that exhaust PgBouncer's connection pool under high load. Load test results (4000 concurrent users, 1M+ requests): - Success rate: 99.81% - 502 errors reduced to 0.02% (edge cases with very slow upstreams) - P50: 450ms, P95: 4300ms Closes #2518 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Based on profiling with 4000 concurrent users (~2000 RPS): - MCP_SESSION_POOL_MAX_PER_KEY: 50 → 200 (reduce session creation) - IDLE_TRANSACTION_TIMEOUT: 120s → 300s (handle slow MCP calls) - CLIENT_IDLE_TIMEOUT: 120s → 300s (align with transaction timeout) - HTTPX_MAX_CONNECTIONS: 200 → 500 (more outbound capacity) - HTTPX_MAX_KEEPALIVE_CONNECTIONS: 100 → 300 - REDIS_MAX_CONNECTIONS: 150 → 100 (stay under maxclients) Results: - Failure rate: 0.446% → 0.102% (4.4x improvement) - RPC latency: 3,014ms → 1,740ms (42% faster) - CRUD latency: 1,207ms → 508ms (58% faster) See: todo/profile-full.md for detailed analysis Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
hughhennelly
pushed a commit
to hughhennelly/mcp-context-forge
that referenced
this pull request
Feb 8, 2026
…BM#2529) * Add profling tools, memray Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Add profling tools, memray Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(db): release DB sessions before external HTTP calls to prevent pool exhaustion This commit addresses issue IBM#2518 where DB connection pool exhaustion occurred during A2A and RPC tool calls due to sessions being held during slow upstream HTTP requests. Changes: - tool_service.py: Extract A2A agent data to local variables before calling db.commit(), allowing HTTP calls to proceed without holding the DB session. The A2A tool invocation logic now uses pre-extracted data instead of querying during the HTTP call phase. - rbac.py: Add db.commit() and db.close() calls before returning user context in all authentication paths (proxy, anonymous, disabled auth). This ensures DB sessions are released early and not held during subsequent request processing. - test_rbac.py: Update test to provide mock db parameter and verify that db.commit() and db.close() are called for proper session cleanup. The fix follows the pattern established in other services: extract all needed data from ORM objects, call db.commit() to release the transaction, then proceed with external HTTP calls. This prevents "idle in transaction" states that exhaust PgBouncer's connection pool under high load. Load test results (4000 concurrent users, 1M+ requests): - Success rate: 99.81% - 502 errors reduced to 0.02% (edge cases with very slow upstreams) - P50: 450ms, P95: 4300ms Closes IBM#2518 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * perf(config): tune connection pools for high concurrency Based on profiling with 4000 concurrent users (~2000 RPS): - MCP_SESSION_POOL_MAX_PER_KEY: 50 → 200 (reduce session creation) - IDLE_TRANSACTION_TIMEOUT: 120s → 300s (handle slow MCP calls) - CLIENT_IDLE_TIMEOUT: 120s → 300s (align with transaction timeout) - HTTPX_MAX_CONNECTIONS: 200 → 500 (more outbound capacity) - HTTPX_MAX_KEEPALIVE_CONNECTIONS: 100 → 300 - REDIS_MAX_CONNECTIONS: 150 → 100 (stay under maxclients) Results: - Failure rate: 0.446% → 0.102% (4.4x improvement) - RPC latency: 3,014ms → 1,740ms (42% faster) - CRUD latency: 1,207ms → 508ms (58% faster) See: todo/profile-full.md for detailed analysis Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
kcostell06
pushed a commit
to kcostell06/mcp-context-forge
that referenced
this pull request
Feb 24, 2026
…BM#2529) * Add profling tools, memray Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Add profling tools, memray Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(db): release DB sessions before external HTTP calls to prevent pool exhaustion This commit addresses issue IBM#2518 where DB connection pool exhaustion occurred during A2A and RPC tool calls due to sessions being held during slow upstream HTTP requests. Changes: - tool_service.py: Extract A2A agent data to local variables before calling db.commit(), allowing HTTP calls to proceed without holding the DB session. The A2A tool invocation logic now uses pre-extracted data instead of querying during the HTTP call phase. - rbac.py: Add db.commit() and db.close() calls before returning user context in all authentication paths (proxy, anonymous, disabled auth). This ensures DB sessions are released early and not held during subsequent request processing. - test_rbac.py: Update test to provide mock db parameter and verify that db.commit() and db.close() are called for proper session cleanup. The fix follows the pattern established in other services: extract all needed data from ORM objects, call db.commit() to release the transaction, then proceed with external HTTP calls. This prevents "idle in transaction" states that exhaust PgBouncer's connection pool under high load. Load test results (4000 concurrent users, 1M+ requests): - Success rate: 99.81% - 502 errors reduced to 0.02% (edge cases with very slow upstreams) - P50: 450ms, P95: 4300ms Closes IBM#2518 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * perf(config): tune connection pools for high concurrency Based on profiling with 4000 concurrent users (~2000 RPS): - MCP_SESSION_POOL_MAX_PER_KEY: 50 → 200 (reduce session creation) - IDLE_TRANSACTION_TIMEOUT: 120s → 300s (handle slow MCP calls) - CLIENT_IDLE_TIMEOUT: 120s → 300s (align with transaction timeout) - HTTPX_MAX_CONNECTIONS: 200 → 500 (more outbound capacity) - HTTPX_MAX_KEEPALIVE_CONNECTIONS: 100 → 300 - REDIS_MAX_CONNECTIONS: 150 → 100 (stay under maxclients) Results: - Failure rate: 0.446% → 0.102% (4.4x improvement) - RPC latency: 3,014ms → 1,740ms (42% faster) - CRUD latency: 1,207ms → 508ms (58% faster) See: todo/profile-full.md for detailed analysis Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #2518 - DB connection pool exhaustion during A2A and RPC tool calls.
Root Cause: DB sessions were held in "idle in transaction" state during slow upstream HTTP calls, exhausting PgBouncer's connection pool under high load.
Solution: Extract all needed data from ORM objects before calling
db.commit(), then proceed with external HTTP calls without holding the DB session.Key Changes
mcpgateway/services/tool_service.py: Extract A2A agent data (endpoint URL, auth, protocol version) to local variables beforedb.commit(). A2A tool invocation now uses pre-extracted data.mcpgateway/middleware/rbac.py: Add earlydb.commit()anddb.close()in all authentication paths (proxy, anonymous, disabled auth) to release sessions before request processing continues.tests/unit/mcpgateway/middleware/test_rbac.py: Updated to verify proper session cleanup.Load Test Results (4000 concurrent users)
Before this fix, connection pool exhaustion caused widespread 502 errors under load. After the fix, only edge cases (very slow upstream calls >120s) trigger timeouts.
Test Plan
make test- 4994 passed)