fix(session-pool): prevent broken session recycling in MCPSessionPool#3605
Closed
gandhipratik203 wants to merge 4 commits intomainfrom
Closed
fix(session-pool): prevent broken session recycling in MCPSessionPool#3605gandhipratik203 wants to merge 4 commits intomainfrom
gandhipratik203 wants to merge 4 commits intomainfrom
Conversation
bf6b328 to
c58a2dd
Compare
c58a2dd to
165134d
Compare
…cycling Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…cess changes Update test_session_context_manager_releases_on_exception to expect release(discard=True) on exception, matching the new discard behaviour added to the session() context manager. Add is_success = True to DummyResponse stubs in test_execute_forwarded_request_success_result and test_execute_forwarded_request_error_result to match the response.is_success guard introduced in mcp_session_pool.py. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…ing", "skip"] Updates docker-compose.yml and docker-compose-verbose-logging.yml to use ["ping", "skip"] instead of ["skip"], aligning with the config.py default. This enables ping-based health checks on session acquisition so broken sessions are detected before being handed to callers, with graceful fallback to skip for servers that do not support the ping method. Closes #3520 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
165134d to
7168d44
Compare
2 tasks
crivetimihai
added a commit
that referenced
this pull request
Mar 19, 2026
#3737) The MCP session pool manually called transport_ctx.__aenter__() and session.__aenter__(), attaching anyio cancel scopes to the HTTP request handler task. When a child task in the transport's TaskGroup failed, it cancelled the request handler, causing ~20-45% tool call failures. Changes: - Run transport/session lifecycle in dedicated asyncio.Task per pooled session so cancel scopes bind to the owner task, not request handlers - Add transport-aware is_closed detection (write stream state, owner task health, receive channel count) from PR #3605 - Replace asyncio.wait_for with anyio.fail_after for MCP SDK calls in tool_service.py (4 sites), mcp_session_pool.py health checks (4), and gateway_service.py explicit health RPC (1) - Fix release() to clean up dead-owner sessions instead of leaking them in _active with consumed semaphore slots - Fix _create_session() to clean up owner task on CancelledError via finally block (not just Exception) - Add 18 tests for owner task lifecycle, transport-aware is_closed, release with dead owner, cancellation cleanup, and prompt pooled regression coverage Tested: 607K+ requests across 200-1000 concurrent users with 0% pool-related failures. All servers (Fast Test, Fast Time), all transports (StreamableHTTP, SSE) verified. Closes #3737 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
4 tasks
crivetimihai
added a commit
that referenced
this pull request
Mar 19, 2026
#3739) * fix(session-pool): isolate cancel scopes via background task ownership (#3737) The MCP session pool manually called transport_ctx.__aenter__() and session.__aenter__(), attaching anyio cancel scopes to the HTTP request handler task. When a child task in the transport's TaskGroup failed, it cancelled the request handler, causing ~20-45% tool call failures. Changes: - Run transport/session lifecycle in dedicated asyncio.Task per pooled session so cancel scopes bind to the owner task, not request handlers - Add transport-aware is_closed detection (write stream state, owner task health, receive channel count) from PR #3605 - Replace asyncio.wait_for with anyio.fail_after for MCP SDK calls in tool_service.py (4 sites), mcp_session_pool.py health checks (4), and gateway_service.py explicit health RPC (1) - Fix release() to clean up dead-owner sessions instead of leaking them in _active with consumed semaphore slots - Fix _create_session() to clean up owner task on CancelledError via finally block (not just Exception) - Add 18 tests for owner task lifecycle, transport-aware is_closed, release with dead owner, cancellation cleanup, and prompt pooled regression coverage Tested: 607K+ requests across 200-1000 concurrent users with 0% pool-related failures. All servers (Fast Test, Fast Time), all transports (StreamableHTTP, SSE) verified. Closes #3737 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(test): exercise actual PromptService fallback path in prompt pooled regression test The prompt fallback test was only asserting that get_mcp_session_pool() raises RuntimeError, not that PromptService._fetch_gateway_prompt_result() actually falls through to the non-pooled sse_client path. Now calls the real method with mocked pool unavailability and verifies the SSE fallback session is initialized and get_prompt is invoked. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(test): improve diff coverage for session pool cancel scope fix - Add test for is_closed graceful degradation when SDK internals raise - Exercise actual PromptService._fetch_gateway_prompt_result fallback path when pool is unavailable (not just bare get_mcp_session_pool) - Improve force-cancel timeout test with shorter cleanup timeout - Add test for _create_session finally-block BaseException handling Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Member
|
Incorporated into #3739 |
1 similar comment
Member
|
Incorporated into #3739 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Related Issue
Closes #3520
Summary
Fixes remaining root causes of broken session detection in
MCPSessionPooland aligns the verbose-logging compose health check default with the recommended production setting.Change details
Part 1 — Transport-aware
is_closedPooledSession.is_closedonly checked the internal_closedflag, missing cases where the underlying transport stream had silently broken. Sessions with a closed write stream or zero open receive channels were incorrectly considered healthy and handed to callers.is_closedto inspectsession._write_stream._closedand_state.open_receive_channelsis Trueandisinstance(open_rx, int)checks to avoid false positives with mock objects in testsgetattrlookups include safe fallbacks so the property degrades gracefully if MCP internals changePart 2 — Health check default (
docker-compose-verbose-logging.yml)MCP_SESSION_POOL_HEALTH_CHECK_METHODSwas set to["skip"]indocker-compose-verbose-logging.yml, contradicting the["ping", "skip"]default already set inconfig.pyanddocker-compose.yml. Updated to["ping", "skip"]so that a ping is attempted on session acquisition (with graceful fallback to skip for servers that do not support it).Part 3 — Tests
Added unit tests covering all
is_closedtransport-inspection paths,release(discard=True)behaviour, and context manager discard on bothExceptionandBaseException. FixedDummyResponsestubs intest_mcp_session_pool_coverage.pyto includeis_success = True.Scope
is_closedinspects actual transport stream statedocker-compose-verbose-logging.ymlhealth check default aligned to["ping", "skip"]is_closedtransport-inspection pathsrelease(discard=True)closes instead of recyclingExceptionandBaseException🔬 Steps to Test
python3 -m pytest tests/unit/mcpgateway/services/test_mcp_session_pool.py tests/unit/mcpgateway/services/test_mcp_session_pool_coverage.py -vdocker-compose -f docker-compose-verbose-logging.yml upand confirmMCP_SESSION_POOL_HEALTH_CHECK_METHODSresolves to["ping", "skip"].locustfile_echo_delay.pylocustfile andfast_test_serverdelay support introduced in that PR). Start the stack withLOCUST_LOCUSTFILE=locustfile_echo_delay.py make testing-up, run 10 users at spawn rate 2 in the Locust UI athttp://localhost:8089, and confirm noClosedResourceErroror empty"Tool invocation failed: "errors appear in the gateway logs.Manual test — idle session with dead backend
Manual test results
Tested against a live gateway (
make dev) with an SSE translator wrappingmcp-server-giton port 9000.Prerequisites
http://localhost:9000/sse(SSE transport) — this federates thegit_logtool into the gatewayuvxavailable (pip install uvif not)Setup — start services, create virtual server, associate tool
Test — health check on idle session (backend killed after 10s idle)
Results summary
isError: falseisError: false✅ClosedResourceErrorhanded out"Tool invocation failed: Connection closed"✅is_closedtransport stream inspection is covered by unit tests only — a silently dead transport with no exception is not reliably triggerable via curl.🏷️ Type of Change
🧪 Verification
make lintmake testpython3 -m pytest tests/unit/mcpgateway/services/test_mcp_session_pool.py tests/unit/mcpgateway/services/test_mcp_session_pool_coverage.py✅ Checklist
make black isort pre-commit)📓 Notes
The
["ping", "skip"]health check chain adds a ping on every session acquisition. Servers that do not support thepingmethod will fall through toskipautomatically, so there is no regression for environments using non-standard MCP servers.