Skip to content

fix(session-pool): isolate cancel scopes via background task ownership#3739

Merged
crivetimihai merged 3 commits intomainfrom
fix/mcp-session-pool-cancel-scope-leak
Mar 19, 2026
Merged

fix(session-pool): isolate cancel scopes via background task ownership#3739
crivetimihai merged 3 commits intomainfrom
fix/mcp-session-pool-cancel-scope-leak

Conversation

@crivetimihai
Copy link
Copy Markdown
Member

@crivetimihai crivetimihai commented Mar 19, 2026

Summary

Fixes the MCP session pool cancel scope leak that caused ~20-45% tool call failures across all servers and transports when pooling was enabled.

  • Run transport/session lifecycle in a dedicated asyncio.Task per pooled session, isolating anyio cancel scopes from request handler tasks
  • Add transport-aware is_closed detection (incorporates PR fix(session-pool): prevent broken session recycling in MCPSessionPool #3605)
  • Replace asyncio.wait_for with anyio.fail_after at 9 MCP SDK call sites
  • Fix release() dead-owner leak and _create_session() CancelledError leak
  • Add 18 lifecycle/regression tests

Closes #3737
Closes #3520
Closes #3605

Supersedes #3520
Incorporates #3605

Test plan

  • 258 unit/e2e pool tests pass (including 18 new cancel scope tests)
  • 607K+ requests at 200-1000 concurrent users, 0% pool-related failures
  • All servers (Fast Test, Fast Time), all transports (StreamableHTTP, SSE) verified
  • pylint 10.00/10, zero W0212, flake8/black/isort/bandit clean

#3737)

The MCP session pool manually called transport_ctx.__aenter__() and
session.__aenter__(), attaching anyio cancel scopes to the HTTP request
handler task. When a child task in the transport's TaskGroup failed, it
cancelled the request handler, causing ~20-45% tool call failures.

Changes:
- Run transport/session lifecycle in dedicated asyncio.Task per pooled
  session so cancel scopes bind to the owner task, not request handlers
- Add transport-aware is_closed detection (write stream state, owner
  task health, receive channel count) from PR #3605
- Replace asyncio.wait_for with anyio.fail_after for MCP SDK calls in
  tool_service.py (4 sites), mcp_session_pool.py health checks (4),
  and gateway_service.py explicit health RPC (1)
- Fix release() to clean up dead-owner sessions instead of leaking
  them in _active with consumed semaphore slots
- Fix _create_session() to clean up owner task on CancelledError via
  finally block (not just Exception)
- Add 18 tests for owner task lifecycle, transport-aware is_closed,
  release with dead owner, cancellation cleanup, and prompt pooled
  regression coverage

Tested: 607K+ requests across 200-1000 concurrent users with 0%
pool-related failures. All servers (Fast Test, Fast Time), all
transports (StreamableHTTP, SSE) verified.

Closes #3737

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai self-assigned this Mar 19, 2026
@crivetimihai crivetimihai added the revisit Revisit this PR at a later date to address further issues, or if problems arise. label Mar 19, 2026
@crivetimihai crivetimihai added this to the Release 1.0.0 milestone Mar 19, 2026
…ed regression test

The prompt fallback test was only asserting that get_mcp_session_pool()
raises RuntimeError, not that PromptService._fetch_gateway_prompt_result()
actually falls through to the non-pooled sse_client path. Now calls the
real method with mocked pool unavailability and verifies the SSE fallback
session is initialized and get_prompt is invoked.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
- Add test for is_closed graceful degradation when SDK internals raise
- Exercise actual PromptService._fetch_gateway_prompt_result fallback
  path when pool is unavailable (not just bare get_mcp_session_pool)
- Improve force-cancel timeout test with shorter cleanup timeout
- Add test for _create_session finally-block BaseException handling

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai crivetimihai merged commit bbc1792 into main Mar 19, 2026
30 of 31 checks passed
@crivetimihai crivetimihai deleted the fix/mcp-session-pool-cancel-scope-leak branch March 19, 2026 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

revisit Revisit this PR at a later date to address further issues, or if problems arise.

Projects

None yet

1 participant