Skip to content

[BUG][PERFORMANCE]: MCP session pool cancel scope leak causes ~20-45% tool call failures on /servers/{id}/mcp #3737

@crivetimihai

Description

@crivetimihai

Bug Summary

The MCP session pool (mcpgateway/services/mcp_session_pool.py) causes ~20-45% of tool calls proxied through /servers/{id}/mcp to fail with ToolInvocationError("Tool invocation failed: "). The failure is consistently reproduced across all backend servers (Fast Test, Fast Time), all transports (StreamableHTTP, SSE), and even sequential single-user calls. Disabling the pool (MCP_SESSION_POOL_ENABLED=false) eliminates the failures entirely. No pool configuration parameter reduces the failure rate.

The root cause is an architectural mismatch: _create_session manually calls transport_ctx.__aenter__() and session.__aenter__(), which attaches anyio cancel scopes to the HTTP request handler task. When a child task in the transport's internal TaskGroup fails, it cancels the host task, killing in-progress call_tool() operations.

Relation to #3520: This issue provides a deeper root cause analysis. #3520 identified broken session recycling as the symptom; this issue identifies the cancel scope leak as the underlying cause. PR #3605 partially addresses #3520 but does not fix the cancel scope issue.


Affected Component

  • mcpgateway - API
  • Federation or Transports

Steps to Reproduce

Minimal curl reproduction (no load test needed):

# Generate token
export TOKEN=$(python -m mcpgateway.utils.create_jwt_token \
  --username admin@example.com --admin --exp 10080 --secret my-test-key --algo HS256 2>/dev/null)

# 30 sequential calls — expect ~20-45% failure
for i in $(seq 1 30); do
  curl -s -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -H "Accept: application/json, text/event-stream" \
    -X POST "http://localhost:8080/servers/b8e3f1a2c4d5e6f7a1b2c3d4e5f6a7b8/mcp" \
    -d "{\"jsonrpc\":\"2.0\",\"id\":$i,\"method\":\"tools/call\",\"params\":{\"name\":\"fast-test-echo\",\"arguments\":{\"message\":\"test-$i\",\"delay\":0}}}" \
    | python3 -c "import sys,json; r=json.load(sys.stdin); print(f'Call {$i}: {\"FAIL\" if r.get(\"result\",{}).get(\"isError\") else \"OK\"}')"
done

With pool disabled (100% success):

# Restart gateway with pool disabled
docker compose stop gateway
MCP_SESSION_POOL_ENABLED=false docker compose up -d gateway
# Wait for healthy, then repeat the same 30 calls — all succeed

Expected Behavior

All 30 sequential tool calls should succeed (as they do with pool disabled or when calling the backend server directly).


Logs / Error Output

Client-visible error:

{"jsonrpc":"2.0","id":3,"result":{"content":[{"type":"text","text":"Tool invocation failed: "}],"isError":true}}

Gateway stack trace:

mcpgateway.services.tool_service.py:4291 invoke_tool
  → asyncio.wait_for(pooled.session.call_tool(...), timeout=effective_timeout)
  → mcp/client/session.py:383 call_tool → send_request
  → mcp/shared/session.py:281 write_stream.send()
  → anyio.ClosedResourceError

mcp.server.streamable_http_manager - ERROR - Stateless session crashed
  → mcp/server/lowlevel/server.py:695 _handle_message
  → mcp/shared/session.py:117 RequestResponder.__exit__
  → RuntimeError: Attempted to exit a cancel scope that isn't the current task's current cancel scope

Comprehensive Test Matrix

Pool ENABLED (30 sequential calls each)

Server Transport Tool Success
Fast Test StreamableHTTP echo (0ms) 23/30 (77%)
Fast Test StreamableHTTP echo (500ms) 17/30 (57%)
Fast Test StreamableHTTP get-stats 16/30 (53%)
Fast Test StreamableHTTP get-system-time 17/30 (57%)
Fast Time StreamableHTTP get-system-time 25/30 (83%)
Fast Time StreamableHTTP convert-time 20/30 (67%)
Fast Time SSE get-system-time 26/30 (87%)
Fast Time SSE convert-time 19/30 (63%)

Pool DISABLED: ALL servers, ALL tools → 30/30 (100%)

Direct to backend (bypass gateway): 20/20 (100%)

SDK isolation tests

Test Result
MCP SDK → fast_test_server directly (pool-style reuse) 20/20 (100%)
MCP SDK → gateway (pool ENABLED) 11/20 (55%)
MCP SDK → gateway (pool DISABLED) 20/20 (100%)

The MCP SDK handles session reuse correctly when used directly. The bug is in the gateway's pool.


Configuration Sweep (No Config Helps)

Every pool parameter was tested. None significantly reduce failures:

Config Result (30 calls)
Baseline (defaults) 26/30 (4 fail)
HEALTH_CHECK_INTERVAL=0 24/30 (6 fail)
TTL=1s 25/30 (5 fail)
MAX_PER_KEY=1 25/30 (5 fail)
EXPLICIT_HEALTH_RPC=true 27/30 (3 fail)
HEALTH_METHODS=[list_tools] 27/30 (3 fail)
INTERVAL=0 + METHODS=[list_tools] 27/30 (3 fail)
INTERVAL=0 + METHODS=[ping] 26/30 (4 fail)
TTL=0 (never reuse) 23/30 (7 fail — worse)
POOL_ENABLED=false 30/30 (0 fail)

TTL=0 makes things worse because forcing fresh session creation on every call increases the window for cancel scope conflicts.


Root Cause Analysis

The cancel scope leak

_create_session (mcp_session_pool.py:1137, 1151) manually enters transport and session contexts:

# Line 1137 — enters transport TaskGroup cancel scope on the request handler task
read_stream, write_stream, _ = await transport_ctx.__aenter__()
# Line 1151 — enters session TaskGroup cancel scope on the request handler task
await session.__aenter__()

This attaches anyio cancel scopes to the HTTP request handler task:

[FastAPI/middleware scopes]
  └── Transport TaskGroup cancel scope (from SDK streamable_http.py)
        └── Session TaskGroup cancel scope (from SDK session.py)

How it kills tool calls

  1. post_writer (SDK streamable_http.py) spawns handle_request_async tasks with no try/except:

    async def handle_request_async():
        await self._handle_post_request(ctx)  # NO error handling
    if isinstance(message.root, JSONRPCRequest):
        tg.start_soon(handle_request_async)   # Spawned in transport TaskGroup
  2. If _handle_post_request raises (HTTP error, connection error), the exception propagates to the TaskGroup, which cancels the transport scope — and with it, the host task (the request handler):

    # anyio _asyncio.py, TaskGroup._spawn/task_done:
    self.cancel_scope.cancel()  # Cancels transport scope → cancels host task
  3. The host task (running call_tool()) receives CancelledError. asyncio.wait_for does NOT convert it to TimeoutError (its own timeout didn't fire). The error surfaces as ClosedResourceError or CancelledError.

Why the non-pool path survives

Without pooling, async with context managers properly unwind cancel scopes via __aexit__:

async with streamablehttp_client(...) as streams:       # Scope A
    async with ClientSession(*streams) as session:       # Scope B
        await session.call_tool(...)                     # Error here...
    # B.__aexit__ properly unwinds session scope
# A.__aexit__ properly unwinds transport scope, collects child errors into BaseExceptionGroup

In the pool path, scopes A and B are ENTERED but __aexit__ is deferred. The cancel scope hierarchy leaks onto the host task, allowing child task failures to cancel the request handler directly.


Relationship to #3520 and PR #3605


Cancel Scope Cleanup Noise

With pool enabled, gateway logs Stateless session crashed + RuntimeError: cancel scope on failing requests. With pool disabled, these entries did not appear in testing. The noise appears coupled to pooled-session failures rather than appearing on successful pool-disabled runs.


Related Upstream Issues

MCP Python SDK (modelcontextprotocol/python-sdk):

Issue Status Relevance
#577 — Cancel scope crash with multiple MCPClient instances OPEN P1 Core upstream issue: BaseSession.__aenter__() creates TaskGroup cancel scopes bound to the current task
#922 — Multiple client sessions → cancel scope error CLOSED Pure-anyio repro proving TaskGroup lifecycle is the constraint
#915 — ClientSessionGroup exception if server unavailable OPEN P1 Connection error + scope stack corruption
#1805 — Resource leak in streamable_http_client OPEN Thread leaks + CPU spikes after context exit
#1811 — read_stream_writer hangs after SSE disconnect OPEN P1 call_tool() hangs forever after connection loss
#2114 — ExceptionGroup wrapping obscures errors OPEN P1 Tagged for v2

anyio (agronholm/anyio):

Issue Status Relevance
#415 — asyncio.wait_for incompatible with anyio cancel scopes OPEN Directly relevant
#787 — Child tasks don't cancel group scope correctly on asyncio OPEN Timing issues

Recommended Fix Options

  1. Dedicated background task per pooled session (most likely architectural fix; not yet validated): Run transport/session lifecycle in a dedicated asyncio.create_task() per pooled session, keeping the request handler task outside the transport's cancel scope hierarchy.

  2. Merge PR fix(session-pool): prevent broken session recycling in MCPSessionPool #3605 (partial fix): Transport-aware is_closed catches sessions broken between calls.

  3. Replace asyncio.wait_for with anyio.fail_after (complementary): Prevents cancel scope corruption on timeout paths. Secondary, not sufficient alone.

  4. Disable pool (proven workaround): MCP_SESSION_POOL_ENABLED=false. 100% success. Performance cost: extra initialize round-trip per call.


Workaround

Set MCP_SESSION_POOL_ENABLED=false in .env or docker-compose.yml.


Environment Info

Key Value
MCP SDK 1.26.0
anyio 4.12.1
Runtime Python 3.12 (container)
Platform / OS Linux (Docker Compose)
Container Docker, 1-3 gateway workers behind nginx
Session pool enabled (default), max 200 per key
Backend servers Rust fast_test_server, Rust fast_time_server
Transports tested StreamableHTTP, SSE — both affected

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafebugSomething isn't workingmcp-protocolAlignment with MCP protocol or specificationperformancePerformance related itemspythonPython / backend development (FastAPI)readyValidated, ready-to-work-on items

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions