-
Notifications
You must be signed in to change notification settings - Fork 615
[BUG][PERFORMANCE]: MCP session pool cancel scope leak causes ~20-45% tool call failures on /servers/{id}/mcp #3737
Description
Bug Summary
The MCP session pool (mcpgateway/services/mcp_session_pool.py) causes ~20-45% of tool calls proxied through /servers/{id}/mcp to fail with ToolInvocationError("Tool invocation failed: "). The failure is consistently reproduced across all backend servers (Fast Test, Fast Time), all transports (StreamableHTTP, SSE), and even sequential single-user calls. Disabling the pool (MCP_SESSION_POOL_ENABLED=false) eliminates the failures entirely. No pool configuration parameter reduces the failure rate.
The root cause is an architectural mismatch: _create_session manually calls transport_ctx.__aenter__() and session.__aenter__(), which attaches anyio cancel scopes to the HTTP request handler task. When a child task in the transport's internal TaskGroup fails, it cancels the host task, killing in-progress call_tool() operations.
Relation to #3520: This issue provides a deeper root cause analysis. #3520 identified broken session recycling as the symptom; this issue identifies the cancel scope leak as the underlying cause. PR #3605 partially addresses #3520 but does not fix the cancel scope issue.
Affected Component
-
mcpgateway- API - Federation or Transports
Steps to Reproduce
Minimal curl reproduction (no load test needed):
# Generate token
export TOKEN=$(python -m mcpgateway.utils.create_jwt_token \
--username admin@example.com --admin --exp 10080 --secret my-test-key --algo HS256 2>/dev/null)
# 30 sequential calls — expect ~20-45% failure
for i in $(seq 1 30); do
curl -s -H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-H "Accept: application/json, text/event-stream" \
-X POST "http://localhost:8080/servers/b8e3f1a2c4d5e6f7a1b2c3d4e5f6a7b8/mcp" \
-d "{\"jsonrpc\":\"2.0\",\"id\":$i,\"method\":\"tools/call\",\"params\":{\"name\":\"fast-test-echo\",\"arguments\":{\"message\":\"test-$i\",\"delay\":0}}}" \
| python3 -c "import sys,json; r=json.load(sys.stdin); print(f'Call {$i}: {\"FAIL\" if r.get(\"result\",{}).get(\"isError\") else \"OK\"}')"
doneWith pool disabled (100% success):
# Restart gateway with pool disabled
docker compose stop gateway
MCP_SESSION_POOL_ENABLED=false docker compose up -d gateway
# Wait for healthy, then repeat the same 30 calls — all succeedExpected Behavior
All 30 sequential tool calls should succeed (as they do with pool disabled or when calling the backend server directly).
Logs / Error Output
Client-visible error:
{"jsonrpc":"2.0","id":3,"result":{"content":[{"type":"text","text":"Tool invocation failed: "}],"isError":true}}Gateway stack trace:
mcpgateway.services.tool_service.py:4291 invoke_tool
→ asyncio.wait_for(pooled.session.call_tool(...), timeout=effective_timeout)
→ mcp/client/session.py:383 call_tool → send_request
→ mcp/shared/session.py:281 write_stream.send()
→ anyio.ClosedResourceError
mcp.server.streamable_http_manager - ERROR - Stateless session crashed
→ mcp/server/lowlevel/server.py:695 _handle_message
→ mcp/shared/session.py:117 RequestResponder.__exit__
→ RuntimeError: Attempted to exit a cancel scope that isn't the current task's current cancel scope
Comprehensive Test Matrix
Pool ENABLED (30 sequential calls each)
| Server | Transport | Tool | Success |
|---|---|---|---|
| Fast Test | StreamableHTTP | echo (0ms) | 23/30 (77%) |
| Fast Test | StreamableHTTP | echo (500ms) | 17/30 (57%) |
| Fast Test | StreamableHTTP | get-stats | 16/30 (53%) |
| Fast Test | StreamableHTTP | get-system-time | 17/30 (57%) |
| Fast Time | StreamableHTTP | get-system-time | 25/30 (83%) |
| Fast Time | StreamableHTTP | convert-time | 20/30 (67%) |
| Fast Time | SSE | get-system-time | 26/30 (87%) |
| Fast Time | SSE | convert-time | 19/30 (63%) |
Pool DISABLED: ALL servers, ALL tools → 30/30 (100%)
Direct to backend (bypass gateway): 20/20 (100%)
SDK isolation tests
| Test | Result |
|---|---|
| MCP SDK → fast_test_server directly (pool-style reuse) | 20/20 (100%) |
| MCP SDK → gateway (pool ENABLED) | 11/20 (55%) |
| MCP SDK → gateway (pool DISABLED) | 20/20 (100%) |
The MCP SDK handles session reuse correctly when used directly. The bug is in the gateway's pool.
Configuration Sweep (No Config Helps)
Every pool parameter was tested. None significantly reduce failures:
| Config | Result (30 calls) |
|---|---|
| Baseline (defaults) | 26/30 (4 fail) |
HEALTH_CHECK_INTERVAL=0 |
24/30 (6 fail) |
TTL=1s |
25/30 (5 fail) |
MAX_PER_KEY=1 |
25/30 (5 fail) |
EXPLICIT_HEALTH_RPC=true |
27/30 (3 fail) |
HEALTH_METHODS=[list_tools] |
27/30 (3 fail) |
INTERVAL=0 + METHODS=[list_tools] |
27/30 (3 fail) |
INTERVAL=0 + METHODS=[ping] |
26/30 (4 fail) |
TTL=0 (never reuse) |
23/30 (7 fail — worse) |
POOL_ENABLED=false |
30/30 (0 fail) |
TTL=0 makes things worse because forcing fresh session creation on every call increases the window for cancel scope conflicts.
Root Cause Analysis
The cancel scope leak
_create_session (mcp_session_pool.py:1137, 1151) manually enters transport and session contexts:
# Line 1137 — enters transport TaskGroup cancel scope on the request handler task
read_stream, write_stream, _ = await transport_ctx.__aenter__()
# Line 1151 — enters session TaskGroup cancel scope on the request handler task
await session.__aenter__()This attaches anyio cancel scopes to the HTTP request handler task:
[FastAPI/middleware scopes]
└── Transport TaskGroup cancel scope (from SDK streamable_http.py)
└── Session TaskGroup cancel scope (from SDK session.py)
How it kills tool calls
-
post_writer(SDKstreamable_http.py) spawnshandle_request_asynctasks with no try/except:async def handle_request_async(): await self._handle_post_request(ctx) # NO error handling if isinstance(message.root, JSONRPCRequest): tg.start_soon(handle_request_async) # Spawned in transport TaskGroup
-
If
_handle_post_requestraises (HTTP error, connection error), the exception propagates to the TaskGroup, which cancels the transport scope — and with it, the host task (the request handler):# anyio _asyncio.py, TaskGroup._spawn/task_done: self.cancel_scope.cancel() # Cancels transport scope → cancels host task
-
The host task (running
call_tool()) receivesCancelledError.asyncio.wait_fordoes NOT convert it toTimeoutError(its own timeout didn't fire). The error surfaces asClosedResourceErrororCancelledError.
Why the non-pool path survives
Without pooling, async with context managers properly unwind cancel scopes via __aexit__:
async with streamablehttp_client(...) as streams: # Scope A
async with ClientSession(*streams) as session: # Scope B
await session.call_tool(...) # Error here...
# B.__aexit__ properly unwinds session scope
# A.__aexit__ properly unwinds transport scope, collects child errors into BaseExceptionGroupIn the pool path, scopes A and B are ENTERED but __aexit__ is deferred. The cancel scope hierarchy leaks onto the host task, allowing child task failures to cancel the request handler directly.
Relationship to #3520 and PR #3605
- [BUG][PERFORMANCE]: MCP session pool recycles broken sessions, causing cascading ClosedResourceError failures under load #3520 identified the symptom: broken sessions recycled by the pool.
- PR fix(session-pool): prevent broken session recycling in MCPSessionPool #3605 (open, not merged) adds transport-aware
is_closeddetection. Thediscard=Truelogic in thesession()context manager is already on main (lines 1894-1900). - PR fix(session-pool): prevent broken session recycling in MCPSessionPool #3605 is a partial fix: it catches sessions broken BETWEEN calls but does not prevent cancel scope corruption DURING calls. It should reduce the failure rate but likely won't eliminate it.
Cancel Scope Cleanup Noise
With pool enabled, gateway logs Stateless session crashed + RuntimeError: cancel scope on failing requests. With pool disabled, these entries did not appear in testing. The noise appears coupled to pooled-session failures rather than appearing on successful pool-disabled runs.
Related Upstream Issues
MCP Python SDK (modelcontextprotocol/python-sdk):
| Issue | Status | Relevance |
|---|---|---|
| #577 — Cancel scope crash with multiple MCPClient instances | OPEN P1 | Core upstream issue: BaseSession.__aenter__() creates TaskGroup cancel scopes bound to the current task |
| #922 — Multiple client sessions → cancel scope error | CLOSED | Pure-anyio repro proving TaskGroup lifecycle is the constraint |
| #915 — ClientSessionGroup exception if server unavailable | OPEN P1 | Connection error + scope stack corruption |
| #1805 — Resource leak in streamable_http_client | OPEN | Thread leaks + CPU spikes after context exit |
| #1811 — read_stream_writer hangs after SSE disconnect | OPEN P1 | call_tool() hangs forever after connection loss |
| #2114 — ExceptionGroup wrapping obscures errors | OPEN P1 | Tagged for v2 |
anyio (agronholm/anyio):
| Issue | Status | Relevance |
|---|---|---|
| #415 — asyncio.wait_for incompatible with anyio cancel scopes | OPEN | Directly relevant |
| #787 — Child tasks don't cancel group scope correctly on asyncio | OPEN | Timing issues |
Recommended Fix Options
-
Dedicated background task per pooled session (most likely architectural fix; not yet validated): Run transport/session lifecycle in a dedicated
asyncio.create_task()per pooled session, keeping the request handler task outside the transport's cancel scope hierarchy. -
Merge PR fix(session-pool): prevent broken session recycling in MCPSessionPool #3605 (partial fix): Transport-aware
is_closedcatches sessions broken between calls. -
Replace
asyncio.wait_forwithanyio.fail_after(complementary): Prevents cancel scope corruption on timeout paths. Secondary, not sufficient alone. -
Disable pool (proven workaround):
MCP_SESSION_POOL_ENABLED=false. 100% success. Performance cost: extra initialize round-trip per call.
Workaround
Set MCP_SESSION_POOL_ENABLED=false in .env or docker-compose.yml.
Environment Info
| Key | Value |
|---|---|
| MCP SDK | 1.26.0 |
| anyio | 4.12.1 |
| Runtime | Python 3.12 (container) |
| Platform / OS | Linux (Docker Compose) |
| Container | Docker, 1-3 gateway workers behind nginx |
| Session pool | enabled (default), max 200 per key |
| Backend servers | Rust fast_test_server, Rust fast_time_server |
| Transports tested | StreamableHTTP, SSE — both affected |
Additional Context
- This issue supersedes the symptom-level description in [BUG][PERFORMANCE]: MCP session pool recycles broken sessions, causing cascading ClosedResourceError failures under load #3520 with a deeper root cause analysis.
- The
/rpcendpoint shares the same backend pool path (tool_service.invoke_tool) and may be exposed, but the same failure rate was not reliably reproduced on/rpcin repeated runs. - The RCA document is at
todo/rca-echo-failure.mdin the repository.