[BUG][PERFORMANCE]: MCP session pool cancel scope leak causes ~20-45% tool call failures on /servers/{id}/mcp

### Bug Summary

The MCP session pool (`mcpgateway/services/mcp_session_pool.py`) causes ~20-45% of tool calls proxied through `/servers/{id}/mcp` to fail with `ToolInvocationError("Tool invocation failed: ")`. The failure is consistently reproduced across all backend servers (Fast Test, Fast Time), all transports (StreamableHTTP, SSE), and even sequential single-user calls. Disabling the pool (`MCP_SESSION_POOL_ENABLED=false`) eliminates the failures entirely. No pool configuration parameter reduces the failure rate.

The root cause is an architectural mismatch: `_create_session` manually calls `transport_ctx.__aenter__()` and `session.__aenter__()`, which attaches anyio cancel scopes to the HTTP request handler task. When a child task in the transport's internal TaskGroup fails, it cancels the host task, killing in-progress `call_tool()` operations.

**Relation to #3520**: This issue provides a deeper root cause analysis. #3520 identified broken session recycling as the symptom; this issue identifies the cancel scope leak as the underlying cause. PR #3605 partially addresses #3520 but does not fix the cancel scope issue.

---

### Affected Component

- [x] `mcpgateway` - API
- [x] Federation or Transports

---

### Steps to Reproduce

**Minimal curl reproduction (no load test needed):**

```bash
# Generate token
export TOKEN=$(python -m mcpgateway.utils.create_jwt_token \
  --username admin@example.com --admin --exp 10080 --secret my-test-key --algo HS256 2>/dev/null)

# 30 sequential calls — expect ~20-45% failure
for i in $(seq 1 30); do
  curl -s -H "Authorization: Bearer $TOKEN" \
    -H "Content-Type: application/json" \
    -H "Accept: application/json, text/event-stream" \
    -X POST "http://localhost:8080/servers/b8e3f1a2c4d5e6f7a1b2c3d4e5f6a7b8/mcp" \
    -d "{\"jsonrpc\":\"2.0\",\"id\":$i,\"method\":\"tools/call\",\"params\":{\"name\":\"fast-test-echo\",\"arguments\":{\"message\":\"test-$i\",\"delay\":0}}}" \
    | python3 -c "import sys,json; r=json.load(sys.stdin); print(f'Call {$i}: {\"FAIL\" if r.get(\"result\",{}).get(\"isError\") else \"OK\"}')"
done
```

**With pool disabled (100% success):**

```bash
# Restart gateway with pool disabled
docker compose stop gateway
MCP_SESSION_POOL_ENABLED=false docker compose up -d gateway
# Wait for healthy, then repeat the same 30 calls — all succeed
```

---

### Expected Behavior

All 30 sequential tool calls should succeed (as they do with pool disabled or when calling the backend server directly).

---

### Logs / Error Output

**Client-visible error:**
```json
{"jsonrpc":"2.0","id":3,"result":{"content":[{"type":"text","text":"Tool invocation failed: "}],"isError":true}}
```

**Gateway stack trace:**
```
mcpgateway.services.tool_service.py:4291 invoke_tool
  → asyncio.wait_for(pooled.session.call_tool(...), timeout=effective_timeout)
  → mcp/client/session.py:383 call_tool → send_request
  → mcp/shared/session.py:281 write_stream.send()
  → anyio.ClosedResourceError

mcp.server.streamable_http_manager - ERROR - Stateless session crashed
  → mcp/server/lowlevel/server.py:695 _handle_message
  → mcp/shared/session.py:117 RequestResponder.__exit__
  → RuntimeError: Attempted to exit a cancel scope that isn't the current task's current cancel scope
```

---

### Comprehensive Test Matrix

#### Pool ENABLED (30 sequential calls each)

| Server | Transport | Tool | Success |
|--------|-----------|------|---------|
| Fast Test | StreamableHTTP | echo (0ms) | 23/30 (77%) |
| Fast Test | StreamableHTTP | echo (500ms) | 17/30 (57%) |
| Fast Test | StreamableHTTP | get-stats | 16/30 (53%) |
| Fast Test | StreamableHTTP | get-system-time | 17/30 (57%) |
| Fast Time | StreamableHTTP | get-system-time | 25/30 (83%) |
| Fast Time | StreamableHTTP | convert-time | 20/30 (67%) |
| Fast Time | SSE | get-system-time | 26/30 (87%) |
| Fast Time | SSE | convert-time | 19/30 (63%) |

#### Pool DISABLED: ALL servers, ALL tools → **30/30 (100%)**

#### Direct to backend (bypass gateway): **20/20 (100%)**

#### SDK isolation tests

| Test | Result |
|------|--------|
| MCP SDK → fast_test_server directly (pool-style reuse) | **20/20 (100%)** |
| MCP SDK → gateway (pool ENABLED) | **11/20 (55%)** |
| MCP SDK → gateway (pool DISABLED) | **20/20 (100%)** |

The MCP SDK handles session reuse correctly when used directly. The bug is in the gateway's pool.

---

### Configuration Sweep (No Config Helps)

Every pool parameter was tested. None significantly reduce failures:

| Config | Result (30 calls) |
|--------|-------------------|
| Baseline (defaults) | 26/30 (4 fail) |
| `HEALTH_CHECK_INTERVAL=0` | 24/30 (6 fail) |
| `TTL=1s` | 25/30 (5 fail) |
| `MAX_PER_KEY=1` | 25/30 (5 fail) |
| `EXPLICIT_HEALTH_RPC=true` | 27/30 (3 fail) |
| `HEALTH_METHODS=[list_tools]` | 27/30 (3 fail) |
| `INTERVAL=0 + METHODS=[list_tools]` | 27/30 (3 fail) |
| `INTERVAL=0 + METHODS=[ping]` | 26/30 (4 fail) |
| `TTL=0` (never reuse) | 23/30 (7 fail — **worse**) |
| **`POOL_ENABLED=false`** | **30/30 (0 fail)** |

`TTL=0` makes things worse because forcing fresh session creation on every call increases the window for cancel scope conflicts.

---

### Root Cause Analysis

#### The cancel scope leak

`_create_session` (`mcp_session_pool.py:1137, 1151`) manually enters transport and session contexts:

```python
# Line 1137 — enters transport TaskGroup cancel scope on the request handler task
read_stream, write_stream, _ = await transport_ctx.__aenter__()
# Line 1151 — enters session TaskGroup cancel scope on the request handler task
await session.__aenter__()
```

This attaches anyio cancel scopes to the HTTP request handler task:
```
[FastAPI/middleware scopes]
  └── Transport TaskGroup cancel scope (from SDK streamable_http.py)
        └── Session TaskGroup cancel scope (from SDK session.py)
```

#### How it kills tool calls

1. `post_writer` (SDK `streamable_http.py`) spawns `handle_request_async` tasks with **no try/except**:
   ```python
   async def handle_request_async():
       await self._handle_post_request(ctx)  # NO error handling
   if isinstance(message.root, JSONRPCRequest):
       tg.start_soon(handle_request_async)   # Spawned in transport TaskGroup
   ```

2. If `_handle_post_request` raises (HTTP error, connection error), the exception propagates to the TaskGroup, which cancels the transport scope — and with it, the **host task** (the request handler):
   ```python
   # anyio _asyncio.py, TaskGroup._spawn/task_done:
   self.cancel_scope.cancel()  # Cancels transport scope → cancels host task
   ```

3. The host task (running `call_tool()`) receives `CancelledError`. `asyncio.wait_for` does NOT convert it to `TimeoutError` (its own timeout didn't fire). The error surfaces as `ClosedResourceError` or `CancelledError`.

#### Why the non-pool path survives

Without pooling, `async with` context managers properly unwind cancel scopes via `__aexit__`:
```python
async with streamablehttp_client(...) as streams:       # Scope A
    async with ClientSession(*streams) as session:       # Scope B
        await session.call_tool(...)                     # Error here...
    # B.__aexit__ properly unwinds session scope
# A.__aexit__ properly unwinds transport scope, collects child errors into BaseExceptionGroup
```

In the pool path, scopes A and B are ENTERED but `__aexit__` is deferred. The cancel scope hierarchy leaks onto the host task, allowing child task failures to cancel the request handler directly.

---

### Relationship to #3520 and PR #3605

- **#3520** identified the symptom: broken sessions recycled by the pool.
- **PR #3605** (open, not merged) adds transport-aware `is_closed` detection. The `discard=True` logic in the `session()` context manager is already on main (lines 1894-1900).
- **PR #3605 is a partial fix**: it catches sessions broken BETWEEN calls but does not prevent cancel scope corruption DURING calls. It should reduce the failure rate but likely won't eliminate it.

---

### Cancel Scope Cleanup Noise

With pool enabled, gateway logs `Stateless session crashed` + `RuntimeError: cancel scope` on failing requests. With pool disabled, these entries did not appear in testing. The noise appears coupled to pooled-session failures rather than appearing on successful pool-disabled runs.

---

### Related Upstream Issues

**MCP Python SDK** ([modelcontextprotocol/python-sdk](https://github.com/modelcontextprotocol/python-sdk)):

| Issue | Status | Relevance |
|-------|--------|-----------|
| [#577](https://github.com/modelcontextprotocol/python-sdk/issues/577) — Cancel scope crash with multiple MCPClient instances | **OPEN P1** | Core upstream issue: `BaseSession.__aenter__()` creates TaskGroup cancel scopes bound to the current task |
| [#922](https://github.com/modelcontextprotocol/python-sdk/issues/922) — Multiple client sessions → cancel scope error | **CLOSED** | Pure-anyio repro proving TaskGroup lifecycle is the constraint |
| [#915](https://github.com/modelcontextprotocol/python-sdk/issues/915) — ClientSessionGroup exception if server unavailable | **OPEN P1** | Connection error + scope stack corruption |
| [#1805](https://github.com/modelcontextprotocol/python-sdk/issues/1805) — Resource leak in streamable_http_client | **OPEN** | Thread leaks + CPU spikes after context exit |
| [#1811](https://github.com/modelcontextprotocol/python-sdk/issues/1811) — read_stream_writer hangs after SSE disconnect | **OPEN P1** | `call_tool()` hangs forever after connection loss |
| [#2114](https://github.com/modelcontextprotocol/python-sdk/issues/2114) — ExceptionGroup wrapping obscures errors | **OPEN P1** | Tagged for v2 |

**anyio** ([agronholm/anyio](https://github.com/agronholm/anyio)):

| Issue | Status | Relevance |
|-------|--------|-----------|
| [#415](https://github.com/agronholm/anyio/issues/415) — asyncio.wait_for incompatible with anyio cancel scopes | **OPEN** | Directly relevant |
| [#787](https://github.com/agronholm/anyio/issues/787) — Child tasks don't cancel group scope correctly on asyncio | **OPEN** | Timing issues |

---

### Recommended Fix Options

1. **Dedicated background task per pooled session** (most likely architectural fix; not yet validated): Run transport/session lifecycle in a dedicated `asyncio.create_task()` per pooled session, keeping the request handler task outside the transport's cancel scope hierarchy.

2. **Merge PR #3605** (partial fix): Transport-aware `is_closed` catches sessions broken between calls.

3. **Replace `asyncio.wait_for` with `anyio.fail_after`** (complementary): Prevents cancel scope corruption on timeout paths. Secondary, not sufficient alone.

4. **Disable pool** (proven workaround): `MCP_SESSION_POOL_ENABLED=false`. 100% success. Performance cost: extra initialize round-trip per call.

---

### Workaround

Set `MCP_SESSION_POOL_ENABLED=false` in `.env` or `docker-compose.yml`.

---

### Environment Info

| Key | Value |
|-----|-------|
| MCP SDK | 1.26.0 |
| anyio | 4.12.1 |
| Runtime | Python 3.12 (container) |
| Platform / OS | Linux (Docker Compose) |
| Container | Docker, 1-3 gateway workers behind nginx |
| Session pool | enabled (default), max 200 per key |
| Backend servers | Rust fast_test_server, Rust fast_time_server |
| Transports tested | StreamableHTTP, SSE — both affected |

---

### Additional Context

- This issue supersedes the symptom-level description in #3520 with a deeper root cause analysis.
- The `/rpc` endpoint shares the same backend pool path (`tool_service.invoke_tool`) and may be exposed, but the same failure rate was not reliably reproduced on `/rpc` in repeated runs.
- The RCA document is at `todo/rca-echo-failure.md` in the repository.

Issue	Status	Relevance
#577 — Cancel scope crash with multiple MCPClient instances	OPEN P1	Core upstream issue: `BaseSession.__aenter__()` creates TaskGroup cancel scopes bound to the current task
#922 — Multiple client sessions → cancel scope error	CLOSED	Pure-anyio repro proving TaskGroup lifecycle is the constraint
#915 — ClientSessionGroup exception if server unavailable	OPEN P1	Connection error + scope stack corruption
#1805 — Resource leak in streamable_http_client	OPEN	Thread leaks + CPU spikes after context exit
#1811 — read_stream_writer hangs after SSE disconnect	OPEN P1	`call_tool()` hangs forever after connection loss
#2114 — ExceptionGroup wrapping obscures errors	OPEN P1	Tagged for v2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][PERFORMANCE]: MCP session pool cancel scope leak causes ~20-45% tool call failures on /servers/{id}/mcp #3737

Bug Summary

Affected Component

Steps to Reproduce

Expected Behavior

Logs / Error Output

Comprehensive Test Matrix

Pool ENABLED (30 sequential calls each)

Pool DISABLED: ALL servers, ALL tools → 30/30 (100%)

Direct to backend (bypass gateway): 20/20 (100%)

SDK isolation tests

Configuration Sweep (No Config Helps)

Root Cause Analysis

The cancel scope leak

How it kills tool calls

Why the non-pool path survives

Relationship to #3520 and PR #3605

Cancel Scope Cleanup Noise

Related Upstream Issues

Recommended Fix Options

Workaround

Environment Info

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Server	Transport	Tool	Success
Fast Test	StreamableHTTP	echo (0ms)	23/30 (77%)
Fast Test	StreamableHTTP	echo (500ms)	17/30 (57%)
Fast Test	StreamableHTTP	get-stats	16/30 (53%)
Fast Test	StreamableHTTP	get-system-time	17/30 (57%)
Fast Time	StreamableHTTP	get-system-time	25/30 (83%)
Fast Time	StreamableHTTP	convert-time	20/30 (67%)
Fast Time	SSE	get-system-time	26/30 (87%)
Fast Time	SSE	convert-time	19/30 (63%)

Test	Result
MCP SDK → fast_test_server directly (pool-style reuse)	20/20 (100%)
MCP SDK → gateway (pool ENABLED)	11/20 (55%)
MCP SDK → gateway (pool DISABLED)	20/20 (100%)

Config	Result (30 calls)
Baseline (defaults)	26/30 (4 fail)
`HEALTH_CHECK_INTERVAL=0`	24/30 (6 fail)
`TTL=1s`	25/30 (5 fail)
`MAX_PER_KEY=1`	25/30 (5 fail)
`EXPLICIT_HEALTH_RPC=true`	27/30 (3 fail)
`HEALTH_METHODS=[list_tools]`	27/30 (3 fail)
`INTERVAL=0 + METHODS=[list_tools]`	27/30 (3 fail)
`INTERVAL=0 + METHODS=[ping]`	26/30 (4 fail)
`TTL=0` (never reuse)	23/30 (7 fail — worse)
`POOL_ENABLED=false`	30/30 (0 fail)

Issue	Status	Relevance
#415 — asyncio.wait_for incompatible with anyio cancel scopes	OPEN	Directly relevant
#787 — Child tasks don't cancel group scope correctly on asyncio	OPEN	Timing issues

Key	Value
MCP SDK	1.26.0
anyio	4.12.1
Runtime	Python 3.12 (container)
Platform / OS	Linux (Docker Compose)
Container	Docker, 1-3 gateway workers behind nginx
Session pool	enabled (default), max 200 per key
Backend servers	Rust fast_test_server, Rust fast_time_server
Transports tested	StreamableHTTP, SSE — both affected

[BUG][PERFORMANCE]: MCP session pool cancel scope leak causes ~20-45% tool call failures on /servers/{id}/mcp #3737

Description

Bug Summary

Affected Component

Steps to Reproduce

Expected Behavior

Logs / Error Output

Comprehensive Test Matrix

Pool ENABLED (30 sequential calls each)

Pool DISABLED: ALL servers, ALL tools → 30/30 (100%)

Direct to backend (bypass gateway): 20/20 (100%)

SDK isolation tests

Configuration Sweep (No Config Helps)

Root Cause Analysis

The cancel scope leak

How it kills tool calls

Why the non-pool path survives

Relationship to #3520 and PR #3605

Cancel Scope Cleanup Noise

Related Upstream Issues

Recommended Fix Options

Workaround

Environment Info

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions