[BUG][PERFORMANCE]: MCP session pool recycles broken sessions, causing cascading ClosedResourceError failures under load

### Bug Summary

The MCP client session pool (`mcpgateway/services/mcp_session_pool.py`) returns broken sessions to the pool after transport failures, causing cascading `ClosedResourceError` failures. When a pooled session's underlying transport dies (network error, backend disconnect, resource exhaustion), the `session()` context manager's `finally` block unconditionally calls `release()`, which puts the dead session back into the pool. The next caller gets the same broken session, fails immediately, returns it again — creating a failure loop that produces 100% error rates even at low concurrency.

**Impact**: All MCP tool calls via pooled Streamable HTTP sessions fail with `ToolInvocationError("Tool invocation failed: ")` (empty message because `ClosedResourceError` has no string representation). Disabling the pool (`MCP_SESSION_POOL_ENABLED=false`) resolves the issue but sacrifices the 10-20x latency improvement the pool provides.

### Steps to Reproduce

Requires PR #3353 (echo delay load test and fast_test_server delay support).

1. Start the testing stack with the echo delay locustfile:

```bash
LOCUST_LOCUSTFILE=locustfile_echo_delay.py make testing-up
```

2. Open the Locust web UI at `http://localhost:8089`

3. Start a test with 10 users, spawn rate 2

4. Observe: nearly all MCP requests fail with `"MCP tool error: Tool invocation failed: "`

5. Check gateway logs to confirm:

```bash
docker compose logs gateway --tail=200 --no-log-prefix | grep "ClosedResourceError\|Tool invocation failed"
```

6. Stop the stack and restart with pooling disabled:

```bash
MCP_SESSION_POOL_ENABLED=false LOCUST_LOCUSTFILE=locustfile_echo_delay.py make testing-up
```

7. Repeat the 10-user test — requests now succeed (at higher latency due to per-request session creation).

### Root Cause

Two gaps in the session pool combine to create the failure loop:

**1. No error-aware release** (`mcp_session_pool.py` ~line 1873-1877)

The `session()` context manager has no `except` clause. It cannot distinguish a successful call from one that threw `ClosedResourceError`. The broken session is returned to the pool identically to a healthy one.

```python
# Current code — no error handling
pooled = await self.acquire(...)
try:
    yield pooled
finally:
    await self.release(pooled)  # Always returns to pool
```

**2. Health checks skip recently-used sessions** (`mcp_session_pool.py` ~line 975-999)

`_validate_session()` only runs health checks when `idle_seconds > health_check_interval` (default 60s). A broken session that is immediately re-acquired (idle < 60s) passes validation with no check. The default health check method `["skip"]` compounds this — even sessions idle for 60+ seconds pass without a real liveness test.

```python
# Session used 1 second ago — no health check, returned as valid
if pooled.idle_seconds > self._health_check_interval:  # 1 < 60
    return await self._run_health_check_chain(pooled)
return True  # Broken session handed out
```

**3. `is_closed` only checks a flag** (`mcp_session_pool.py` ~line 140-146)

`PooledSession.is_closed` returns `self._closed`, a boolean set only by explicit `mark_closed()` calls. It does not inspect the underlying transport streams. A session with dead read/write streams still reports `is_closed = False`.

### Observable Symptoms

| Symptom | Explanation |
|---------|-------------|
| 100% `ClosedResourceError` on `call_tool` | Broken sessions recycled indefinitely |
| `fast_test_server` logs show `initialize` but no tool calls | Requests never reach the backend tool handler |
| `"Tool invocation failed: "` with empty message | `ClosedResourceError` has no message string |
| Sub-500ms response times despite 500ms delay | Broken sessions fail fast (~50ms) |
| Gateway returns HTTP 500 with `ExceptionGroup` | MCP SDK's `send_log_message` error handler also hits closed stream |

### Error Chain (from gateway logs)

```
tool_service.py:3704  →  pooled.session.call_tool(...)
mcp/client/session.py:383  →  self.send_request(...)
mcp/shared/session.py:281  →  self._write_stream.send(...)
anyio/streams/memory.py:218  →  raise ClosedResourceError
  ↓
ExceptionGroup wraps the error
  ↓
tool_service.py:4031  →  raise ToolInvocationError("Tool invocation failed: ")
```

### Recommended Fix

**1. Error-aware session release**

Modify `session()` to catch exceptions and signal `release()` to discard the broken session:

```python
pooled = await self.acquire(...)
failed = False
try:
    yield pooled
except BaseException:
    failed = True
    raise
finally:
    await self.release(pooled, discard=failed)
```

Add a `discard` parameter to `release()` that closes and evicts the session instead of returning it to the pool.

**2. Better health check defaults**

Change `MCP_SESSION_POOL_HEALTH_CHECK_METHODS` from `["skip"]` to `["ping", "skip"]` so idle sessions get a real liveness check.

**3. Transport-aware `is_closed` (longer term)**

Make `is_closed` inspect the actual state of the underlying read/write streams rather than relying solely on a manually-set flag.

### Affected Code

| File | Lines | Component |
|------|-------|-----------|
| `mcpgateway/services/mcp_session_pool.py` | ~1873-1877 | `session()` context manager |
| `mcpgateway/services/mcp_session_pool.py` | ~840-895 | `release()` method |
| `mcpgateway/services/mcp_session_pool.py` | ~975-999 | `_validate_session()` |
| `mcpgateway/services/mcp_session_pool.py` | ~140-146 | `PooledSession.is_closed` |
| `docker-compose.yml` | ~484 | `MCP_SESSION_POOL_HEALTH_CHECK_METHODS` default |

### Discovered During

Load testing with `locustfile_echo_delay.py` (PR #3353) against the `fast_test_server` echo tool with a 500ms delay, exercising the Streamable HTTP path through `/servers/{id}/mcp`.

File	Lines	Component
`mcpgateway/services/mcp_session_pool.py`	~1873-1877	`session()` context manager
`mcpgateway/services/mcp_session_pool.py`	~840-895	`release()` method
`mcpgateway/services/mcp_session_pool.py`	~975-999	`_validate_session()`
`mcpgateway/services/mcp_session_pool.py`	~140-146	`PooledSession.is_closed`
`docker-compose.yml`	~484	`MCP_SESSION_POOL_HEALTH_CHECK_METHODS` default

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][PERFORMANCE]: MCP session pool recycles broken sessions, causing cascading ClosedResourceError failures under load #3520

Bug Summary

Steps to Reproduce

Root Cause

Observable Symptoms

Error Chain (from gateway logs)

Recommended Fix

Affected Code

Discovered During

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Symptom	Explanation
100% `ClosedResourceError` on `call_tool`	Broken sessions recycled indefinitely
`fast_test_server` logs show `initialize` but no tool calls	Requests never reach the backend tool handler
`"Tool invocation failed: "` with empty message	`ClosedResourceError` has no message string
Sub-500ms response times despite 500ms delay	Broken sessions fail fast (~50ms)
Gateway returns HTTP 500 with `ExceptionGroup`	MCP SDK's `send_log_message` error handler also hits closed stream

[BUG][PERFORMANCE]: MCP session pool recycles broken sessions, causing cascading ClosedResourceError failures under load #3520

Description

Bug Summary

Steps to Reproduce

Root Cause

Observable Symptoms

Error Chain (from gateway logs)

Recommended Fix

Affected Code

Discovered During

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions