Skip to content

[BUG][PERFORMANCE]: MCP session pool recycles broken sessions, causing cascading ClosedResourceError failures under load #3520

@jonpspri

Description

@jonpspri

Bug Summary

The MCP client session pool (mcpgateway/services/mcp_session_pool.py) returns broken sessions to the pool after transport failures, causing cascading ClosedResourceError failures. When a pooled session's underlying transport dies (network error, backend disconnect, resource exhaustion), the session() context manager's finally block unconditionally calls release(), which puts the dead session back into the pool. The next caller gets the same broken session, fails immediately, returns it again — creating a failure loop that produces 100% error rates even at low concurrency.

Impact: All MCP tool calls via pooled Streamable HTTP sessions fail with ToolInvocationError("Tool invocation failed: ") (empty message because ClosedResourceError has no string representation). Disabling the pool (MCP_SESSION_POOL_ENABLED=false) resolves the issue but sacrifices the 10-20x latency improvement the pool provides.

Steps to Reproduce

Requires PR #3353 (echo delay load test and fast_test_server delay support).

  1. Start the testing stack with the echo delay locustfile:
LOCUST_LOCUSTFILE=locustfile_echo_delay.py make testing-up
  1. Open the Locust web UI at http://localhost:8089

  2. Start a test with 10 users, spawn rate 2

  3. Observe: nearly all MCP requests fail with "MCP tool error: Tool invocation failed: "

  4. Check gateway logs to confirm:

docker compose logs gateway --tail=200 --no-log-prefix | grep "ClosedResourceError\|Tool invocation failed"
  1. Stop the stack and restart with pooling disabled:
MCP_SESSION_POOL_ENABLED=false LOCUST_LOCUSTFILE=locustfile_echo_delay.py make testing-up
  1. Repeat the 10-user test — requests now succeed (at higher latency due to per-request session creation).

Root Cause

Two gaps in the session pool combine to create the failure loop:

1. No error-aware release (mcp_session_pool.py ~line 1873-1877)

The session() context manager has no except clause. It cannot distinguish a successful call from one that threw ClosedResourceError. The broken session is returned to the pool identically to a healthy one.

# Current code — no error handling
pooled = await self.acquire(...)
try:
    yield pooled
finally:
    await self.release(pooled)  # Always returns to pool

2. Health checks skip recently-used sessions (mcp_session_pool.py ~line 975-999)

_validate_session() only runs health checks when idle_seconds > health_check_interval (default 60s). A broken session that is immediately re-acquired (idle < 60s) passes validation with no check. The default health check method ["skip"] compounds this — even sessions idle for 60+ seconds pass without a real liveness test.

# Session used 1 second ago — no health check, returned as valid
if pooled.idle_seconds > self._health_check_interval:  # 1 < 60
    return await self._run_health_check_chain(pooled)
return True  # Broken session handed out

3. is_closed only checks a flag (mcp_session_pool.py ~line 140-146)

PooledSession.is_closed returns self._closed, a boolean set only by explicit mark_closed() calls. It does not inspect the underlying transport streams. A session with dead read/write streams still reports is_closed = False.

Observable Symptoms

Symptom Explanation
100% ClosedResourceError on call_tool Broken sessions recycled indefinitely
fast_test_server logs show initialize but no tool calls Requests never reach the backend tool handler
"Tool invocation failed: " with empty message ClosedResourceError has no message string
Sub-500ms response times despite 500ms delay Broken sessions fail fast (~50ms)
Gateway returns HTTP 500 with ExceptionGroup MCP SDK's send_log_message error handler also hits closed stream

Error Chain (from gateway logs)

tool_service.py:3704  →  pooled.session.call_tool(...)
mcp/client/session.py:383  →  self.send_request(...)
mcp/shared/session.py:281  →  self._write_stream.send(...)
anyio/streams/memory.py:218  →  raise ClosedResourceError
  ↓
ExceptionGroup wraps the error
  ↓
tool_service.py:4031  →  raise ToolInvocationError("Tool invocation failed: ")

Recommended Fix

1. Error-aware session release

Modify session() to catch exceptions and signal release() to discard the broken session:

pooled = await self.acquire(...)
failed = False
try:
    yield pooled
except BaseException:
    failed = True
    raise
finally:
    await self.release(pooled, discard=failed)

Add a discard parameter to release() that closes and evicts the session instead of returning it to the pool.

2. Better health check defaults

Change MCP_SESSION_POOL_HEALTH_CHECK_METHODS from ["skip"] to ["ping", "skip"] so idle sessions get a real liveness check.

3. Transport-aware is_closed (longer term)

Make is_closed inspect the actual state of the underlying read/write streams rather than relying solely on a manually-set flag.

Affected Code

File Lines Component
mcpgateway/services/mcp_session_pool.py ~1873-1877 session() context manager
mcpgateway/services/mcp_session_pool.py ~840-895 release() method
mcpgateway/services/mcp_session_pool.py ~975-999 _validate_session()
mcpgateway/services/mcp_session_pool.py ~140-146 PooledSession.is_closed
docker-compose.yml ~484 MCP_SESSION_POOL_HEALTH_CHECK_METHODS default

Discovered During

Load testing with locustfile_echo_delay.py (PR #3353) against the fast_test_server echo tool with a 500ms delay, exercising the Streamable HTTP path through /servers/{id}/mcp.

Metadata

Metadata

Labels

MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafebugSomething isn't workingperformancePerformance related itemsreadyValidated, ready-to-work-on items

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions