-
Notifications
You must be signed in to change notification settings - Fork 615
[BUG][PERFORMANCE]: MCP session pool recycles broken sessions, causing cascading ClosedResourceError failures under load #3520
Description
Bug Summary
The MCP client session pool (mcpgateway/services/mcp_session_pool.py) returns broken sessions to the pool after transport failures, causing cascading ClosedResourceError failures. When a pooled session's underlying transport dies (network error, backend disconnect, resource exhaustion), the session() context manager's finally block unconditionally calls release(), which puts the dead session back into the pool. The next caller gets the same broken session, fails immediately, returns it again — creating a failure loop that produces 100% error rates even at low concurrency.
Impact: All MCP tool calls via pooled Streamable HTTP sessions fail with ToolInvocationError("Tool invocation failed: ") (empty message because ClosedResourceError has no string representation). Disabling the pool (MCP_SESSION_POOL_ENABLED=false) resolves the issue but sacrifices the 10-20x latency improvement the pool provides.
Steps to Reproduce
Requires PR #3353 (echo delay load test and fast_test_server delay support).
- Start the testing stack with the echo delay locustfile:
LOCUST_LOCUSTFILE=locustfile_echo_delay.py make testing-up-
Open the Locust web UI at
http://localhost:8089 -
Start a test with 10 users, spawn rate 2
-
Observe: nearly all MCP requests fail with
"MCP tool error: Tool invocation failed: " -
Check gateway logs to confirm:
docker compose logs gateway --tail=200 --no-log-prefix | grep "ClosedResourceError\|Tool invocation failed"- Stop the stack and restart with pooling disabled:
MCP_SESSION_POOL_ENABLED=false LOCUST_LOCUSTFILE=locustfile_echo_delay.py make testing-up- Repeat the 10-user test — requests now succeed (at higher latency due to per-request session creation).
Root Cause
Two gaps in the session pool combine to create the failure loop:
1. No error-aware release (mcp_session_pool.py ~line 1873-1877)
The session() context manager has no except clause. It cannot distinguish a successful call from one that threw ClosedResourceError. The broken session is returned to the pool identically to a healthy one.
# Current code — no error handling
pooled = await self.acquire(...)
try:
yield pooled
finally:
await self.release(pooled) # Always returns to pool2. Health checks skip recently-used sessions (mcp_session_pool.py ~line 975-999)
_validate_session() only runs health checks when idle_seconds > health_check_interval (default 60s). A broken session that is immediately re-acquired (idle < 60s) passes validation with no check. The default health check method ["skip"] compounds this — even sessions idle for 60+ seconds pass without a real liveness test.
# Session used 1 second ago — no health check, returned as valid
if pooled.idle_seconds > self._health_check_interval: # 1 < 60
return await self._run_health_check_chain(pooled)
return True # Broken session handed out3. is_closed only checks a flag (mcp_session_pool.py ~line 140-146)
PooledSession.is_closed returns self._closed, a boolean set only by explicit mark_closed() calls. It does not inspect the underlying transport streams. A session with dead read/write streams still reports is_closed = False.
Observable Symptoms
| Symptom | Explanation |
|---|---|
100% ClosedResourceError on call_tool |
Broken sessions recycled indefinitely |
fast_test_server logs show initialize but no tool calls |
Requests never reach the backend tool handler |
"Tool invocation failed: " with empty message |
ClosedResourceError has no message string |
| Sub-500ms response times despite 500ms delay | Broken sessions fail fast (~50ms) |
Gateway returns HTTP 500 with ExceptionGroup |
MCP SDK's send_log_message error handler also hits closed stream |
Error Chain (from gateway logs)
tool_service.py:3704 → pooled.session.call_tool(...)
mcp/client/session.py:383 → self.send_request(...)
mcp/shared/session.py:281 → self._write_stream.send(...)
anyio/streams/memory.py:218 → raise ClosedResourceError
↓
ExceptionGroup wraps the error
↓
tool_service.py:4031 → raise ToolInvocationError("Tool invocation failed: ")
Recommended Fix
1. Error-aware session release
Modify session() to catch exceptions and signal release() to discard the broken session:
pooled = await self.acquire(...)
failed = False
try:
yield pooled
except BaseException:
failed = True
raise
finally:
await self.release(pooled, discard=failed)Add a discard parameter to release() that closes and evicts the session instead of returning it to the pool.
2. Better health check defaults
Change MCP_SESSION_POOL_HEALTH_CHECK_METHODS from ["skip"] to ["ping", "skip"] so idle sessions get a real liveness check.
3. Transport-aware is_closed (longer term)
Make is_closed inspect the actual state of the underlying read/write streams rather than relying solely on a manually-set flag.
Affected Code
| File | Lines | Component |
|---|---|---|
mcpgateway/services/mcp_session_pool.py |
~1873-1877 | session() context manager |
mcpgateway/services/mcp_session_pool.py |
~840-895 | release() method |
mcpgateway/services/mcp_session_pool.py |
~975-999 | _validate_session() |
mcpgateway/services/mcp_session_pool.py |
~140-146 | PooledSession.is_closed |
docker-compose.yml |
~484 | MCP_SESSION_POOL_HEALTH_CHECK_METHODS default |
Discovered During
Load testing with locustfile_echo_delay.py (PR #3353) against the fast_test_server echo tool with a 500ms delay, exercising the Streamable HTTP path through /servers/{id}/mcp.