Skip to content

fix(session-pool): prevent broken session recycling in MCPSessionPool#3605

Closed
gandhipratik203 wants to merge 4 commits intomainfrom
fix/mcp-session-pool-broken-session-recycling
Closed

fix(session-pool): prevent broken session recycling in MCPSessionPool#3605
gandhipratik203 wants to merge 4 commits intomainfrom
fix/mcp-session-pool-broken-session-recycling

Conversation

@gandhipratik203
Copy link
Copy Markdown
Collaborator

@gandhipratik203 gandhipratik203 commented Mar 11, 2026

Related Issue

Closes #3520

Summary

Fixes remaining root causes of broken session detection in MCPSessionPool and aligns the verbose-logging compose health check default with the recommended production setting.

Change details

Part 1 — Transport-aware is_closed

PooledSession.is_closed only checked the internal _closed flag, missing cases where the underlying transport stream had silently broken. Sessions with a closed write stream or zero open receive channels were incorrectly considered healthy and handed to callers.

  • Extended is_closed to inspect session._write_stream._closed and _state.open_receive_channels
  • Uses strict is True and isinstance(open_rx, int) checks to avoid false positives with mock objects in tests
  • All getattr lookups include safe fallbacks so the property degrades gracefully if MCP internals change

Part 2 — Health check default (docker-compose-verbose-logging.yml)

MCP_SESSION_POOL_HEALTH_CHECK_METHODS was set to ["skip"] in docker-compose-verbose-logging.yml, contradicting the ["ping", "skip"] default already set in config.py and docker-compose.yml. Updated to ["ping", "skip"] so that a ping is attempted on session acquisition (with graceful fallback to skip for servers that do not support it).

Part 3 — Tests

Added unit tests covering all is_closed transport-inspection paths, release(discard=True) behaviour, and context manager discard on both Exception and BaseException. Fixed DummyResponse stubs in test_mcp_session_pool_coverage.py to include is_success = True.


Scope

Requirement Status
is_closed inspects actual transport stream state
docker-compose-verbose-logging.yml health check default aligned to ["ping", "skip"]
Unit tests cover all is_closed transport-inspection paths
Unit tests confirm release(discard=True) closes instead of recycling
Unit tests confirm context manager discards on Exception and BaseException

🔬 Steps to Test

  1. Run the session pool unit tests: python3 -m pytest tests/unit/mcpgateway/services/test_mcp_session_pool.py tests/unit/mcpgateway/services/test_mcp_session_pool_coverage.py -v
  2. Verify the health check default: start with docker-compose -f docker-compose-verbose-logging.yml up and confirm MCP_SESSION_POOL_HEALTH_CHECK_METHODS resolves to ["ping", "skip"].
  3. Optionally reproduce with the echo-delay load test from PR refactor(makefile): parameterize target families and add deprecation framework #3353 (requires the locustfile_echo_delay.py locustfile and fast_test_server delay support introduced in that PR). Start the stack with LOCUST_LOCUSTFILE=locustfile_echo_delay.py make testing-up, run 10 users at spawn rate 2 in the Locust UI at http://localhost:8089, and confirm no ClosedResourceError or empty "Tool invocation failed: " errors appear in the gateway logs.
Manual test — idle session with dead backend

Manual test results

Tested against a live gateway (make dev) with an SSE translator wrapping mcp-server-git on port 9000.

Prerequisites

  • A gateway registered in the UI pointing to http://localhost:9000/sse (SSE transport) — this federates the git_log tool into the gateway
  • uvx available (pip install uv if not)

Setup — start services, create virtual server, associate tool

# Terminal 1 — start the gateway
make dev

# Terminal 2 — start the SSE translator (wraps mcp-server-git on port 9000)
python3 -m mcpgateway.translate --stdio "uvx mcp-server-git" --port 9000

# Terminal 3 — set up auth and test resources
export TOKEN=$(python3 -m mcpgateway.utils.create_jwt_token \
  --username admin@example.com --exp 10080 2>/dev/null | tail -1)

# Create a virtual server
export SERVER_ID=$(curl -s -X POST http://localhost:8000/servers \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"server": {"name": "test-server", "description": "Session pool fix test"}}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")

# Get the git_log tool ID and name (federated from the registered gateway)
read TOOL_ID TOOL_NAME < <(curl -s http://localhost:8000/tools \
  -H "Authorization: Bearer $TOKEN" | python3 -c \
  "import sys,json; ts=json.load(sys.stdin); \
  t=next(t for t in ts if t['originalName']=='git_log'); \
  print(t['id'], t['name'])")
export TOOL_ID TOOL_NAME
echo "Tool: $TOOL_NAME (id: $TOOL_ID)"

# Associate the tool with the virtual server
curl -s -X PUT "http://localhost:8000/servers/${SERVER_ID}" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"server\": {\"name\": \"test-server\"}, \"associated_tools\": [\"${TOOL_ID}\"]}" \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print('Tools:', d['associatedTools'])"

Test — health check on idle session (backend killed after 10s idle)

This test lowers MCP_SESSION_POOL_HEALTH_CHECK_INTERVAL to 5s for the duration of the test only.
Restart make dev normally afterwards to restore the 60s default.

# B1. Restart the gateway with a reduced health check interval
# (Ctrl+C the existing make dev first, then:)
MCP_SESSION_POOL_HEALTH_CHECK_INTERVAL=5 make dev

# B2. Happy path — establishes a pooled session
curl -s -X POST "http://localhost:8000/servers/${SERVER_ID}/mcp" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d "{\"jsonrpc\":\"2.0\",\"id\":4,\"method\":\"tools/call\",\"params\":{\"name\":\"${TOOL_NAME}\",\"arguments\":{\"repo_path\":\"/tmp\"}}}" \
  | python3 -m json.tool
# Expected: "isError": false, "text": "/tmp"

# B3. Wait 10s idle, then kill the translator (no calls during idle)
sleep 10 && pkill -f "mcpgateway.translate" && echo "Translator killed"

# B4. Call immediately — ping health check should detect dead session before the call
curl -s -X POST "http://localhost:8000/servers/${SERVER_ID}/mcp" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d "{\"jsonrpc\":\"2.0\",\"id\":2,\"method\":\"tools/call\",\"params\":{\"name\":\"${TOOL_NAME}\",\"arguments\":{\"repo_path\":\"/tmp\"}}}" \
  | python3 -m json.tool
# Expected: "isError": true, "text": "Tool invocation failed: Connection closed"
# (pre-fix: idle session handed out unchecked, silent ClosedResourceError)

# B5. Restore the gateway with the default 60s interval
# (Ctrl+C make dev, then:)
make dev

Results summary

Test Scenario Before fix After fix
B2 Tool call — backend running isError: false isError: false
B4 10s idle, backend killed silently, call Silent ClosedResourceError handed out "Tool invocation failed: Connection closed"

is_closed transport stream inspection is covered by unit tests only — a silently dead transport with no exception is not reliably triggerable via curl.


🏷️ Type of Change

  • Bug fix
  • Feature / Enhancement
  • Documentation
  • Refactor
  • Chore (deps, CI, tooling)
  • Other (describe below)

🧪 Verification

Check Command Status
Lint suite make lint
Unit tests make test
Session pool tests python3 -m pytest tests/unit/mcpgateway/services/test_mcp_session_pool.py tests/unit/mcpgateway/services/test_mcp_session_pool_coverage.py

✅ Checklist

  • Code formatted (make black isort pre-commit)
  • Tests added/updated for changes
  • Documentation updated (if applicable)
  • No secrets or credentials committed

📓 Notes

The ["ping", "skip"] health check chain adds a ping on every session acquisition. Servers that do not support the ping method will fall through to skip automatically, so there is no regression for environments using non-standard MCP servers.

@gandhipratik203 gandhipratik203 marked this pull request as ready for review March 11, 2026 08:39
@gandhipratik203 gandhipratik203 changed the title Fix/mcp session pool broken session recycling fix(session-pool): prevent broken session recycling in MCPSessionPool Mar 11, 2026
@gandhipratik203 gandhipratik203 added bug Something isn't working performance Performance related items labels Mar 11, 2026
@gandhipratik203 gandhipratik203 force-pushed the fix/mcp-session-pool-broken-session-recycling branch from bf6b328 to c58a2dd Compare March 11, 2026 22:45
@gandhipratik203 gandhipratik203 added the MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe label Mar 11, 2026
@gandhipratik203 gandhipratik203 marked this pull request as draft March 12, 2026 09:32
@gandhipratik203 gandhipratik203 marked this pull request as ready for review March 12, 2026 15:08
@gandhipratik203 gandhipratik203 added this to the Release 1.0.0 milestone Mar 12, 2026
@gandhipratik203 gandhipratik203 self-assigned this Mar 12, 2026
@gandhipratik203 gandhipratik203 force-pushed the fix/mcp-session-pool-broken-session-recycling branch from c58a2dd to 165134d Compare March 12, 2026 21:48
@gandhipratik203 gandhipratik203 added release-fix Critical bugfix required for the release ready Validated, ready-to-work-on items labels Mar 13, 2026
@msureshkumar88 msureshkumar88 self-requested a review March 18, 2026 11:13
…cycling

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…cess changes

Update test_session_context_manager_releases_on_exception to expect
release(discard=True) on exception, matching the new discard behaviour
added to the session() context manager.

Add is_success = True to DummyResponse stubs in
test_execute_forwarded_request_success_result and
test_execute_forwarded_request_error_result to match the
response.is_success guard introduced in mcp_session_pool.py.

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
…ing", "skip"]

Updates docker-compose.yml and docker-compose-verbose-logging.yml to use
["ping", "skip"] instead of ["skip"], aligning with the config.py default.

This enables ping-based health checks on session acquisition so broken
sessions are detected before being handed to callers, with graceful fallback
to skip for servers that do not support the ping method.

Closes #3520

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>
@gandhipratik203 gandhipratik203 force-pushed the fix/mcp-session-pool-broken-session-recycling branch from 165134d to 7168d44 Compare March 19, 2026 07:37
crivetimihai added a commit that referenced this pull request Mar 19, 2026
#3737)

The MCP session pool manually called transport_ctx.__aenter__() and
session.__aenter__(), attaching anyio cancel scopes to the HTTP request
handler task. When a child task in the transport's TaskGroup failed, it
cancelled the request handler, causing ~20-45% tool call failures.

Changes:
- Run transport/session lifecycle in dedicated asyncio.Task per pooled
  session so cancel scopes bind to the owner task, not request handlers
- Add transport-aware is_closed detection (write stream state, owner
  task health, receive channel count) from PR #3605
- Replace asyncio.wait_for with anyio.fail_after for MCP SDK calls in
  tool_service.py (4 sites), mcp_session_pool.py health checks (4),
  and gateway_service.py explicit health RPC (1)
- Fix release() to clean up dead-owner sessions instead of leaking
  them in _active with consumed semaphore slots
- Fix _create_session() to clean up owner task on CancelledError via
  finally block (not just Exception)
- Add 18 tests for owner task lifecycle, transport-aware is_closed,
  release with dead owner, cancellation cleanup, and prompt pooled
  regression coverage

Tested: 607K+ requests across 200-1000 concurrent users with 0%
pool-related failures. All servers (Fast Test, Fast Time), all
transports (StreamableHTTP, SSE) verified.

Closes #3737

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
crivetimihai added a commit that referenced this pull request Mar 19, 2026
#3739)

* fix(session-pool): isolate cancel scopes via background task ownership (#3737)

The MCP session pool manually called transport_ctx.__aenter__() and
session.__aenter__(), attaching anyio cancel scopes to the HTTP request
handler task. When a child task in the transport's TaskGroup failed, it
cancelled the request handler, causing ~20-45% tool call failures.

Changes:
- Run transport/session lifecycle in dedicated asyncio.Task per pooled
  session so cancel scopes bind to the owner task, not request handlers
- Add transport-aware is_closed detection (write stream state, owner
  task health, receive channel count) from PR #3605
- Replace asyncio.wait_for with anyio.fail_after for MCP SDK calls in
  tool_service.py (4 sites), mcp_session_pool.py health checks (4),
  and gateway_service.py explicit health RPC (1)
- Fix release() to clean up dead-owner sessions instead of leaking
  them in _active with consumed semaphore slots
- Fix _create_session() to clean up owner task on CancelledError via
  finally block (not just Exception)
- Add 18 tests for owner task lifecycle, transport-aware is_closed,
  release with dead owner, cancellation cleanup, and prompt pooled
  regression coverage

Tested: 607K+ requests across 200-1000 concurrent users with 0%
pool-related failures. All servers (Fast Test, Fast Time), all
transports (StreamableHTTP, SSE) verified.

Closes #3737

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(test): exercise actual PromptService fallback path in prompt pooled regression test

The prompt fallback test was only asserting that get_mcp_session_pool()
raises RuntimeError, not that PromptService._fetch_gateway_prompt_result()
actually falls through to the non-pooled sse_client path. Now calls the
real method with mocked pool unavailability and verifies the SSE fallback
session is initialized and get_prompt is invoked.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(test): improve diff coverage for session pool cancel scope fix

- Add test for is_closed graceful degradation when SDK internals raise
- Exercise actual PromptService._fetch_gateway_prompt_result fallback
  path when pool is unavailable (not just bare get_mcp_session_pool)
- Improve force-cancel timeout test with shorter cleanup timeout
- Add test for _create_session finally-block BaseException handling

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai
Copy link
Copy Markdown
Member

Incorporated into #3739

1 similar comment
@crivetimihai
Copy link
Copy Markdown
Member

Incorporated into #3739

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe performance Performance related items ready Validated, ready-to-work-on items release-fix Critical bugfix required for the release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG][PERFORMANCE]: MCP session pool recycles broken sessions, causing cascading ClosedResourceError failures under load

2 participants