fix(session-pool): prevent broken session recycling in MCPSessionPool by gandhipratik203 · Pull Request #3605 · IBM/mcp-context-forge

gandhipratik203 · 2026-03-11T08:38:45Z

Related Issue

Summary

Fixes remaining root causes of broken session detection in MCPSessionPool and aligns the verbose-logging compose health check default with the recommended production setting.

Change details

Part 1 — Transport-aware is_closed

PooledSession.is_closed only checked the internal _closed flag, missing cases where the underlying transport stream had silently broken. Sessions with a closed write stream or zero open receive channels were incorrectly considered healthy and handed to callers.

Extended is_closed to inspect session._write_stream._closed and _state.open_receive_channels
Uses strict is True and isinstance(open_rx, int) checks to avoid false positives with mock objects in tests
All getattr lookups include safe fallbacks so the property degrades gracefully if MCP internals change

Part 2 — Health check default (docker-compose-verbose-logging.yml)

MCP_SESSION_POOL_HEALTH_CHECK_METHODS was set to ["skip"] in docker-compose-verbose-logging.yml, contradicting the ["ping", "skip"] default already set in config.py and docker-compose.yml. Updated to ["ping", "skip"] so that a ping is attempted on session acquisition (with graceful fallback to skip for servers that do not support it).

Part 3 — Tests

Added unit tests covering all is_closed transport-inspection paths, release(discard=True) behaviour, and context manager discard on both Exception and BaseException. Fixed DummyResponse stubs in test_mcp_session_pool_coverage.py to include is_success = True.

Scope

Requirement	Status
`is_closed` inspects actual transport stream state	✅
`docker-compose-verbose-logging.yml` health check default aligned to `["ping", "skip"]`	✅
Unit tests cover all `is_closed` transport-inspection paths	✅
Unit tests confirm `release(discard=True)` closes instead of recycling	✅
Unit tests confirm context manager discards on `Exception` and `BaseException`	✅

🔬 Steps to Test

Run the session pool unit tests: python3 -m pytest tests/unit/mcpgateway/services/test_mcp_session_pool.py tests/unit/mcpgateway/services/test_mcp_session_pool_coverage.py -v
Verify the health check default: start with docker-compose -f docker-compose-verbose-logging.yml up and confirm MCP_SESSION_POOL_HEALTH_CHECK_METHODS resolves to ["ping", "skip"].
Optionally reproduce with the echo-delay load test from PR refactor(makefile): parameterize target families and add deprecation framework #3353 (requires the locustfile_echo_delay.py locustfile and fast_test_server delay support introduced in that PR). Start the stack with LOCUST_LOCUSTFILE=locustfile_echo_delay.py make testing-up, run 10 users at spawn rate 2 in the Locust UI at http://localhost:8089, and confirm no ClosedResourceError or empty "Tool invocation failed: " errors appear in the gateway logs.

Manual test — idle session with dead backend

Manual test results

Tested against a live gateway (make dev) with an SSE translator wrapping mcp-server-git on port 9000.

Prerequisites

A gateway registered in the UI pointing to http://localhost:9000/sse (SSE transport) — this federates the git_log tool into the gateway
uvx available (pip install uv if not)

Setup — start services, create virtual server, associate tool

# Terminal 1 — start the gateway
make dev

# Terminal 2 — start the SSE translator (wraps mcp-server-git on port 9000)
python3 -m mcpgateway.translate --stdio "uvx mcp-server-git" --port 9000

# Terminal 3 — set up auth and test resources
export TOKEN=$(python3 -m mcpgateway.utils.create_jwt_token \
  --username admin@example.com --exp 10080 2>/dev/null | tail -1)

# Create a virtual server
export SERVER_ID=$(curl -s -X POST http://localhost:8000/servers \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"server": {"name": "test-server", "description": "Session pool fix test"}}' \
  | python3 -c "import sys,json; print(json.load(sys.stdin)['id'])")

# Get the git_log tool ID and name (federated from the registered gateway)
read TOOL_ID TOOL_NAME < <(curl -s http://localhost:8000/tools \
  -H "Authorization: Bearer $TOKEN" | python3 -c \
  "import sys,json; ts=json.load(sys.stdin); \
  t=next(t for t in ts if t['originalName']=='git_log'); \
  print(t['id'], t['name'])")
export TOOL_ID TOOL_NAME
echo "Tool: $TOOL_NAME (id: $TOOL_ID)"

# Associate the tool with the virtual server
curl -s -X PUT "http://localhost:8000/servers/${SERVER_ID}" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"server\": {\"name\": \"test-server\"}, \"associated_tools\": [\"${TOOL_ID}\"]}" \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print('Tools:', d['associatedTools'])"

Test — health check on idle session (backend killed after 10s idle)

This test lowers MCP_SESSION_POOL_HEALTH_CHECK_INTERVAL to 5s for the duration of the test only.
Restart make dev normally afterwards to restore the 60s default.

# B1. Restart the gateway with a reduced health check interval
# (Ctrl+C the existing make dev first, then:)
MCP_SESSION_POOL_HEALTH_CHECK_INTERVAL=5 make dev

# B2. Happy path — establishes a pooled session
curl -s -X POST "http://localhost:8000/servers/${SERVER_ID}/mcp" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d "{\"jsonrpc\":\"2.0\",\"id\":4,\"method\":\"tools/call\",\"params\":{\"name\":\"${TOOL_NAME}\",\"arguments\":{\"repo_path\":\"/tmp\"}}}" \
  | python3 -m json.tool
# Expected: "isError": false, "text": "/tmp"

# B3. Wait 10s idle, then kill the translator (no calls during idle)
sleep 10 && pkill -f "mcpgateway.translate" && echo "Translator killed"

# B4. Call immediately — ping health check should detect dead session before the call
curl -s -X POST "http://localhost:8000/servers/${SERVER_ID}/mcp" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -H "Accept: application/json, text/event-stream" \
  -d "{\"jsonrpc\":\"2.0\",\"id\":2,\"method\":\"tools/call\",\"params\":{\"name\":\"${TOOL_NAME}\",\"arguments\":{\"repo_path\":\"/tmp\"}}}" \
  | python3 -m json.tool
# Expected: "isError": true, "text": "Tool invocation failed: Connection closed"
# (pre-fix: idle session handed out unchecked, silent ClosedResourceError)

# B5. Restore the gateway with the default 60s interval
# (Ctrl+C make dev, then:)
make dev

Results summary

Test	Scenario	Before fix	After fix
B2	Tool call — backend running	`isError: false`	`isError: false` ✅
B4	10s idle, backend killed silently, call	Silent `ClosedResourceError` handed out	`"Tool invocation failed: Connection closed"` ✅

is_closed transport stream inspection is covered by unit tests only — a silently dead transport with no exception is not reliably triggerable via curl.

🏷️ Type of Change

🧪 Verification

Check	Command	Status
Lint suite	`make lint`	✅
Unit tests	`make test`	✅
Session pool tests	`python3 -m pytest tests/unit/mcpgateway/services/test_mcp_session_pool.py tests/unit/mcpgateway/services/test_mcp_session_pool_coverage.py`	✅

✅ Checklist

Code formatted (make black isort pre-commit)
Tests added/updated for changes
Documentation updated (if applicable)
No secrets or credentials committed

📓 Notes

The ["ping", "skip"] health check chain adds a ping on every session acquisition. Servers that do not support the ping method will fall through to skip automatically, so there is no regression for environments using non-standard MCP servers.

…cycling Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

…cess changes Update test_session_context_manager_releases_on_exception to expect release(discard=True) on exception, matching the new discard behaviour added to the session() context manager. Add is_success = True to DummyResponse stubs in test_execute_forwarded_request_success_result and test_execute_forwarded_request_error_result to match the response.is_success guard introduced in mcp_session_pool.py. Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

…ing", "skip"] Updates docker-compose.yml and docker-compose-verbose-logging.yml to use ["ping", "skip"] instead of ["skip"], aligning with the config.py default. This enables ping-based health checks on session acquisition so broken sessions are detected before being handed to callers, with graceful fallback to skip for servers that do not support the ping method. Closes #3520 Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

#3737) The MCP session pool manually called transport_ctx.__aenter__() and session.__aenter__(), attaching anyio cancel scopes to the HTTP request handler task. When a child task in the transport's TaskGroup failed, it cancelled the request handler, causing ~20-45% tool call failures. Changes: - Run transport/session lifecycle in dedicated asyncio.Task per pooled session so cancel scopes bind to the owner task, not request handlers - Add transport-aware is_closed detection (write stream state, owner task health, receive channel count) from PR #3605 - Replace asyncio.wait_for with anyio.fail_after for MCP SDK calls in tool_service.py (4 sites), mcp_session_pool.py health checks (4), and gateway_service.py explicit health RPC (1) - Fix release() to clean up dead-owner sessions instead of leaking them in _active with consumed semaphore slots - Fix _create_session() to clean up owner task on CancelledError via finally block (not just Exception) - Add 18 tests for owner task lifecycle, transport-aware is_closed, release with dead owner, cancellation cleanup, and prompt pooled regression coverage Tested: 607K+ requests across 200-1000 concurrent users with 0% pool-related failures. All servers (Fast Test, Fast Time), all transports (StreamableHTTP, SSE) verified. Closes #3737 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

#3739) * fix(session-pool): isolate cancel scopes via background task ownership (#3737) The MCP session pool manually called transport_ctx.__aenter__() and session.__aenter__(), attaching anyio cancel scopes to the HTTP request handler task. When a child task in the transport's TaskGroup failed, it cancelled the request handler, causing ~20-45% tool call failures. Changes: - Run transport/session lifecycle in dedicated asyncio.Task per pooled session so cancel scopes bind to the owner task, not request handlers - Add transport-aware is_closed detection (write stream state, owner task health, receive channel count) from PR #3605 - Replace asyncio.wait_for with anyio.fail_after for MCP SDK calls in tool_service.py (4 sites), mcp_session_pool.py health checks (4), and gateway_service.py explicit health RPC (1) - Fix release() to clean up dead-owner sessions instead of leaking them in _active with consumed semaphore slots - Fix _create_session() to clean up owner task on CancelledError via finally block (not just Exception) - Add 18 tests for owner task lifecycle, transport-aware is_closed, release with dead owner, cancellation cleanup, and prompt pooled regression coverage Tested: 607K+ requests across 200-1000 concurrent users with 0% pool-related failures. All servers (Fast Test, Fast Time), all transports (StreamableHTTP, SSE) verified. Closes #3737 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(test): exercise actual PromptService fallback path in prompt pooled regression test The prompt fallback test was only asserting that get_mcp_session_pool() raises RuntimeError, not that PromptService._fetch_gateway_prompt_result() actually falls through to the non-pooled sse_client path. Now calls the real method with mocked pool unavailability and verifies the SSE fallback session is initialized and get_prompt is invoked. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(test): improve diff coverage for session pool cancel scope fix - Add test for is_closed graceful degradation when SDK internals raise - Exercise actual PromptService._fetch_gateway_prompt_result fallback path when pool is unavailable (not just bare get_mcp_session_pool) - Improve force-cancel timeout test with shorter cleanup timeout - Add test for _create_session finally-block BaseException handling Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai · 2026-03-19T14:05:44Z

Incorporated into #3739

crivetimihai · 2026-03-19T14:06:00Z

Incorporated into #3739

gandhipratik203 marked this pull request as ready for review March 11, 2026 08:39

gandhipratik203 requested review from crivetimihai, kevalmahajan and madhav165 as code owners March 11, 2026 08:39

gandhipratik203 changed the title ~~Fix/mcp session pool broken session recycling~~ fix(session-pool): prevent broken session recycling in MCPSessionPool Mar 11, 2026

gandhipratik203 added bug Something isn't working performance Performance related items labels Mar 11, 2026

gandhipratik203 force-pushed the fix/mcp-session-pool-broken-session-recycling branch from bf6b328 to c58a2dd Compare March 11, 2026 22:45

gandhipratik203 added the MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe label Mar 11, 2026

gandhipratik203 requested a review from jonpspri March 12, 2026 07:48

gandhipratik203 marked this pull request as draft March 12, 2026 09:32

gandhipratik203 marked this pull request as ready for review March 12, 2026 15:08

gandhipratik203 added this to the Release 1.0.0 milestone Mar 12, 2026

gandhipratik203 self-assigned this Mar 12, 2026

gandhipratik203 requested review from Lang-Akshay and cafalchio March 12, 2026 15:29

gandhipratik203 force-pushed the fix/mcp-session-pool-broken-session-recycling branch from c58a2dd to 165134d Compare March 12, 2026 21:48

gandhipratik203 added release-fix Critical bugfix required for the release ready Validated, ready-to-work-on items labels Mar 13, 2026

msureshkumar88 self-requested a review March 18, 2026 11:13

gandhipratik203 added 4 commits March 19, 2026 07:30

fix(session-pool): discard broken sessions on exception instead of re…

ca9ba65

…cycling Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

fix(session-pool): detect broken transport streams via is_closed

ca3c384

Signed-off-by: Pratik Gandhi <gandhipratik203@gmail.com>

gandhipratik203 force-pushed the fix/mcp-session-pool-broken-session-recycling branch from 165134d to 7168d44 Compare March 19, 2026 07:37

crivetimihai mentioned this pull request Mar 19, 2026

[BUG][PERFORMANCE]: MCP session pool cancel scope leak causes ~20-45% tool call failures on /servers/{id}/mcp #3737

Closed

2 tasks

crivetimihai mentioned this pull request Mar 19, 2026

fix(session-pool): isolate cancel scopes via background task ownership #3739

Merged

4 tasks

crivetimihai closed this in #3739 Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(session-pool): prevent broken session recycling in MCPSessionPool#3605

fix(session-pool): prevent broken session recycling in MCPSessionPool#3605
gandhipratik203 wants to merge 4 commits intomainfrom
fix/mcp-session-pool-broken-session-recycling

gandhipratik203 commented Mar 11, 2026 •

edited

Loading

Uh oh!

crivetimihai commented Mar 19, 2026

Uh oh!

crivetimihai commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gandhipratik203 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issue

Summary

Scope

🔬 Steps to Test

Manual test results

🏷️ Type of Change

🧪 Verification

✅ Checklist

📓 Notes

Uh oh!

crivetimihai commented Mar 19, 2026

Uh oh!

crivetimihai commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gandhipratik203 commented Mar 11, 2026 •

edited

Loading