fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)#13406
Conversation
…res (NousResearch#13383) Streamable HTTP MCP servers may garbage-collect their server-side session state while the OAuth token remains valid — idle TTL, server restart, pod rotation, etc. Before this fix, the tool-call handler treated the resulting "Invalid or expired session" error as a plain tool failure with no recovery path, so **every subsequent call on the affected server failed until the gateway was manually restarted**. Reporter: NousResearch#13383. The OAuth-based recovery path (``_handle_auth_error_and_retry``) already exists for 401s, but it only fires on auth errors. Session expiry slipped through because the access token is still valid — nothing 401'd, so the existing recovery branch was skipped. Fix --- Add a sibling function ``_handle_session_expired_and_retry`` that detects MCP session-expiry via ``_is_session_expired_error`` (a narrow allow-list of known-stable substrings: ``"invalid or expired session"``, ``"session expired"``, ``"session not found"``, ``"unknown session"``, etc.) and then uses the existing transport reconnect mechanism: * Sets ``MCPServerTask._reconnect_event`` — the server task's lifecycle loop already interprets this as "tear down the current ``streamablehttp_client`` + ``ClientSession`` and rebuild them, reusing the existing OAuth provider instance". * Waits up to 15 s for the new session to come back ready. * Retries the original call once. If the retry succeeds, returns its result and resets the circuit-breaker error count. If the retry raises, or if the reconnect doesn't ready in time, falls through to the caller's generic error path. Unlike the 401 path, this does **not** call ``handle_401`` — the access token is already valid and running an OAuth refresh would be a pointless round-trip. All 5 MCP handlers (``call_tool``, ``list_resources``, ``read_resource``, ``list_prompts``, ``get_prompt``) now consult both recovery paths before falling through: recovered = _handle_auth_error_and_retry(...) # 401 path if recovered is not None: return recovered recovered = _handle_session_expired_and_retry(...) # new if recovered is not None: return recovered # generic error response Narrow scope — explicitly not changed ------------------------------------- * **Detection is string-based on a 5-entry allow-list.** The MCP SDK wraps JSON-RPC errors in ``McpError`` whose exception type + attributes vary across SDK versions, so matching on message substrings is the durable path. Kept narrow to avoid false positives — a regular ``RuntimeError("Tool failed")`` will NOT trigger spurious reconnects (pinned by ``test_is_session_expired_rejects_unrelated_errors``). * **No change to the existing 401 recovery flow.** The new path is consulted only after the auth path declines (returns ``None``). * **Retry count stays at 1.** If the reconnect-then-retry also fails, we don't loop — the error surfaces normally so the model sees a failed tool call rather than a hang. * **``InterruptedError`` is explicitly excluded** from session-expired detection so user-cancel signals always short-circuit the same way they did before (pinned by ``test_is_session_expired_rejects_interrupted_error``). Regression coverage ------------------- ``tests/tools/test_mcp_tool_session_expired.py`` (new, 16 cases): Unit tests for ``_is_session_expired_error``: * ``test_is_session_expired_detects_invalid_or_expired_session`` — reporter's exact wpcom-mcp text. * ``test_is_session_expired_detects_expired_session_variant`` — "Session expired" / "expired session" variants. * ``test_is_session_expired_detects_session_not_found`` — server GC variant ("session not found", "unknown session"). * ``test_is_session_expired_is_case_insensitive``. * ``test_is_session_expired_rejects_unrelated_errors`` — narrow-scope canary: random RuntimeError / ValueError / 401 don't trigger. * ``test_is_session_expired_rejects_interrupted_error`` — user cancel must never route through reconnect. * ``test_is_session_expired_rejects_empty_message``. Handler integration tests: * ``test_call_tool_handler_reconnects_on_session_expired`` — reporter's full repro: first call raises "Invalid or expired session", handler signals ``_reconnect_event``, retries once, returns the retry's success result with no ``error`` key. * ``test_call_tool_handler_non_session_expired_error_falls_through`` — preserved-behaviour canary: random tool failures do NOT trigger reconnect. * ``test_session_expired_handler_returns_none_without_loop`` — defensive: cold-start / shutdown race. * ``test_session_expired_handler_returns_none_without_server_record`` — torn-down server falls through cleanly. * ``test_session_expired_handler_returns_none_when_retry_also_fails`` — no retry loop on repeated failure. Parametrised across all 4 non-``tools/call`` handlers: * ``test_non_tool_handlers_also_reconnect_on_session_expired`` [list_resources / read_resource / list_prompts / get_prompt]. **15 of 16 fail on clean ``origin/main`` (``6fb69229``)** with ``ImportError: cannot import name '_is_session_expired_error'`` — the fix's surface symbols don't exist there yet. The 1 passing test is an ordering artefact of pytest-xdist worker collection. Validation ---------- ``source venv/bin/activate && python -m pytest tests/tools/test_mcp_tool_session_expired.py -q`` → **16 passed**. Broader MCP suite (5 files: ``test_mcp_tool.py``, ``test_mcp_tool_401_handling.py``, ``test_mcp_tool_session_expired.py``, ``test_mcp_reconnect_signal.py``, ``test_mcp_oauth.py``) → **230 passed, 0 regressions**. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds automatic recovery for MCP Streamable HTTP servers whose transport session expires while OAuth tokens remain valid, by triggering a reconnect and retrying the operation once.
Changes:
- Introduces session-expiry detection (
_SESSION_EXPIRED_MARKERS,_is_session_expired_error) and a reconnect+retry helper (_handle_session_expired_and_retry). - Wires the new recovery path into all 5 MCP handler branches (tools/call, resources/list, resources/read, prompts/list, prompts/get).
- Adds a new test suite covering detection and handler-level reconnect plumbing.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| tools/mcp_tool.py | Adds session-expired detection + reconnect-and-retry helper and integrates it into all MCP handlers. |
| tests/tools/test_mcp_tool_session_expired.py | Adds unit + integration tests to validate the new session-expiry recovery behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| srv = _servers.get(server_name) | ||
| if srv is None or not hasattr(srv, "_reconnect_event"): | ||
| return None | ||
|
|
||
| loop = _mcp_loop | ||
| if loop is None or not loop.is_running(): | ||
| return None | ||
|
|
||
| logger.info( | ||
| "MCP server '%s': %s failed with session-expired error (%s); " | ||
| "signalling transport reconnect and retrying once.", | ||
| server_name, op_description, exc, | ||
| ) | ||
|
|
||
| # Trigger the same reconnect mechanism the OAuth recovery path | ||
| # uses, then wait briefly for the new session to come back ready. | ||
| loop.call_soon_threadsafe(srv._reconnect_event.set) | ||
| deadline = time.monotonic() + 15 | ||
| ready = False | ||
| while time.monotonic() < deadline: | ||
| if srv.session is not None and srv._ready.is_set(): | ||
| ready = True |
There was a problem hiding this comment.
This helper checks for _reconnect_event but then unconditionally accesses srv._ready inside the readiness loop. If a server record is partially constructed/tearing down (or a test stub is missing _ready), this can raise AttributeError and prevent falling back to the generic error path. Consider also guarding for _ready (and potentially session) via hasattr/getattr before entering the loop.
| import time | ||
| from unittest.mock import AsyncMock, MagicMock |
There was a problem hiding this comment.
Unused imports: AsyncMock and time are imported but not referenced in this test module. Removing them reduces noise and avoids potential lint failures if static checks are enabled in CI later.
| import time | |
| from unittest.mock import AsyncMock, MagicMock | |
| from unittest.mock import MagicMock |
| # uses, then wait briefly for the new session to come back ready. | ||
| loop.call_soon_threadsafe(srv._reconnect_event.set) | ||
| deadline = time.monotonic() + 15 | ||
| ready = False | ||
| while time.monotonic() < deadline: | ||
| if srv.session is not None and srv._ready.is_set(): |
There was a problem hiding this comment.
The post-reconnect “ready” probe will usually succeed immediately because _ready is intentionally kept set across reconnects (see MCPServerTask.run comment) and srv.session is still non-None until the server task actually tears down the old session. That means the loop can break before the reconnect has happened, and the retry may run against the old (expired) session. To ensure the retry uses the new session, capture the current old_session = srv.session before signalling reconnect and wait until srv.session becomes None and then a different non-None object (or otherwise track a reconnect generation).
| # uses, then wait briefly for the new session to come back ready. | |
| loop.call_soon_threadsafe(srv._reconnect_event.set) | |
| deadline = time.monotonic() + 15 | |
| ready = False | |
| while time.monotonic() < deadline: | |
| if srv.session is not None and srv._ready.is_set(): | |
| # uses, then wait briefly for the old session to be torn down and | |
| # a different session object to come back ready. | |
| old_session = srv.session | |
| loop.call_soon_threadsafe(srv._reconnect_event.set) | |
| deadline = time.monotonic() + 15 | |
| ready = False | |
| saw_teardown = old_session is None | |
| while time.monotonic() < deadline: | |
| current_session = srv.session | |
| if current_session is None: | |
| saw_teardown = True | |
| elif ( | |
| saw_teardown | |
| and current_session is not old_session | |
| and srv._ready.is_set() | |
| ): |
| # Trigger the same reconnect mechanism the OAuth recovery path | ||
| # uses, then wait briefly for the new session to come back ready. | ||
| loop.call_soon_threadsafe(srv._reconnect_event.set) | ||
| deadline = time.monotonic() + 15 | ||
| ready = False | ||
| while time.monotonic() < deadline: | ||
| if srv.session is not None and srv._ready.is_set(): | ||
| ready = True | ||
| break | ||
| time.sleep(0.25) |
There was a problem hiding this comment.
_handle_session_expired_and_retry blocks in a polling loop (time.sleep) for up to 15s and doesn’t honor user interrupts while waiting for reconnect readiness. Since this runs on the calling (agent) thread, a user cancel/new message won’t be observed until after the loop finishes, which is inconsistent with _run_on_mcp_loop’s interrupt-aware behavior. Consider awaiting srv._ready on the MCP loop with a timeout (via _run_on_mcp_loop) or otherwise checking tools.interrupt.is_interrupted() during the wait so cancellations short-circuit promptly.
| # Trigger the same reconnect mechanism the OAuth recovery path | |
| # uses, then wait briefly for the new session to come back ready. | |
| loop.call_soon_threadsafe(srv._reconnect_event.set) | |
| deadline = time.monotonic() + 15 | |
| ready = False | |
| while time.monotonic() < deadline: | |
| if srv.session is not None and srv._ready.is_set(): | |
| ready = True | |
| break | |
| time.sleep(0.25) | |
| def _is_user_interrupted() -> bool: | |
| try: | |
| interrupt_mod = __import__("tools.interrupt", fromlist=["is_interrupted"]) | |
| is_interrupted = getattr(interrupt_mod, "is_interrupted", None) | |
| return bool(is_interrupted and is_interrupted()) | |
| except Exception: | |
| return False | |
| async def _wait_for_reconnect_ready() -> bool: | |
| deadline = time.monotonic() + 15 | |
| while time.monotonic() < deadline: | |
| if srv.session is not None and srv._ready.is_set(): | |
| return True | |
| await asyncio.sleep(0.25) | |
| return False | |
| # Trigger the same reconnect mechanism the OAuth recovery path | |
| # uses, then wait briefly for the new session to come back ready. | |
| loop.call_soon_threadsafe(srv._reconnect_event.set) | |
| ready_future = asyncio.run_coroutine_threadsafe(_wait_for_reconnect_ready(), loop) | |
| ready = False | |
| try: | |
| while True: | |
| if _is_user_interrupted(): | |
| ready_future.cancel() | |
| logger.info( | |
| "MCP server '%s': reconnect wait interrupted after " | |
| "session-expired error; aborting retry.", | |
| server_name, | |
| ) | |
| return None | |
| try: | |
| ready = ready_future.result(timeout=0.25) | |
| break | |
| except concurrent.futures.TimeoutError: | |
| continue | |
| except Exception: | |
| ready_future.cancel() | |
| raise |
| try: | ||
| result = retry_call() | ||
| try: | ||
| parsed = json.loads(result) | ||
| if "error" not in parsed: | ||
| _server_error_counts[server_name] = 0 | ||
| return result | ||
| except (json.JSONDecodeError, TypeError): | ||
| _server_error_counts[server_name] = 0 | ||
| return result | ||
| except Exception as retry_exc: | ||
| logger.warning( | ||
| "MCP %s/%s retry after session reconnect failed: %s", | ||
| server_name, op_description, retry_exc, | ||
| ) | ||
| return None |
There was a problem hiding this comment.
If retry_call() raises InterruptedError (user cancel) it will currently be caught by the broad except Exception and swallowed, causing the handler to fall through to a generic error response rather than returning the standard interrupted JSON. To preserve cancellation semantics, handle InterruptedError explicitly here (e.g., re-raise so the caller’s except InterruptedError runs, or return _interrupted_call_result()).
|
Supply-chain Quick version: Was masked pre- |
|
Closing — merged upstream as |
Fixes #13383.
TL;DR
Streamable HTTP MCP servers may garbage-collect their server-side session state while the OAuth token remains valid — idle TTL, server restart, pod rotation, etc. Before this fix, the tool-call handler treated the resulting "Invalid or expired session" error as a plain tool failure with no recovery path. Every subsequent call on the affected server failed until the gateway was manually restarted.
The existing
_handle_auth_error_and_retryonly fires on 401s, which this class of failure never triggers (token is still valid).Fix: add a sibling
_handle_session_expired_and_retrythat detects the session-expiry error pattern and drives the existing transport-reconnect mechanism (MCPServerTask._reconnect_event), then retries the call once.Root cause
tools/mcp_tool.py, all 5 handler branches follow this pattern:_is_auth_erroronly catches:OAuthFlowError/OAuthTokenError/OAuthNonInteractiveErrorhttpx.HTTPStatusErrorwithstatus_code == 401When the server returns
"Invalid params: Invalid or expired session"as a JSON-RPC error (reporter's exact wpcom-mcp log), it's wrapped in anmcp.McpError. The token is still valid (direct API calls return 200), so 401-based detection never fires.Fix
New narrow detection + reconnect helper, wired into all 5 handlers:
Each handler gets one additional 4-line block:
Behaviour matrix
McpError("Invalid or expired session")McpError("Session expired")RuntimeError("Tool execution failed")InterruptedErrorNarrow scope — explicitly not changed
test_is_session_expired_rejects_unrelated_errors).InterruptedErroris explicitly excluded from session-expiry detection. User-cancel signals short-circuit identically to before (pinned by dedicated test)._reconnect_eventthat the 401 path already drives — no new transport code.Regression coverage
tests/tools/test_mcp_tool_session_expired.py— 16 new test cases:7 unit tests for
_is_session_expired_error:"Invalid params: Invalid or expired session")"Session expired"/"expired session"variants"session not found","unknown session")InterruptedError, rejects empty-message exceptions.5 integration tests for handler plumbing:
_make_tool_handler.Nonewithout event loop, without server record, when retry also fails.4 parametrised tests across all non-
tools/callhandlers (list_resources,read_resource,list_prompts,get_prompt) confirming they share the recovery pattern.15 of 16 tests fail on clean
origin/main(6fb69229) with:The 1 that passes is an xdist-ordering artefact of worker collection.
Validation
Broader MCP suite (5 files):
python -m pytest \ tests/tools/test_mcp_tool.py \ tests/tools/test_mcp_tool_401_handling.py \ tests/tools/test_mcp_tool_session_expired.py \ tests/tools/test_mcp_reconnect_signal.py \ tests/tools/test_mcp_oauth.py -q # 230 passed, 0 regressionsPre-empted review questions
Q. Why substring detection instead of an exception-type check?
MCP SDK exception types (
McpError,RpcError, etc.) vary acrossmcppackage versions and are sometimes wrapped in transport-level exceptions (httpx.ReadError,anyio.EndOfStream) when the server closes the connection. Message-substring matching on a narrow allow-list survives SDK upgrades and covers the same error class across MCP server implementations.Q. Could this cause a retry storm if the server is permanently broken?
No — retry count is exactly 1 per tool call. If the reconnect+retry also fails, the error surfaces and the circuit-breaker (
_server_error_counts) starts counting failures. The model sees an error and stops retrying.Q. Why not extend
_handle_auth_error_and_retryto handle both cases?They have different recovery semantics — the 401 path calls
handle_401which may refresh tokens or prompt for re-auth; the session-expired path skips all of that because the token is valid. Keeping them as sibling functions makes the intent explicit at each handler call site.Q. What if
InterruptedErrorcarries a session-expired message inargs?The
isinstance(exc, InterruptedError)short-circuit returnsFalsebefore any message matching runs. User-cancel signals always route through_interrupted_call_result()first (via the handlers' ownexcept InterruptedErrorbranch) — this check is defense-in-depth for the rare case where the exception chain gets reordered.Q. What happens to in-flight tool calls during the reconnect?
Only the specific tool call that triggered the session-expired error is affected. Other in-flight calls against the same server will also see session-expired on their next operation and trigger the same recovery independently (idempotent:
_reconnect_event.set()on an already-set event is a no-op; the reconnect drains + rebuilds once regardless of how many callers signalled it).Co-authored via LLM assistance; I've reviewed every line and am responsible for correctness.