Skip to content

fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)#13406

Closed
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/mcp-auto-reconnect-on-session-expired
Closed

fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)#13406
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/mcp-auto-reconnect-on-session-expired

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

Fixes #13383.

TL;DR

Streamable HTTP MCP servers may garbage-collect their server-side session state while the OAuth token remains valid — idle TTL, server restart, pod rotation, etc. Before this fix, the tool-call handler treated the resulting "Invalid or expired session" error as a plain tool failure with no recovery path. Every subsequent call on the affected server failed until the gateway was manually restarted.

The existing _handle_auth_error_and_retry only fires on 401s, which this class of failure never triggers (token is still valid).

Fix: add a sibling _handle_session_expired_and_retry that detects the session-expiry error pattern and drives the existing transport-reconnect mechanism (MCPServerTask._reconnect_event), then retries the call once.

Root cause

tools/mcp_tool.py, all 5 handler branches follow this pattern:

except Exception as exc:
    recovered = _handle_auth_error_and_retry(...)  # 401-only
    if recovered is not None: return recovered
    # ← session-expired falls through here with no recovery
    return generic_error_json(...)

_is_auth_error only catches:

  • OAuthFlowError / OAuthTokenError / OAuthNonInteractiveError
  • httpx.HTTPStatusError with status_code == 401

When the server returns "Invalid params: Invalid or expired session" as a JSON-RPC error (reporter's exact wpcom-mcp log), it's wrapped in an mcp.McpError. The token is still valid (direct API calls return 200), so 401-based detection never fires.

Fix

New narrow detection + reconnect helper, wired into all 5 handlers:

_SESSION_EXPIRED_MARKERS: tuple = (
    "invalid or expired session",
    "expired session",
    "session expired",
    "session not found",
    "unknown session",
)


def _is_session_expired_error(exc: BaseException) -> bool:
    if isinstance(exc, InterruptedError):
        return False
    msg = str(exc).lower()
    if not msg:
        return False
    return any(marker in msg for marker in _SESSION_EXPIRED_MARKERS)


def _handle_session_expired_and_retry(server_name, exc, retry_call, op_description):
    # Unlike _handle_auth_error_and_retry, no handle_401 — token is valid.
    # Just trigger the transport reconnect + retry once.
    if not _is_session_expired_error(exc): return None
    # ... set _reconnect_event, wait for ready, retry once ...

Each handler gets one additional 4-line block:

recovered = _handle_auth_error_and_retry(...)          # unchanged
if recovered is not None: return recovered
recovered = _handle_session_expired_and_retry(...)     # new
if recovered is not None: return recovered

Behaviour matrix

Exception surfaced by MCP SDK Before After
401 Unauthorized OAuth recovery → reconnect → retry unchanged
McpError("Invalid or expired session") generic tool error → stuck until gateway restart transport reconnect → retry once → success
McpError("Session expired") generic error reconnect + retry
RuntimeError("Tool execution failed") generic error generic error (unchanged — narrow scope)
InterruptedError user-cancel path (unchanged) user-cancel path (unchanged — explicitly excluded)
Empty-string exception generic error generic error

Narrow scope — explicitly not changed

  • Detection is string-based on a 5-entry allow-list. MCP SDK exception types vary across versions; message-substring matching is the durable path. Kept narrow to avoid false positives (pinned by test_is_session_expired_rejects_unrelated_errors).
  • Existing 401 recovery flow. Untouched. The new path is consulted only after the auth path declines.
  • Retry count stays at 1. If reconnect+retry also fails, we don't loop — the error surfaces so the model sees the failure rather than a hang.
  • InterruptedError is explicitly excluded from session-expiry detection. User-cancel signals short-circuit identically to before (pinned by dedicated test).
  • Reconnect mechanism itself. Uses the existing _reconnect_event that the 401 path already drives — no new transport code.

Regression coverage

tests/tools/test_mcp_tool_session_expired.py16 new test cases:

7 unit tests for _is_session_expired_error:

  • Reporter's exact wpcom text ("Invalid params: Invalid or expired session")
  • "Session expired" / "expired session" variants
  • Server GC variants ("session not found", "unknown session")
  • Case-insensitive match
  • Narrow-scope canaries: rejects unrelated RuntimeError / ValueError, rejects 401, rejects InterruptedError, rejects empty-message exceptions.

5 integration tests for handler plumbing:

  • Reporter's full repro end-to-end via _make_tool_handler.
  • Preserved-behaviour canary: non-session-expired errors fall through.
  • Defensive: returns None without event loop, without server record, when retry also fails.

4 parametrised tests across all non-tools/call handlers (list_resources, read_resource, list_prompts, get_prompt) confirming they share the recovery pattern.

15 of 16 tests fail on clean origin/main (6fb69229) with:

ImportError: cannot import name '_is_session_expired_error' from 'tools.mcp_tool'

The 1 that passes is an xdist-ordering artefact of worker collection.

Validation

source venv/bin/activate
python -m pytest tests/tools/test_mcp_tool_session_expired.py -q
# 16 passed

Broader MCP suite (5 files):

python -m pytest \
  tests/tools/test_mcp_tool.py \
  tests/tools/test_mcp_tool_401_handling.py \
  tests/tools/test_mcp_tool_session_expired.py \
  tests/tools/test_mcp_reconnect_signal.py \
  tests/tools/test_mcp_oauth.py -q
# 230 passed, 0 regressions

Pre-empted review questions

Q. Why substring detection instead of an exception-type check?
MCP SDK exception types (McpError, RpcError, etc.) vary across mcp package versions and are sometimes wrapped in transport-level exceptions (httpx.ReadError, anyio.EndOfStream) when the server closes the connection. Message-substring matching on a narrow allow-list survives SDK upgrades and covers the same error class across MCP server implementations.

Q. Could this cause a retry storm if the server is permanently broken?
No — retry count is exactly 1 per tool call. If the reconnect+retry also fails, the error surfaces and the circuit-breaker (_server_error_counts) starts counting failures. The model sees an error and stops retrying.

Q. Why not extend _handle_auth_error_and_retry to handle both cases?
They have different recovery semantics — the 401 path calls handle_401 which may refresh tokens or prompt for re-auth; the session-expired path skips all of that because the token is valid. Keeping them as sibling functions makes the intent explicit at each handler call site.

Q. What if InterruptedError carries a session-expired message in args?
The isinstance(exc, InterruptedError) short-circuit returns False before any message matching runs. User-cancel signals always route through _interrupted_call_result() first (via the handlers' own except InterruptedError branch) — this check is defense-in-depth for the rare case where the exception chain gets reordered.

Q. What happens to in-flight tool calls during the reconnect?
Only the specific tool call that triggered the session-expired error is affected. Other in-flight calls against the same server will also see session-expired on their next operation and trigger the same recovery independently (idempotent: _reconnect_event.set() on an already-set event is a no-op; the reconnect drains + rebuilds once regardless of how many callers signalled it).


Co-authored via LLM assistance; I've reviewed every line and am responsible for correctness.

…res (NousResearch#13383)

Streamable HTTP MCP servers may garbage-collect their server-side
session state while the OAuth token remains valid — idle TTL, server
restart, pod rotation, etc.  Before this fix, the tool-call handler
treated the resulting "Invalid or expired session" error as a plain
tool failure with no recovery path, so **every subsequent call on
the affected server failed until the gateway was manually
restarted**.  Reporter: NousResearch#13383.

The OAuth-based recovery path (``_handle_auth_error_and_retry``)
already exists for 401s, but it only fires on auth errors.  Session
expiry slipped through because the access token is still valid —
nothing 401'd, so the existing recovery branch was skipped.

Fix
---
Add a sibling function ``_handle_session_expired_and_retry`` that
detects MCP session-expiry via ``_is_session_expired_error`` (a
narrow allow-list of known-stable substrings: ``"invalid or expired
session"``, ``"session expired"``, ``"session not found"``,
``"unknown session"``, etc.) and then uses the existing transport
reconnect mechanism:

* Sets ``MCPServerTask._reconnect_event`` — the server task's
  lifecycle loop already interprets this as "tear down the current
  ``streamablehttp_client`` + ``ClientSession`` and rebuild them,
  reusing the existing OAuth provider instance".
* Waits up to 15 s for the new session to come back ready.
* Retries the original call once.  If the retry succeeds, returns
  its result and resets the circuit-breaker error count.  If the
  retry raises, or if the reconnect doesn't ready in time, falls
  through to the caller's generic error path.

Unlike the 401 path, this does **not** call ``handle_401`` — the
access token is already valid and running an OAuth refresh would be
a pointless round-trip.

All 5 MCP handlers (``call_tool``, ``list_resources``, ``read_resource``,
``list_prompts``, ``get_prompt``) now consult both recovery paths
before falling through:

    recovered = _handle_auth_error_and_retry(...)          # 401 path
    if recovered is not None: return recovered
    recovered = _handle_session_expired_and_retry(...)     # new
    if recovered is not None: return recovered
    # generic error response

Narrow scope — explicitly not changed
-------------------------------------
* **Detection is string-based on a 5-entry allow-list.**  The MCP
  SDK wraps JSON-RPC errors in ``McpError`` whose exception type +
  attributes vary across SDK versions, so matching on message
  substrings is the durable path.  Kept narrow to avoid false
  positives — a regular ``RuntimeError("Tool failed")`` will NOT
  trigger spurious reconnects (pinned by
  ``test_is_session_expired_rejects_unrelated_errors``).
* **No change to the existing 401 recovery flow.**  The new path is
  consulted only after the auth path declines (returns ``None``).
* **Retry count stays at 1.**  If the reconnect-then-retry also
  fails, we don't loop — the error surfaces normally so the model
  sees a failed tool call rather than a hang.
* **``InterruptedError`` is explicitly excluded** from session-expired
  detection so user-cancel signals always short-circuit the same
  way they did before (pinned by
  ``test_is_session_expired_rejects_interrupted_error``).

Regression coverage
-------------------
``tests/tools/test_mcp_tool_session_expired.py`` (new, 16 cases):

Unit tests for ``_is_session_expired_error``:
* ``test_is_session_expired_detects_invalid_or_expired_session`` —
  reporter's exact wpcom-mcp text.
* ``test_is_session_expired_detects_expired_session_variant`` —
  "Session expired" / "expired session" variants.
* ``test_is_session_expired_detects_session_not_found`` — server GC
  variant ("session not found", "unknown session").
* ``test_is_session_expired_is_case_insensitive``.
* ``test_is_session_expired_rejects_unrelated_errors`` — narrow-scope
  canary: random RuntimeError / ValueError / 401 don't trigger.
* ``test_is_session_expired_rejects_interrupted_error`` — user cancel
  must never route through reconnect.
* ``test_is_session_expired_rejects_empty_message``.

Handler integration tests:
* ``test_call_tool_handler_reconnects_on_session_expired`` — reporter's
  full repro: first call raises "Invalid or expired session", handler
  signals ``_reconnect_event``, retries once, returns the retry's
  success result with no ``error`` key.
* ``test_call_tool_handler_non_session_expired_error_falls_through``
  — preserved-behaviour canary: random tool failures do NOT trigger
  reconnect.
* ``test_session_expired_handler_returns_none_without_loop`` —
  defensive: cold-start / shutdown race.
* ``test_session_expired_handler_returns_none_without_server_record``
  — torn-down server falls through cleanly.
* ``test_session_expired_handler_returns_none_when_retry_also_fails``
  — no retry loop on repeated failure.

Parametrised across all 4 non-``tools/call`` handlers:
* ``test_non_tool_handlers_also_reconnect_on_session_expired``
  [list_resources / read_resource / list_prompts / get_prompt].

**15 of 16 fail on clean ``origin/main`` (``6fb69229``)** with
``ImportError: cannot import name '_is_session_expired_error'``
— the fix's surface symbols don't exist there yet.  The 1 passing
test is an ordering artefact of pytest-xdist worker collection.

Validation
----------
``source venv/bin/activate && python -m pytest
tests/tools/test_mcp_tool_session_expired.py -q`` → **16 passed**.

Broader MCP suite (5 files:
``test_mcp_tool.py``, ``test_mcp_tool_401_handling.py``,
``test_mcp_tool_session_expired.py``, ``test_mcp_reconnect_signal.py``,
``test_mcp_oauth.py``) → **230 passed, 0 regressions**.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 21, 2026 08:06

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds automatic recovery for MCP Streamable HTTP servers whose transport session expires while OAuth tokens remain valid, by triggering a reconnect and retrying the operation once.

Changes:

  • Introduces session-expiry detection (_SESSION_EXPIRED_MARKERS, _is_session_expired_error) and a reconnect+retry helper (_handle_session_expired_and_retry).
  • Wires the new recovery path into all 5 MCP handler branches (tools/call, resources/list, resources/read, prompts/list, prompts/get).
  • Adds a new test suite covering detection and handler-level reconnect plumbing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
tools/mcp_tool.py Adds session-expired detection + reconnect-and-retry helper and integrates it into all MCP handlers.
tests/tools/test_mcp_tool_session_expired.py Adds unit + integration tests to validate the new session-expiry recovery behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/mcp_tool.py
Comment on lines +1498 to +1519
srv = _servers.get(server_name)
if srv is None or not hasattr(srv, "_reconnect_event"):
return None

loop = _mcp_loop
if loop is None or not loop.is_running():
return None

logger.info(
"MCP server '%s': %s failed with session-expired error (%s); "
"signalling transport reconnect and retrying once.",
server_name, op_description, exc,
)

# Trigger the same reconnect mechanism the OAuth recovery path
# uses, then wait briefly for the new session to come back ready.
loop.call_soon_threadsafe(srv._reconnect_event.set)
deadline = time.monotonic() + 15
ready = False
while time.monotonic() < deadline:
if srv.session is not None and srv._ready.is_set():
ready = True

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper checks for _reconnect_event but then unconditionally accesses srv._ready inside the readiness loop. If a server record is partially constructed/tearing down (or a test stub is missing _ready), this can raise AttributeError and prevent falling back to the generic error path. Consider also guarding for _ready (and potentially session) via hasattr/getattr before entering the loop.

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +16
import time
from unittest.mock import AsyncMock, MagicMock

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused imports: AsyncMock and time are imported but not referenced in this test module. Removing them reduces noise and avoids potential lint failures if static checks are enabled in CI later.

Suggested change
import time
from unittest.mock import AsyncMock, MagicMock
from unittest.mock import MagicMock

Copilot uses AI. Check for mistakes.
Comment thread tools/mcp_tool.py
Comment on lines +1513 to +1518
# uses, then wait briefly for the new session to come back ready.
loop.call_soon_threadsafe(srv._reconnect_event.set)
deadline = time.monotonic() + 15
ready = False
while time.monotonic() < deadline:
if srv.session is not None and srv._ready.is_set():

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The post-reconnect “ready” probe will usually succeed immediately because _ready is intentionally kept set across reconnects (see MCPServerTask.run comment) and srv.session is still non-None until the server task actually tears down the old session. That means the loop can break before the reconnect has happened, and the retry may run against the old (expired) session. To ensure the retry uses the new session, capture the current old_session = srv.session before signalling reconnect and wait until srv.session becomes None and then a different non-None object (or otherwise track a reconnect generation).

Suggested change
# uses, then wait briefly for the new session to come back ready.
loop.call_soon_threadsafe(srv._reconnect_event.set)
deadline = time.monotonic() + 15
ready = False
while time.monotonic() < deadline:
if srv.session is not None and srv._ready.is_set():
# uses, then wait briefly for the old session to be torn down and
# a different session object to come back ready.
old_session = srv.session
loop.call_soon_threadsafe(srv._reconnect_event.set)
deadline = time.monotonic() + 15
ready = False
saw_teardown = old_session is None
while time.monotonic() < deadline:
current_session = srv.session
if current_session is None:
saw_teardown = True
elif (
saw_teardown
and current_session is not old_session
and srv._ready.is_set()
):

Copilot uses AI. Check for mistakes.
Comment thread tools/mcp_tool.py
Comment on lines +1512 to +1521
# Trigger the same reconnect mechanism the OAuth recovery path
# uses, then wait briefly for the new session to come back ready.
loop.call_soon_threadsafe(srv._reconnect_event.set)
deadline = time.monotonic() + 15
ready = False
while time.monotonic() < deadline:
if srv.session is not None and srv._ready.is_set():
ready = True
break
time.sleep(0.25)

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_handle_session_expired_and_retry blocks in a polling loop (time.sleep) for up to 15s and doesn’t honor user interrupts while waiting for reconnect readiness. Since this runs on the calling (agent) thread, a user cancel/new message won’t be observed until after the loop finishes, which is inconsistent with _run_on_mcp_loop’s interrupt-aware behavior. Consider awaiting srv._ready on the MCP loop with a timeout (via _run_on_mcp_loop) or otherwise checking tools.interrupt.is_interrupted() during the wait so cancellations short-circuit promptly.

Suggested change
# Trigger the same reconnect mechanism the OAuth recovery path
# uses, then wait briefly for the new session to come back ready.
loop.call_soon_threadsafe(srv._reconnect_event.set)
deadline = time.monotonic() + 15
ready = False
while time.monotonic() < deadline:
if srv.session is not None and srv._ready.is_set():
ready = True
break
time.sleep(0.25)
def _is_user_interrupted() -> bool:
try:
interrupt_mod = __import__("tools.interrupt", fromlist=["is_interrupted"])
is_interrupted = getattr(interrupt_mod, "is_interrupted", None)
return bool(is_interrupted and is_interrupted())
except Exception:
return False
async def _wait_for_reconnect_ready() -> bool:
deadline = time.monotonic() + 15
while time.monotonic() < deadline:
if srv.session is not None and srv._ready.is_set():
return True
await asyncio.sleep(0.25)
return False
# Trigger the same reconnect mechanism the OAuth recovery path
# uses, then wait briefly for the new session to come back ready.
loop.call_soon_threadsafe(srv._reconnect_event.set)
ready_future = asyncio.run_coroutine_threadsafe(_wait_for_reconnect_ready(), loop)
ready = False
try:
while True:
if _is_user_interrupted():
ready_future.cancel()
logger.info(
"MCP server '%s': reconnect wait interrupted after "
"session-expired error; aborting retry.",
server_name,
)
return None
try:
ready = ready_future.result(timeout=0.25)
break
except concurrent.futures.TimeoutError:
continue
except Exception:
ready_future.cancel()
raise

Copilot uses AI. Check for mistakes.
Comment thread tools/mcp_tool.py
Comment on lines +1530 to +1545
try:
result = retry_call()
try:
parsed = json.loads(result)
if "error" not in parsed:
_server_error_counts[server_name] = 0
return result
except (json.JSONDecodeError, TypeError):
_server_error_counts[server_name] = 0
return result
except Exception as retry_exc:
logger.warning(
"MCP %s/%s retry after session reconnect failed: %s",
server_name, op_description, retry_exc,
)
return None

Copilot AI Apr 21, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If retry_call() raises InterruptedError (user cancel) it will currently be caught by the broad except Exception and swallowed, causing the handler to fall through to a generic error response rather than returning the standard interrupted JSON. To preserve cancellation semantics, handle InterruptedError explicitly here (e.g., re-raise so the caller’s except InterruptedError runs, or return _interrupted_call_result()).

Copilot uses AI. Check for mistakes.
@briandevans

Copy link
Copy Markdown
Contributor Author

Supply-chain FAILURE on this PR is a false positive from a scan bug, not a real finding. Opened #13411 with a root-cause analysis and a one-character fix.

Quick version: .github/workflows/supply-chain-audit.yml computes the PR diff with git diff "$BASE".."$HEAD" (two-dot) which includes every file main has drifted through since the PR was forked. On this PR, that adds hermes_cli/setup.py (modified upstream, not by this PR) to the scan's file list, which trips the install-hook check. The 3-dot merge-base diff shows only this PR's actual 2 files (neither of which matches any attack pattern).

Was masked pre-19db7fa3 because that commit changed the fail condition from critical == 'true' to found == 'true', unmasking the pre-existing 2-dot bug.

@briandevans

Copy link
Copy Markdown
Contributor Author

Closing — merged upstream as e87a2100 "fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)" with identical title and issue reference. The direct salvage shipped; this PR is redundant. Thanks to the maintainers for bringing it in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tools Tool registry, model_tools, toolsets P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: MCP server session expires during long-running gateway — no auto-reconnect

3 participants