fix(mcp): auto-reconnect + retry once when the transport session expires (#13383) by briandevans · Pull Request #13406 · NousResearch/hermes-agent

briandevans · 2026-04-21T08:06:35Z

TL;DR

Streamable HTTP MCP servers may garbage-collect their server-side session state while the OAuth token remains valid — idle TTL, server restart, pod rotation, etc. Before this fix, the tool-call handler treated the resulting "Invalid or expired session" error as a plain tool failure with no recovery path. Every subsequent call on the affected server failed until the gateway was manually restarted.

The existing _handle_auth_error_and_retry only fires on 401s, which this class of failure never triggers (token is still valid).

Fix: add a sibling _handle_session_expired_and_retry that detects the session-expiry error pattern and drives the existing transport-reconnect mechanism (MCPServerTask._reconnect_event), then retries the call once.

Root cause

tools/mcp_tool.py, all 5 handler branches follow this pattern:

except Exception as exc:
    recovered = _handle_auth_error_and_retry(...)  # 401-only
    if recovered is not None: return recovered
    # ← session-expired falls through here with no recovery
    return generic_error_json(...)

_is_auth_error only catches:

OAuthFlowError / OAuthTokenError / OAuthNonInteractiveError
httpx.HTTPStatusError with status_code == 401

When the server returns "Invalid params: Invalid or expired session" as a JSON-RPC error (reporter's exact wpcom-mcp log), it's wrapped in an mcp.McpError. The token is still valid (direct API calls return 200), so 401-based detection never fires.

Fix

New narrow detection + reconnect helper, wired into all 5 handlers:

_SESSION_EXPIRED_MARKERS: tuple = (
    "invalid or expired session",
    "expired session",
    "session expired",
    "session not found",
    "unknown session",
)


def _is_session_expired_error(exc: BaseException) -> bool:
    if isinstance(exc, InterruptedError):
        return False
    msg = str(exc).lower()
    if not msg:
        return False
    return any(marker in msg for marker in _SESSION_EXPIRED_MARKERS)


def _handle_session_expired_and_retry(server_name, exc, retry_call, op_description):
    # Unlike _handle_auth_error_and_retry, no handle_401 — token is valid.
    # Just trigger the transport reconnect + retry once.
    if not _is_session_expired_error(exc): return None
    # ... set _reconnect_event, wait for ready, retry once ...

Each handler gets one additional 4-line block:

recovered = _handle_auth_error_and_retry(...)          # unchanged
if recovered is not None: return recovered
recovered = _handle_session_expired_and_retry(...)     # new
if recovered is not None: return recovered

Behaviour matrix

Exception surfaced by MCP SDK	Before	After
401 Unauthorized	OAuth recovery → reconnect → retry	unchanged
`McpError("Invalid or expired session")`	generic tool error → stuck until gateway restart	transport reconnect → retry once → success
`McpError("Session expired")`	generic error	reconnect + retry
`RuntimeError("Tool execution failed")`	generic error	generic error (unchanged — narrow scope)
`InterruptedError`	user-cancel path (unchanged)	user-cancel path (unchanged — explicitly excluded)
Empty-string exception	generic error	generic error

Narrow scope — explicitly not changed

Detection is string-based on a 5-entry allow-list. MCP SDK exception types vary across versions; message-substring matching is the durable path. Kept narrow to avoid false positives (pinned by test_is_session_expired_rejects_unrelated_errors).
Existing 401 recovery flow. Untouched. The new path is consulted only after the auth path declines.
Retry count stays at 1. If reconnect+retry also fails, we don't loop — the error surfaces so the model sees the failure rather than a hang.
InterruptedError is explicitly excluded from session-expiry detection. User-cancel signals short-circuit identically to before (pinned by dedicated test).
Reconnect mechanism itself. Uses the existing _reconnect_event that the 401 path already drives — no new transport code.

Regression coverage

tests/tools/test_mcp_tool_session_expired.py — 16 new test cases:

7 unit tests for _is_session_expired_error:

Reporter's exact wpcom text ("Invalid params: Invalid or expired session")
"Session expired" / "expired session" variants
Server GC variants ("session not found", "unknown session")
Case-insensitive match
Narrow-scope canaries: rejects unrelated RuntimeError / ValueError, rejects 401, rejects InterruptedError, rejects empty-message exceptions.

5 integration tests for handler plumbing:

Reporter's full repro end-to-end via _make_tool_handler.
Preserved-behaviour canary: non-session-expired errors fall through.
Defensive: returns None without event loop, without server record, when retry also fails.

4 parametrised tests across all non-tools/call handlers (list_resources, read_resource, list_prompts, get_prompt) confirming they share the recovery pattern.

15 of 16 tests fail on clean origin/main (6fb69229) with:

ImportError: cannot import name '_is_session_expired_error' from 'tools.mcp_tool'

The 1 that passes is an xdist-ordering artefact of worker collection.

Validation

source venv/bin/activate
python -m pytest tests/tools/test_mcp_tool_session_expired.py -q
# 16 passed

Broader MCP suite (5 files):

python -m pytest \
  tests/tools/test_mcp_tool.py \
  tests/tools/test_mcp_tool_401_handling.py \
  tests/tools/test_mcp_tool_session_expired.py \
  tests/tools/test_mcp_reconnect_signal.py \
  tests/tools/test_mcp_oauth.py -q
# 230 passed, 0 regressions

Pre-empted review questions

Q. Why substring detection instead of an exception-type check?
MCP SDK exception types (McpError, RpcError, etc.) vary across mcp package versions and are sometimes wrapped in transport-level exceptions (httpx.ReadError, anyio.EndOfStream) when the server closes the connection. Message-substring matching on a narrow allow-list survives SDK upgrades and covers the same error class across MCP server implementations.

Q. Could this cause a retry storm if the server is permanently broken?
No — retry count is exactly 1 per tool call. If the reconnect+retry also fails, the error surfaces and the circuit-breaker (_server_error_counts) starts counting failures. The model sees an error and stops retrying.

Q. Why not extend _handle_auth_error_and_retry to handle both cases?
They have different recovery semantics — the 401 path calls handle_401 which may refresh tokens or prompt for re-auth; the session-expired path skips all of that because the token is valid. Keeping them as sibling functions makes the intent explicit at each handler call site.

Q. What if InterruptedError carries a session-expired message in args?
The isinstance(exc, InterruptedError) short-circuit returns False before any message matching runs. User-cancel signals always route through _interrupted_call_result() first (via the handlers' own except InterruptedError branch) — this check is defense-in-depth for the rare case where the exception chain gets reordered.

Q. What happens to in-flight tool calls during the reconnect?
Only the specific tool call that triggered the session-expired error is affected. Other in-flight calls against the same server will also see session-expired on their next operation and trigger the same recovery independently (idempotent: _reconnect_event.set() on an already-set event is a no-op; the reconnect drains + rebuilds once regardless of how many callers signalled it).

_{Co-authored via LLM assistance; I've reviewed every line and am responsible for correctness.}

…res (NousResearch#13383) Streamable HTTP MCP servers may garbage-collect their server-side session state while the OAuth token remains valid — idle TTL, server restart, pod rotation, etc. Before this fix, the tool-call handler treated the resulting "Invalid or expired session" error as a plain tool failure with no recovery path, so **every subsequent call on the affected server failed until the gateway was manually restarted**. Reporter: NousResearch#13383. The OAuth-based recovery path (``_handle_auth_error_and_retry``) already exists for 401s, but it only fires on auth errors. Session expiry slipped through because the access token is still valid — nothing 401'd, so the existing recovery branch was skipped. Fix --- Add a sibling function ``_handle_session_expired_and_retry`` that detects MCP session-expiry via ``_is_session_expired_error`` (a narrow allow-list of known-stable substrings: ``"invalid or expired session"``, ``"session expired"``, ``"session not found"``, ``"unknown session"``, etc.) and then uses the existing transport reconnect mechanism: * Sets ``MCPServerTask._reconnect_event`` — the server task's lifecycle loop already interprets this as "tear down the current ``streamablehttp_client`` + ``ClientSession`` and rebuild them, reusing the existing OAuth provider instance". * Waits up to 15 s for the new session to come back ready. * Retries the original call once. If the retry succeeds, returns its result and resets the circuit-breaker error count. If the retry raises, or if the reconnect doesn't ready in time, falls through to the caller's generic error path. Unlike the 401 path, this does **not** call ``handle_401`` — the access token is already valid and running an OAuth refresh would be a pointless round-trip. All 5 MCP handlers (``call_tool``, ``list_resources``, ``read_resource``, ``list_prompts``, ``get_prompt``) now consult both recovery paths before falling through: recovered = _handle_auth_error_and_retry(...) # 401 path if recovered is not None: return recovered recovered = _handle_session_expired_and_retry(...) # new if recovered is not None: return recovered # generic error response Narrow scope — explicitly not changed ------------------------------------- * **Detection is string-based on a 5-entry allow-list.** The MCP SDK wraps JSON-RPC errors in ``McpError`` whose exception type + attributes vary across SDK versions, so matching on message substrings is the durable path. Kept narrow to avoid false positives — a regular ``RuntimeError("Tool failed")`` will NOT trigger spurious reconnects (pinned by ``test_is_session_expired_rejects_unrelated_errors``). * **No change to the existing 401 recovery flow.** The new path is consulted only after the auth path declines (returns ``None``). * **Retry count stays at 1.** If the reconnect-then-retry also fails, we don't loop — the error surfaces normally so the model sees a failed tool call rather than a hang. * **``InterruptedError`` is explicitly excluded** from session-expired detection so user-cancel signals always short-circuit the same way they did before (pinned by ``test_is_session_expired_rejects_interrupted_error``). Regression coverage ------------------- ``tests/tools/test_mcp_tool_session_expired.py`` (new, 16 cases): Unit tests for ``_is_session_expired_error``: * ``test_is_session_expired_detects_invalid_or_expired_session`` — reporter's exact wpcom-mcp text. * ``test_is_session_expired_detects_expired_session_variant`` — "Session expired" / "expired session" variants. * ``test_is_session_expired_detects_session_not_found`` — server GC variant ("session not found", "unknown session"). * ``test_is_session_expired_is_case_insensitive``. * ``test_is_session_expired_rejects_unrelated_errors`` — narrow-scope canary: random RuntimeError / ValueError / 401 don't trigger. * ``test_is_session_expired_rejects_interrupted_error`` — user cancel must never route through reconnect. * ``test_is_session_expired_rejects_empty_message``. Handler integration tests: * ``test_call_tool_handler_reconnects_on_session_expired`` — reporter's full repro: first call raises "Invalid or expired session", handler signals ``_reconnect_event``, retries once, returns the retry's success result with no ``error`` key. * ``test_call_tool_handler_non_session_expired_error_falls_through`` — preserved-behaviour canary: random tool failures do NOT trigger reconnect. * ``test_session_expired_handler_returns_none_without_loop`` — defensive: cold-start / shutdown race. * ``test_session_expired_handler_returns_none_without_server_record`` — torn-down server falls through cleanly. * ``test_session_expired_handler_returns_none_when_retry_also_fails`` — no retry loop on repeated failure. Parametrised across all 4 non-``tools/call`` handlers: * ``test_non_tool_handlers_also_reconnect_on_session_expired`` [list_resources / read_resource / list_prompts / get_prompt]. **15 of 16 fail on clean ``origin/main`` (``6fb69229``)** with ``ImportError: cannot import name '_is_session_expired_error'`` — the fix's surface symbols don't exist there yet. The 1 passing test is an ordering artefact of pytest-xdist worker collection. Validation ---------- ``source venv/bin/activate && python -m pytest tests/tools/test_mcp_tool_session_expired.py -q`` → **16 passed**. Broader MCP suite (5 files: ``test_mcp_tool.py``, ``test_mcp_tool_401_handling.py``, ``test_mcp_tool_session_expired.py``, ``test_mcp_reconnect_signal.py``, ``test_mcp_oauth.py``) → **230 passed, 0 regressions**. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds automatic recovery for MCP Streamable HTTP servers whose transport session expires while OAuth tokens remain valid, by triggering a reconnect and retrying the operation once.

Changes:

Introduces session-expiry detection (_SESSION_EXPIRED_MARKERS, _is_session_expired_error) and a reconnect+retry helper (_handle_session_expired_and_retry).
Wires the new recovery path into all 5 MCP handler branches (tools/call, resources/list, resources/read, prompts/list, prompts/get).
Adds a new test suite covering detection and handler-level reconnect plumbing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
tools/mcp_tool.py	Adds session-expired detection + reconnect-and-retry helper and integrates it into all MCP handlers.
tests/tools/test_mcp_tool_session_expired.py	Adds unit + integration tests to validate the new session-expiry recovery behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T08:11:26Z

+        srv = _servers.get(server_name)
+    if srv is None or not hasattr(srv, "_reconnect_event"):
+        return None
+
+    loop = _mcp_loop
+    if loop is None or not loop.is_running():
+        return None
+
+    logger.info(
+        "MCP server '%s': %s failed with session-expired error (%s); "
+        "signalling transport reconnect and retrying once.",
+        server_name, op_description, exc,
+    )
+
+    # Trigger the same reconnect mechanism the OAuth recovery path
+    # uses, then wait briefly for the new session to come back ready.
+    loop.call_soon_threadsafe(srv._reconnect_event.set)
+    deadline = time.monotonic() + 15
+    ready = False
+    while time.monotonic() < deadline:
+        if srv.session is not None and srv._ready.is_set():
+            ready = True


This helper checks for _reconnect_event but then unconditionally accesses srv._ready inside the readiness loop. If a server record is partially constructed/tearing down (or a test stub is missing _ready), this can raise AttributeError and prevent falling back to the generic error path. Consider also guarding for _ready (and potentially session) via hasattr/getattr before entering the loop.

Copilot · 2026-04-21T08:11:27Z

+import time
+from unittest.mock import AsyncMock, MagicMock


Unused imports: AsyncMock and time are imported but not referenced in this test module. Removing them reduces noise and avoids potential lint failures if static checks are enabled in CI later.

Suggested change

import time

from unittest.mock import AsyncMock, MagicMock

from unittest.mock import MagicMock

Copilot · 2026-04-21T08:11:27Z

+    # uses, then wait briefly for the new session to come back ready.
+    loop.call_soon_threadsafe(srv._reconnect_event.set)
+    deadline = time.monotonic() + 15
+    ready = False
+    while time.monotonic() < deadline:
+        if srv.session is not None and srv._ready.is_set():


The post-reconnect “ready” probe will usually succeed immediately because _ready is intentionally kept set across reconnects (see MCPServerTask.run comment) and srv.session is still non-None until the server task actually tears down the old session. That means the loop can break before the reconnect has happened, and the retry may run against the old (expired) session. To ensure the retry uses the new session, capture the current old_session = srv.session before signalling reconnect and wait until srv.session becomes None and then a different non-None object (or otherwise track a reconnect generation).

Suggested change

# uses, then wait briefly for the new session to come back ready.

loop.call_soon_threadsafe(srv._reconnect_event.set)

deadline = time.monotonic() + 15

ready = False

while time.monotonic() < deadline:

if srv.session is not None and srv._ready.is_set():

# uses, then wait briefly for the old session to be torn down and

# a different session object to come back ready.

old_session = srv.session

loop.call_soon_threadsafe(srv._reconnect_event.set)

deadline = time.monotonic() + 15

ready = False

saw_teardown = old_session is None

while time.monotonic() < deadline:

current_session = srv.session

if current_session is None:

saw_teardown = True

elif (

saw_teardown

and current_session is not old_session

and srv._ready.is_set()

):

Copilot · 2026-04-21T08:11:27Z

+    # Trigger the same reconnect mechanism the OAuth recovery path
+    # uses, then wait briefly for the new session to come back ready.
+    loop.call_soon_threadsafe(srv._reconnect_event.set)
+    deadline = time.monotonic() + 15
+    ready = False
+    while time.monotonic() < deadline:
+        if srv.session is not None and srv._ready.is_set():
+            ready = True
+            break
+        time.sleep(0.25)


_handle_session_expired_and_retry blocks in a polling loop (time.sleep) for up to 15s and doesn’t honor user interrupts while waiting for reconnect readiness. Since this runs on the calling (agent) thread, a user cancel/new message won’t be observed until after the loop finishes, which is inconsistent with _run_on_mcp_loop’s interrupt-aware behavior. Consider awaiting srv._ready on the MCP loop with a timeout (via _run_on_mcp_loop) or otherwise checking tools.interrupt.is_interrupted() during the wait so cancellations short-circuit promptly.

Suggested change

# Trigger the same reconnect mechanism the OAuth recovery path

# uses, then wait briefly for the new session to come back ready.

loop.call_soon_threadsafe(srv._reconnect_event.set)

deadline = time.monotonic() + 15

ready = False

while time.monotonic() < deadline:

if srv.session is not None and srv._ready.is_set():

ready = True

break

time.sleep(0.25)

def _is_user_interrupted() -> bool:

try:

interrupt_mod = __import__("tools.interrupt", fromlist=["is_interrupted"])

is_interrupted = getattr(interrupt_mod, "is_interrupted", None)

return bool(is_interrupted and is_interrupted())

except Exception:

return False

async def _wait_for_reconnect_ready() -> bool:

deadline = time.monotonic() + 15

while time.monotonic() < deadline:

if srv.session is not None and srv._ready.is_set():

return True

await asyncio.sleep(0.25)

return False

# Trigger the same reconnect mechanism the OAuth recovery path

# uses, then wait briefly for the new session to come back ready.

loop.call_soon_threadsafe(srv._reconnect_event.set)

ready_future = asyncio.run_coroutine_threadsafe(_wait_for_reconnect_ready(), loop)

ready = False

try:

while True:

if _is_user_interrupted():

ready_future.cancel()

logger.info(

"MCP server '%s': reconnect wait interrupted after "

"session-expired error; aborting retry.",

server_name,

)

return None

try:

ready = ready_future.result(timeout=0.25)

break

except concurrent.futures.TimeoutError:

continue

except Exception:

ready_future.cancel()

raise

Copilot · 2026-04-21T08:11:28Z

+    try:
+        result = retry_call()
+        try:
+            parsed = json.loads(result)
+            if "error" not in parsed:
+                _server_error_counts[server_name] = 0
+                return result
+        except (json.JSONDecodeError, TypeError):
+            _server_error_counts[server_name] = 0
+            return result
+    except Exception as retry_exc:
+        logger.warning(
+            "MCP %s/%s retry after session reconnect failed: %s",
+            server_name, op_description, retry_exc,
+        )
+    return None


If retry_call() raises InterruptedError (user cancel) it will currently be caught by the broad except Exception and swallowed, causing the handler to fall through to a generic error response rather than returning the standard interrupted JSON. To preserve cancellation semantics, handle InterruptedError explicitly here (e.g., re-raise so the caller’s except InterruptedError runs, or return _interrupted_call_result()).

briandevans · 2026-04-21T08:21:12Z

Supply-chain FAILURE on this PR is a false positive from a scan bug, not a real finding. Opened #13411 with a root-cause analysis and a one-character fix.

Quick version: .github/workflows/supply-chain-audit.yml computes the PR diff with git diff "$BASE".."$HEAD" (two-dot) which includes every file main has drifted through since the PR was forked. On this PR, that adds hermes_cli/setup.py (modified upstream, not by this PR) to the scan's file list, which trips the install-hook check. The 3-dot merge-base diff shows only this PR's actual 2 files (neither of which matches any attack pattern).

Was masked pre-19db7fa3 because that commit changed the fail condition from critical == 'true' to found == 'true', unmasking the pre-existing 2-dot bug.

briandevans · 2026-04-24T13:08:21Z

Closing — merged upstream as e87a2100 "fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)" with identical title and issue reference. The direct salvage shipped; this PR is redundant. Thanks to the maintainers for bringing it in.

Copilot AI review requested due to automatic review settings April 21, 2026 08:06

Copilot started reviewing on behalf of briandevans April 21, 2026 08:07 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

briandevans mentioned this pull request Apr 21, 2026

ci(supply-chain): use merge-base diff (A...B) so upstream drift doesn't false-positive #13411

Closed

19 tasks

alt-glitch mentioned this pull request Apr 22, 2026

fix(mcp): reconnect on session identity and transport errors #13795

Open

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth comp/tools Tool registry, model_tools, toolsets labels Apr 22, 2026

alt-glitch mentioned this pull request Apr 22, 2026

fix(mcp): auto-reconnect on expired session during tool call #13402

Closed

teknium1 mentioned this pull request Apr 24, 2026

fix(mcp-oauth): Pydantic serialization + port hardening + cross-origin security + session auto-reconnect #15125

Merged

briandevans closed this Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)#13406

fix(mcp): auto-reconnect + retry once when the transport session expires (#13383)#13406
briandevans wants to merge 1 commit into
NousResearch:mainfrom
briandevans:fix/mcp-auto-reconnect-on-session-expired

briandevans commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

briandevans commented Apr 21, 2026

Uh oh!

briandevans commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	import time
	from unittest.mock import AsyncMock, MagicMock
	from unittest.mock import MagicMock

Conversation

briandevans commented Apr 21, 2026

TL;DR

Root cause

Fix

Behaviour matrix

Narrow scope — explicitly not changed

Regression coverage

Validation

Pre-empted review questions

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

briandevans commented Apr 21, 2026

Uh oh!

briandevans commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants