Skip to content

fix(tools): ignore anyio cleanup artifact during MCP HTTP reconnect#32111

Open
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/mcp-anyio-cleanup-31987
Open

fix(tools): ignore anyio cleanup artifact during MCP HTTP reconnect#32111
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/mcp-anyio-cleanup-31987

Conversation

@briandevans

Copy link
Copy Markdown
Contributor

What does this PR do?

When a Streamable HTTP MCP server tears down its session for a planned
reconnect (e.g. after the keepalive timer triggers, default 180s),
async with streamable_http_client(...) unwinds and anyio's TaskGroup
cleanup can raise RuntimeError("The current task is not holding this lock") from a Lock that was acquired by a different coroutine inside
the TaskGroup. The session shut down cleanly, but the cleanup raises.

Previously the reconnect loop's except Exception treated this as a
real connection failure and burned a slot in the
_MAX_RECONNECT_RETRIES budget. Every keepalive-driven reconnect cost
one retry attempt, and after ~5 cleanup events the server was
permanently disconnected for the rest of the gateway lifetime.

Fix: add _ANYIO_CLEANUP_MARKERS + _is_anyio_cleanup_artifact() that
walk BaseExceptionGroup chains (the SDK wraps the inner
RuntimeError in an ExceptionGroup("unhandled errors in a TaskGroup", ...)), and in the retry loop skip the retry-counter
increment when the artifact fires after the server was already ready.
The helper mirrors the existing _SESSION_EXPIRED_MARKERS design.

Related Issue

Fixes #31987

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • tools/mcp_tool.py — add _ANYIO_CLEANUP_MARKERS + _is_anyio_cleanup_artifact() (walks BaseExceptionGroup), and short-circuit the reconnect retry counter when the artifact fires after _ready is set.
  • tests/tools/test_mcp_anyio_cleanup_artifact.py — six regression tests: bare RuntimeError match, ExceptionGroup unwrap (one + nested levels), rejection of unrelated RuntimeError/ConnectionError, the full reconnect-loop scenario showing >_MAX_RECONNECT_RETRIES cleanup events survive, and a control test confirming real ConnectionError failures still exhaust the budget at _MAX_RECONNECT_RETRIES + 1 attempts.

How to Test

  1. uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/tools/test_mcp_anyio_cleanup_artifact.py -v — 6 passed
  2. uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/tools/test_mcp_reconnect_signal.py tests/tools/test_mcp_stability.py tests/tools/test_mcp_tool_session_expired.py tests/tools/test_mcp_cancelled_error_propagation.py tests/tools/test_mcp_circuit_breaker.py tests/tools/test_mcp_tool.py -v — 239 passed, no regressions
  3. Regression guard: with tools/mcp_tool.py reverted to origin/main, test_reconnect_loop_ignores_anyio_cleanup_artifact fails with assert 6 > 6 (server gives up after 5 cleanup events) — confirms the test exercises the fix.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run focused tests for the touched code and all pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 15.x

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — the anyio/asyncio interaction is OS-independent
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

Reproduces the user-reported log from #31987:

keepalive failed, triggering reconnect
reconnect requested — tearing down HTTP session
connection lost (attempt 1/5): unhandled errors in a TaskGroup
...
failed after 5 reconnection attempts, giving up

With the fix, those RuntimeError("not holding this lock") cleanup
artifacts drop to DEBUG-level and don't consume the retry budget.

Related / Positioning

  • daimon-nous (issue commenter) suggested extending _SESSION_EXPIRED_MARKERS to cover this RuntimeError. I chose a separate _ANYIO_CLEANUP_MARKERS tuple + dedicated helper because (a) _is_session_expired_error is only consulted by tool-call handlers (_handle_session_expired_and_retry at lines 2422/2484/2544/2607/2678), not by the reconnect retry loop where the actual budget burn happens; (b) the failure mode is cleanup-after-clean-teardown, not the server rejecting a request — different semantics worth keeping distinct. The two could be merged later if more anyio cleanup signatures need handling.
  • fix(tools): keep cua-driver MCP session task-local #28556 (fix(tools): keep cua-driver MCP session task-local) touches a related anyio cancel-scope problem in tools/computer_use/cua_backend.py. Different file, different code path; no overlap.

Audited siblings: confirmed every async with streamable_http_client(...) block in _run_http shares the same outer reconnect retry loop in run(), so the single check in the except Exception handler covers all three call sites (lines 1463, plus the wrapping retries at 1594-1654 surface the same exception class). Stdio transport (_run_stdio) does not use streamable_http_client, so the artifact cannot fire from that path. No widening needed.

Copilot AI review requested due to automatic review settings May 25, 2026 14:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR prevents benign anyio teardown errors during HTTP/StreamableHTTP reconnects from consuming the MCP reconnect retry budget (fixing the “permanent disconnect after N keepalive reconnects” regression in #31987), and adds regression tests to lock in the behavior.

Changes:

  • Detect and ignore the anyio “not holding this lock” teardown artifact during reconnect loops.
  • Add _is_anyio_cleanup_artifact() to unwrap BaseExceptionGroup and match known anyio cleanup markers.
  • Introduce regression tests covering both the marker detection and reconnect-budget behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
tools/mcp_tool.py Adds detection logic to ignore benign anyio teardown errors and avoid burning reconnect retries.
tests/tools/test_mcp_anyio_cleanup_artifact.py Adds regression tests for marker matching and reconnect retry-budget behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/mcp_tool.py
"reconnect tear-down (not counting against retries): %s",
self.name, exc,
)
backoff = 1.0
Comment thread tools/mcp_tool.py
Comment on lines +1999 to +2010
Walks ``BaseExceptionGroup`` chains because the top-level error
from a ``streamable_http_client`` teardown is typically
``ExceptionGroup("unhandled errors in a TaskGroup", [...])`` with
the real anyio RuntimeError nested inside. See #31987.
"""
stack: list[BaseException] = [exc]
while stack:
cur = stack.pop()
msg = str(cur).lower()
if msg and any(marker in msg for marker in _ANYIO_CLEANUP_MARKERS):
return True
if isinstance(cur, BaseExceptionGroup):
Comment thread tools/mcp_tool.py
``ExceptionGroup("unhandled errors in a TaskGroup", [...])`` with
the real anyio RuntimeError nested inside. See #31987.
"""
stack: list[BaseException] = [exc]
Comment thread tools/mcp_tool.py
Comment on lines +2004 to +2010
stack: list[BaseException] = [exc]
while stack:
cur = stack.pop()
msg = str(cur).lower()
if msg and any(marker in msg for marker in _ANYIO_CLEANUP_MARKERS):
return True
if isinstance(cur, BaseExceptionGroup):
# And no error was latched onto the server.
assert server._error is None

asyncio.new_event_loop().run_until_complete(_run())
f"failures; got {call_count}"
)

asyncio.new_event_loop().run_until_complete(_run())
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth labels May 25, 2026
@briandevans

Copy link
Copy Markdown
Contributor Author

@copilot All actionable findings addressed in commit afebfd883:

  • tools/mcp_tool.py:1660 (tight-reconnect-loop): added await asyncio.sleep(0) before continue so a transport that fails synchronously inside its __aenter__ cannot starve the event loop with a tight reconnect spin. The backoff reset to 1.0s is preserved (a transient cleanup artifact should not delay a healthy reconnect by the prior backoff).
  • tests/tools/test_mcp_anyio_cleanup_artifact.py:119, 164 (leaked event loops): replaced asyncio.new_event_loop().run_until_complete(_run()) with asyncio.run(_run()) in both tests, which closes the loop in a try/finally internally.
  • tools/mcp_tool.py:2004, 2010 (BaseExceptionGroup Python 3.11+): not actionable — pyproject.toml pins requires-python = ">=3.11", so the symbol is always available at import and runtime in this repo.

The CI test (6) failure on the prior commit is tests/hermes_cli/test_web_server.py::TestPtyWebSocket::test_resize_escape_is_forwarded timing out after 30s on a starlette.testclient WebSocket portal — unrelated to tools/mcp_tool.py or this PR's test file; appears to be a pre-existing flake in slice 6/6 of the web-server pty suite.

@hclsys hclsys left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is well-scoped and the matcher is appropriately narrow — _ANYIO_CLEANUP_MARKERS = ("current task is not holding this lock",) won't false-positive on unrelated standalone errors, and the negative tests (DNS resolution failed, ECONNRESET) confirm that. The ExceptionGroup-unwrap (positive/nested/group tests) is the right call since streamable_http_client.__aexit__ surfaces the error wrapped in a TaskGroup group. Good.

One correctness edge worth considering before merge — mixed ExceptionGroups:

_is_anyio_cleanup_artifact returns True as soon as any node in the group tree matches the marker (tools/mcp_tool.py, the any(marker in msg ...) inside the while stack walk). It does not check that the artifact is the only exception in the group. But anyio's TaskGroup __aexit__ bundles all child-task exceptions into a single ExceptionGroup. So a teardown where the benign lock error and a genuine transport failure both surface — e.g.

ExceptionGroup("unhandled errors in a TaskGroup", [
    RuntimeError("The current task is not holding this lock"),  # benign
    ConnectionError("Connection refused"),                      # real
])

— would match on the first node and hit the continue, so the reconnect loop skips burning a retry slot on a real failure that's co-bundled with the cleanup noise. That's the inverse of the bug you're fixing (here you'd want to count it).

This is plausibly reachable: the reconnect that triggers the teardown can itself fail for a real reason in the same unwind. Suggest tightening to "benign only if every leaf in the group is an artifact (or the group's non-artifact leaves are empty)" — i.e. walk the group and require all real (non-group) exceptions to match the marker, rather than any. A test with a mixed group asserting False would lock it in.

Standalone-artifact path and the budget/backoff logic look correct otherwise.

When `streamable_http_client.__aexit__` unwinds for a planned reconnect
on a Streamable HTTP MCP server (e.g. after a keepalive-triggered
tear-down), anyio's TaskGroup cleanup can raise
`RuntimeError("The current task is not holding this lock")` even though
the session shut down cleanly.  The reconnect loop's `except Exception`
treated this as a real connection failure and burned a slot in the
`_MAX_RECONNECT_RETRIES` budget, so every keepalive-driven reconnect
cost one retry attempt and the server was permanently disconnected
after 5 cleanup events.

Add `_ANYIO_CLEANUP_MARKERS` + `_is_anyio_cleanup_artifact()` that walk
`BaseExceptionGroup` chains (the SDK wraps the inner RuntimeError in a
TaskGroup ExceptionGroup), and in the retry loop skip the retry-counter
increment when the artifact fires after the server was already ready.

Mirrors the existing `_SESSION_EXPIRED_MARKERS` design.  Fixes NousResearch#31987.
Copilot review feedback on NousResearch#32111:

- mcp_tool.py: the cleanup-artifact branch went straight to `continue`
  with no await, which would let a transport that fails synchronously
  inside `streamable_http_client.__aenter__` spin tightly without
  yielding to the event loop. Add an `await asyncio.sleep(0)` before
  the `continue` so other tasks (including shutdown delivery) can run.
- test_mcp_anyio_cleanup_artifact.py: switch
  `asyncio.new_event_loop().run_until_complete(_run())` to
  `asyncio.run(_run())` so the event loop is properly closed and
  doesn't trigger ResourceWarnings.

The Python-3.11+ `BaseExceptionGroup` concern Copilot also flagged is
not actionable: `pyproject.toml` pins `requires-python = ">=3.11"`,
so the symbol is always available.
@briandevans briandevans force-pushed the fix/mcp-anyio-cleanup-31987 branch from 752261b to 103468d Compare June 5, 2026 23:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP HTTP transport: anyio RuntimeError during streamable_http_client cleanup causes reconnect failure loop

4 participants