fix(tools): ignore anyio cleanup artifact during MCP HTTP reconnect by briandevans · Pull Request #32111 · NousResearch/hermes-agent

briandevans · 2026-05-25T14:20:15Z

What does this PR do?

When a Streamable HTTP MCP server tears down its session for a planned
reconnect (e.g. after the keepalive timer triggers, default 180s),
async with streamable_http_client(...) unwinds and anyio's TaskGroup
cleanup can raise RuntimeError("The current task is not holding this lock") from a Lock that was acquired by a different coroutine inside
the TaskGroup. The session shut down cleanly, but the cleanup raises.

Previously the reconnect loop's except Exception treated this as a
real connection failure and burned a slot in the
_MAX_RECONNECT_RETRIES budget. Every keepalive-driven reconnect cost
one retry attempt, and after ~5 cleanup events the server was
permanently disconnected for the rest of the gateway lifetime.

Fix: add _ANYIO_CLEANUP_MARKERS + _is_anyio_cleanup_artifact() that
walk BaseExceptionGroup chains (the SDK wraps the inner
RuntimeError in an ExceptionGroup("unhandled errors in a TaskGroup", ...)), and in the retry loop skip the retry-counter
increment when the artifact fires after the server was already ready.
The helper mirrors the existing _SESSION_EXPIRED_MARKERS design.

Related Issue

Fixes #31987

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

tools/mcp_tool.py — add _ANYIO_CLEANUP_MARKERS + _is_anyio_cleanup_artifact() (walks BaseExceptionGroup), and short-circuit the reconnect retry counter when the artifact fires after _ready is set.
tests/tools/test_mcp_anyio_cleanup_artifact.py — six regression tests: bare RuntimeError match, ExceptionGroup unwrap (one + nested levels), rejection of unrelated RuntimeError/ConnectionError, the full reconnect-loop scenario showing >_MAX_RECONNECT_RETRIES cleanup events survive, and a control test confirming real ConnectionError failures still exhaust the budget at _MAX_RECONNECT_RETRIES + 1 attempts.

How to Test

uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/tools/test_mcp_anyio_cleanup_artifact.py -v — 6 passed
uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/tools/test_mcp_reconnect_signal.py tests/tools/test_mcp_stability.py tests/tools/test_mcp_tool_session_expired.py tests/tools/test_mcp_cancelled_error_propagation.py tests/tools/test_mcp_circuit_breaker.py tests/tools/test_mcp_tool.py -v — 239 passed, no regressions
Regression guard: with tools/mcp_tool.py reverted to origin/main, test_reconnect_loop_ignores_anyio_cleanup_artifact fails with assert 6 > 6 (server gives up after 5 cleanup events) — confirms the test exercises the fix.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run focused tests for the touched code and all pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: macOS 15.x

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — the anyio/asyncio interaction is OS-independent
I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

Reproduces the user-reported log from #31987:

keepalive failed, triggering reconnect
reconnect requested — tearing down HTTP session
connection lost (attempt 1/5): unhandled errors in a TaskGroup
...
failed after 5 reconnection attempts, giving up

With the fix, those RuntimeError("not holding this lock") cleanup
artifacts drop to DEBUG-level and don't consume the retry budget.

Related / Positioning

daimon-nous (issue commenter) suggested extending _SESSION_EXPIRED_MARKERS to cover this RuntimeError. I chose a separate _ANYIO_CLEANUP_MARKERS tuple + dedicated helper because (a) _is_session_expired_error is only consulted by tool-call handlers (_handle_session_expired_and_retry at lines 2422/2484/2544/2607/2678), not by the reconnect retry loop where the actual budget burn happens; (b) the failure mode is cleanup-after-clean-teardown, not the server rejecting a request — different semantics worth keeping distinct. The two could be merged later if more anyio cleanup signatures need handling.
fix(tools): keep cua-driver MCP session task-local #28556 (fix(tools): keep cua-driver MCP session task-local) touches a related anyio cancel-scope problem in tools/computer_use/cua_backend.py. Different file, different code path; no overlap.

Audited siblings: confirmed every async with streamable_http_client(...) block in _run_http shares the same outer reconnect retry loop in run(), so the single check in the except Exception handler covers all three call sites (lines 1463, plus the wrapping retries at 1594-1654 surface the same exception class). Stdio transport (_run_stdio) does not use streamable_http_client, so the artifact cannot fire from that path. No widening needed.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR prevents benign anyio teardown errors during HTTP/StreamableHTTP reconnects from consuming the MCP reconnect retry budget (fixing the “permanent disconnect after N keepalive reconnects” regression in #31987), and adds regression tests to lock in the behavior.

Changes:

Detect and ignore the anyio “not holding this lock” teardown artifact during reconnect loops.
Add _is_anyio_cleanup_artifact() to unwrap BaseExceptionGroup and match known anyio cleanup markers.
Introduce regression tests covering both the marker detection and reconnect-budget behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
tools/mcp_tool.py	Adds detection logic to ignore benign anyio teardown errors and avoid burning reconnect retries.
tests/tools/test_mcp_anyio_cleanup_artifact.py	Adds regression tests for marker matching and reconnect retry-budget behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                        "reconnect tear-down (not counting against retries): %s",
+                        self.name, exc,
+                    )
+                    backoff = 1.0


+    Walks ``BaseExceptionGroup`` chains because the top-level error
+    from a ``streamable_http_client`` teardown is typically
+    ``ExceptionGroup("unhandled errors in a TaskGroup", [...])`` with
+    the real anyio RuntimeError nested inside.  See #31987.
+    """
+    stack: list[BaseException] = [exc]
+    while stack:
+        cur = stack.pop()
+        msg = str(cur).lower()
+        if msg and any(marker in msg for marker in _ANYIO_CLEANUP_MARKERS):
+            return True
+        if isinstance(cur, BaseExceptionGroup):


+    ``ExceptionGroup("unhandled errors in a TaskGroup", [...])`` with
+    the real anyio RuntimeError nested inside.  See #31987.
+    """
+    stack: list[BaseException] = [exc]


+    stack: list[BaseException] = [exc]
+    while stack:
+        cur = stack.pop()
+        msg = str(cur).lower()
+        if msg and any(marker in msg for marker in _ANYIO_CLEANUP_MARKERS):
+            return True
+        if isinstance(cur, BaseExceptionGroup):


+        # And no error was latched onto the server.
+        assert server._error is None
+
+    asyncio.new_event_loop().run_until_complete(_run())


+            f"failures; got {call_count}"
+        )
+
+    asyncio.new_event_loop().run_until_complete(_run())


briandevans · 2026-05-25T15:10:29Z

@copilot All actionable findings addressed in commit afebfd883:

tools/mcp_tool.py:1660 (tight-reconnect-loop): added await asyncio.sleep(0) before continue so a transport that fails synchronously inside its __aenter__ cannot starve the event loop with a tight reconnect spin. The backoff reset to 1.0s is preserved (a transient cleanup artifact should not delay a healthy reconnect by the prior backoff).
tests/tools/test_mcp_anyio_cleanup_artifact.py:119, 164 (leaked event loops): replaced asyncio.new_event_loop().run_until_complete(_run()) with asyncio.run(_run()) in both tests, which closes the loop in a try/finally internally.
tools/mcp_tool.py:2004, 2010 (BaseExceptionGroup Python 3.11+): not actionable — pyproject.toml pins requires-python = ">=3.11", so the symbol is always available at import and runtime in this repo.

The CI test (6) failure on the prior commit is tests/hermes_cli/test_web_server.py::TestPtyWebSocket::test_resize_escape_is_forwarded timing out after 30s on a starlette.testclient WebSocket portal — unrelated to tools/mcp_tool.py or this PR's test file; appears to be a pre-existing flake in slice 6/6 of the web-server pty suite.

hclsys

The fix is well-scoped and the matcher is appropriately narrow — _ANYIO_CLEANUP_MARKERS = ("current task is not holding this lock",) won't false-positive on unrelated standalone errors, and the negative tests (DNS resolution failed, ECONNRESET) confirm that. The ExceptionGroup-unwrap (positive/nested/group tests) is the right call since streamable_http_client.__aexit__ surfaces the error wrapped in a TaskGroup group. Good.

One correctness edge worth considering before merge — mixed ExceptionGroups:

_is_anyio_cleanup_artifact returns True as soon as any node in the group tree matches the marker (tools/mcp_tool.py, the any(marker in msg ...) inside the while stack walk). It does not check that the artifact is the only exception in the group. But anyio's TaskGroup __aexit__ bundles all child-task exceptions into a single ExceptionGroup. So a teardown where the benign lock error and a genuine transport failure both surface — e.g.

ExceptionGroup("unhandled errors in a TaskGroup", [
    RuntimeError("The current task is not holding this lock"),  # benign
    ConnectionError("Connection refused"),                      # real
])

— would match on the first node and hit the continue, so the reconnect loop skips burning a retry slot on a real failure that's co-bundled with the cleanup noise. That's the inverse of the bug you're fixing (here you'd want to count it).

This is plausibly reachable: the reconnect that triggers the teardown can itself fail for a real reason in the same unwind. Suggest tightening to "benign only if every leaf in the group is an artifact (or the group's non-artifact leaves are empty)" — i.e. walk the group and require all real (non-group) exceptions to match the marker, rather than any. A test with a mixed group asserting False would lock it in.

Standalone-artifact path and the budget/backoff logic look correct otherwise.

When `streamable_http_client.__aexit__` unwinds for a planned reconnect on a Streamable HTTP MCP server (e.g. after a keepalive-triggered tear-down), anyio's TaskGroup cleanup can raise `RuntimeError("The current task is not holding this lock")` even though the session shut down cleanly. The reconnect loop's `except Exception` treated this as a real connection failure and burned a slot in the `_MAX_RECONNECT_RETRIES` budget, so every keepalive-driven reconnect cost one retry attempt and the server was permanently disconnected after 5 cleanup events. Add `_ANYIO_CLEANUP_MARKERS` + `_is_anyio_cleanup_artifact()` that walk `BaseExceptionGroup` chains (the SDK wraps the inner RuntimeError in a TaskGroup ExceptionGroup), and in the retry loop skip the retry-counter increment when the artifact fires after the server was already ready. Mirrors the existing `_SESSION_EXPIRED_MARKERS` design. Fixes NousResearch#31987.

Copilot review feedback on NousResearch#32111: - mcp_tool.py: the cleanup-artifact branch went straight to `continue` with no await, which would let a transport that fails synchronously inside `streamable_http_client.__aenter__` spin tightly without yielding to the event loop. Add an `await asyncio.sleep(0)` before the `continue` so other tasks (including shutdown delivery) can run. - test_mcp_anyio_cleanup_artifact.py: switch `asyncio.new_event_loop().run_until_complete(_run())` to `asyncio.run(_run())` so the event loop is properly closed and doesn't trigger ResourceWarnings. The Python-3.11+ `BaseExceptionGroup` concern Copilot also flagged is not actionable: `pyproject.toml` pins `requires-python = ">=3.11"`, so the symbol is always available.

Copilot AI review requested due to automatic review settings May 25, 2026 14:20

Copilot AI reviewed May 25, 2026

View reviewed changes

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth labels May 25, 2026

hclsys reviewed May 25, 2026

View reviewed changes

briandevans force-pushed the fix/mcp-anyio-cleanup-31987 branch from afebfd8 to 3286875 Compare May 28, 2026 06:09

briandevans force-pushed the fix/mcp-anyio-cleanup-31987 branch from 3286875 to 5ffc4e7 Compare May 29, 2026 06:14

Urfread mentioned this pull request May 30, 2026

[Reliability] Repeated tool-call errors (path/retry patterns) inflate latency; need stronger fail-fast + dedupe #35114

Closed

briandevans force-pushed the fix/mcp-anyio-cleanup-31987 branch from 5ffc4e7 to ce631b5 Compare May 30, 2026 07:10

briandevans force-pushed the fix/mcp-anyio-cleanup-31987 branch from ce631b5 to 752261b Compare June 2, 2026 11:18

briandevans added 2 commits June 5, 2026 16:15

briandevans force-pushed the fix/mcp-anyio-cleanup-31987 branch from 752261b to 103468d Compare June 5, 2026 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tools): ignore anyio cleanup artifact during MCP HTTP reconnect#32111

fix(tools): ignore anyio cleanup artifact during MCP HTTP reconnect#32111
briandevans wants to merge 2 commits into
NousResearch:mainfrom
briandevans:fix/mcp-anyio-cleanup-31987

briandevans commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

briandevans commented May 25, 2026

Uh oh!

hclsys left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

briandevans commented May 25, 2026

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Related / Positioning

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

briandevans commented May 25, 2026

Uh oh!

hclsys left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants