fix(tools): ignore anyio cleanup artifact during MCP HTTP reconnect#32111
fix(tools): ignore anyio cleanup artifact during MCP HTTP reconnect#32111briandevans wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR prevents benign anyio teardown errors during HTTP/StreamableHTTP reconnects from consuming the MCP reconnect retry budget (fixing the “permanent disconnect after N keepalive reconnects” regression in #31987), and adds regression tests to lock in the behavior.
Changes:
- Detect and ignore the anyio “not holding this lock” teardown artifact during reconnect loops.
- Add
_is_anyio_cleanup_artifact()to unwrapBaseExceptionGroupand match known anyio cleanup markers. - Introduce regression tests covering both the marker detection and reconnect-budget behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| tools/mcp_tool.py | Adds detection logic to ignore benign anyio teardown errors and avoid burning reconnect retries. |
| tests/tools/test_mcp_anyio_cleanup_artifact.py | Adds regression tests for marker matching and reconnect retry-budget behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "reconnect tear-down (not counting against retries): %s", | ||
| self.name, exc, | ||
| ) | ||
| backoff = 1.0 |
| Walks ``BaseExceptionGroup`` chains because the top-level error | ||
| from a ``streamable_http_client`` teardown is typically | ||
| ``ExceptionGroup("unhandled errors in a TaskGroup", [...])`` with | ||
| the real anyio RuntimeError nested inside. See #31987. | ||
| """ | ||
| stack: list[BaseException] = [exc] | ||
| while stack: | ||
| cur = stack.pop() | ||
| msg = str(cur).lower() | ||
| if msg and any(marker in msg for marker in _ANYIO_CLEANUP_MARKERS): | ||
| return True | ||
| if isinstance(cur, BaseExceptionGroup): |
| ``ExceptionGroup("unhandled errors in a TaskGroup", [...])`` with | ||
| the real anyio RuntimeError nested inside. See #31987. | ||
| """ | ||
| stack: list[BaseException] = [exc] |
| stack: list[BaseException] = [exc] | ||
| while stack: | ||
| cur = stack.pop() | ||
| msg = str(cur).lower() | ||
| if msg and any(marker in msg for marker in _ANYIO_CLEANUP_MARKERS): | ||
| return True | ||
| if isinstance(cur, BaseExceptionGroup): |
| # And no error was latched onto the server. | ||
| assert server._error is None | ||
|
|
||
| asyncio.new_event_loop().run_until_complete(_run()) |
| f"failures; got {call_count}" | ||
| ) | ||
|
|
||
| asyncio.new_event_loop().run_until_complete(_run()) |
|
@copilot All actionable findings addressed in commit afebfd883:
The CI test (6) failure on the prior commit is |
hclsys
left a comment
There was a problem hiding this comment.
The fix is well-scoped and the matcher is appropriately narrow — _ANYIO_CLEANUP_MARKERS = ("current task is not holding this lock",) won't false-positive on unrelated standalone errors, and the negative tests (DNS resolution failed, ECONNRESET) confirm that. The ExceptionGroup-unwrap (positive/nested/group tests) is the right call since streamable_http_client.__aexit__ surfaces the error wrapped in a TaskGroup group. Good.
One correctness edge worth considering before merge — mixed ExceptionGroups:
_is_anyio_cleanup_artifact returns True as soon as any node in the group tree matches the marker (tools/mcp_tool.py, the any(marker in msg ...) inside the while stack walk). It does not check that the artifact is the only exception in the group. But anyio's TaskGroup __aexit__ bundles all child-task exceptions into a single ExceptionGroup. So a teardown where the benign lock error and a genuine transport failure both surface — e.g.
ExceptionGroup("unhandled errors in a TaskGroup", [
RuntimeError("The current task is not holding this lock"), # benign
ConnectionError("Connection refused"), # real
])
— would match on the first node and hit the continue, so the reconnect loop skips burning a retry slot on a real failure that's co-bundled with the cleanup noise. That's the inverse of the bug you're fixing (here you'd want to count it).
This is plausibly reachable: the reconnect that triggers the teardown can itself fail for a real reason in the same unwind. Suggest tightening to "benign only if every leaf in the group is an artifact (or the group's non-artifact leaves are empty)" — i.e. walk the group and require all real (non-group) exceptions to match the marker, rather than any. A test with a mixed group asserting False would lock it in.
Standalone-artifact path and the budget/backoff logic look correct otherwise.
afebfd8 to
3286875
Compare
3286875 to
5ffc4e7
Compare
5ffc4e7 to
ce631b5
Compare
ce631b5 to
752261b
Compare
When `streamable_http_client.__aexit__` unwinds for a planned reconnect
on a Streamable HTTP MCP server (e.g. after a keepalive-triggered
tear-down), anyio's TaskGroup cleanup can raise
`RuntimeError("The current task is not holding this lock")` even though
the session shut down cleanly. The reconnect loop's `except Exception`
treated this as a real connection failure and burned a slot in the
`_MAX_RECONNECT_RETRIES` budget, so every keepalive-driven reconnect
cost one retry attempt and the server was permanently disconnected
after 5 cleanup events.
Add `_ANYIO_CLEANUP_MARKERS` + `_is_anyio_cleanup_artifact()` that walk
`BaseExceptionGroup` chains (the SDK wraps the inner RuntimeError in a
TaskGroup ExceptionGroup), and in the retry loop skip the retry-counter
increment when the artifact fires after the server was already ready.
Mirrors the existing `_SESSION_EXPIRED_MARKERS` design. Fixes NousResearch#31987.
Copilot review feedback on NousResearch#32111: - mcp_tool.py: the cleanup-artifact branch went straight to `continue` with no await, which would let a transport that fails synchronously inside `streamable_http_client.__aenter__` spin tightly without yielding to the event loop. Add an `await asyncio.sleep(0)` before the `continue` so other tasks (including shutdown delivery) can run. - test_mcp_anyio_cleanup_artifact.py: switch `asyncio.new_event_loop().run_until_complete(_run())` to `asyncio.run(_run())` so the event loop is properly closed and doesn't trigger ResourceWarnings. The Python-3.11+ `BaseExceptionGroup` concern Copilot also flagged is not actionable: `pyproject.toml` pins `requires-python = ">=3.11"`, so the symbol is always available.
752261b to
103468d
Compare
What does this PR do?
When a Streamable HTTP MCP server tears down its session for a planned
reconnect (e.g. after the keepalive timer triggers, default 180s),
async with streamable_http_client(...)unwinds and anyio's TaskGroupcleanup can raise
RuntimeError("The current task is not holding this lock")from a Lock that was acquired by a different coroutine insidethe TaskGroup. The session shut down cleanly, but the cleanup raises.
Previously the reconnect loop's
except Exceptiontreated this as areal connection failure and burned a slot in the
_MAX_RECONNECT_RETRIESbudget. Every keepalive-driven reconnect costone retry attempt, and after ~5 cleanup events the server was
permanently disconnected for the rest of the gateway lifetime.
Fix: add
_ANYIO_CLEANUP_MARKERS+_is_anyio_cleanup_artifact()thatwalk
BaseExceptionGroupchains (the SDK wraps the innerRuntimeErrorin anExceptionGroup("unhandled errors in a TaskGroup", ...)), and in the retry loop skip the retry-counterincrement when the artifact fires after the server was already ready.
The helper mirrors the existing
_SESSION_EXPIRED_MARKERSdesign.Related Issue
Fixes #31987
Type of Change
Changes Made
tools/mcp_tool.py— add_ANYIO_CLEANUP_MARKERS+_is_anyio_cleanup_artifact()(walksBaseExceptionGroup), and short-circuit the reconnect retry counter when the artifact fires after_readyis set.tests/tools/test_mcp_anyio_cleanup_artifact.py— six regression tests: bareRuntimeErrormatch,ExceptionGroupunwrap (one + nested levels), rejection of unrelatedRuntimeError/ConnectionError, the full reconnect-loop scenario showing >_MAX_RECONNECT_RETRIES cleanup events survive, and a control test confirming realConnectionErrorfailures still exhaust the budget at_MAX_RECONNECT_RETRIES + 1attempts.How to Test
uv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/tools/test_mcp_anyio_cleanup_artifact.py -v— 6 passeduv run --with pytest --with pytest-xdist --with pytest-asyncio python3 -m pytest tests/tools/test_mcp_reconnect_signal.py tests/tools/test_mcp_stability.py tests/tools/test_mcp_tool_session_expired.py tests/tools/test_mcp_cancelled_error_propagation.py tests/tools/test_mcp_circuit_breaker.py tests/tools/test_mcp_tool.py -v— 239 passed, no regressionstools/mcp_tool.pyreverted toorigin/main,test_reconnect_loop_ignores_anyio_cleanup_artifactfails withassert 6 > 6(server gives up after 5 cleanup events) — confirms the test exercises the fix.Checklist
Code
fix(scope):,feat(scope):, etc.)Documentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/AScreenshots / Logs
Reproduces the user-reported log from #31987:
With the fix, those
RuntimeError("not holding this lock")cleanupartifacts drop to DEBUG-level and don't consume the retry budget.
Related / Positioning
_SESSION_EXPIRED_MARKERSto cover this RuntimeError. I chose a separate_ANYIO_CLEANUP_MARKERStuple + dedicated helper because (a)_is_session_expired_erroris only consulted by tool-call handlers (_handle_session_expired_and_retryat lines 2422/2484/2544/2607/2678), not by the reconnect retry loop where the actual budget burn happens; (b) the failure mode is cleanup-after-clean-teardown, not the server rejecting a request — different semantics worth keeping distinct. The two could be merged later if more anyio cleanup signatures need handling.fix(tools): keep cua-driver MCP session task-local) touches a related anyio cancel-scope problem intools/computer_use/cua_backend.py. Different file, different code path; no overlap.