fix(mcp): don't cancel registration on stale per-thread interrupt flag#14173
Open
bobashopcashier wants to merge 1 commit into
Open
fix(mcp): don't cancel registration on stale per-thread interrupt flag#14173bobashopcashier wants to merge 1 commit into
bobashopcashier wants to merge 1 commit into
Conversation
_run_on_mcp_loop() polls is_interrupted() and cancels the submitted coroutine if the calling thread's tid is in _interrupted_threads. That cancellation channel exists for in-flight tool calls so a new user message aborts a long-running tool, but it was also active for the discovery / probe paths that run during gateway startup and /reload-mcp. A stale per-tid flag left behind on a recycled default-executor worker (clear_interrupt() already warns about this failure mode in run_agent.py) cancels the stdio handshake before it finishes and surfaces as a bare "CancelledError" in gateway.error.log. Fix: respect_interrupt kwarg on _run_on_mcp_loop, defaulting to True for tool-call callers so user interrupts still propagate mid-call. Pass False from register_mcp_servers() and probe_mcp_server_tools() — those paths must never be cancelled by a stale flag. Also: _format_connect_error() now emits an actionable hint when the exception is a naked CancelledError instead of just the class name. Tests: regression guards for both respect_interrupt branches and for the new CancelledError path (including a cause-preservation case and the existing missing-executable path). Refs NousResearch#14113.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Narrow fix for #14113 where stdio MCP servers silently fail to register in the launchd-managed gateway with a bare
CancelledError, while the same config works viahermes mcp test,hermes chat -q, and the gateway's HTTP transport.What's happening
_run_on_mcp_loop()(tools/mcp_tool.py:1544) pollsis_interrupted()in a 0.1s loop and callsfuture.cancel()on the submitted coroutine if the calling thread's tid is in_interrupted_threads. That cancellation channel exists so a new user message aborts a long-running tool call mid-flight — exactly the right semantics for_make_tool_handler.The same plumbing was also being used for registration / discovery / probe — call paths that run during gateway startup and
/reload-mcp, scheduled onto the default executor's worker pool.run_agent.py's ownclear_interrupt()(line 3619–3623) already warns that stale per-tid flags can survive a turn boundary and fire on unrelated work scheduled onto the recycled worker:When a prior agent turn's cleanup was missed (crash, abandoned task, whatever), the executor worker's tid stays in
_interrupted_threads. The next thing to run on that worker —discover_mcp_tools()at gateway boot or/reload-mcp— gets its stdio handshake cancelled before it finishes.asyncio.gather(..., return_exceptions=True)collects the bareCancelledErrorand the warning in the report is emitted.The fix
One kwarg on
_run_on_mcp_loop:True— tool-call callers (_make_tool_handler._call_onceand its five variants) keep the existing interrupt behavior; user interrupts still abort in-flight tool calls.Falsefromregister_mcp_servers()andprobe_mcp_server_tools()— registration / probe cannot be cancelled by a stale flag on a recycled worker.Also:
_format_connect_error()now recognises a nakedasyncio.CancelledError(no__cause__, no__context__) and emits a short actionable message instead of the bare class name. A chainedCancelledErrorwith a real cause still falls through to the existing flattening path so the underlying reason is preserved.Scope limits
gateway/run.py,run_agent.py, ormodel_tools.py. The fix lives where the bug lives._format_connect_errormessage will point them to the next datapoint (parent-task cancel vs. some other source).Related Issue
Fixes #14113
Type of Change
Changes Made
tools/mcp_tool.py:_run_on_mcp_loop: newrespect_interruptkwarg (keyword-only, defaultTrue)register_mcp_serversandprobe_mcp_server_tools: passrespect_interrupt=False_format_connect_error: recognize nakedCancelledErrorand emit an actionable hinttests/tools/test_mcp_tool.py:TestRunOnMCPLoopInterrupts.test_respect_interrupt_false_ignores_stale_flag— simulates stale flag, asserts registration proceedsTestRunOnMCPLoopInterrupts.test_respect_interrupt_default_still_cancels— regression guard, tool-call path still raisesInterruptedErrorTestFormatConnectError.test_naked_cancelled_error_has_actionable_message— new CancelledError branchTestFormatConnectError.test_cancelled_error_with_cause_falls_through— chained cause still renderedTestFormatConnectError.test_missing_node_executable_unchanged— existing FileNotFoundError path unaffectedHow to Test
source .venv/bin/activate scripts/run_tests.sh tests/tools/test_mcp_tool.py -qExpected:
171 passed.Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests pass — onlytests/tools/test_mcp_tool.pywas run (171 passed); the full suite was not runDocumentation & Housekeeping
docs/, docstrings) — added explanatory docstrings on the newrespect_interruptkwargcli-config.yaml.exampleif I added/changed config keys — N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/AScreenshots / Logs