Skip to content

fix(mcp): don't cancel registration on stale per-thread interrupt flag#14173

Open
bobashopcashier wants to merge 1 commit into
NousResearch:mainfrom
bobashopcashier:fix/mcp-registration-stale-interrupt-14113
Open

fix(mcp): don't cancel registration on stale per-thread interrupt flag#14173
bobashopcashier wants to merge 1 commit into
NousResearch:mainfrom
bobashopcashier:fix/mcp-registration-stale-interrupt-14113

Conversation

@bobashopcashier

Copy link
Copy Markdown
Contributor

What does this PR do?

Narrow fix for #14113 where stdio MCP servers silently fail to register in the launchd-managed gateway with a bare CancelledError, while the same config works via hermes mcp test, hermes chat -q, and the gateway's HTTP transport.

What's happening

_run_on_mcp_loop() (tools/mcp_tool.py:1544) polls is_interrupted() in a 0.1s loop and calls future.cancel() on the submitted coroutine if the calling thread's tid is in _interrupted_threads. That cancellation channel exists so a new user message aborts a long-running tool call mid-flight — exactly the right semantics for _make_tool_handler.

The same plumbing was also being used for registration / discovery / probe — call paths that run during gateway startup and /reload-mcp, scheduled onto the default executor's worker pool. run_agent.py's own clear_interrupt() (line 3619–3623) already warns that stale per-tid flags can survive a turn boundary and fire on unrelated work scheduled onto the recycled worker:

"guarantees no stale interrupt can survive a turn boundary and fire on a subsequent, unrelated tool call that happens to get scheduled onto the same recycled worker tid."

When a prior agent turn's cleanup was missed (crash, abandoned task, whatever), the executor worker's tid stays in _interrupted_threads. The next thing to run on that worker — discover_mcp_tools() at gateway boot or /reload-mcp — gets its stdio handshake cancelled before it finishes. asyncio.gather(..., return_exceptions=True) collects the bare CancelledError and the warning in the report is emitted.

The fix

One kwarg on _run_on_mcp_loop:

def _run_on_mcp_loop(coro, timeout: float = 30, *, respect_interrupt: bool = True):
    ...
    if respect_interrupt and is_interrupted():
        future.cancel()
        raise InterruptedError("User sent a new message")
  • Default True — tool-call callers (_make_tool_handler._call_once and its five variants) keep the existing interrupt behavior; user interrupts still abort in-flight tool calls.
  • Explicit False from register_mcp_servers() and probe_mcp_server_tools() — registration / probe cannot be cancelled by a stale flag on a recycled worker.

Also: _format_connect_error() now recognises a naked asyncio.CancelledError (no __cause__, no __context__) and emits a short actionable message instead of the bare class name. A chained CancelledError with a real cause still falls through to the existing flattening path so the underlying reason is preserved.

Scope limits

  • Does not touch gateway/run.py, run_agent.py, or model_tools.py. The fix lives where the bug lives.
  • Does not change tool-call interrupt semantics.
  • Does not empirically confirm the stale-flag hypothesis against a live launchd environment — I don't have one. The fix is defensible on its own terms (registration shouldn't be cancellable by a stale user-interrupt flag), and removes the most plausible trigger for the reported symptom. If the reporter sees residual cancellations after this lands, the improved _format_connect_error message will point them to the next datapoint (parent-task cancel vs. some other source).

Related Issue

Fixes #14113

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • tools/mcp_tool.py:
    • _run_on_mcp_loop: new respect_interrupt kwarg (keyword-only, default True)
    • register_mcp_servers and probe_mcp_server_tools: pass respect_interrupt=False
    • _format_connect_error: recognize naked CancelledError and emit an actionable hint
  • tests/tools/test_mcp_tool.py:
    • TestRunOnMCPLoopInterrupts.test_respect_interrupt_false_ignores_stale_flag — simulates stale flag, asserts registration proceeds
    • TestRunOnMCPLoopInterrupts.test_respect_interrupt_default_still_cancels — regression guard, tool-call path still raises InterruptedError
    • TestFormatConnectError.test_naked_cancelled_error_has_actionable_message — new CancelledError branch
    • TestFormatConnectError.test_cancelled_error_with_cause_falls_through — chained cause still rendered
    • TestFormatConnectError.test_missing_node_executable_unchanged — existing FileNotFoundError path unaffected

How to Test

source .venv/bin/activate
scripts/run_tests.sh tests/tools/test_mcp_tool.py -q

Expected: 171 passed.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass — only tests/tools/test_mcp_tool.py was run (171 passed); the full suite was not run
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS (Darwin 25.4.0). No live launchd repro — I don't have the reporter's environment.

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — added explanatory docstrings on the new respect_interrupt kwarg
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact (Windows, macOS) — pure Python, no platform-specific code
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

Screenshots / Logs

▶ running pytest with 4 workers, hermetic env, in /…/hermes-agent
  (TZ=UTC LANG=C.UTF-8 PYTHONHASHSEED=0; all credential env vars unset)
........................................................................ [ 42%]
........................................................................ [ 84%]
...........................                                              [100%]
171 passed, 5 warnings in 8.01s

_run_on_mcp_loop() polls is_interrupted() and cancels the submitted
coroutine if the calling thread's tid is in _interrupted_threads. That
cancellation channel exists for in-flight tool calls so a new user
message aborts a long-running tool, but it was also active for the
discovery / probe paths that run during gateway startup and /reload-mcp.
A stale per-tid flag left behind on a recycled default-executor worker
(clear_interrupt() already warns about this failure mode in run_agent.py)
cancels the stdio handshake before it finishes and surfaces as a bare
"CancelledError" in gateway.error.log.

Fix: respect_interrupt kwarg on _run_on_mcp_loop, defaulting to True for
tool-call callers so user interrupts still propagate mid-call. Pass False
from register_mcp_servers() and probe_mcp_server_tools() — those paths
must never be cancelled by a stale flag.

Also: _format_connect_error() now emits an actionable hint when the
exception is a naked CancelledError instead of just the class name.

Tests: regression guards for both respect_interrupt branches and for the
new CancelledError path (including a cause-preservation case and the
existing missing-executable path).

Refs NousResearch#14113.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Gateway fails to register stdio MCP servers silently on macOS (launchd)

2 participants