Skip to content

[BUG]: anyio cancel scope spin loop causes 100% CPU after load test stops #2360

@crivetimihai

Description

@crivetimihai

Summary

When MCP client transports (sse_client or streamablehttp_client) are closed while tasks are blocked on HTTP streaming reads, anyio's cancel scope enters a CPU spin loop consuming 100% CPU per affected worker. This occurs because _deliver_cancellation repeatedly reschedules itself when tasks cannot acknowledge cancellation.

Environment

  • MCP SDK: 1.25.0
  • anyio: 4.10.0
  • httpx: 0.28.1
  • httpx-sse: 0.4.0
  • Python: 3.12.12
  • ASGI Server: Gunicorn/Granian with Uvicorn workers

Reproduction Steps

  1. Deploy MCP Gateway with multiple workers (e.g., 16 workers × 3 replicas)
  2. Run high-concurrency load test (4000+ virtual users) using Streamable HTTP transport
  3. Stop the load test abruptly (clients disconnect without clean shutdown)
  4. Observe CPU usage: all gateway workers spike to ~50% CPU each (~800% per container)
  5. CPU remains pinned indefinitely until workers are restarted

Root Cause

Both sse_client and streamablehttp_client in the MCP SDK use anyio task groups:

async with anyio.create_task_group() as tg:
    try:
        yield read_stream, write_stream
    finally:
        tg.cancel_scope.cancel()  # <-- Triggers the spin

When SSE/HTTP readers are blocked on aiter_sse()aiter_lines() and can't acknowledge cancellation, anyio's _deliver_cancellation() enters a tight loop:

# anyio/_backends/_asyncio.py
def _deliver_cancellation(self, origin):
    should_retry = False
    for task in self._tasks:
        should_retry = True  # Set for EVERY task
        task.cancel(...)
    
    if should_retry:
        get_running_loop().call_soon(self._deliver_cancellation, origin)  # SPIN

Evidence

py-spy profiling shows MainThread stuck in _deliver_cancellation:

Thread 15 (active+gil): "MainThread"
    _deliver_cancellation (anyio/_backends/_asyncio.py:569)
    run (asyncio/runners.py:118)
    run (uvicorn/workers.py:104)

Upstream Issues

Proposed Fix

Add move_on_after timeout in mcp_session_pool.py:_close_session():

async def _close_session(self, pooled: PooledSession) -> None:
    if pooled.is_closed:
        return
    pooled.mark_closed()

    with anyio.move_on_after(5.0):
        try:
            await pooled.session.__aexit__(None, None, None)
        except Exception as e:
            logger.debug(f"Error closing session: {e}")

    with anyio.move_on_after(5.0):
        try:
            await pooled.transport_context.__aexit__(None, None, None)
        except Exception as e:
            logger.debug(f"Error closing transport: {e}")

Current Workaround

Worker recycling limits spin duration but doesn't prevent it:

GUNICORN_MAX_REQUESTS=100000
GUNICORN_MAX_REQUESTS_JITTER=10000

Upstream Reporting Plan

  1. Report to MCP SDK: Suggest adding move_on_after to transport cleanup
  2. Comment on anyio#695 with this reproduction case
  3. Consider httpx-sse improvement to allow cancellation of aiter_lines()

Related

Metadata

Metadata

Assignees

Labels

MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafebugSomething isn't workingperformancePerformance related itemspythonPython / backend development (FastAPI)

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions