-
Notifications
You must be signed in to change notification settings - Fork 615
[BUG]: anyio cancel scope spin loop causes 100% CPU after load test stops #2360
Copy link
Copy link
Closed
Copy link
Labels
MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafeP1: Non-negotiable, critical requirements without which the product is non-functional or unsafebugSomething isn't workingSomething isn't workingperformancePerformance related itemsPerformance related itemspythonPython / backend development (FastAPI)Python / backend development (FastAPI)
Milestone
Description
Summary
When MCP client transports (sse_client or streamablehttp_client) are closed while tasks are blocked on HTTP streaming reads, anyio's cancel scope enters a CPU spin loop consuming 100% CPU per affected worker. This occurs because _deliver_cancellation repeatedly reschedules itself when tasks cannot acknowledge cancellation.
Environment
- MCP SDK: 1.25.0
- anyio: 4.10.0
- httpx: 0.28.1
- httpx-sse: 0.4.0
- Python: 3.12.12
- ASGI Server: Gunicorn/Granian with Uvicorn workers
Reproduction Steps
- Deploy MCP Gateway with multiple workers (e.g., 16 workers × 3 replicas)
- Run high-concurrency load test (4000+ virtual users) using Streamable HTTP transport
- Stop the load test abruptly (clients disconnect without clean shutdown)
- Observe CPU usage: all gateway workers spike to ~50% CPU each (~800% per container)
- CPU remains pinned indefinitely until workers are restarted
Root Cause
Both sse_client and streamablehttp_client in the MCP SDK use anyio task groups:
async with anyio.create_task_group() as tg:
try:
yield read_stream, write_stream
finally:
tg.cancel_scope.cancel() # <-- Triggers the spinWhen SSE/HTTP readers are blocked on aiter_sse() → aiter_lines() and can't acknowledge cancellation, anyio's _deliver_cancellation() enters a tight loop:
# anyio/_backends/_asyncio.py
def _deliver_cancellation(self, origin):
should_retry = False
for task in self._tasks:
should_retry = True # Set for EVERY task
task.cancel(...)
if should_retry:
get_running_loop().call_soon(self._deliver_cancellation, origin) # SPINEvidence
py-spy profiling shows MainThread stuck in _deliver_cancellation:
Thread 15 (active+gil): "MainThread"
_deliver_cancellation (anyio/_backends/_asyncio.py:569)
run (asyncio/runners.py:118)
run (uvicorn/workers.py:104)
Upstream Issues
- anyio#695: 100% CPU load after cancel (March 2024, still open)
- claude-agent-sdk-python#378: Same pattern, 66% CPU in
_deliver_cancellation
Proposed Fix
Add move_on_after timeout in mcp_session_pool.py:_close_session():
async def _close_session(self, pooled: PooledSession) -> None:
if pooled.is_closed:
return
pooled.mark_closed()
with anyio.move_on_after(5.0):
try:
await pooled.session.__aexit__(None, None, None)
except Exception as e:
logger.debug(f"Error closing session: {e}")
with anyio.move_on_after(5.0):
try:
await pooled.transport_context.__aexit__(None, None, None)
except Exception as e:
logger.debug(f"Error closing transport: {e}")Current Workaround
Worker recycling limits spin duration but doesn't prevent it:
GUNICORN_MAX_REQUESTS=100000
GUNICORN_MAX_REQUESTS_JITTER=10000Upstream Reporting Plan
- Report to MCP SDK: Suggest adding
move_on_afterto transport cleanup - Comment on anyio#695 with this reproduction case
- Consider httpx-sse improvement to allow cancellation of
aiter_lines()
Related
- Detailed analysis:
todo/mcp-sdk-issue.md - Part of [PERFORMANCE]: PR #2211 causes FOR UPDATE lock contention and CPU spin loop under high load #2355 (high-load performance degradation)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
MUSTP1: Non-negotiable, critical requirements without which the product is non-functional or unsafeP1: Non-negotiable, critical requirements without which the product is non-functional or unsafebugSomething isn't workingSomething isn't workingperformancePerformance related itemsPerformance related itemspythonPython / backend development (FastAPI)Python / backend development (FastAPI)