fix-2360: prevent CPU spin loop after SSE client disconnect by crivetimihai · Pull Request #2506 · IBM/mcp-context-forge

crivetimihai · 2026-01-26T02:41:41Z

Summary

Fixes the CPU spin loop issue where Gunicorn workers consume 100%+ CPU when idle after load tests stop. The root cause was fire-and-forget asyncio.create_task() patterns leaving orphaned tasks in anyio's _deliver_cancellation spin loop.

Key fixes:

Track respond tasks in _respond_tasks dict for proper lifecycle management
Cancel respond tasks before removing sessions (prevents orphaning)
Escalation path: cancel → timeout → disconnect transport → retry cancel → move to stuck_tasks
Add cleanup on SSE response creation failure (prevents orphaned tasks)
Add stuck task reaper (30s interval) to clean completed tasks and retry cancellation
Redis respond loop now uses timeout-based polling with session existence check
SSE generator's finally block now also invokes disconnect callback

Load testing:

Updated make load-test-spin-detector to be full-featured (JWT auth, all user classes, 4000-user baseline)
Spike/drop pattern stress tests session cleanup

Closes #2360 - [BUG]: anyio cancel scope spin loop causes 100% CPU after load test stops
Closes #2357 - [BUG]: (sse): Granian CPU spikes to 800% after load stops, recovers when load resumes

Test plan

Unit tests for task tracking/cancellation (9 new tests)
Test for escalation path with fake transport verifying disconnect() is called
Tests for stuck task reaper lifecycle
All existing tests pass
Run make load-test-spin-detector to verify CPU returns to idle during pause phases

Root cause: Fire-and-forget asyncio.create_task() patterns left orphaned tasks that caused anyio _deliver_cancellation to spin at 100% CPU per worker. Changes: - Add _respond_tasks dict to track respond tasks by session_id - Cancel respond tasks explicitly before session cleanup in remove_session() - Cancel all respond tasks during shutdown() - Pass disconnect callback to SSE transport for defensive cleanup - Convert database backend from fire-and-forget to structured concurrency The fix ensures all asyncio tasks are properly tracked, cancelled on disconnect, and awaited to completion, preventing orphaned tasks from spinning the event loop. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Follow-up fixes based on testing and review: 1. Cancellation timeout escalation (Finding 1): - _cancel_respond_task() now escalates on timeout by calling transport.disconnect() - Retries cancellation after escalation - Always removes task from tracking to prevent buildup 2. Redis respond loop exit path (Finding 2): - Changed from infinite pubsub.listen() to timeout-based get_message() polling - Added session existence check - loop exits if session removed - Allows loop to exit even without cancellation 3. Generator finally block cleanup (Finding 3): - Added on_disconnect_callback() in event_generator() finally block - Covers: CancelledError, GeneratorExit, exceptions, and normal completion - Idempotent - safe if callback already ran from on_client_close 4. Added load-test-spin-detector make target: - Spike/drop pattern to stress test session cleanup - Docker stats monitoring at each phase - Color-coded output with pass/fail indicators - Log file output to /tmp Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Finding 1 (HIGH): Fixed race condition in sse_endpoint where respond task was created AFTER create_sse_response(). If client disconnected during response setup, the disconnect callback ran before the task existed, leaving it orphaned. Now matches utility_sse_endpoint ordering: 1. Compute user_with_token 2. Create and register respond task 3. Call create_sse_response() Finding 2 (MEDIUM): Added _stuck_tasks dict to track tasks that couldn't be cancelled after escalation. Previously these were dropped from tracking entirely, losing visibility. Now they're moved to _stuck_tasks for monitoring and final cleanup during shutdown(). Updated tests to verify escalation behavior. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Finding 1 (HIGH): Fixed orphaned respond task when create_sse_response() fails. Added try/except around create_sse_response() in both sse_endpoint and utility_sse_endpoint - on failure, calls remove_session() to clean up the task and session before re-raising. Finding 2 (MEDIUM): Added stuck task reaper that runs every 30 seconds to: - Remove completed tasks from _stuck_tasks - Retry cancellation for still-stuck tasks - Prevent memory leaks from tasks that eventually complete Finding 3 (LOW): Added test for escalation path with fake transport to verify transport.disconnect() is called during escalation. Also added tests for the stuck task reaper lifecycle. Also updated load-test-spin-detector to be a full-featured test matching load-test-ui with JWT auth, all user classes, entity ID fetching, and the same 4000-user baseline. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

- Reduce logging level to WARNING to suppress noisy worker messages - Only run entity fetching and cleanup on master/standalone nodes - Reduce cycle sizes from 4000 to 1000 peak users for faster iteration - Update banner to reflect new cycle pattern (500 -> 750 -> 1000) - Remove verbose JWT token generation log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Finding 1 (HIGH): Add explicit asyncio.CancelledError handling in SSE endpoints. In Python 3.8+, CancelledError inherits from BaseException, not Exception, so the previous except block wouldn't catch it. Now cleanup runs even when requests are cancelled during SSE handshake. Finding 2 (MEDIUM): Add sleep(0.1) when Redis get_message returns None to prevent tight loop. The loop now has guaranteed minimum sleep even when Redis returns immediately in certain states. Finding 3 (MEDIUM): Add _closing_sessions set to allow respond loops to exit early. remove_session() now marks the session as closing BEFORE attempting task cancellation, so the respond loop (Redis and DB backends) can exit immediately without waiting for the full cancellation timeout. Finding 4 (LOW): Already addressed in previous commit with test test_cancel_respond_task_escalation_calls_transport_disconnect. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

- Cycles now repeat indefinitely instead of stopping after 5 - Fixed log file path to /tmp/spin_detector.log for easy monitoring - Added periodic summary every 5 cycles showing PASS/WARN/FAIL counts - Cycle numbering now shows total count and pattern letter (e.g., "CYCLE 6 (A)") - Banner shows monitoring command: tail -f /tmp/spin_detector.log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

CancelledError inherits from BaseException in Python 3.8+, so it won't be caught by 'except Exception' handlers. The explicit handlers were unnecessary and triggered pylint W0706 (try-except-raise). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

…dlers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

The blocking `async for message in pubsub.listen()` pattern doesn't respond to asyncio cancellation properly. When anyio's cancel scope tries to cancel tasks using this pattern, the tasks don't respond because the async iterator is blocked waiting for Redis messages. This causes anyio's `_deliver_cancellation` to continuously reschedule itself with `call_soon()`, creating a CPU spin loop that consumes 100% CPU per affected worker. Changed to timeout-based polling pattern: - Use `get_message(timeout=1.0)` with `asyncio.wait_for()` - Loop allows cancellation check every ~1 second - Added sleep on None/non-message responses to prevent edge case spins Files fixed: - mcpgateway/services/cancellation_service.py - mcpgateway/services/event_service.py Closes #2360 (partial - additional spin sources may exist) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai · 2026-01-26T11:33:33Z

Update: Fixed blocking pubsub.listen() pattern

Root Cause Identified

The CPU spin loop is caused by anyio's _deliver_cancellation function continuously rescheduling itself when tasks don't respond to cancellation.

When a cancel scope is triggered (e.g., SSE client disconnects):

anyio calls task.cancel() for each task in the scope
If tasks don't respond (stuck in blocking operations), should_retry stays True
It reschedules itself with call_soon() immediately
This creates a tight CPU spin loop

Problematic Pattern

async for message in pubsub.listen():  # Blocks indefinitely

This async iterator blocks waiting for Redis and doesn't respond to CancelledError until the next message arrives.

Fix Applied

Changed to timeout-based polling in:

mcpgateway/services/cancellation_service.py
mcpgateway/services/event_service.py

while True:
    try:
        message = await asyncio.wait_for(
            pubsub.get_message(timeout=1.0), timeout=1.5)
    except asyncio.TimeoutError:
        continue  # Loop back, allowing cancellation check

Status: Partial Fix

Testing shows this helped delay the spin loop but containers are still hitting ~800% CPU eventually. There are likely additional spin sources to investigate:

sse_starlette task group cancellation behavior
Other blocking async operations
Possible issues in the MCP client HTTP streaming

Continuing investigation...

crivetimihai · 2026-01-26T11:41:34Z

Investigation Update: Remaining Spin Sources

Current Status

The pubsub.listen() fix helped but containers still reach ~800% CPU after load tests complete.

Evidence

Load test finished (no active nginx traffic in last 10 seconds)
Locust has completed but containers still spinning
All 24 workers per container at ~17% CPU each = ~800% total
Each worker has 1 running thread (main event loop) and ~45 sleeping threads
No new logs being generated (silent CPU spin)

Root Cause Pattern

anyio's _deliver_cancellation() keeps rescheduling itself with call_soon() when:

A cancel scope is triggered (e.g., SSE client disconnects)
Tasks in the scope don't respond to cancellation (stuck in blocking operations)
should_retry stays True → reschedules immediately → CPU spin

Remaining Investigation Areas

sse_starlette task groups - When SSE connections close, sse_starlette's task group cancellation may have tasks that don't respond
MCP client HTTP streaming - response.aiter_lines() and similar patterns in mcpgateway/translate.py may block on HTTP connections
Other async iterators - Multiple async for patterns in codebase that may block on external I/O:
- async for line in response.aiter_lines() (translate.py, llm_proxy_service.py)
- async for event in self._agent.astream_events() (mcp_client_chat_service.py)
- async for key in redis.scan_iter() (auth_cache.py, registry_cache.py)

Next Steps

Add timeout-based polling to remaining blocking patterns
Check if MCP client connections are properly cleaned up after session ends
Consider adding explicit cancellation checks in long-running async generators

The MCP session/transport __aexit__ methods can block indefinitely when internal tasks don't respond to cancellation. This causes anyio's _deliver_cancellation to spin in a tight loop, consuming ~800% CPU. Root cause: When calling session.__aexit__() or transport.__aexit__(), they attempt to cancel internal tasks (like post_writer waiting on memory streams). If these tasks don't respond to CancelledError, anyio's cancel scope keeps calling call_soon() to reschedule _deliver_cancellation, creating a CPU spin loop. Changes: - Add SESSION_CLEANUP_TIMEOUT constant (5 seconds) to mcp_session_pool.py - Wrap all __aexit__ calls in asyncio.wait_for() with timeout - Add timeout to pubsub cleanup in session_registry.py and registry_cache.py - Add timeout to streamable HTTP context cleanup in translate.py This is a continuation of the fix for issue #2360. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai · 2026-01-26T12:10:49Z

Additional Fix: Timeouts for Session/Transport Cleanup

py-spy profiling after the previous fix still showed CPU spin at ~800% with _deliver_cancellation in the call stack. The root cause was identified in MCP session cleanup operations.

Root Cause

When closing MCP sessions, session.__aexit__() and transport.__aexit__() try to cancel internal tasks. If these tasks (like post_writer waiting on anyio.streams.memory.receive()) don't respond to CancelledError, anyio's cancel scope keeps calling call_soon() to reschedule _deliver_cancellation, creating a CPU spin loop.

Stack Trace Pattern (from py-spy)

limited_check (gateway_service.py:3349)
  → _check_single_gateway_health (3571)
    → session (mcp_session_pool.py:1005)
      → acquire (519)
        → _close_session (902)
          → session.__aexit__
            → anyio _deliver_cancellation (spin loop)

Fix

Added asyncio.wait_for() with 5-second timeout to all __aexit__ calls:

mcp_session_pool.py: _close_session() and _create_session() cleanup
session_registry.py: Redis pubsub cleanup
registry_cache.py: Redis pubsub cleanup
translate.py: Streamable HTTP context cleanup

If cleanup times out, we log a warning and proceed anyway (best-effort cleanup). This prevents indefinite blocking and the resulting CPU spin.

Add MCP_SESSION_POOL_CLEANUP_TIMEOUT setting (default: 5.0 seconds) to control how long cleanup operations wait for session/transport __aexit__ calls to complete. Clarification: This timeout does NOT affect tool execution time (which uses TOOL_TIMEOUT). It only affects cleanup of idle/released sessions to prevent CPU spin loops when internal tasks don't respond to cancel. Changes: - Add mcp_session_pool_cleanup_timeout to config.py - Add MCP_SESSION_POOL_CLEANUP_TIMEOUT to .env.example with docs - Add to charts/mcp-stack/values.yaml - Update mcp_session_pool.py to use _get_cleanup_timeout() helper - Update session_registry.py and registry_cache.py to use config - Update translate.py to use config with fallback When to adjust: - Increase if you see frequent "cleanup timed out" warnings in logs - Decrease for faster shutdown (at risk of resource leaks) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai · 2026-01-26T12:17:30Z

Configuration: Cleanup Timeout Now Configurable

Made SESSION_CLEANUP_TIMEOUT configurable via MCP_SESSION_POOL_CLEANUP_TIMEOUT (default: 5.0 seconds).

Clarification on Implications

Timeout	What it controls	Default
`TOOL_TIMEOUT`	How long a tool call can run	60s
`MCP_SESSION_POOL_CLEANUP_TIMEOUT`	How long to wait for `__aexit__` cleanup	5s

The cleanup timeout does NOT affect tool execution. It only applies to:

Session eviction (stale/invalid sessions)
Session release after use (returning to pool)
Pool shutdown (graceful close)
Failed session creation cleanup

Configuration

# .env
MCP_SESSION_POOL_CLEANUP_TIMEOUT=5.0

# Helm values.yaml
mcpContextForge:
  config:
    MCP_SESSION_POOL_CLEANUP_TIMEOUT: "5.0"

When to Adjust

Increase (e.g., 10.0): If you see frequent "cleanup timed out" warnings in logs
Decrease (e.g., 2.0): For faster shutdown, accepting risk of resource leaks

Fixes CPU spin loop (anyio#695) where _deliver_cancellation spins at 100% CPU when SSE task group tasks don't respond to cancellation. Root cause: When an SSE connection ends, sse_starlette's task group tries to cancel all tasks. If a task (like _listen_for_disconnect waiting on receive()) doesn't respond to cancellation, anyio's _deliver_cancellation keeps rescheduling itself in a tight loop. Fix: Override EventSourceResponse.__call__ to set a deadline on the cancel scope when cancellation starts. This ensures that if tasks don't respond within SSE_TASK_GROUP_CLEANUP_TIMEOUT (5 seconds), the scope times out instead of spinning indefinitely. References: - agronholm/anyio#695 - anthropics/claude-agent-sdk-python#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai · 2026-01-26T13:14:41Z

Fix: SSE Cancel Scope Deadline

Found the root cause via web search - this is a known issue:

Similar issues exist in other Python projects:

Disconnecting an MCP server causes a CPU core to get stuck at 100% Chainlit/chainlit#2182

Problem

When an SSE connection ends, sse_starlette's task group tries to cancel all tasks. If a task (like _listen_for_disconnect waiting on receive()) doesn't respond to cancellation, anyio's _deliver_cancellation keeps calling call_soon() to reschedule itself, causing a CPU spin loop.

Fix

Created a patched EventSourceResponse that overrides __call__ to set a deadline on the cancel scope when cancellation starts:

async def cancel_on_finish(coro):
    await coro()
    # Set deadline to prevent indefinite spin if tasks don't respond
    task_group.cancel_scope.deadline = anyio.current_time() + SSE_TASK_GROUP_CLEANUP_TIMEOUT
    task_group.cancel_scope.cancel()

This ensures that if tasks don't respond within 5 seconds, the scope times out and exits cleanly instead of spinning.

Configuration

SSE_TASK_GROUP_CLEANUP_TIMEOUT = 5.0 (hardcoded for now, can be made configurable if needed)

Testing Needed

Rebuild containers and verify the spin no longer occurs after load tests.

translate.py was importing EventSourceResponse directly from sse_starlette, bypassing the patched version in sse_transport.py that prevents the anyio _deliver_cancellation CPU spin loop (anyio#695). This change ensures all SSE connections in the translate module (stdio-to-SSE bridge) also benefit from the cancel scope deadline fix. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

With many concurrent connections (691 TCP sockets observed), each cancelled SSE task group spinning for up to 5 seconds caused sustained high CPU usage. Reducing the timeout to 0.5s minimizes CPU waste during spin loops while still allowing normal cleanup to complete. The cleanup timeout only affects cleanup of cancelled/released connections, not normal operation or tool execution time. Changes: - SSE_TASK_GROUP_CLEANUP_TIMEOUT: 5.0 -> 0.5 seconds - mcp_session_pool_cleanup_timeout: 5.0 -> 0.5 seconds - Updated .env.example and charts/mcp-stack/values.yaml Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

…faults - Add SSE_TASK_GROUP_CLEANUP_TIMEOUT setting (default: 5.0s) - Make sse_transport.py read timeout from config via lazy loader - Keep MCP_SESSION_POOL_CLEANUP_TIMEOUT at 5.0s default - Override both to 0.5s in docker-compose.yml for testing The 5.0s default is safe for production. The 0.5s override in docker-compose.yml allows testing aggressive cleanup to verify it doesn't affect normal operation. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

The MCP SDK's internal anyio task groups don't respond to cancellation properly, causing CPU spin loops in _deliver_cancellation. This spin happens inside the MCP SDK (streamablehttp_client, sse_client) which we cannot patch. Reduce GUNICORN_MAX_REQUESTS from 10M to 5K to ensure workers are recycled frequently, cleaning up any accumulated stuck task groups. Root cause chain observed: 1. PostgreSQL idle transaction timeout 2. Gateway state change failures 3. SSE connections terminated 4. MCP SDK task groups spin (anyio#695) This is a workaround until the MCP SDK properly handles cancellation. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Root cause: anyio's _deliver_cancellation has no iteration limit. When tasks don't respond to CancelledError, it schedules call_soon() callbacks indefinitely, causing 100% CPU spin (anyio#695). Solution: - Monkey-patch CancelScope._deliver_cancellation to track iterations - Give up after 100 iterations and log warning - Clear _cancel_handle to stop further call_soon() callbacks Also switched from asyncio.wait_for() to anyio.move_on_after() for MCP session cleanup, which better propagates cancellation through anyio's cancel scope system. Trade-off: If cancellation gives up after 100 iterations, some tasks may not be properly cancelled. However, GUNICORN_MAX_REQUESTS=5000 worker recycling will eventually clean up orphaned tasks. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

…ed by default The anyio monkey-patch is now feature-flagged and disabled by default: - ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false (default) - ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=100 This allows testing performance with and without the patch, and easy rollback if upstream anyio/MCP SDK fixes the issue. Added: - Config settings for enabling/disabling the patch - apply_anyio_cancel_delivery_patch() function for explicit control - remove_anyio_cancel_delivery_patch() to restore original behavior - Documentation in .env.example and docker-compose.yml To enable: set ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Add multi-layered documentation for CPU spin loop mitigation settings across all configuration files. This ensures operators understand and can tune the workarounds for anyio#695. Changes: - .env.example: Add Layer 1/2/3 headers with cross-references to docs and issue #2360, document all 6 mitigation variables - README.md: Expand "CPU Spin Loop Mitigation" section with all 3 layers, configuration tables, and tuning tips - docker-compose.yml: Consolidate all mitigation variables into one section with SSE protection (Layer 1), cleanup timeouts (Layer 2), and experimental anyio patch (Layer 3) - charts/mcp-stack/values.yaml: Add comprehensive mitigation section with layer documentation and cross-references - docs/docs/operations/cpu-spin-loop-mitigation.md: NEW - Full guide with root cause analysis, 4-layer defense diagram, configuration tables, diagnostic commands, and tuning recommendations - docs/docs/.pages: Add Operations section to navigation - docs/docs/operations/.pages: Add nav for operations docs Mitigation variables documented: - Layer 1: SSE_SEND_TIMEOUT, SSE_RAPID_YIELD_WINDOW_MS, SSE_RAPID_YIELD_MAX - Layer 2: MCP_SESSION_POOL_CLEANUP_TIMEOUT, SSE_TASK_GROUP_CLEANUP_TIMEOUT - Layer 3: ANYIO_CANCEL_DELIVERY_PATCH_ENABLED, ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS Related: #2360, anyio#695, claude-agent-sdk#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Update spin detector load test for faster issue reproduction: - Increase user counts: 4000 → 4000 → 10000 pattern - Fast spawn rate: 1000 users/s - Shorter wait times: 0.01-0.1s between requests - Reduced connection timeouts: 5s (fail fast) Related: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai · 2026-01-27T10:22:33Z

PR Summary: CPU Spin Loop Mitigation (Issue #2360)

Problem

After load tests or sustained traffic, Gunicorn workers enter 100% CPU spin loops while appearing idle. Root cause: anyio's _deliver_cancellation enters an infinite loop when MCP SDK tasks don't respond to CancelledError (anyio#695).

Before fix:

CPU: ~800% per gateway (24 workers × ~33% each) when idle after load test
37.5% of workers (9/24) stuck in _deliver_cancellation spin
8,000+ epoll_pwait syscalls/second per spinning worker

After fix:

CPU: ~2.5% per gateway when idle
Zero spinning workers
Clean recovery after load tests

Changes Overview

File	Changes
`docker-compose.yml`	Added 3-layer mitigation config with documented env vars
`mcpgateway/config.py`	+41 lines: New config options for timeouts and patch
`mcpgateway/transports/sse_transport.py`	+249 lines: anyio monkey-patch implementation
`mcpgateway/services/mcp_session_pool.py`	+77 lines: Cleanup timeout support
`mcpgateway/cache/session_registry.py`	+396 lines: Task tracking and cleanup
`tests/loadtest/locustfile_spin_detector.py`	+1608 lines: New load test for spin detection
`Makefile`	+58 lines: `make load-test-spin-detector` target
`docs/docs/operations/cpu-spin-loop-mitigation.md`	+396 lines: Comprehensive documentation

New Configuration Options

Layer 1: SSE Connection Protection

- SSE_SEND_TIMEOUT=30.0              # ASGI send() timeout
- SSE_RAPID_YIELD_WINDOW_MS=1000     # Detection window
- SSE_RAPID_YIELD_MAX=50             # Max yields before disconnect

Layer 2: Cleanup Timeouts

- MCP_SESSION_POOL_CLEANUP_TIMEOUT=0.5   # Session __aexit__ timeout
- SSE_TASK_GROUP_CLEANUP_TIMEOUT=0.5     # SSE task group timeout

Layer 3: anyio Monkey-Patch (Experimental)

- ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true   # Enable workaround
- ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=500   # ~60ms recovery time

Layer 4: Worker Recycling

- GUNICORN_MAX_REQUESTS=1000000      # Recycle workers (reduced from 10M)
- GUNICORN_MAX_REQUESTS_JITTER=100000

How MAX_ITERATIONS Works

Value	Recovery Time	Use Case
100	~12ms	Aggressive - fastest spin recovery
500	~60ms	Recommended - balanced
1000	~121ms	Conservative

Normal cancellations complete in 1-10 iterations. Values above 100 handle edge cases.

New Load Test: `make load-test-spin-detector`

Escalating spike/drop pattern specifically designed to trigger and detect CPU spin loops:

Wave 1:  4,000 users for 30s → 10s pause (check CPU)
Wave 2:  6,000 users for 45s → 15s pause
Wave 3:  8,000 users for 60s → 20s pause
Wave 4: 10,000 users for 75s → 30s pause
Wave 5: 10,000 users for 90s → 30s pause
→ Repeats forever (Ctrl+C to stop)

Pass criteria: CPU drops to <10% during pause phases

Testing Results

Scenario	CPU After Load	Spinning Workers
Without mitigation	~800%	9/24 (37.5%)
With mitigation	~2.5%	0/24 (0%)

Tested with:

4,000-10,000 concurrent users
3,000+ RPS sustained
Multiple spike/drop cycles

Documentation

New operations guide: docs/docs/operations/cpu-spin-loop-mitigation.md

Problem description and root cause analysis
Layer-by-layer mitigation explanation
Configuration reference
Troubleshooting guide
Verification steps

Related Issues

Closes [BUG]: anyio cancel scope spin loop causes 100% CPU after load test stops #2360
Upstream: anyio#695
Related: Claude SDK#378

Add docstring to nested cancel_on_finish function in EventSourceResponse.__call__ to achieve 100% interrogate coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

jonpspri

Let's be clear. I hate that we've gotten to this. But the patch is as clean as it can get without introducing a new object to encapsulate the process. Let's be sure we're monitoring the upstream so we can rip this out when it's corrected.

* fix-2360: prevent asyncio CPU spin loop after SSE client disconnect Root cause: Fire-and-forget asyncio.create_task() patterns left orphaned tasks that caused anyio _deliver_cancellation to spin at 100% CPU per worker. Changes: - Add _respond_tasks dict to track respond tasks by session_id - Cancel respond tasks explicitly before session cleanup in remove_session() - Cancel all respond tasks during shutdown() - Pass disconnect callback to SSE transport for defensive cleanup - Convert database backend from fire-and-forget to structured concurrency The fix ensures all asyncio tasks are properly tracked, cancelled on disconnect, and awaited to completion, preventing orphaned tasks from spinning the event loop. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: additional fixes for CPU spin loop after SSE disconnect Follow-up fixes based on testing and review: 1. Cancellation timeout escalation (Finding 1): - _cancel_respond_task() now escalates on timeout by calling transport.disconnect() - Retries cancellation after escalation - Always removes task from tracking to prevent buildup 2. Redis respond loop exit path (Finding 2): - Changed from infinite pubsub.listen() to timeout-based get_message() polling - Added session existence check - loop exits if session removed - Allows loop to exit even without cancellation 3. Generator finally block cleanup (Finding 3): - Added on_disconnect_callback() in event_generator() finally block - Covers: CancelledError, GeneratorExit, exceptions, and normal completion - Idempotent - safe if callback already ran from on_client_close 4. Added load-test-spin-detector make target: - Spike/drop pattern to stress test session cleanup - Docker stats monitoring at each phase - Color-coded output with pass/fail indicators - Log file output to /tmp Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: fix race condition in sse_endpoint and add stuck task tracking Finding 1 (HIGH): Fixed race condition in sse_endpoint where respond task was created AFTER create_sse_response(). If client disconnected during response setup, the disconnect callback ran before the task existed, leaving it orphaned. Now matches utility_sse_endpoint ordering: 1. Compute user_with_token 2. Create and register respond task 3. Call create_sse_response() Finding 2 (MEDIUM): Added _stuck_tasks dict to track tasks that couldn't be cancelled after escalation. Previously these were dropped from tracking entirely, losing visibility. Now they're moved to _stuck_tasks for monitoring and final cleanup during shutdown(). Updated tests to verify escalation behavior. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add SSE failure cleanup, stuck task reaper, and full load test Finding 1 (HIGH): Fixed orphaned respond task when create_sse_response() fails. Added try/except around create_sse_response() in both sse_endpoint and utility_sse_endpoint - on failure, calls remove_session() to clean up the task and session before re-raising. Finding 2 (MEDIUM): Added stuck task reaper that runs every 30 seconds to: - Remove completed tasks from _stuck_tasks - Retry cancellation for still-stuck tasks - Prevent memory leaks from tasks that eventually complete Finding 3 (LOW): Added test for escalation path with fake transport to verify transport.disconnect() is called during escalation. Also added tests for the stuck task reaper lifecycle. Also updated load-test-spin-detector to be a full-featured test matching load-test-ui with JWT auth, all user classes, entity ID fetching, and the same 4000-user baseline. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: improve load-test-spin-detector output and reduce cycle sizes - Reduce logging level to WARNING to suppress noisy worker messages - Only run entity fetching and cleanup on master/standalone nodes - Reduce cycle sizes from 4000 to 1000 peak users for faster iteration - Update banner to reflect new cycle pattern (500 -> 750 -> 1000) - Remove verbose JWT token generation log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: address remaining CPU spin loop findings Finding 1 (HIGH): Add explicit asyncio.CancelledError handling in SSE endpoints. In Python 3.8+, CancelledError inherits from BaseException, not Exception, so the previous except block wouldn't catch it. Now cleanup runs even when requests are cancelled during SSE handshake. Finding 2 (MEDIUM): Add sleep(0.1) when Redis get_message returns None to prevent tight loop. The loop now has guaranteed minimum sleep even when Redis returns immediately in certain states. Finding 3 (MEDIUM): Add _closing_sessions set to allow respond loops to exit early. remove_session() now marks the session as closing BEFORE attempting task cancellation, so the respond loop (Redis and DB backends) can exit immediately without waiting for the full cancellation timeout. Finding 4 (LOW): Already addressed in previous commit with test test_cancel_respond_task_escalation_calls_transport_disconnect. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: make load-test-spin-detector run unlimited cycles - Cycles now repeat indefinitely instead of stopping after 5 - Fixed log file path to /tmp/spin_detector.log for easy monitoring - Added periodic summary every 5 cycles showing PASS/WARN/FAIL counts - Cycle numbering now shows total count and pattern letter (e.g., "CYCLE 6 (A)") - Banner shows monitoring command: tail -f /tmp/spin_detector.log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add asyncio.CancelledError to SSE endpoint Raises docs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: remove redundant asyncio.CancelledError handlers CancelledError inherits from BaseException in Python 3.8+, so it won't be caught by 'except Exception' handlers. The explicit handlers were unnecessary and triggered pylint W0706 (try-except-raise). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: restore asyncio.CancelledError in Raises docs for inner handlers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add sleep on non-message Redis pubsub types to prevent spin Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(pubsub): replace blocking listen() with timeout-based get_message() The blocking `async for message in pubsub.listen()` pattern doesn't respond to asyncio cancellation properly. When anyio's cancel scope tries to cancel tasks using this pattern, the tasks don't respond because the async iterator is blocked waiting for Redis messages. This causes anyio's `_deliver_cancellation` to continuously reschedule itself with `call_soon()`, creating a CPU spin loop that consumes 100% CPU per affected worker. Changed to timeout-based polling pattern: - Use `get_message(timeout=1.0)` with `asyncio.wait_for()` - Loop allows cancellation check every ~1 second - Added sleep on None/non-message responses to prevent edge case spins Files fixed: - mcpgateway/services/cancellation_service.py - mcpgateway/services/event_service.py Closes IBM#2360 (partial - additional spin sources may exist) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): add timeouts to __aexit__ calls to prevent CPU spin loops The MCP session/transport __aexit__ methods can block indefinitely when internal tasks don't respond to cancellation. This causes anyio's _deliver_cancellation to spin in a tight loop, consuming ~800% CPU. Root cause: When calling session.__aexit__() or transport.__aexit__(), they attempt to cancel internal tasks (like post_writer waiting on memory streams). If these tasks don't respond to CancelledError, anyio's cancel scope keeps calling call_soon() to reschedule _deliver_cancellation, creating a CPU spin loop. Changes: - Add SESSION_CLEANUP_TIMEOUT constant (5 seconds) to mcp_session_pool.py - Wrap all __aexit__ calls in asyncio.wait_for() with timeout - Add timeout to pubsub cleanup in session_registry.py and registry_cache.py - Add timeout to streamable HTTP context cleanup in translate.py This is a continuation of the fix for issue IBM#2360. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(config): make session cleanup timeout configurable Add MCP_SESSION_POOL_CLEANUP_TIMEOUT setting (default: 5.0 seconds) to control how long cleanup operations wait for session/transport __aexit__ calls to complete. Clarification: This timeout does NOT affect tool execution time (which uses TOOL_TIMEOUT). It only affects cleanup of idle/released sessions to prevent CPU spin loops when internal tasks don't respond to cancel. Changes: - Add mcp_session_pool_cleanup_timeout to config.py - Add MCP_SESSION_POOL_CLEANUP_TIMEOUT to .env.example with docs - Add to charts/mcp-stack/values.yaml - Update mcp_session_pool.py to use _get_cleanup_timeout() helper - Update session_registry.py and registry_cache.py to use config - Update translate.py to use config with fallback When to adjust: - Increase if you see frequent "cleanup timed out" warnings in logs - Decrease for faster shutdown (at risk of resource leaks) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add deadline to cancel scope to prevent CPU spin loop Fixes CPU spin loop (anyio#695) where _deliver_cancellation spins at 100% CPU when SSE task group tasks don't respond to cancellation. Root cause: When an SSE connection ends, sse_starlette's task group tries to cancel all tasks. If a task (like _listen_for_disconnect waiting on receive()) doesn't respond to cancellation, anyio's _deliver_cancellation keeps rescheduling itself in a tight loop. Fix: Override EventSourceResponse.__call__ to set a deadline on the cancel scope when cancellation starts. This ensures that if tasks don't respond within SSE_TASK_GROUP_CLEANUP_TIMEOUT (5 seconds), the scope times out instead of spinning indefinitely. References: - agronholm/anyio#695 - anthropics/claude-agent-sdk-python#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(translate): use patched EventSourceResponse to prevent CPU spin translate.py was importing EventSourceResponse directly from sse_starlette, bypassing the patched version in sse_transport.py that prevents the anyio _deliver_cancellation CPU spin loop (anyio#695). This change ensures all SSE connections in the translate module (stdio-to-SSE bridge) also benefit from the cancel scope deadline fix. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): reduce cleanup timeouts from 5s to 0.5s With many concurrent connections (691 TCP sockets observed), each cancelled SSE task group spinning for up to 5 seconds caused sustained high CPU usage. Reducing the timeout to 0.5s minimizes CPU waste during spin loops while still allowing normal cleanup to complete. The cleanup timeout only affects cleanup of cancelled/released connections, not normal operation or tool execution time. Changes: - SSE_TASK_GROUP_CLEANUP_TIMEOUT: 5.0 -> 0.5 seconds - mcp_session_pool_cleanup_timeout: 5.0 -> 0.5 seconds - Updated .env.example and charts/mcp-stack/values.yaml Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(cleanup): make SSE cleanup timeout configurable with safe defaults - Add SSE_TASK_GROUP_CLEANUP_TIMEOUT setting (default: 5.0s) - Make sse_transport.py read timeout from config via lazy loader - Keep MCP_SESSION_POOL_CLEANUP_TIMEOUT at 5.0s default - Override both to 0.5s in docker-compose.yml for testing The 5.0s default is safe for production. The 0.5s override in docker-compose.yml allows testing aggressive cleanup to verify it doesn't affect normal operation. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(gunicorn): reduce max_requests to recycle stuck workers The MCP SDK's internal anyio task groups don't respond to cancellation properly, causing CPU spin loops in _deliver_cancellation. This spin happens inside the MCP SDK (streamablehttp_client, sse_client) which we cannot patch. Reduce GUNICORN_MAX_REQUESTS from 10M to 5K to ensure workers are recycled frequently, cleaning up any accumulated stuck task groups. Root cause chain observed: 1. PostgreSQL idle transaction timeout 2. Gateway state change failures 3. SSE connections terminated 4. MCP SDK task groups spin (anyio#695) This is a workaround until the MCP SDK properly handles cancellation. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(anyio): monkey-patch _deliver_cancellation to prevent CPU spin Root cause: anyio's _deliver_cancellation has no iteration limit. When tasks don't respond to CancelledError, it schedules call_soon() callbacks indefinitely, causing 100% CPU spin (anyio#695). Solution: - Monkey-patch CancelScope._deliver_cancellation to track iterations - Give up after 100 iterations and log warning - Clear _cancel_handle to stop further call_soon() callbacks Also switched from asyncio.wait_for() to anyio.move_on_after() for MCP session cleanup, which better propagates cancellation through anyio's cancel scope system. Trade-off: If cancellation gives up after 100 iterations, some tasks may not be properly cancelled. However, GUNICORN_MAX_REQUESTS=5000 worker recycling will eventually clean up orphaned tasks. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(anyio): make _deliver_cancellation patch optional and disabled by default The anyio monkey-patch is now feature-flagged and disabled by default: - ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false (default) - ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=100 This allows testing performance with and without the patch, and easy rollback if upstream anyio/MCP SDK fixes the issue. Added: - Config settings for enabling/disabling the patch - apply_anyio_cancel_delivery_patch() function for explicit control - remove_anyio_cancel_delivery_patch() to restore original behavior - Documentation in .env.example and docker-compose.yml To enable: set ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add comprehensive CPU spin loop mitigation documentation (IBM#2360) Add multi-layered documentation for CPU spin loop mitigation settings across all configuration files. This ensures operators understand and can tune the workarounds for anyio#695. Changes: - .env.example: Add Layer 1/2/3 headers with cross-references to docs and issue IBM#2360, document all 6 mitigation variables - README.md: Expand "CPU Spin Loop Mitigation" section with all 3 layers, configuration tables, and tuning tips - docker-compose.yml: Consolidate all mitigation variables into one section with SSE protection (Layer 1), cleanup timeouts (Layer 2), and experimental anyio patch (Layer 3) - charts/mcp-stack/values.yaml: Add comprehensive mitigation section with layer documentation and cross-references - docs/docs/operations/cpu-spin-loop-mitigation.md: NEW - Full guide with root cause analysis, 4-layer defense diagram, configuration tables, diagnostic commands, and tuning recommendations - docs/docs/.pages: Add Operations section to navigation - docs/docs/operations/.pages: Add nav for operations docs Mitigation variables documented: - Layer 1: SSE_SEND_TIMEOUT, SSE_RAPID_YIELD_WINDOW_MS, SSE_RAPID_YIELD_MAX - Layer 2: MCP_SESSION_POOL_CLEANUP_TIMEOUT, SSE_TASK_GROUP_CLEANUP_TIMEOUT - Layer 3: ANYIO_CANCEL_DELIVERY_PATCH_ENABLED, ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS Related: IBM#2360, anyio#695, claude-agent-sdk#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(loadtest): aggressive spin detector with configurable timings Update spin detector load test for faster issue reproduction: - Increase user counts: 4000 → 4000 → 10000 pattern - Fast spawn rate: 1000 users/s - Shorter wait times: 0.01-0.1s between requests - Reduced connection timeouts: 5s (fail fast) Related: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * compose mitigation Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * load test Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add docstring to cancel_on_finish for interrogate coverage Add docstring to nested cancel_on_finish function in EventSourceResponse.__call__ to achieve 100% interrogate coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>

* fix-2360: prevent asyncio CPU spin loop after SSE client disconnect Root cause: Fire-and-forget asyncio.create_task() patterns left orphaned tasks that caused anyio _deliver_cancellation to spin at 100% CPU per worker. Changes: - Add _respond_tasks dict to track respond tasks by session_id - Cancel respond tasks explicitly before session cleanup in remove_session() - Cancel all respond tasks during shutdown() - Pass disconnect callback to SSE transport for defensive cleanup - Convert database backend from fire-and-forget to structured concurrency The fix ensures all asyncio tasks are properly tracked, cancelled on disconnect, and awaited to completion, preventing orphaned tasks from spinning the event loop. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: additional fixes for CPU spin loop after SSE disconnect Follow-up fixes based on testing and review: 1. Cancellation timeout escalation (Finding 1): - _cancel_respond_task() now escalates on timeout by calling transport.disconnect() - Retries cancellation after escalation - Always removes task from tracking to prevent buildup 2. Redis respond loop exit path (Finding 2): - Changed from infinite pubsub.listen() to timeout-based get_message() polling - Added session existence check - loop exits if session removed - Allows loop to exit even without cancellation 3. Generator finally block cleanup (Finding 3): - Added on_disconnect_callback() in event_generator() finally block - Covers: CancelledError, GeneratorExit, exceptions, and normal completion - Idempotent - safe if callback already ran from on_client_close 4. Added load-test-spin-detector make target: - Spike/drop pattern to stress test session cleanup - Docker stats monitoring at each phase - Color-coded output with pass/fail indicators - Log file output to /tmp Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: fix race condition in sse_endpoint and add stuck task tracking Finding 1 (HIGH): Fixed race condition in sse_endpoint where respond task was created AFTER create_sse_response(). If client disconnected during response setup, the disconnect callback ran before the task existed, leaving it orphaned. Now matches utility_sse_endpoint ordering: 1. Compute user_with_token 2. Create and register respond task 3. Call create_sse_response() Finding 2 (MEDIUM): Added _stuck_tasks dict to track tasks that couldn't be cancelled after escalation. Previously these were dropped from tracking entirely, losing visibility. Now they're moved to _stuck_tasks for monitoring and final cleanup during shutdown(). Updated tests to verify escalation behavior. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add SSE failure cleanup, stuck task reaper, and full load test Finding 1 (HIGH): Fixed orphaned respond task when create_sse_response() fails. Added try/except around create_sse_response() in both sse_endpoint and utility_sse_endpoint - on failure, calls remove_session() to clean up the task and session before re-raising. Finding 2 (MEDIUM): Added stuck task reaper that runs every 30 seconds to: - Remove completed tasks from _stuck_tasks - Retry cancellation for still-stuck tasks - Prevent memory leaks from tasks that eventually complete Finding 3 (LOW): Added test for escalation path with fake transport to verify transport.disconnect() is called during escalation. Also added tests for the stuck task reaper lifecycle. Also updated load-test-spin-detector to be a full-featured test matching load-test-ui with JWT auth, all user classes, entity ID fetching, and the same 4000-user baseline. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: improve load-test-spin-detector output and reduce cycle sizes - Reduce logging level to WARNING to suppress noisy worker messages - Only run entity fetching and cleanup on master/standalone nodes - Reduce cycle sizes from 4000 to 1000 peak users for faster iteration - Update banner to reflect new cycle pattern (500 -> 750 -> 1000) - Remove verbose JWT token generation log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: address remaining CPU spin loop findings Finding 1 (HIGH): Add explicit asyncio.CancelledError handling in SSE endpoints. In Python 3.8+, CancelledError inherits from BaseException, not Exception, so the previous except block wouldn't catch it. Now cleanup runs even when requests are cancelled during SSE handshake. Finding 2 (MEDIUM): Add sleep(0.1) when Redis get_message returns None to prevent tight loop. The loop now has guaranteed minimum sleep even when Redis returns immediately in certain states. Finding 3 (MEDIUM): Add _closing_sessions set to allow respond loops to exit early. remove_session() now marks the session as closing BEFORE attempting task cancellation, so the respond loop (Redis and DB backends) can exit immediately without waiting for the full cancellation timeout. Finding 4 (LOW): Already addressed in previous commit with test test_cancel_respond_task_escalation_calls_transport_disconnect. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: make load-test-spin-detector run unlimited cycles - Cycles now repeat indefinitely instead of stopping after 5 - Fixed log file path to /tmp/spin_detector.log for easy monitoring - Added periodic summary every 5 cycles showing PASS/WARN/FAIL counts - Cycle numbering now shows total count and pattern letter (e.g., "CYCLE 6 (A)") - Banner shows monitoring command: tail -f /tmp/spin_detector.log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add asyncio.CancelledError to SSE endpoint Raises docs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: remove redundant asyncio.CancelledError handlers CancelledError inherits from BaseException in Python 3.8+, so it won't be caught by 'except Exception' handlers. The explicit handlers were unnecessary and triggered pylint W0706 (try-except-raise). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: restore asyncio.CancelledError in Raises docs for inner handlers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add sleep on non-message Redis pubsub types to prevent spin Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(pubsub): replace blocking listen() with timeout-based get_message() The blocking `async for message in pubsub.listen()` pattern doesn't respond to asyncio cancellation properly. When anyio's cancel scope tries to cancel tasks using this pattern, the tasks don't respond because the async iterator is blocked waiting for Redis messages. This causes anyio's `_deliver_cancellation` to continuously reschedule itself with `call_soon()`, creating a CPU spin loop that consumes 100% CPU per affected worker. Changed to timeout-based polling pattern: - Use `get_message(timeout=1.0)` with `asyncio.wait_for()` - Loop allows cancellation check every ~1 second - Added sleep on None/non-message responses to prevent edge case spins Files fixed: - mcpgateway/services/cancellation_service.py - mcpgateway/services/event_service.py Closes IBM#2360 (partial - additional spin sources may exist) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): add timeouts to __aexit__ calls to prevent CPU spin loops The MCP session/transport __aexit__ methods can block indefinitely when internal tasks don't respond to cancellation. This causes anyio's _deliver_cancellation to spin in a tight loop, consuming ~800% CPU. Root cause: When calling session.__aexit__() or transport.__aexit__(), they attempt to cancel internal tasks (like post_writer waiting on memory streams). If these tasks don't respond to CancelledError, anyio's cancel scope keeps calling call_soon() to reschedule _deliver_cancellation, creating a CPU spin loop. Changes: - Add SESSION_CLEANUP_TIMEOUT constant (5 seconds) to mcp_session_pool.py - Wrap all __aexit__ calls in asyncio.wait_for() with timeout - Add timeout to pubsub cleanup in session_registry.py and registry_cache.py - Add timeout to streamable HTTP context cleanup in translate.py This is a continuation of the fix for issue IBM#2360. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(config): make session cleanup timeout configurable Add MCP_SESSION_POOL_CLEANUP_TIMEOUT setting (default: 5.0 seconds) to control how long cleanup operations wait for session/transport __aexit__ calls to complete. Clarification: This timeout does NOT affect tool execution time (which uses TOOL_TIMEOUT). It only affects cleanup of idle/released sessions to prevent CPU spin loops when internal tasks don't respond to cancel. Changes: - Add mcp_session_pool_cleanup_timeout to config.py - Add MCP_SESSION_POOL_CLEANUP_TIMEOUT to .env.example with docs - Add to charts/mcp-stack/values.yaml - Update mcp_session_pool.py to use _get_cleanup_timeout() helper - Update session_registry.py and registry_cache.py to use config - Update translate.py to use config with fallback When to adjust: - Increase if you see frequent "cleanup timed out" warnings in logs - Decrease for faster shutdown (at risk of resource leaks) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add deadline to cancel scope to prevent CPU spin loop Fixes CPU spin loop (anyio#695) where _deliver_cancellation spins at 100% CPU when SSE task group tasks don't respond to cancellation. Root cause: When an SSE connection ends, sse_starlette's task group tries to cancel all tasks. If a task (like _listen_for_disconnect waiting on receive()) doesn't respond to cancellation, anyio's _deliver_cancellation keeps rescheduling itself in a tight loop. Fix: Override EventSourceResponse.__call__ to set a deadline on the cancel scope when cancellation starts. This ensures that if tasks don't respond within SSE_TASK_GROUP_CLEANUP_TIMEOUT (5 seconds), the scope times out instead of spinning indefinitely. References: - agronholm/anyio#695 - anthropics/claude-agent-sdk-python#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(translate): use patched EventSourceResponse to prevent CPU spin translate.py was importing EventSourceResponse directly from sse_starlette, bypassing the patched version in sse_transport.py that prevents the anyio _deliver_cancellation CPU spin loop (anyio#695). This change ensures all SSE connections in the translate module (stdio-to-SSE bridge) also benefit from the cancel scope deadline fix. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): reduce cleanup timeouts from 5s to 0.5s With many concurrent connections (691 TCP sockets observed), each cancelled SSE task group spinning for up to 5 seconds caused sustained high CPU usage. Reducing the timeout to 0.5s minimizes CPU waste during spin loops while still allowing normal cleanup to complete. The cleanup timeout only affects cleanup of cancelled/released connections, not normal operation or tool execution time. Changes: - SSE_TASK_GROUP_CLEANUP_TIMEOUT: 5.0 -> 0.5 seconds - mcp_session_pool_cleanup_timeout: 5.0 -> 0.5 seconds - Updated .env.example and charts/mcp-stack/values.yaml Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(cleanup): make SSE cleanup timeout configurable with safe defaults - Add SSE_TASK_GROUP_CLEANUP_TIMEOUT setting (default: 5.0s) - Make sse_transport.py read timeout from config via lazy loader - Keep MCP_SESSION_POOL_CLEANUP_TIMEOUT at 5.0s default - Override both to 0.5s in docker-compose.yml for testing The 5.0s default is safe for production. The 0.5s override in docker-compose.yml allows testing aggressive cleanup to verify it doesn't affect normal operation. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(gunicorn): reduce max_requests to recycle stuck workers The MCP SDK's internal anyio task groups don't respond to cancellation properly, causing CPU spin loops in _deliver_cancellation. This spin happens inside the MCP SDK (streamablehttp_client, sse_client) which we cannot patch. Reduce GUNICORN_MAX_REQUESTS from 10M to 5K to ensure workers are recycled frequently, cleaning up any accumulated stuck task groups. Root cause chain observed: 1. PostgreSQL idle transaction timeout 2. Gateway state change failures 3. SSE connections terminated 4. MCP SDK task groups spin (anyio#695) This is a workaround until the MCP SDK properly handles cancellation. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(anyio): monkey-patch _deliver_cancellation to prevent CPU spin Root cause: anyio's _deliver_cancellation has no iteration limit. When tasks don't respond to CancelledError, it schedules call_soon() callbacks indefinitely, causing 100% CPU spin (anyio#695). Solution: - Monkey-patch CancelScope._deliver_cancellation to track iterations - Give up after 100 iterations and log warning - Clear _cancel_handle to stop further call_soon() callbacks Also switched from asyncio.wait_for() to anyio.move_on_after() for MCP session cleanup, which better propagates cancellation through anyio's cancel scope system. Trade-off: If cancellation gives up after 100 iterations, some tasks may not be properly cancelled. However, GUNICORN_MAX_REQUESTS=5000 worker recycling will eventually clean up orphaned tasks. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(anyio): make _deliver_cancellation patch optional and disabled by default The anyio monkey-patch is now feature-flagged and disabled by default: - ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false (default) - ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=100 This allows testing performance with and without the patch, and easy rollback if upstream anyio/MCP SDK fixes the issue. Added: - Config settings for enabling/disabling the patch - apply_anyio_cancel_delivery_patch() function for explicit control - remove_anyio_cancel_delivery_patch() to restore original behavior - Documentation in .env.example and docker-compose.yml To enable: set ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add comprehensive CPU spin loop mitigation documentation (IBM#2360) Add multi-layered documentation for CPU spin loop mitigation settings across all configuration files. This ensures operators understand and can tune the workarounds for anyio#695. Changes: - .env.example: Add Layer 1/2/3 headers with cross-references to docs and issue IBM#2360, document all 6 mitigation variables - README.md: Expand "CPU Spin Loop Mitigation" section with all 3 layers, configuration tables, and tuning tips - docker-compose.yml: Consolidate all mitigation variables into one section with SSE protection (Layer 1), cleanup timeouts (Layer 2), and experimental anyio patch (Layer 3) - charts/mcp-stack/values.yaml: Add comprehensive mitigation section with layer documentation and cross-references - docs/docs/operations/cpu-spin-loop-mitigation.md: NEW - Full guide with root cause analysis, 4-layer defense diagram, configuration tables, diagnostic commands, and tuning recommendations - docs/docs/.pages: Add Operations section to navigation - docs/docs/operations/.pages: Add nav for operations docs Mitigation variables documented: - Layer 1: SSE_SEND_TIMEOUT, SSE_RAPID_YIELD_WINDOW_MS, SSE_RAPID_YIELD_MAX - Layer 2: MCP_SESSION_POOL_CLEANUP_TIMEOUT, SSE_TASK_GROUP_CLEANUP_TIMEOUT - Layer 3: ANYIO_CANCEL_DELIVERY_PATCH_ENABLED, ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS Related: IBM#2360, anyio#695, claude-agent-sdk#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(loadtest): aggressive spin detector with configurable timings Update spin detector load test for faster issue reproduction: - Increase user counts: 4000 → 4000 → 10000 pattern - Fast spawn rate: 1000 users/s - Shorter wait times: 0.01-0.1s between requests - Reduced connection timeouts: 5s (fail fast) Related: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * compose mitigation Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * load test Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add docstring to cancel_on_finish for interrogate coverage Add docstring to nested cancel_on_finish function in EventSourceResponse.__call__ to achieve 100% interrogate coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai added 4 commits January 26, 2026 01:23

crivetimihai requested review from kevalmahajan and madhav165 as code owners January 26, 2026 02:41

crivetimihai self-assigned this Jan 26, 2026

crivetimihai added 7 commits January 26, 2026 02:51

fix-2360: add asyncio.CancelledError to SSE endpoint Raises docs

19fcd39

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Linting

2e63233

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

fix-2360: restore asyncio.CancelledError in Raises docs for inner han…

883217c

…dlers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai added performance Performance related items revisit Revisit this PR at a later date to address further issues, or if problems arise. labels Jan 26, 2026

crivetimihai added this to the Release 1.0.0-RC1 milestone Jan 26, 2026

crivetimihai added 2 commits January 26, 2026 09:25

fix-2360: add sleep on non-message Redis pubsub types to prevent spin

44ac577

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai added 3 commits January 26, 2026 13:57

crivetimihai added 5 commits January 26, 2026 14:38

Linting

fd9c6b1

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai force-pushed the asyncio-cpu-spin-fix-2360 branch from 698891c to c08bb95 Compare January 26, 2026 20:43

crivetimihai force-pushed the asyncio-cpu-spin-fix-2360 branch from f4b566f to 1226dd5 Compare January 26, 2026 21:33

crivetimihai added 2 commits January 27, 2026 00:46

compose mitigation

5c177ef

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

load test

fd6d678

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai requested review from brian-hussey and jonpspri January 27, 2026 09:41

crivetimihai added 2 commits January 27, 2026 10:15

Defaults

0c1c6a0

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Defaults

f205549

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

docs: add docstring to cancel_on_finish for interrogate coverage

5410fa8

Add docstring to nested cancel_on_finish function in EventSourceResponse.__call__ to achieve 100% interrogate coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

jonpspri approved these changes Jan 27, 2026

View reviewed changes

crivetimihai merged commit 8257613 into main Jan 27, 2026
53 checks passed

crivetimihai deleted the asyncio-cpu-spin-fix-2360 branch January 27, 2026 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix-2360: prevent CPU spin loop after SSE client disconnect#2506

fix-2360: prevent CPU spin loop after SSE client disconnect#2506
crivetimihai merged 30 commits intomainfrom
asyncio-cpu-spin-fix-2360

crivetimihai commented Jan 26, 2026 •

edited

Loading

Uh oh!

crivetimihai commented Jan 26, 2026

Uh oh!

crivetimihai commented Jan 26, 2026

Uh oh!

crivetimihai commented Jan 26, 2026

Uh oh!

crivetimihai commented Jan 26, 2026

Uh oh!

crivetimihai commented Jan 26, 2026 •

edited

Loading

Uh oh!

crivetimihai commented Jan 27, 2026

Uh oh!

jonpspri left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

crivetimihai commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

crivetimihai commented Jan 26, 2026

Update: Fixed blocking pubsub.listen() pattern

Root Cause Identified

Problematic Pattern

Fix Applied

Status: Partial Fix

Uh oh!

crivetimihai commented Jan 26, 2026

Investigation Update: Remaining Spin Sources

Current Status

Evidence

Root Cause Pattern

Remaining Investigation Areas

Next Steps

Uh oh!

crivetimihai commented Jan 26, 2026

Additional Fix: Timeouts for Session/Transport Cleanup

Root Cause

Stack Trace Pattern (from py-spy)

Fix

Uh oh!

crivetimihai commented Jan 26, 2026

Configuration: Cleanup Timeout Now Configurable

Clarification on Implications

Configuration

When to Adjust

Uh oh!

crivetimihai commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix: SSE Cancel Scope Deadline

Problem

Fix

Configuration

Testing Needed

Uh oh!

crivetimihai commented Jan 27, 2026

PR Summary: CPU Spin Loop Mitigation (Issue #2360)

Problem

Changes Overview

New Configuration Options

Layer 1: SSE Connection Protection

Layer 2: Cleanup Timeouts

Layer 3: anyio Monkey-Patch (Experimental)

Layer 4: Worker Recycling

How MAX_ITERATIONS Works

New Load Test: make load-test-spin-detector

Testing Results

Documentation

Related Issues

Uh oh!

jonpspri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crivetimihai commented Jan 26, 2026 •

edited

Loading

crivetimihai commented Jan 26, 2026 •

edited

Loading

New Load Test: `make load-test-spin-detector`