fix-2360: prevent CPU spin loop after SSE client disconnect#2506
fix-2360: prevent CPU spin loop after SSE client disconnect#2506crivetimihai merged 30 commits intomainfrom
Conversation
Root cause: Fire-and-forget asyncio.create_task() patterns left orphaned tasks that caused anyio _deliver_cancellation to spin at 100% CPU per worker. Changes: - Add _respond_tasks dict to track respond tasks by session_id - Cancel respond tasks explicitly before session cleanup in remove_session() - Cancel all respond tasks during shutdown() - Pass disconnect callback to SSE transport for defensive cleanup - Convert database backend from fire-and-forget to structured concurrency The fix ensures all asyncio tasks are properly tracked, cancelled on disconnect, and awaited to completion, preventing orphaned tasks from spinning the event loop. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Follow-up fixes based on testing and review: 1. Cancellation timeout escalation (Finding 1): - _cancel_respond_task() now escalates on timeout by calling transport.disconnect() - Retries cancellation after escalation - Always removes task from tracking to prevent buildup 2. Redis respond loop exit path (Finding 2): - Changed from infinite pubsub.listen() to timeout-based get_message() polling - Added session existence check - loop exits if session removed - Allows loop to exit even without cancellation 3. Generator finally block cleanup (Finding 3): - Added on_disconnect_callback() in event_generator() finally block - Covers: CancelledError, GeneratorExit, exceptions, and normal completion - Idempotent - safe if callback already ran from on_client_close 4. Added load-test-spin-detector make target: - Spike/drop pattern to stress test session cleanup - Docker stats monitoring at each phase - Color-coded output with pass/fail indicators - Log file output to /tmp Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Finding 1 (HIGH): Fixed race condition in sse_endpoint where respond task was created AFTER create_sse_response(). If client disconnected during response setup, the disconnect callback ran before the task existed, leaving it orphaned. Now matches utility_sse_endpoint ordering: 1. Compute user_with_token 2. Create and register respond task 3. Call create_sse_response() Finding 2 (MEDIUM): Added _stuck_tasks dict to track tasks that couldn't be cancelled after escalation. Previously these were dropped from tracking entirely, losing visibility. Now they're moved to _stuck_tasks for monitoring and final cleanup during shutdown(). Updated tests to verify escalation behavior. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Finding 1 (HIGH): Fixed orphaned respond task when create_sse_response() fails. Added try/except around create_sse_response() in both sse_endpoint and utility_sse_endpoint - on failure, calls remove_session() to clean up the task and session before re-raising. Finding 2 (MEDIUM): Added stuck task reaper that runs every 30 seconds to: - Remove completed tasks from _stuck_tasks - Retry cancellation for still-stuck tasks - Prevent memory leaks from tasks that eventually complete Finding 3 (LOW): Added test for escalation path with fake transport to verify transport.disconnect() is called during escalation. Also added tests for the stuck task reaper lifecycle. Also updated load-test-spin-detector to be a full-featured test matching load-test-ui with JWT auth, all user classes, entity ID fetching, and the same 4000-user baseline. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
- Reduce logging level to WARNING to suppress noisy worker messages - Only run entity fetching and cleanup on master/standalone nodes - Reduce cycle sizes from 4000 to 1000 peak users for faster iteration - Update banner to reflect new cycle pattern (500 -> 750 -> 1000) - Remove verbose JWT token generation log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Finding 1 (HIGH): Add explicit asyncio.CancelledError handling in SSE endpoints. In Python 3.8+, CancelledError inherits from BaseException, not Exception, so the previous except block wouldn't catch it. Now cleanup runs even when requests are cancelled during SSE handshake. Finding 2 (MEDIUM): Add sleep(0.1) when Redis get_message returns None to prevent tight loop. The loop now has guaranteed minimum sleep even when Redis returns immediately in certain states. Finding 3 (MEDIUM): Add _closing_sessions set to allow respond loops to exit early. remove_session() now marks the session as closing BEFORE attempting task cancellation, so the respond loop (Redis and DB backends) can exit immediately without waiting for the full cancellation timeout. Finding 4 (LOW): Already addressed in previous commit with test test_cancel_respond_task_escalation_calls_transport_disconnect. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
- Cycles now repeat indefinitely instead of stopping after 5 - Fixed log file path to /tmp/spin_detector.log for easy monitoring - Added periodic summary every 5 cycles showing PASS/WARN/FAIL counts - Cycle numbering now shows total count and pattern letter (e.g., "CYCLE 6 (A)") - Banner shows monitoring command: tail -f /tmp/spin_detector.log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
CancelledError inherits from BaseException in Python 3.8+, so it won't be caught by 'except Exception' handlers. The explicit handlers were unnecessary and triggered pylint W0706 (try-except-raise). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…dlers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
The blocking `async for message in pubsub.listen()` pattern doesn't respond to asyncio cancellation properly. When anyio's cancel scope tries to cancel tasks using this pattern, the tasks don't respond because the async iterator is blocked waiting for Redis messages. This causes anyio's `_deliver_cancellation` to continuously reschedule itself with `call_soon()`, creating a CPU spin loop that consumes 100% CPU per affected worker. Changed to timeout-based polling pattern: - Use `get_message(timeout=1.0)` with `asyncio.wait_for()` - Loop allows cancellation check every ~1 second - Added sleep on None/non-message responses to prevent edge case spins Files fixed: - mcpgateway/services/cancellation_service.py - mcpgateway/services/event_service.py Closes #2360 (partial - additional spin sources may exist) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Update: Fixed blocking pubsub.listen() patternRoot Cause IdentifiedThe CPU spin loop is caused by anyio's When a cancel scope is triggered (e.g., SSE client disconnects):
Problematic Patternasync for message in pubsub.listen(): # Blocks indefinitelyThis async iterator blocks waiting for Redis and doesn't respond to CancelledError until the next message arrives. Fix AppliedChanged to timeout-based polling in:
while True:
try:
message = await asyncio.wait_for(
pubsub.get_message(timeout=1.0), timeout=1.5)
except asyncio.TimeoutError:
continue # Loop back, allowing cancellation checkStatus: Partial FixTesting shows this helped delay the spin loop but containers are still hitting ~800% CPU eventually. There are likely additional spin sources to investigate:
Continuing investigation... |
Investigation Update: Remaining Spin SourcesCurrent StatusThe pubsub.listen() fix helped but containers still reach ~800% CPU after load tests complete. Evidence
Root Cause Patternanyio's
Remaining Investigation Areas
Next Steps
|
The MCP session/transport __aexit__ methods can block indefinitely when internal tasks don't respond to cancellation. This causes anyio's _deliver_cancellation to spin in a tight loop, consuming ~800% CPU. Root cause: When calling session.__aexit__() or transport.__aexit__(), they attempt to cancel internal tasks (like post_writer waiting on memory streams). If these tasks don't respond to CancelledError, anyio's cancel scope keeps calling call_soon() to reschedule _deliver_cancellation, creating a CPU spin loop. Changes: - Add SESSION_CLEANUP_TIMEOUT constant (5 seconds) to mcp_session_pool.py - Wrap all __aexit__ calls in asyncio.wait_for() with timeout - Add timeout to pubsub cleanup in session_registry.py and registry_cache.py - Add timeout to streamable HTTP context cleanup in translate.py This is a continuation of the fix for issue #2360. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Additional Fix: Timeouts for Session/Transport Cleanuppy-spy profiling after the previous fix still showed CPU spin at ~800% with Root CauseWhen closing MCP sessions, Stack Trace Pattern (from py-spy)FixAdded
If cleanup times out, we log a warning and proceed anyway (best-effort cleanup). This prevents indefinite blocking and the resulting CPU spin. |
Add MCP_SESSION_POOL_CLEANUP_TIMEOUT setting (default: 5.0 seconds) to control how long cleanup operations wait for session/transport __aexit__ calls to complete. Clarification: This timeout does NOT affect tool execution time (which uses TOOL_TIMEOUT). It only affects cleanup of idle/released sessions to prevent CPU spin loops when internal tasks don't respond to cancel. Changes: - Add mcp_session_pool_cleanup_timeout to config.py - Add MCP_SESSION_POOL_CLEANUP_TIMEOUT to .env.example with docs - Add to charts/mcp-stack/values.yaml - Update mcp_session_pool.py to use _get_cleanup_timeout() helper - Update session_registry.py and registry_cache.py to use config - Update translate.py to use config with fallback When to adjust: - Increase if you see frequent "cleanup timed out" warnings in logs - Decrease for faster shutdown (at risk of resource leaks) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Configuration: Cleanup Timeout Now ConfigurableMade Clarification on Implications
The cleanup timeout does NOT affect tool execution. It only applies to:
Configuration# .env
MCP_SESSION_POOL_CLEANUP_TIMEOUT=5.0# Helm values.yaml
mcpContextForge:
config:
MCP_SESSION_POOL_CLEANUP_TIMEOUT: "5.0"When to Adjust
|
Fixes CPU spin loop (anyio#695) where _deliver_cancellation spins at 100% CPU when SSE task group tasks don't respond to cancellation. Root cause: When an SSE connection ends, sse_starlette's task group tries to cancel all tasks. If a task (like _listen_for_disconnect waiting on receive()) doesn't respond to cancellation, anyio's _deliver_cancellation keeps rescheduling itself in a tight loop. Fix: Override EventSourceResponse.__call__ to set a deadline on the cancel scope when cancellation starts. This ensures that if tasks don't respond within SSE_TASK_GROUP_CLEANUP_TIMEOUT (5 seconds), the scope times out instead of spinning indefinitely. References: - agronholm/anyio#695 - anthropics/claude-agent-sdk-python#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Fix: SSE Cancel Scope DeadlineFound the root cause via web search - this is a known issue: Similar issues exist in other Python projects: ProblemWhen an SSE connection ends, FixCreated a patched async def cancel_on_finish(coro):
await coro()
# Set deadline to prevent indefinite spin if tasks don't respond
task_group.cancel_scope.deadline = anyio.current_time() + SSE_TASK_GROUP_CLEANUP_TIMEOUT
task_group.cancel_scope.cancel()This ensures that if tasks don't respond within 5 seconds, the scope times out and exits cleanly instead of spinning. Configuration
Testing NeededRebuild containers and verify the spin no longer occurs after load tests. |
translate.py was importing EventSourceResponse directly from sse_starlette, bypassing the patched version in sse_transport.py that prevents the anyio _deliver_cancellation CPU spin loop (anyio#695). This change ensures all SSE connections in the translate module (stdio-to-SSE bridge) also benefit from the cancel scope deadline fix. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
With many concurrent connections (691 TCP sockets observed), each cancelled SSE task group spinning for up to 5 seconds caused sustained high CPU usage. Reducing the timeout to 0.5s minimizes CPU waste during spin loops while still allowing normal cleanup to complete. The cleanup timeout only affects cleanup of cancelled/released connections, not normal operation or tool execution time. Changes: - SSE_TASK_GROUP_CLEANUP_TIMEOUT: 5.0 -> 0.5 seconds - mcp_session_pool_cleanup_timeout: 5.0 -> 0.5 seconds - Updated .env.example and charts/mcp-stack/values.yaml Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…faults - Add SSE_TASK_GROUP_CLEANUP_TIMEOUT setting (default: 5.0s) - Make sse_transport.py read timeout from config via lazy loader - Keep MCP_SESSION_POOL_CLEANUP_TIMEOUT at 5.0s default - Override both to 0.5s in docker-compose.yml for testing The 5.0s default is safe for production. The 0.5s override in docker-compose.yml allows testing aggressive cleanup to verify it doesn't affect normal operation. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
The MCP SDK's internal anyio task groups don't respond to cancellation properly, causing CPU spin loops in _deliver_cancellation. This spin happens inside the MCP SDK (streamablehttp_client, sse_client) which we cannot patch. Reduce GUNICORN_MAX_REQUESTS from 10M to 5K to ensure workers are recycled frequently, cleaning up any accumulated stuck task groups. Root cause chain observed: 1. PostgreSQL idle transaction timeout 2. Gateway state change failures 3. SSE connections terminated 4. MCP SDK task groups spin (anyio#695) This is a workaround until the MCP SDK properly handles cancellation. Relates to: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Root cause: anyio's _deliver_cancellation has no iteration limit. When tasks don't respond to CancelledError, it schedules call_soon() callbacks indefinitely, causing 100% CPU spin (anyio#695). Solution: - Monkey-patch CancelScope._deliver_cancellation to track iterations - Give up after 100 iterations and log warning - Clear _cancel_handle to stop further call_soon() callbacks Also switched from asyncio.wait_for() to anyio.move_on_after() for MCP session cleanup, which better propagates cancellation through anyio's cancel scope system. Trade-off: If cancellation gives up after 100 iterations, some tasks may not be properly cancelled. However, GUNICORN_MAX_REQUESTS=5000 worker recycling will eventually clean up orphaned tasks. Closes #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ed by default The anyio monkey-patch is now feature-flagged and disabled by default: - ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false (default) - ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=100 This allows testing performance with and without the patch, and easy rollback if upstream anyio/MCP SDK fixes the issue. Added: - Config settings for enabling/disabling the patch - apply_anyio_cancel_delivery_patch() function for explicit control - remove_anyio_cancel_delivery_patch() to restore original behavior - Documentation in .env.example and docker-compose.yml To enable: set ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Add multi-layered documentation for CPU spin loop mitigation settings across all configuration files. This ensures operators understand and can tune the workarounds for anyio#695. Changes: - .env.example: Add Layer 1/2/3 headers with cross-references to docs and issue #2360, document all 6 mitigation variables - README.md: Expand "CPU Spin Loop Mitigation" section with all 3 layers, configuration tables, and tuning tips - docker-compose.yml: Consolidate all mitigation variables into one section with SSE protection (Layer 1), cleanup timeouts (Layer 2), and experimental anyio patch (Layer 3) - charts/mcp-stack/values.yaml: Add comprehensive mitigation section with layer documentation and cross-references - docs/docs/operations/cpu-spin-loop-mitigation.md: NEW - Full guide with root cause analysis, 4-layer defense diagram, configuration tables, diagnostic commands, and tuning recommendations - docs/docs/.pages: Add Operations section to navigation - docs/docs/operations/.pages: Add nav for operations docs Mitigation variables documented: - Layer 1: SSE_SEND_TIMEOUT, SSE_RAPID_YIELD_WINDOW_MS, SSE_RAPID_YIELD_MAX - Layer 2: MCP_SESSION_POOL_CLEANUP_TIMEOUT, SSE_TASK_GROUP_CLEANUP_TIMEOUT - Layer 3: ANYIO_CANCEL_DELIVERY_PATCH_ENABLED, ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS Related: #2360, anyio#695, claude-agent-sdk#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
698891c to
c08bb95
Compare
Update spin detector load test for faster issue reproduction: - Increase user counts: 4000 → 4000 → 10000 pattern - Fast spawn rate: 1000 users/s - Shorter wait times: 0.01-0.1s between requests - Reduced connection timeouts: 5s (fail fast) Related: #2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
f4b566f to
1226dd5
Compare
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
PR Summary: CPU Spin Loop Mitigation (Issue #2360)ProblemAfter load tests or sustained traffic, Gunicorn workers enter 100% CPU spin loops while appearing idle. Root cause: anyio's Before fix:
After fix:
Changes Overview
New Configuration OptionsLayer 1: SSE Connection Protection- SSE_SEND_TIMEOUT=30.0 # ASGI send() timeout
- SSE_RAPID_YIELD_WINDOW_MS=1000 # Detection window
- SSE_RAPID_YIELD_MAX=50 # Max yields before disconnectLayer 2: Cleanup Timeouts- MCP_SESSION_POOL_CLEANUP_TIMEOUT=0.5 # Session __aexit__ timeout
- SSE_TASK_GROUP_CLEANUP_TIMEOUT=0.5 # SSE task group timeoutLayer 3: anyio Monkey-Patch (Experimental)- ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true # Enable workaround
- ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=500 # ~60ms recovery timeLayer 4: Worker Recycling- GUNICORN_MAX_REQUESTS=1000000 # Recycle workers (reduced from 10M)
- GUNICORN_MAX_REQUESTS_JITTER=100000How MAX_ITERATIONS Works
Normal cancellations complete in 1-10 iterations. Values above 100 handle edge cases. New Load Test:
|
| Scenario | CPU After Load | Spinning Workers |
|---|---|---|
| Without mitigation | ~800% | 9/24 (37.5%) |
| With mitigation | ~2.5% | 0/24 (0%) |
Tested with:
- 4,000-10,000 concurrent users
- 3,000+ RPS sustained
- Multiple spike/drop cycles
Documentation
New operations guide: docs/docs/operations/cpu-spin-loop-mitigation.md
- Problem description and root cause analysis
- Layer-by-layer mitigation explanation
- Configuration reference
- Troubleshooting guide
- Verification steps
Related Issues
- Closes [BUG]: anyio cancel scope spin loop causes 100% CPU after load test stops #2360
- Upstream: anyio#695
- Related: Claude SDK#378
Add docstring to nested cancel_on_finish function in EventSourceResponse.__call__ to achieve 100% interrogate coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
jonpspri
left a comment
There was a problem hiding this comment.
Let's be clear. I hate that we've gotten to this. But the patch is as clean as it can get without introducing a new object to encapsulate the process. Let's be sure we're monitoring the upstream so we can rip this out when it's corrected.
* fix-2360: prevent asyncio CPU spin loop after SSE client disconnect Root cause: Fire-and-forget asyncio.create_task() patterns left orphaned tasks that caused anyio _deliver_cancellation to spin at 100% CPU per worker. Changes: - Add _respond_tasks dict to track respond tasks by session_id - Cancel respond tasks explicitly before session cleanup in remove_session() - Cancel all respond tasks during shutdown() - Pass disconnect callback to SSE transport for defensive cleanup - Convert database backend from fire-and-forget to structured concurrency The fix ensures all asyncio tasks are properly tracked, cancelled on disconnect, and awaited to completion, preventing orphaned tasks from spinning the event loop. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: additional fixes for CPU spin loop after SSE disconnect Follow-up fixes based on testing and review: 1. Cancellation timeout escalation (Finding 1): - _cancel_respond_task() now escalates on timeout by calling transport.disconnect() - Retries cancellation after escalation - Always removes task from tracking to prevent buildup 2. Redis respond loop exit path (Finding 2): - Changed from infinite pubsub.listen() to timeout-based get_message() polling - Added session existence check - loop exits if session removed - Allows loop to exit even without cancellation 3. Generator finally block cleanup (Finding 3): - Added on_disconnect_callback() in event_generator() finally block - Covers: CancelledError, GeneratorExit, exceptions, and normal completion - Idempotent - safe if callback already ran from on_client_close 4. Added load-test-spin-detector make target: - Spike/drop pattern to stress test session cleanup - Docker stats monitoring at each phase - Color-coded output with pass/fail indicators - Log file output to /tmp Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: fix race condition in sse_endpoint and add stuck task tracking Finding 1 (HIGH): Fixed race condition in sse_endpoint where respond task was created AFTER create_sse_response(). If client disconnected during response setup, the disconnect callback ran before the task existed, leaving it orphaned. Now matches utility_sse_endpoint ordering: 1. Compute user_with_token 2. Create and register respond task 3. Call create_sse_response() Finding 2 (MEDIUM): Added _stuck_tasks dict to track tasks that couldn't be cancelled after escalation. Previously these were dropped from tracking entirely, losing visibility. Now they're moved to _stuck_tasks for monitoring and final cleanup during shutdown(). Updated tests to verify escalation behavior. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add SSE failure cleanup, stuck task reaper, and full load test Finding 1 (HIGH): Fixed orphaned respond task when create_sse_response() fails. Added try/except around create_sse_response() in both sse_endpoint and utility_sse_endpoint - on failure, calls remove_session() to clean up the task and session before re-raising. Finding 2 (MEDIUM): Added stuck task reaper that runs every 30 seconds to: - Remove completed tasks from _stuck_tasks - Retry cancellation for still-stuck tasks - Prevent memory leaks from tasks that eventually complete Finding 3 (LOW): Added test for escalation path with fake transport to verify transport.disconnect() is called during escalation. Also added tests for the stuck task reaper lifecycle. Also updated load-test-spin-detector to be a full-featured test matching load-test-ui with JWT auth, all user classes, entity ID fetching, and the same 4000-user baseline. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: improve load-test-spin-detector output and reduce cycle sizes - Reduce logging level to WARNING to suppress noisy worker messages - Only run entity fetching and cleanup on master/standalone nodes - Reduce cycle sizes from 4000 to 1000 peak users for faster iteration - Update banner to reflect new cycle pattern (500 -> 750 -> 1000) - Remove verbose JWT token generation log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: address remaining CPU spin loop findings Finding 1 (HIGH): Add explicit asyncio.CancelledError handling in SSE endpoints. In Python 3.8+, CancelledError inherits from BaseException, not Exception, so the previous except block wouldn't catch it. Now cleanup runs even when requests are cancelled during SSE handshake. Finding 2 (MEDIUM): Add sleep(0.1) when Redis get_message returns None to prevent tight loop. The loop now has guaranteed minimum sleep even when Redis returns immediately in certain states. Finding 3 (MEDIUM): Add _closing_sessions set to allow respond loops to exit early. remove_session() now marks the session as closing BEFORE attempting task cancellation, so the respond loop (Redis and DB backends) can exit immediately without waiting for the full cancellation timeout. Finding 4 (LOW): Already addressed in previous commit with test test_cancel_respond_task_escalation_calls_transport_disconnect. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: make load-test-spin-detector run unlimited cycles - Cycles now repeat indefinitely instead of stopping after 5 - Fixed log file path to /tmp/spin_detector.log for easy monitoring - Added periodic summary every 5 cycles showing PASS/WARN/FAIL counts - Cycle numbering now shows total count and pattern letter (e.g., "CYCLE 6 (A)") - Banner shows monitoring command: tail -f /tmp/spin_detector.log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add asyncio.CancelledError to SSE endpoint Raises docs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: remove redundant asyncio.CancelledError handlers CancelledError inherits from BaseException in Python 3.8+, so it won't be caught by 'except Exception' handlers. The explicit handlers were unnecessary and triggered pylint W0706 (try-except-raise). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: restore asyncio.CancelledError in Raises docs for inner handlers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add sleep on non-message Redis pubsub types to prevent spin Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(pubsub): replace blocking listen() with timeout-based get_message() The blocking `async for message in pubsub.listen()` pattern doesn't respond to asyncio cancellation properly. When anyio's cancel scope tries to cancel tasks using this pattern, the tasks don't respond because the async iterator is blocked waiting for Redis messages. This causes anyio's `_deliver_cancellation` to continuously reschedule itself with `call_soon()`, creating a CPU spin loop that consumes 100% CPU per affected worker. Changed to timeout-based polling pattern: - Use `get_message(timeout=1.0)` with `asyncio.wait_for()` - Loop allows cancellation check every ~1 second - Added sleep on None/non-message responses to prevent edge case spins Files fixed: - mcpgateway/services/cancellation_service.py - mcpgateway/services/event_service.py Closes IBM#2360 (partial - additional spin sources may exist) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): add timeouts to __aexit__ calls to prevent CPU spin loops The MCP session/transport __aexit__ methods can block indefinitely when internal tasks don't respond to cancellation. This causes anyio's _deliver_cancellation to spin in a tight loop, consuming ~800% CPU. Root cause: When calling session.__aexit__() or transport.__aexit__(), they attempt to cancel internal tasks (like post_writer waiting on memory streams). If these tasks don't respond to CancelledError, anyio's cancel scope keeps calling call_soon() to reschedule _deliver_cancellation, creating a CPU spin loop. Changes: - Add SESSION_CLEANUP_TIMEOUT constant (5 seconds) to mcp_session_pool.py - Wrap all __aexit__ calls in asyncio.wait_for() with timeout - Add timeout to pubsub cleanup in session_registry.py and registry_cache.py - Add timeout to streamable HTTP context cleanup in translate.py This is a continuation of the fix for issue IBM#2360. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(config): make session cleanup timeout configurable Add MCP_SESSION_POOL_CLEANUP_TIMEOUT setting (default: 5.0 seconds) to control how long cleanup operations wait for session/transport __aexit__ calls to complete. Clarification: This timeout does NOT affect tool execution time (which uses TOOL_TIMEOUT). It only affects cleanup of idle/released sessions to prevent CPU spin loops when internal tasks don't respond to cancel. Changes: - Add mcp_session_pool_cleanup_timeout to config.py - Add MCP_SESSION_POOL_CLEANUP_TIMEOUT to .env.example with docs - Add to charts/mcp-stack/values.yaml - Update mcp_session_pool.py to use _get_cleanup_timeout() helper - Update session_registry.py and registry_cache.py to use config - Update translate.py to use config with fallback When to adjust: - Increase if you see frequent "cleanup timed out" warnings in logs - Decrease for faster shutdown (at risk of resource leaks) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add deadline to cancel scope to prevent CPU spin loop Fixes CPU spin loop (anyio#695) where _deliver_cancellation spins at 100% CPU when SSE task group tasks don't respond to cancellation. Root cause: When an SSE connection ends, sse_starlette's task group tries to cancel all tasks. If a task (like _listen_for_disconnect waiting on receive()) doesn't respond to cancellation, anyio's _deliver_cancellation keeps rescheduling itself in a tight loop. Fix: Override EventSourceResponse.__call__ to set a deadline on the cancel scope when cancellation starts. This ensures that if tasks don't respond within SSE_TASK_GROUP_CLEANUP_TIMEOUT (5 seconds), the scope times out instead of spinning indefinitely. References: - agronholm/anyio#695 - anthropics/claude-agent-sdk-python#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(translate): use patched EventSourceResponse to prevent CPU spin translate.py was importing EventSourceResponse directly from sse_starlette, bypassing the patched version in sse_transport.py that prevents the anyio _deliver_cancellation CPU spin loop (anyio#695). This change ensures all SSE connections in the translate module (stdio-to-SSE bridge) also benefit from the cancel scope deadline fix. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): reduce cleanup timeouts from 5s to 0.5s With many concurrent connections (691 TCP sockets observed), each cancelled SSE task group spinning for up to 5 seconds caused sustained high CPU usage. Reducing the timeout to 0.5s minimizes CPU waste during spin loops while still allowing normal cleanup to complete. The cleanup timeout only affects cleanup of cancelled/released connections, not normal operation or tool execution time. Changes: - SSE_TASK_GROUP_CLEANUP_TIMEOUT: 5.0 -> 0.5 seconds - mcp_session_pool_cleanup_timeout: 5.0 -> 0.5 seconds - Updated .env.example and charts/mcp-stack/values.yaml Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(cleanup): make SSE cleanup timeout configurable with safe defaults - Add SSE_TASK_GROUP_CLEANUP_TIMEOUT setting (default: 5.0s) - Make sse_transport.py read timeout from config via lazy loader - Keep MCP_SESSION_POOL_CLEANUP_TIMEOUT at 5.0s default - Override both to 0.5s in docker-compose.yml for testing The 5.0s default is safe for production. The 0.5s override in docker-compose.yml allows testing aggressive cleanup to verify it doesn't affect normal operation. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(gunicorn): reduce max_requests to recycle stuck workers The MCP SDK's internal anyio task groups don't respond to cancellation properly, causing CPU spin loops in _deliver_cancellation. This spin happens inside the MCP SDK (streamablehttp_client, sse_client) which we cannot patch. Reduce GUNICORN_MAX_REQUESTS from 10M to 5K to ensure workers are recycled frequently, cleaning up any accumulated stuck task groups. Root cause chain observed: 1. PostgreSQL idle transaction timeout 2. Gateway state change failures 3. SSE connections terminated 4. MCP SDK task groups spin (anyio#695) This is a workaround until the MCP SDK properly handles cancellation. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(anyio): monkey-patch _deliver_cancellation to prevent CPU spin Root cause: anyio's _deliver_cancellation has no iteration limit. When tasks don't respond to CancelledError, it schedules call_soon() callbacks indefinitely, causing 100% CPU spin (anyio#695). Solution: - Monkey-patch CancelScope._deliver_cancellation to track iterations - Give up after 100 iterations and log warning - Clear _cancel_handle to stop further call_soon() callbacks Also switched from asyncio.wait_for() to anyio.move_on_after() for MCP session cleanup, which better propagates cancellation through anyio's cancel scope system. Trade-off: If cancellation gives up after 100 iterations, some tasks may not be properly cancelled. However, GUNICORN_MAX_REQUESTS=5000 worker recycling will eventually clean up orphaned tasks. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(anyio): make _deliver_cancellation patch optional and disabled by default The anyio monkey-patch is now feature-flagged and disabled by default: - ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false (default) - ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=100 This allows testing performance with and without the patch, and easy rollback if upstream anyio/MCP SDK fixes the issue. Added: - Config settings for enabling/disabling the patch - apply_anyio_cancel_delivery_patch() function for explicit control - remove_anyio_cancel_delivery_patch() to restore original behavior - Documentation in .env.example and docker-compose.yml To enable: set ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add comprehensive CPU spin loop mitigation documentation (IBM#2360) Add multi-layered documentation for CPU spin loop mitigation settings across all configuration files. This ensures operators understand and can tune the workarounds for anyio#695. Changes: - .env.example: Add Layer 1/2/3 headers with cross-references to docs and issue IBM#2360, document all 6 mitigation variables - README.md: Expand "CPU Spin Loop Mitigation" section with all 3 layers, configuration tables, and tuning tips - docker-compose.yml: Consolidate all mitigation variables into one section with SSE protection (Layer 1), cleanup timeouts (Layer 2), and experimental anyio patch (Layer 3) - charts/mcp-stack/values.yaml: Add comprehensive mitigation section with layer documentation and cross-references - docs/docs/operations/cpu-spin-loop-mitigation.md: NEW - Full guide with root cause analysis, 4-layer defense diagram, configuration tables, diagnostic commands, and tuning recommendations - docs/docs/.pages: Add Operations section to navigation - docs/docs/operations/.pages: Add nav for operations docs Mitigation variables documented: - Layer 1: SSE_SEND_TIMEOUT, SSE_RAPID_YIELD_WINDOW_MS, SSE_RAPID_YIELD_MAX - Layer 2: MCP_SESSION_POOL_CLEANUP_TIMEOUT, SSE_TASK_GROUP_CLEANUP_TIMEOUT - Layer 3: ANYIO_CANCEL_DELIVERY_PATCH_ENABLED, ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS Related: IBM#2360, anyio#695, claude-agent-sdk#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(loadtest): aggressive spin detector with configurable timings Update spin detector load test for faster issue reproduction: - Increase user counts: 4000 → 4000 → 10000 pattern - Fast spawn rate: 1000 users/s - Shorter wait times: 0.01-0.1s between requests - Reduced connection timeouts: 5s (fail fast) Related: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * compose mitigation Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * load test Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add docstring to cancel_on_finish for interrogate coverage Add docstring to nested cancel_on_finish function in EventSourceResponse.__call__ to achieve 100% interrogate coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> Signed-off-by: hughhennnelly <hughhennelly06@gmail.com>
* fix-2360: prevent asyncio CPU spin loop after SSE client disconnect Root cause: Fire-and-forget asyncio.create_task() patterns left orphaned tasks that caused anyio _deliver_cancellation to spin at 100% CPU per worker. Changes: - Add _respond_tasks dict to track respond tasks by session_id - Cancel respond tasks explicitly before session cleanup in remove_session() - Cancel all respond tasks during shutdown() - Pass disconnect callback to SSE transport for defensive cleanup - Convert database backend from fire-and-forget to structured concurrency The fix ensures all asyncio tasks are properly tracked, cancelled on disconnect, and awaited to completion, preventing orphaned tasks from spinning the event loop. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: additional fixes for CPU spin loop after SSE disconnect Follow-up fixes based on testing and review: 1. Cancellation timeout escalation (Finding 1): - _cancel_respond_task() now escalates on timeout by calling transport.disconnect() - Retries cancellation after escalation - Always removes task from tracking to prevent buildup 2. Redis respond loop exit path (Finding 2): - Changed from infinite pubsub.listen() to timeout-based get_message() polling - Added session existence check - loop exits if session removed - Allows loop to exit even without cancellation 3. Generator finally block cleanup (Finding 3): - Added on_disconnect_callback() in event_generator() finally block - Covers: CancelledError, GeneratorExit, exceptions, and normal completion - Idempotent - safe if callback already ran from on_client_close 4. Added load-test-spin-detector make target: - Spike/drop pattern to stress test session cleanup - Docker stats monitoring at each phase - Color-coded output with pass/fail indicators - Log file output to /tmp Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: fix race condition in sse_endpoint and add stuck task tracking Finding 1 (HIGH): Fixed race condition in sse_endpoint where respond task was created AFTER create_sse_response(). If client disconnected during response setup, the disconnect callback ran before the task existed, leaving it orphaned. Now matches utility_sse_endpoint ordering: 1. Compute user_with_token 2. Create and register respond task 3. Call create_sse_response() Finding 2 (MEDIUM): Added _stuck_tasks dict to track tasks that couldn't be cancelled after escalation. Previously these were dropped from tracking entirely, losing visibility. Now they're moved to _stuck_tasks for monitoring and final cleanup during shutdown(). Updated tests to verify escalation behavior. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add SSE failure cleanup, stuck task reaper, and full load test Finding 1 (HIGH): Fixed orphaned respond task when create_sse_response() fails. Added try/except around create_sse_response() in both sse_endpoint and utility_sse_endpoint - on failure, calls remove_session() to clean up the task and session before re-raising. Finding 2 (MEDIUM): Added stuck task reaper that runs every 30 seconds to: - Remove completed tasks from _stuck_tasks - Retry cancellation for still-stuck tasks - Prevent memory leaks from tasks that eventually complete Finding 3 (LOW): Added test for escalation path with fake transport to verify transport.disconnect() is called during escalation. Also added tests for the stuck task reaper lifecycle. Also updated load-test-spin-detector to be a full-featured test matching load-test-ui with JWT auth, all user classes, entity ID fetching, and the same 4000-user baseline. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: improve load-test-spin-detector output and reduce cycle sizes - Reduce logging level to WARNING to suppress noisy worker messages - Only run entity fetching and cleanup on master/standalone nodes - Reduce cycle sizes from 4000 to 1000 peak users for faster iteration - Update banner to reflect new cycle pattern (500 -> 750 -> 1000) - Remove verbose JWT token generation log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: address remaining CPU spin loop findings Finding 1 (HIGH): Add explicit asyncio.CancelledError handling in SSE endpoints. In Python 3.8+, CancelledError inherits from BaseException, not Exception, so the previous except block wouldn't catch it. Now cleanup runs even when requests are cancelled during SSE handshake. Finding 2 (MEDIUM): Add sleep(0.1) when Redis get_message returns None to prevent tight loop. The loop now has guaranteed minimum sleep even when Redis returns immediately in certain states. Finding 3 (MEDIUM): Add _closing_sessions set to allow respond loops to exit early. remove_session() now marks the session as closing BEFORE attempting task cancellation, so the respond loop (Redis and DB backends) can exit immediately without waiting for the full cancellation timeout. Finding 4 (LOW): Already addressed in previous commit with test test_cancel_respond_task_escalation_calls_transport_disconnect. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: make load-test-spin-detector run unlimited cycles - Cycles now repeat indefinitely instead of stopping after 5 - Fixed log file path to /tmp/spin_detector.log for easy monitoring - Added periodic summary every 5 cycles showing PASS/WARN/FAIL counts - Cycle numbering now shows total count and pattern letter (e.g., "CYCLE 6 (A)") - Banner shows monitoring command: tail -f /tmp/spin_detector.log Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add asyncio.CancelledError to SSE endpoint Raises docs Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: remove redundant asyncio.CancelledError handlers CancelledError inherits from BaseException in Python 3.8+, so it won't be caught by 'except Exception' handlers. The explicit handlers were unnecessary and triggered pylint W0706 (try-except-raise). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: restore asyncio.CancelledError in Raises docs for inner handlers Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix-2360: add sleep on non-message Redis pubsub types to prevent spin Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(pubsub): replace blocking listen() with timeout-based get_message() The blocking `async for message in pubsub.listen()` pattern doesn't respond to asyncio cancellation properly. When anyio's cancel scope tries to cancel tasks using this pattern, the tasks don't respond because the async iterator is blocked waiting for Redis messages. This causes anyio's `_deliver_cancellation` to continuously reschedule itself with `call_soon()`, creating a CPU spin loop that consumes 100% CPU per affected worker. Changed to timeout-based polling pattern: - Use `get_message(timeout=1.0)` with `asyncio.wait_for()` - Loop allows cancellation check every ~1 second - Added sleep on None/non-message responses to prevent edge case spins Files fixed: - mcpgateway/services/cancellation_service.py - mcpgateway/services/event_service.py Closes IBM#2360 (partial - additional spin sources may exist) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): add timeouts to __aexit__ calls to prevent CPU spin loops The MCP session/transport __aexit__ methods can block indefinitely when internal tasks don't respond to cancellation. This causes anyio's _deliver_cancellation to spin in a tight loop, consuming ~800% CPU. Root cause: When calling session.__aexit__() or transport.__aexit__(), they attempt to cancel internal tasks (like post_writer waiting on memory streams). If these tasks don't respond to CancelledError, anyio's cancel scope keeps calling call_soon() to reschedule _deliver_cancellation, creating a CPU spin loop. Changes: - Add SESSION_CLEANUP_TIMEOUT constant (5 seconds) to mcp_session_pool.py - Wrap all __aexit__ calls in asyncio.wait_for() with timeout - Add timeout to pubsub cleanup in session_registry.py and registry_cache.py - Add timeout to streamable HTTP context cleanup in translate.py This is a continuation of the fix for issue IBM#2360. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(config): make session cleanup timeout configurable Add MCP_SESSION_POOL_CLEANUP_TIMEOUT setting (default: 5.0 seconds) to control how long cleanup operations wait for session/transport __aexit__ calls to complete. Clarification: This timeout does NOT affect tool execution time (which uses TOOL_TIMEOUT). It only affects cleanup of idle/released sessions to prevent CPU spin loops when internal tasks don't respond to cancel. Changes: - Add mcp_session_pool_cleanup_timeout to config.py - Add MCP_SESSION_POOL_CLEANUP_TIMEOUT to .env.example with docs - Add to charts/mcp-stack/values.yaml - Update mcp_session_pool.py to use _get_cleanup_timeout() helper - Update session_registry.py and registry_cache.py to use config - Update translate.py to use config with fallback When to adjust: - Increase if you see frequent "cleanup timed out" warnings in logs - Decrease for faster shutdown (at risk of resource leaks) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add deadline to cancel scope to prevent CPU spin loop Fixes CPU spin loop (anyio#695) where _deliver_cancellation spins at 100% CPU when SSE task group tasks don't respond to cancellation. Root cause: When an SSE connection ends, sse_starlette's task group tries to cancel all tasks. If a task (like _listen_for_disconnect waiting on receive()) doesn't respond to cancellation, anyio's _deliver_cancellation keeps rescheduling itself in a tight loop. Fix: Override EventSourceResponse.__call__ to set a deadline on the cancel scope when cancellation starts. This ensures that if tasks don't respond within SSE_TASK_GROUP_CLEANUP_TIMEOUT (5 seconds), the scope times out instead of spinning indefinitely. References: - agronholm/anyio#695 - anthropics/claude-agent-sdk-python#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(translate): use patched EventSourceResponse to prevent CPU spin translate.py was importing EventSourceResponse directly from sse_starlette, bypassing the patched version in sse_transport.py that prevents the anyio _deliver_cancellation CPU spin loop (anyio#695). This change ensures all SSE connections in the translate module (stdio-to-SSE bridge) also benefit from the cancel scope deadline fix. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(cleanup): reduce cleanup timeouts from 5s to 0.5s With many concurrent connections (691 TCP sockets observed), each cancelled SSE task group spinning for up to 5 seconds caused sustained high CPU usage. Reducing the timeout to 0.5s minimizes CPU waste during spin loops while still allowing normal cleanup to complete. The cleanup timeout only affects cleanup of cancelled/released connections, not normal operation or tool execution time. Changes: - SSE_TASK_GROUP_CLEANUP_TIMEOUT: 5.0 -> 0.5 seconds - mcp_session_pool_cleanup_timeout: 5.0 -> 0.5 seconds - Updated .env.example and charts/mcp-stack/values.yaml Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(cleanup): make SSE cleanup timeout configurable with safe defaults - Add SSE_TASK_GROUP_CLEANUP_TIMEOUT setting (default: 5.0s) - Make sse_transport.py read timeout from config via lazy loader - Keep MCP_SESSION_POOL_CLEANUP_TIMEOUT at 5.0s default - Override both to 0.5s in docker-compose.yml for testing The 5.0s default is safe for production. The 0.5s override in docker-compose.yml allows testing aggressive cleanup to verify it doesn't affect normal operation. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(gunicorn): reduce max_requests to recycle stuck workers The MCP SDK's internal anyio task groups don't respond to cancellation properly, causing CPU spin loops in _deliver_cancellation. This spin happens inside the MCP SDK (streamablehttp_client, sse_client) which we cannot patch. Reduce GUNICORN_MAX_REQUESTS from 10M to 5K to ensure workers are recycled frequently, cleaning up any accumulated stuck task groups. Root cause chain observed: 1. PostgreSQL idle transaction timeout 2. Gateway state change failures 3. SSE connections terminated 4. MCP SDK task groups spin (anyio#695) This is a workaround until the MCP SDK properly handles cancellation. Relates to: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Linting Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(anyio): monkey-patch _deliver_cancellation to prevent CPU spin Root cause: anyio's _deliver_cancellation has no iteration limit. When tasks don't respond to CancelledError, it schedules call_soon() callbacks indefinitely, causing 100% CPU spin (anyio#695). Solution: - Monkey-patch CancelScope._deliver_cancellation to track iterations - Give up after 100 iterations and log warning - Clear _cancel_handle to stop further call_soon() callbacks Also switched from asyncio.wait_for() to anyio.move_on_after() for MCP session cleanup, which better propagates cancellation through anyio's cancel scope system. Trade-off: If cancellation gives up after 100 iterations, some tasks may not be properly cancelled. However, GUNICORN_MAX_REQUESTS=5000 worker recycling will eventually clean up orphaned tasks. Closes IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * refactor(anyio): make _deliver_cancellation patch optional and disabled by default The anyio monkey-patch is now feature-flagged and disabled by default: - ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=false (default) - ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS=100 This allows testing performance with and without the patch, and easy rollback if upstream anyio/MCP SDK fixes the issue. Added: - Config settings for enabling/disabling the patch - apply_anyio_cancel_delivery_patch() function for explicit control - remove_anyio_cancel_delivery_patch() to restore original behavior - Documentation in .env.example and docker-compose.yml To enable: set ANYIO_CANCEL_DELIVERY_PATCH_ENABLED=true Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add comprehensive CPU spin loop mitigation documentation (IBM#2360) Add multi-layered documentation for CPU spin loop mitigation settings across all configuration files. This ensures operators understand and can tune the workarounds for anyio#695. Changes: - .env.example: Add Layer 1/2/3 headers with cross-references to docs and issue IBM#2360, document all 6 mitigation variables - README.md: Expand "CPU Spin Loop Mitigation" section with all 3 layers, configuration tables, and tuning tips - docker-compose.yml: Consolidate all mitigation variables into one section with SSE protection (Layer 1), cleanup timeouts (Layer 2), and experimental anyio patch (Layer 3) - charts/mcp-stack/values.yaml: Add comprehensive mitigation section with layer documentation and cross-references - docs/docs/operations/cpu-spin-loop-mitigation.md: NEW - Full guide with root cause analysis, 4-layer defense diagram, configuration tables, diagnostic commands, and tuning recommendations - docs/docs/.pages: Add Operations section to navigation - docs/docs/operations/.pages: Add nav for operations docs Mitigation variables documented: - Layer 1: SSE_SEND_TIMEOUT, SSE_RAPID_YIELD_WINDOW_MS, SSE_RAPID_YIELD_MAX - Layer 2: MCP_SESSION_POOL_CLEANUP_TIMEOUT, SSE_TASK_GROUP_CLEANUP_TIMEOUT - Layer 3: ANYIO_CANCEL_DELIVERY_PATCH_ENABLED, ANYIO_CANCEL_DELIVERY_MAX_ITERATIONS Related: IBM#2360, anyio#695, claude-agent-sdk#378 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * feat(loadtest): aggressive spin detector with configurable timings Update spin detector load test for faster issue reproduction: - Increase user counts: 4000 → 4000 → 10000 pattern - Fast spawn rate: 1000 users/s - Shorter wait times: 0.01-0.1s between requests - Reduced connection timeouts: 5s (fail fast) Related: IBM#2360 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * compose mitigation Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * load test Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Defaults Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add docstring to cancel_on_finish for interrogate coverage Add docstring to nested cancel_on_finish function in EventSourceResponse.__call__ to achieve 100% interrogate coverage. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Summary
Fixes the CPU spin loop issue where Gunicorn workers consume 100%+ CPU when idle after load tests stop. The root cause was fire-and-forget
asyncio.create_task()patterns leaving orphaned tasks in anyio's_deliver_cancellationspin loop.Key fixes:
_respond_tasksdict for proper lifecycle managementfinallyblock now also invokes disconnect callbackLoad testing:
make load-test-spin-detectorto be full-featured (JWT auth, all user classes, 4000-user baseline)Closes #2360 - [BUG]: anyio cancel scope spin loop causes 100% CPU after load test stops
Closes #2357 - [BUG]: (sse): Granian CPU spikes to 800% after load stops, recovers when load resumes
Test plan
disconnect()is calledmake load-test-spin-detectorto verify CPU returns to idle during pause phases