fix(perf): resolve high-load performance degradation and CPU spin loops by crivetimihai · Pull Request #2359 · IBM/mcp-context-forge

crivetimihai · 2026-01-24T12:55:36Z

Summary

Resolves critical performance issues under high load (4000+ concurrent users) including database lock contention, cascading FOR UPDATE locks, and SSE connection spin loops.

Changes

Database Performance Fixes

Bulk UPDATE for gateway heartbeats - Replace sequential FOR UPDATE loops with single bulk UPDATE statement, reducing lock contention by ~95%
nowait locks for state changes - Add nowait=True to get_for_update() calls in tool, server, prompt, and resource services for fail-fast lock handling
409 Conflict responses - Return HTTP 409 when lock conflicts occur instead of blocking

SSE Transport Spin Loop Prevention

Robust rapid yield detection - Track consecutive rapid yields (<100ms apart) to detect spin loops
client_close_handler_callable - Use sse_starlette callback to detect disconnects that granian may not propagate
Remove harmful deque clearing - Fix bug where clearing yield_timestamps after keepalives prevented spin detection

Lock Conflict Error Handling

Add ToolLockConflictError, ServerLockConflictError, PromptLockConflictError, ResourceLockConflictError
Explicit exception handlers to prevent generic Exception from swallowing lock errors
HTTP 409 responses for lock conflicts in all state change endpoints

Configuration & Documentation

Add SSE rapid yield detection settings (sse_rapid_yield_window_ms, sse_rapid_yield_max)
Document granian worker lifecycle options (GRANIAN_WORKERS_LIFETIME, GRANIAN_WORKERS_MAX_RSS) as workaround for [BUG]: (sse): Granian CPU spikes to 800% after load stops, recovers when load resumes #2357
Update dependencies (gunicorn, sqlalchemy, langgraph, hypothesis, etc.)

Test Results

All unit tests passing
Load tested with 4000 concurrent users
Database lock contention eliminated
No more cascading lock timeouts

Known Limitation

Post-load CPU spike (#2357) - After load stops, granian may spin at ~800% CPU due to SSE connections not being properly closed. This is a granian issue (#286), documented with workaround options.

Closes

Closes [PERFORMANCE]: PR #2211 causes FOR UPDATE lock contention and CPU spin loop under high load #2355

Related Issues

[BUG]: (sse): Granian CPU spikes to 800% after load stops, recovers when load resumes #2357 - Granian CPU spike after load (documented, workaround added)
[FEATURE]: Granian feature requests - max-requests, jitter, and worker lifecycle improvements #2358 - Granian feature requests for better worker lifecycle management

…2355) Phase 1: Replace cascading FOR UPDATE loops with bulk UPDATE statements in gateway_service.set_gateway_state() to eliminate lock contention when activating/deactivating gateways with many tools/servers/prompts/resources. Phase 2: Add nowait=True to get_for_update() calls in set_server_state() and set_tool_state() to fail fast on locked rows instead of blocking. Add ServerLockConflictError and ToolLockConflictError exceptions with 409 Conflict handlers in main.py and admin.py routers. Phase 3: Fix CPU spin loop in SSE transport by properly detecting client disconnection. Add request.is_disconnected() check, consecutive error counting, GeneratorExit handling, and ensure _client_gone is set in all exit paths. Results: - RPS improved from 173-600 to ~2000 under load - Failure rate reduced from 14-22% to 0.03-0.04% - Blocked queries reduced from 33-48 to 0 - CPU after load test: ~1% (was 800%+ spin loop) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Add configurable timeout settings for database operations: - db_lock_timeout_ms: Maximum wait for row locks (default 5000ms) - db_statement_timeout_ms: Maximum statement execution (default 30000ms) These settings can be used with get_for_update() to prevent indefinite blocking under high concurrency scenarios. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Update gateway_service tests to use side_effect for multiple db.execute calls (SELECT + bulk UPDATEs) instead of single return_value. Update row_level_locking test to expect nowait=True parameter in get_for_update calls for set_tool_state. Update SSE transport tests to mock request.is_disconnected() and adjust error handling test to expect consecutive errors causing generator stop instead of error event emission. Add missing exception documentation for ServerLockConflictError and ToolLockConflictError in service docstrings (flake8 DAR401). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

…ps (#2355) When Granian ASGI server fails to send to a disconnected client, it logs "ASGI transport error: SendError" but doesn't raise an exception to our code. This causes rapid iteration of the generator without proper timeout handling. Add send_timeout=5.0 to EventSourceResponse to ensure sends time out if they fail, triggering sse_starlette's built-in error handling. Also enable sse_starlette's built-in ping mechanism when keepalive is enabled, which provides additional disconnect detection. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

When clients disconnect abruptly, Granian may fail sends without raising Python exceptions. This adds rapid yield detection: if 50+ yields occur within 1 second, we assume client is disconnected and stop the generator. New configurable settings: - SSE_SEND_TIMEOUT: ASGI send timeout (default 30s) - SSE_RAPID_YIELD_WINDOW_MS: detection window (default 1000ms) - SSE_RAPID_YIELD_MAX: max yields before disconnect (default 50) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Changed from WARNING to ERROR so the detection message is visible even when LOG_LEVEL=ERROR. This is appropriate since rapid yield detection indicates a problem condition (client disconnect not reported by ASGI). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Finding 1: Fix lock conflict error propagation - Add explicit except handlers for ToolLockConflictError and ServerLockConflictError before the generic Exception handler - This allows 409 responses to propagate correctly instead of being wrapped as generic 400 errors Finding 3: Improve SSE rapid yield detection - Only track message yields, not keepalives - Reset the timestamp deque when timeout occurs (we actually waited) - This prevents false positives on high-throughput legitimate streams Finding 4: Remove unused db timeout settings - Remove db_lock_timeout_ms and db_statement_timeout_ms from config - These settings were defined but never wired into DB operations - Avoids false sense of protection Finding 2 (notifications) is intentional: gateway-level notifications are sent, and bulk UPDATE is used for performance under high load. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Extends the lock contention fix to prompt_service and resource_service: - Add PromptLockConflictError and ResourceLockConflictError classes - Use nowait=True in get_for_update to fail fast if row is locked - Add 409 Conflict handlers in main.py for both services - Re-raise specific errors before generic Exception handler This ensures consistent lock handling across all state change endpoints. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

- Track time since last yield as additional signal (<10ms is suspicious) - Check rapid yield after BOTH message and keepalive yields - Reset timestamps only after successful keepalive wait - Include time interval in error log for debugging Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

SSE transport improvements: - Add consecutive rapid yield counter for simpler spin loop detection (triggers after 10 yields < 100ms apart) - Remove deque clearing after keepalives that prevented detection - Add client_close_handler_callable to detect disconnects that ASGI servers like granian may not propagate via request.is_disconnected() Test updates: - Update row-level locking tests to expect nowait=True for prompt and resource state changes Dependency updates: - Containerfile.lite: Update UBI base images to latest - gunicorn 23.0.0 -> 24.1.1 - sqlalchemy 2.0.45 -> 2.0.46 - langgraph 1.0.6 -> 1.0.7 - hypothesis 6.150.2 -> 6.150.3 - schemathesis 4.9.2 -> 4.9.4 - copier 9.11.1 -> 9.11.3 - pytest-html 4.1.1 -> 4.2.0 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

…rkaround Document GRANIAN_WORKERS_LIFETIME and GRANIAN_WORKERS_MAX_RSS options as commented-out configuration in docker-compose.yml and run-granian.sh. These options provide a workaround for granian issue #286 where SSE connections are not properly closed after client disconnect, causing CPU spin loops after load tests complete. Refs: #2357, #2358 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai · 2026-01-24T14:48:27Z

This PR solves the crash, but doesn't fully solve the issue of CPU spin-up after load has completed. #2360 was created to track that issue. To revisit this PR later in case issues arrise.

…ps (IBM#2359) * fix(perf): resolve lock contention and CPU spin loop under high load (IBM#2355) Phase 1: Replace cascading FOR UPDATE loops with bulk UPDATE statements in gateway_service.set_gateway_state() to eliminate lock contention when activating/deactivating gateways with many tools/servers/prompts/resources. Phase 2: Add nowait=True to get_for_update() calls in set_server_state() and set_tool_state() to fail fast on locked rows instead of blocking. Add ServerLockConflictError and ToolLockConflictError exceptions with 409 Conflict handlers in main.py and admin.py routers. Phase 3: Fix CPU spin loop in SSE transport by properly detecting client disconnection. Add request.is_disconnected() check, consecutive error counting, GeneratorExit handling, and ensure _client_gone is set in all exit paths. Results: - RPS improved from 173-600 to ~2000 under load - Failure rate reduced from 14-22% to 0.03-0.04% - Blocked queries reduced from 33-48 to 0 - CPU after load test: ~1% (was 800%+ spin loop) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(perf): add database lock timeout configuration (IBM#2355) Add configurable timeout settings for database operations: - db_lock_timeout_ms: Maximum wait for row locks (default 5000ms) - db_statement_timeout_ms: Maximum statement execution (default 30000ms) These settings can be used with get_for_update() to prevent indefinite blocking under high concurrency scenarios. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * test: update tests for bulk UPDATE and SSE transport changes (IBM#2355) Update gateway_service tests to use side_effect for multiple db.execute calls (SELECT + bulk UPDATEs) instead of single return_value. Update row_level_locking test to expect nowait=True parameter in get_for_update calls for set_tool_state. Update SSE transport tests to mock request.is_disconnected() and adjust error handling test to expect consecutive errors causing generator stop instead of error event emission. Add missing exception documentation for ServerLockConflictError and ToolLockConflictError in service docstrings (flake8 DAR401). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add send_timeout to EventSourceResponse to prevent spin loops (IBM#2355) When Granian ASGI server fails to send to a disconnected client, it logs "ASGI transport error: SendError" but doesn't raise an exception to our code. This causes rapid iteration of the generator without proper timeout handling. Add send_timeout=5.0 to EventSourceResponse to ensure sends time out if they fail, triggering sse_starlette's built-in error handling. Also enable sse_starlette's built-in ping mechanism when keepalive is enabled, which provides additional disconnect detection. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add rapid yield detection to prevent CPU spin loops (IBM#2355) When clients disconnect abruptly, Granian may fail sends without raising Python exceptions. This adds rapid yield detection: if 50+ yields occur within 1 second, we assume client is disconnected and stop the generator. New configurable settings: - SSE_SEND_TIMEOUT: ASGI send timeout (default 30s) - SSE_RAPID_YIELD_WINDOW_MS: detection window (default 1000ms) - SSE_RAPID_YIELD_MAX: max yields before disconnect (default 50) Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): log rapid yield detection at ERROR level for visibility Changed from WARNING to ERROR so the detection message is visible even when LOG_LEVEL=ERROR. This is appropriate since rapid yield detection indicates a problem condition (client disconnect not reported by ASGI). Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix: address review findings from ChatGPT analysis Finding 1: Fix lock conflict error propagation - Add explicit except handlers for ToolLockConflictError and ServerLockConflictError before the generic Exception handler - This allows 409 responses to propagate correctly instead of being wrapped as generic 400 errors Finding 3: Improve SSE rapid yield detection - Only track message yields, not keepalives - Reset the timestamp deque when timeout occurs (we actually waited) - This prevents false positives on high-throughput legitimate streams Finding 4: Remove unused db timeout settings - Remove db_lock_timeout_ms and db_statement_timeout_ms from config - These settings were defined but never wired into DB operations - Avoids false sense of protection Finding 2 (notifications) is intentional: gateway-level notifications are sent, and bulk UPDATE is used for performance under high load. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(perf): add nowait locks to prompt and resource state changes Extends the lock contention fix to prompt_service and resource_service: - Add PromptLockConflictError and ResourceLockConflictError classes - Use nowait=True in get_for_update to fail fast if row is locked - Add 409 Conflict handlers in main.py for both services - Re-raise specific errors before generic Exception handler This ensures consistent lock handling across all state change endpoints. Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): improve rapid yield detection to catch all spin scenarios - Track time since last yield as additional signal (<10ms is suspicious) - Check rapid yield after BOTH message and keepalive yields - Reset timestamps only after successful keepalive wait - Include time interval in error log for debugging Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * fix(sse): add robust spin loop detection and update dependencies (IBM#2355) SSE transport improvements: - Add consecutive rapid yield counter for simpler spin loop detection (triggers after 10 yields < 100ms apart) - Remove deque clearing after keepalives that prevented detection - Add client_close_handler_callable to detect disconnects that ASGI servers like granian may not propagate via request.is_disconnected() Test updates: - Update row-level locking tests to expect nowait=True for prompt and resource state changes Dependency updates: - Containerfile.lite: Update UBI base images to latest - gunicorn 23.0.0 -> 24.1.1 - sqlalchemy 2.0.45 -> 2.0.46 - langgraph 1.0.6 -> 1.0.7 - hypothesis 6.150.2 -> 6.150.3 - schemathesis 4.9.2 -> 4.9.4 - copier 9.11.1 -> 9.11.3 - pytest-html 4.1.1 -> 4.2.0 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docs: add granian worker lifecycle options for SSE connection leak workaround Document GRANIAN_WORKERS_LIFETIME and GRANIAN_WORKERS_MAX_RSS options as commented-out configuration in docker-compose.yml and run-granian.sh. These options provide a workaround for granian issue IBM#286 where SSE connections are not properly closed after client disconnect, causing CPU spin loops after load tests complete. Refs: IBM#2357, IBM#2358 Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docker-compose updates for GUNICORN Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docker-compose updates for GUNICORN Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * docker-compose updates for GUNICORN Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * Update pyproject.toml Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> * lint Signed-off-by: Mihai Criveti <crivetimihai@gmail.com> --------- Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai added 11 commits January 24, 2026 00:07

crivetimihai requested review from kevalmahajan and madhav165 as code owners January 24, 2026 12:55

crivetimihai self-assigned this Jan 24, 2026

crivetimihai added this to the Release 1.0.0-BETA-2 milestone Jan 24, 2026

crivetimihai added 5 commits January 24, 2026 13:35

docker-compose updates for GUNICORN

dbe26a0

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

docker-compose updates for GUNICORN

d15772a

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

docker-compose updates for GUNICORN

38d72fd

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

Update pyproject.toml

1fd93ba

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

lint

26837c7

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

crivetimihai merged commit b773c67 into main Jan 24, 2026
53 checks passed

crivetimihai deleted the 2355-performance-lock branch January 24, 2026 14:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(perf): resolve high-load performance degradation and CPU spin loops#2359

fix(perf): resolve high-load performance degradation and CPU spin loops#2359
crivetimihai merged 16 commits intomainfrom
2355-performance-lock

crivetimihai commented Jan 24, 2026

Uh oh!

crivetimihai commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

crivetimihai commented Jan 24, 2026

Summary

Changes

Database Performance Fixes

SSE Transport Spin Loop Prevention

Lock Conflict Error Handling

Configuration & Documentation

Test Results

Known Limitation

Closes

Related Issues

Uh oh!

crivetimihai commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant