Skip to content

fix(perf): resolve high-load performance degradation and CPU spin loops#2359

Merged
crivetimihai merged 16 commits intomainfrom
2355-performance-lock
Jan 24, 2026
Merged

fix(perf): resolve high-load performance degradation and CPU spin loops#2359
crivetimihai merged 16 commits intomainfrom
2355-performance-lock

Conversation

@crivetimihai
Copy link
Copy Markdown
Member

Summary

Resolves critical performance issues under high load (4000+ concurrent users) including database lock contention, cascading FOR UPDATE locks, and SSE connection spin loops.

Changes

Database Performance Fixes

  • Bulk UPDATE for gateway heartbeats - Replace sequential FOR UPDATE loops with single bulk UPDATE statement, reducing lock contention by ~95%
  • nowait locks for state changes - Add nowait=True to get_for_update() calls in tool, server, prompt, and resource services for fail-fast lock handling
  • 409 Conflict responses - Return HTTP 409 when lock conflicts occur instead of blocking

SSE Transport Spin Loop Prevention

  • Robust rapid yield detection - Track consecutive rapid yields (<100ms apart) to detect spin loops
  • client_close_handler_callable - Use sse_starlette callback to detect disconnects that granian may not propagate
  • Remove harmful deque clearing - Fix bug where clearing yield_timestamps after keepalives prevented spin detection

Lock Conflict Error Handling

  • Add ToolLockConflictError, ServerLockConflictError, PromptLockConflictError, ResourceLockConflictError
  • Explicit exception handlers to prevent generic Exception from swallowing lock errors
  • HTTP 409 responses for lock conflicts in all state change endpoints

Configuration & Documentation

Test Results

  • All unit tests passing
  • Load tested with 4000 concurrent users
  • Database lock contention eliminated
  • No more cascading lock timeouts

Known Limitation

Post-load CPU spike (#2357) - After load stops, granian may spin at ~800% CPU due to SSE connections not being properly closed. This is a granian issue (#286), documented with workaround options.

Closes

Related Issues

…2355)

Phase 1: Replace cascading FOR UPDATE loops with bulk UPDATE statements
in gateway_service.set_gateway_state() to eliminate lock contention when
activating/deactivating gateways with many tools/servers/prompts/resources.

Phase 2: Add nowait=True to get_for_update() calls in set_server_state()
and set_tool_state() to fail fast on locked rows instead of blocking.
Add ServerLockConflictError and ToolLockConflictError exceptions with
409 Conflict handlers in main.py and admin.py routers.

Phase 3: Fix CPU spin loop in SSE transport by properly detecting client
disconnection. Add request.is_disconnected() check, consecutive error
counting, GeneratorExit handling, and ensure _client_gone is set in all
exit paths.

Results:
- RPS improved from 173-600 to ~2000 under load
- Failure rate reduced from 14-22% to 0.03-0.04%
- Blocked queries reduced from 33-48 to 0
- CPU after load test: ~1% (was 800%+ spin loop)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Add configurable timeout settings for database operations:
- db_lock_timeout_ms: Maximum wait for row locks (default 5000ms)
- db_statement_timeout_ms: Maximum statement execution (default 30000ms)

These settings can be used with get_for_update() to prevent indefinite
blocking under high concurrency scenarios.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Update gateway_service tests to use side_effect for multiple db.execute
calls (SELECT + bulk UPDATEs) instead of single return_value.

Update row_level_locking test to expect nowait=True parameter in
get_for_update calls for set_tool_state.

Update SSE transport tests to mock request.is_disconnected() and adjust
error handling test to expect consecutive errors causing generator stop
instead of error event emission.

Add missing exception documentation for ServerLockConflictError and
ToolLockConflictError in service docstrings (flake8 DAR401).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…ps (#2355)

When Granian ASGI server fails to send to a disconnected client, it logs
"ASGI transport error: SendError" but doesn't raise an exception to our
code. This causes rapid iteration of the generator without proper
timeout handling.

Add send_timeout=5.0 to EventSourceResponse to ensure sends time out if
they fail, triggering sse_starlette's built-in error handling.

Also enable sse_starlette's built-in ping mechanism when keepalive is
enabled, which provides additional disconnect detection.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
When clients disconnect abruptly, Granian may fail sends without
raising Python exceptions. This adds rapid yield detection: if
50+ yields occur within 1 second, we assume client is disconnected
and stop the generator.

New configurable settings:
- SSE_SEND_TIMEOUT: ASGI send timeout (default 30s)
- SSE_RAPID_YIELD_WINDOW_MS: detection window (default 1000ms)
- SSE_RAPID_YIELD_MAX: max yields before disconnect (default 50)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Changed from WARNING to ERROR so the detection message is visible
even when LOG_LEVEL=ERROR. This is appropriate since rapid yield
detection indicates a problem condition (client disconnect not
reported by ASGI).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Finding 1: Fix lock conflict error propagation
- Add explicit except handlers for ToolLockConflictError and
  ServerLockConflictError before the generic Exception handler
- This allows 409 responses to propagate correctly instead of
  being wrapped as generic 400 errors

Finding 3: Improve SSE rapid yield detection
- Only track message yields, not keepalives
- Reset the timestamp deque when timeout occurs (we actually waited)
- This prevents false positives on high-throughput legitimate streams

Finding 4: Remove unused db timeout settings
- Remove db_lock_timeout_ms and db_statement_timeout_ms from config
- These settings were defined but never wired into DB operations
- Avoids false sense of protection

Finding 2 (notifications) is intentional: gateway-level notifications
are sent, and bulk UPDATE is used for performance under high load.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Extends the lock contention fix to prompt_service and resource_service:
- Add PromptLockConflictError and ResourceLockConflictError classes
- Use nowait=True in get_for_update to fail fast if row is locked
- Add 409 Conflict handlers in main.py for both services
- Re-raise specific errors before generic Exception handler

This ensures consistent lock handling across all state change endpoints.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
- Track time since last yield as additional signal (<10ms is suspicious)
- Check rapid yield after BOTH message and keepalive yields
- Reset timestamps only after successful keepalive wait
- Include time interval in error log for debugging

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
SSE transport improvements:
- Add consecutive rapid yield counter for simpler spin loop detection
  (triggers after 10 yields < 100ms apart)
- Remove deque clearing after keepalives that prevented detection
- Add client_close_handler_callable to detect disconnects that ASGI
  servers like granian may not propagate via request.is_disconnected()

Test updates:
- Update row-level locking tests to expect nowait=True for prompt
  and resource state changes

Dependency updates:
- Containerfile.lite: Update UBI base images to latest
- gunicorn 23.0.0 -> 24.1.1
- sqlalchemy 2.0.45 -> 2.0.46
- langgraph 1.0.6 -> 1.0.7
- hypothesis 6.150.2 -> 6.150.3
- schemathesis 4.9.2 -> 4.9.4
- copier 9.11.1 -> 9.11.3
- pytest-html 4.1.1 -> 4.2.0

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
…rkaround

Document GRANIAN_WORKERS_LIFETIME and GRANIAN_WORKERS_MAX_RSS options
as commented-out configuration in docker-compose.yml and run-granian.sh.

These options provide a workaround for granian issue #286 where SSE
connections are not properly closed after client disconnect, causing
CPU spin loops after load tests complete.

Refs: #2357, #2358
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
@crivetimihai
Copy link
Copy Markdown
Member Author

This PR solves the crash, but doesn't fully solve the issue of CPU spin-up after load has completed. #2360 was created to track that issue. To revisit this PR later in case issues arrise.

@crivetimihai crivetimihai merged commit b773c67 into main Jan 24, 2026
53 checks passed
@crivetimihai crivetimihai deleted the 2355-performance-lock branch January 24, 2026 14:48
kcostell06 pushed a commit to kcostell06/mcp-context-forge that referenced this pull request Feb 24, 2026
…ps (IBM#2359)

* fix(perf): resolve lock contention and CPU spin loop under high load (IBM#2355)

Phase 1: Replace cascading FOR UPDATE loops with bulk UPDATE statements
in gateway_service.set_gateway_state() to eliminate lock contention when
activating/deactivating gateways with many tools/servers/prompts/resources.

Phase 2: Add nowait=True to get_for_update() calls in set_server_state()
and set_tool_state() to fail fast on locked rows instead of blocking.
Add ServerLockConflictError and ToolLockConflictError exceptions with
409 Conflict handlers in main.py and admin.py routers.

Phase 3: Fix CPU spin loop in SSE transport by properly detecting client
disconnection. Add request.is_disconnected() check, consecutive error
counting, GeneratorExit handling, and ensure _client_gone is set in all
exit paths.

Results:
- RPS improved from 173-600 to ~2000 under load
- Failure rate reduced from 14-22% to 0.03-0.04%
- Blocked queries reduced from 33-48 to 0
- CPU after load test: ~1% (was 800%+ spin loop)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(perf): add database lock timeout configuration (IBM#2355)

Add configurable timeout settings for database operations:
- db_lock_timeout_ms: Maximum wait for row locks (default 5000ms)
- db_statement_timeout_ms: Maximum statement execution (default 30000ms)

These settings can be used with get_for_update() to prevent indefinite
blocking under high concurrency scenarios.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* test: update tests for bulk UPDATE and SSE transport changes (IBM#2355)

Update gateway_service tests to use side_effect for multiple db.execute
calls (SELECT + bulk UPDATEs) instead of single return_value.

Update row_level_locking test to expect nowait=True parameter in
get_for_update calls for set_tool_state.

Update SSE transport tests to mock request.is_disconnected() and adjust
error handling test to expect consecutive errors causing generator stop
instead of error event emission.

Add missing exception documentation for ServerLockConflictError and
ToolLockConflictError in service docstrings (flake8 DAR401).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): add send_timeout to EventSourceResponse to prevent spin loops (IBM#2355)

When Granian ASGI server fails to send to a disconnected client, it logs
"ASGI transport error: SendError" but doesn't raise an exception to our
code. This causes rapid iteration of the generator without proper
timeout handling.

Add send_timeout=5.0 to EventSourceResponse to ensure sends time out if
they fail, triggering sse_starlette's built-in error handling.

Also enable sse_starlette's built-in ping mechanism when keepalive is
enabled, which provides additional disconnect detection.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): add rapid yield detection to prevent CPU spin loops (IBM#2355)

When clients disconnect abruptly, Granian may fail sends without
raising Python exceptions. This adds rapid yield detection: if
50+ yields occur within 1 second, we assume client is disconnected
and stop the generator.

New configurable settings:
- SSE_SEND_TIMEOUT: ASGI send timeout (default 30s)
- SSE_RAPID_YIELD_WINDOW_MS: detection window (default 1000ms)
- SSE_RAPID_YIELD_MAX: max yields before disconnect (default 50)

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): log rapid yield detection at ERROR level for visibility

Changed from WARNING to ERROR so the detection message is visible
even when LOG_LEVEL=ERROR. This is appropriate since rapid yield
detection indicates a problem condition (client disconnect not
reported by ASGI).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix: address review findings from ChatGPT analysis

Finding 1: Fix lock conflict error propagation
- Add explicit except handlers for ToolLockConflictError and
  ServerLockConflictError before the generic Exception handler
- This allows 409 responses to propagate correctly instead of
  being wrapped as generic 400 errors

Finding 3: Improve SSE rapid yield detection
- Only track message yields, not keepalives
- Reset the timestamp deque when timeout occurs (we actually waited)
- This prevents false positives on high-throughput legitimate streams

Finding 4: Remove unused db timeout settings
- Remove db_lock_timeout_ms and db_statement_timeout_ms from config
- These settings were defined but never wired into DB operations
- Avoids false sense of protection

Finding 2 (notifications) is intentional: gateway-level notifications
are sent, and bulk UPDATE is used for performance under high load.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(perf): add nowait locks to prompt and resource state changes

Extends the lock contention fix to prompt_service and resource_service:
- Add PromptLockConflictError and ResourceLockConflictError classes
- Use nowait=True in get_for_update to fail fast if row is locked
- Add 409 Conflict handlers in main.py for both services
- Re-raise specific errors before generic Exception handler

This ensures consistent lock handling across all state change endpoints.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): improve rapid yield detection to catch all spin scenarios

- Track time since last yield as additional signal (<10ms is suspicious)
- Check rapid yield after BOTH message and keepalive yields
- Reset timestamps only after successful keepalive wait
- Include time interval in error log for debugging

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* fix(sse): add robust spin loop detection and update dependencies (IBM#2355)

SSE transport improvements:
- Add consecutive rapid yield counter for simpler spin loop detection
  (triggers after 10 yields < 100ms apart)
- Remove deque clearing after keepalives that prevented detection
- Add client_close_handler_callable to detect disconnects that ASGI
  servers like granian may not propagate via request.is_disconnected()

Test updates:
- Update row-level locking tests to expect nowait=True for prompt
  and resource state changes

Dependency updates:
- Containerfile.lite: Update UBI base images to latest
- gunicorn 23.0.0 -> 24.1.1
- sqlalchemy 2.0.45 -> 2.0.46
- langgraph 1.0.6 -> 1.0.7
- hypothesis 6.150.2 -> 6.150.3
- schemathesis 4.9.2 -> 4.9.4
- copier 9.11.1 -> 9.11.3
- pytest-html 4.1.1 -> 4.2.0

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docs: add granian worker lifecycle options for SSE connection leak workaround

Document GRANIAN_WORKERS_LIFETIME and GRANIAN_WORKERS_MAX_RSS options
as commented-out configuration in docker-compose.yml and run-granian.sh.

These options provide a workaround for granian issue IBM#286 where SSE
connections are not properly closed after client disconnect, causing
CPU spin loops after load tests complete.

Refs: IBM#2357, IBM#2358
Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docker-compose updates for GUNICORN

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docker-compose updates for GUNICORN

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* docker-compose updates for GUNICORN

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* Update pyproject.toml

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

* lint

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>

---------

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PERFORMANCE]: PR #2211 causes FOR UPDATE lock contention and CPU spin loop under high load

1 participant