Skip to content

[PERFORMANCE]: PR #2211 causes FOR UPDATE lock contention and CPU spin loop under high load #2355

@crivetimihai

Description

@crivetimihai

Summary

THREE PRs identified that cause severe performance degradation and CPU spin loop under high concurrent load:

  1. PR Synchronize Gateway Activation State Across Tools, Prompts, and Resources #2211 (96db39855) - Introduced cascading FOR UPDATE lock contention
  2. PR 2251 - Resolve server deactivation error for servers with email teams #2253 (591644d78) - WORST! 22.5% failures, 381 connections stuck 3.7 minutes
  3. PR 1865_logging-cpu-optimization #2170 (2b9256737) - 14.3% failures, full CPU spin loop

Environment

  • 3 gateway replicas (Granian or Gunicorn - both affected)
  • PostgreSQL with PgBouncer
  • nginx load balancer
  • 4000 concurrent users (Locust load test)

Binary Search Test Results

2026-01-23 #29 5757e7c9f Linting and doc updates ← KNOWN BAD
    │
    ├── 2026-01-22 #28 2b9256737 1865_logging-cpu-optimization (#2170)
    │       ↑
    │       └── ❌ TESTED: HARD CRASH
    │           └── RPS: 173-600, Failures: 14.3%, Blocked: 48 (111s), Idle-in-Tx: 140 (246s)
    │           └── CPU after load: 391-777% (SPIN LOOP)
    │
    ├── 2026-01-21 #20 591644d78 Resolve server deactivation error (#2253)
    │       ↑
    │       └── ❌ TESTED: **WORST OF ALL!**
    │           └── RPS: 180-1586, Failures: **22.5%**, Blocked: 48 (120s), Idle-in-Tx: **381 (222s!)**
    │           └── CPU after load: 97-295% (elevated, slowly recovering)
    │           └── ⚠️ CATASTROPHIC: 381 connections stuck for 3.7 minutes!
    │
    ├── 2026-01-21 #16 96db39855 fix(gateway): synchronize activation state (#2211)
    │       ↑
    │       └── ❌ TESTED: Performance regression + CPU spin loop
    │           └── RPS: 1200-1700 (↓45%), Failures: 0.8%, Blocked: 33 (104s), Idle-in-Tx: 70 (106s)
    │           └── CPU after load: 810-830% (SPIN LOOP)
    │           └── ⚠️ INTRODUCED: FOR UPDATE lock contention
    │
    ├── 2026-01-21 #15 ff4011699 fix(perf): resolve N+1 queries (#2204)
    │       ↑
    │       └── ✅ TESTED: Healthy
    │           └── RPS: 2000-2200, Failures: 0.1%, Blocked: 0, Idle-in-Tx: 50-70 (cycling OK)
    │           └── CPU after load: 1-2% (normal)
    │
2026-01-19 #0  5c33f658 Baseline ← KNOWN GOOD
    │       ↑
    │       └── ✅ TESTED: Stable
    │           └── RPS: 2370+, Failures: ~0%, Blocked: 0, Idle-in-Tx: 0
    │           └── CPU after load: 1-2% (normal)

Summary Table

Commit RPS Fail% Blocked Queries Idle-in-Tx CPU After Load Status
#0 5c33f658 Baseline 2370+ ~0% 0 0 1-2% ✅ GOOD
#15 ff4011699 N+1 fix (#2204) 2000-2200 0.1% 0 50-70 1-2% ✅ GOOD
#16 96db39855 State sync (#2211) 1200-1700 0.8% 33 (104s) 70 (106s) 810-830% ❌ BAD
#20 591644d78 Server deact (#2253) 180-1586 22.5% 48 (120s) 381 (222s!) 97-295% WORST
#28 2b9256737 Logging opt (#2170) 173-600 14.3% 48 (111s) 140 (246s) 391-777% ❌ CRASH

Root Cause #1: PR #2211 - Cascading FOR UPDATE Locks

PR #2211 added cascading state updates that acquire FOR UPDATE locks on every related entity:

# gateway_service.py - set_gateway_state() now does:
async def set_gateway_state(...):
    gateway = get_for_update(db, gateway_id)  # Lock #1: gateway row
    
    for tool in tools:
        await tool_service.set_tool_state(...)  # Lock #2-N: each tool row
    
    # NEW in PR #2211:
    for prompt in prompts:
        await prompt_service.set_prompt_state(...)  # Lock #N+1 to N+M: each prompt
    
    for resource in resources:
        await resource_service.set_resource_state(...)  # Lock #N+M+1 to N+M+K: each resource

Impact: A single gateway state change now acquires 1 + tools + prompts + resources FOR UPDATE locks sequentially, causing massive contention under concurrent load.


Root Cause #2: PR #2253 - Server Deactivation (WORST)

PR #2253 ("Resolve server deactivation error for servers with email teams") caused the worst degradation:

  • 381 idle-in-transaction connections stuck for 3.7 minutes
  • 22.5% failure rate (worst of all tested commits)
  • RPS oscillates wildly between 180-1586

This PR modifies server_service.py and likely adds additional database operations that compound the lock contention from PR #2211.


Root Cause #3: PR #2170 - Logging Changes

PR #2170 ("logging-cpu-optimization") also worsens the problem:

  • 14.3% failure rate
  • Full CPU spin loop (391-777%) after load ends
  • Idle-in-transaction connections stuck for 4+ minutes

Suggested Fixes

Immediate (PR #2211)

  1. Replace sequential FOR UPDATE with bulk UPDATE

    # Instead of:
    for prompt in prompts:
        await prompt_service.set_prompt_state(db, prompt.id, activate)
    
    # Use:
    from sqlalchemy import update
    db.execute(
        update(DbPrompt)
        .where(DbPrompt.gateway_id == gateway_id)
        .values(enabled=activate)
    )
  2. Use skip_locked=True for non-critical state updates

    stmt = stmt.with_for_update(skip_locked=True)
    # Return 409 Conflict if row is locked instead of waiting
  3. Add statement timeout

    db.execute(text("SET LOCAL statement_timeout = '5s'"))

Short-term

  1. Investigate PR 2251 - Resolve server deactivation error for servers with email teams #2253 - Why does server deactivation cause 381 stuck connections?

  2. Investigate PR 1865_logging-cpu-optimization #2170 - Why do logging changes cause longer idle-in-transaction?

  3. Add database lock timeout

    SET lock_timeout = '5s';

Long-term

  1. Use optimistic locking with version columns instead of FOR UPDATE
  2. Implement proper transaction timeouts in async context
  3. Add circuit breaker for database operations under high load
  4. Create load test regression suite to catch similar issues

Reproduction

  1. Deploy with 3+ gateway replicas: make testing-up
  2. Run load test: make load-test-ui with 4000 users
  3. Monitor PostgreSQL:
    SELECT state, wait_event_type, COUNT(*), 
           MAX(EXTRACT(EPOCH FROM (NOW() - state_change)))::int as max_age
    FROM pg_stat_activity WHERE datname = 'mcp' 
    GROUP BY state, wait_event_type ORDER BY count DESC;
  4. Stop load test
  5. Check container CPU: docker stats --no-stream
  6. Observe gateways at elevated CPU with 0 users

References

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdatabaseperformancePerformance related itemspythonPython / backend development (FastAPI)

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions