-
Notifications
You must be signed in to change notification settings - Fork 615
[PERFORMANCE]: PR #2211 causes FOR UPDATE lock contention and CPU spin loop under high load #2355
Description
Summary
THREE PRs identified that cause severe performance degradation and CPU spin loop under high concurrent load:
- PR Synchronize Gateway Activation State Across Tools, Prompts, and Resources #2211 (
96db39855) - Introduced cascading FOR UPDATE lock contention - PR 2251 - Resolve server deactivation error for servers with email teams #2253 (
591644d78) - WORST! 22.5% failures, 381 connections stuck 3.7 minutes - PR 1865_logging-cpu-optimization #2170 (
2b9256737) - 14.3% failures, full CPU spin loop
Environment
- 3 gateway replicas (Granian or Gunicorn - both affected)
- PostgreSQL with PgBouncer
- nginx load balancer
- 4000 concurrent users (Locust load test)
Binary Search Test Results
2026-01-23 #29 5757e7c9f Linting and doc updates ← KNOWN BAD
│
├── 2026-01-22 #28 2b9256737 1865_logging-cpu-optimization (#2170)
│ ↑
│ └── ❌ TESTED: HARD CRASH
│ └── RPS: 173-600, Failures: 14.3%, Blocked: 48 (111s), Idle-in-Tx: 140 (246s)
│ └── CPU after load: 391-777% (SPIN LOOP)
│
├── 2026-01-21 #20 591644d78 Resolve server deactivation error (#2253)
│ ↑
│ └── ❌ TESTED: **WORST OF ALL!**
│ └── RPS: 180-1586, Failures: **22.5%**, Blocked: 48 (120s), Idle-in-Tx: **381 (222s!)**
│ └── CPU after load: 97-295% (elevated, slowly recovering)
│ └── ⚠️ CATASTROPHIC: 381 connections stuck for 3.7 minutes!
│
├── 2026-01-21 #16 96db39855 fix(gateway): synchronize activation state (#2211)
│ ↑
│ └── ❌ TESTED: Performance regression + CPU spin loop
│ └── RPS: 1200-1700 (↓45%), Failures: 0.8%, Blocked: 33 (104s), Idle-in-Tx: 70 (106s)
│ └── CPU after load: 810-830% (SPIN LOOP)
│ └── ⚠️ INTRODUCED: FOR UPDATE lock contention
│
├── 2026-01-21 #15 ff4011699 fix(perf): resolve N+1 queries (#2204)
│ ↑
│ └── ✅ TESTED: Healthy
│ └── RPS: 2000-2200, Failures: 0.1%, Blocked: 0, Idle-in-Tx: 50-70 (cycling OK)
│ └── CPU after load: 1-2% (normal)
│
2026-01-19 #0 5c33f658 Baseline ← KNOWN GOOD
│ ↑
│ └── ✅ TESTED: Stable
│ └── RPS: 2370+, Failures: ~0%, Blocked: 0, Idle-in-Tx: 0
│ └── CPU after load: 1-2% (normal)
Summary Table
| Commit | RPS | Fail% | Blocked Queries | Idle-in-Tx | CPU After Load | Status |
|---|---|---|---|---|---|---|
#0 5c33f658 Baseline |
2370+ | ~0% | 0 | 0 | 1-2% | ✅ GOOD |
#15 ff4011699 N+1 fix (#2204) |
2000-2200 | 0.1% | 0 | 50-70 | 1-2% | ✅ GOOD |
#16 96db39855 State sync (#2211) |
1200-1700 | 0.8% | 33 (104s) | 70 (106s) | 810-830% | ❌ BAD |
#20 591644d78 Server deact (#2253) |
180-1586 | 22.5% | 48 (120s) | 381 (222s!) | 97-295% | ❌ WORST |
#28 2b9256737 Logging opt (#2170) |
173-600 | 14.3% | 48 (111s) | 140 (246s) | 391-777% | ❌ CRASH |
Root Cause #1: PR #2211 - Cascading FOR UPDATE Locks
PR #2211 added cascading state updates that acquire FOR UPDATE locks on every related entity:
# gateway_service.py - set_gateway_state() now does:
async def set_gateway_state(...):
gateway = get_for_update(db, gateway_id) # Lock #1: gateway row
for tool in tools:
await tool_service.set_tool_state(...) # Lock #2-N: each tool row
# NEW in PR #2211:
for prompt in prompts:
await prompt_service.set_prompt_state(...) # Lock #N+1 to N+M: each prompt
for resource in resources:
await resource_service.set_resource_state(...) # Lock #N+M+1 to N+M+K: each resourceImpact: A single gateway state change now acquires 1 + tools + prompts + resources FOR UPDATE locks sequentially, causing massive contention under concurrent load.
Root Cause #2: PR #2253 - Server Deactivation (WORST)
PR #2253 ("Resolve server deactivation error for servers with email teams") caused the worst degradation:
- 381 idle-in-transaction connections stuck for 3.7 minutes
- 22.5% failure rate (worst of all tested commits)
- RPS oscillates wildly between 180-1586
This PR modifies server_service.py and likely adds additional database operations that compound the lock contention from PR #2211.
Root Cause #3: PR #2170 - Logging Changes
PR #2170 ("logging-cpu-optimization") also worsens the problem:
- 14.3% failure rate
- Full CPU spin loop (391-777%) after load ends
- Idle-in-transaction connections stuck for 4+ minutes
Suggested Fixes
Immediate (PR #2211)
-
Replace sequential FOR UPDATE with bulk UPDATE
# Instead of: for prompt in prompts: await prompt_service.set_prompt_state(db, prompt.id, activate) # Use: from sqlalchemy import update db.execute( update(DbPrompt) .where(DbPrompt.gateway_id == gateway_id) .values(enabled=activate) )
-
Use
skip_locked=Truefor non-critical state updatesstmt = stmt.with_for_update(skip_locked=True) # Return 409 Conflict if row is locked instead of waiting
-
Add statement timeout
db.execute(text("SET LOCAL statement_timeout = '5s'"))
Short-term
-
Investigate PR 2251 - Resolve server deactivation error for servers with email teams #2253 - Why does server deactivation cause 381 stuck connections?
-
Investigate PR 1865_logging-cpu-optimization #2170 - Why do logging changes cause longer idle-in-transaction?
-
Add database lock timeout
SET lock_timeout = '5s';
Long-term
- Use optimistic locking with version columns instead of FOR UPDATE
- Implement proper transaction timeouts in async context
- Add circuit breaker for database operations under high load
- Create load test regression suite to catch similar issues
Reproduction
- Deploy with 3+ gateway replicas:
make testing-up - Run load test:
make load-test-uiwith 4000 users - Monitor PostgreSQL:
SELECT state, wait_event_type, COUNT(*), MAX(EXTRACT(EPOCH FROM (NOW() - state_change)))::int as max_age FROM pg_stat_activity WHERE datname = 'mcp' GROUP BY state, wait_event_type ORDER BY count DESC;
- Stop load test
- Check container CPU:
docker stats --no-stream - Observe gateways at elevated CPU with 0 users
References
- PR Synchronize Gateway Activation State Across Tools, Prompts, and Resources #2211: Synchronize Gateway Activation State Across Tools, Prompts, and Resources #2211
- PR 2251 - Resolve server deactivation error for servers with email teams #2253: 2251 - Resolve server deactivation error for servers with email teams #2253
- PR 1865_logging-cpu-optimization #2170: 1865_logging-cpu-optimization #2170
- RCA document:
todo/rca-crash.md - Binary search results:
todo/commit-id-status.md