[PERFORMANCE]: PR #2211 causes FOR UPDATE lock contention and CPU spin loop under high load

## Summary

**THREE PRs identified** that cause severe performance degradation and CPU spin loop under high concurrent load:

1. **PR #2211** (`96db39855`) - Introduced cascading FOR UPDATE lock contention
2. **PR #2253** (`591644d78`) - **WORST!** 22.5% failures, 381 connections stuck 3.7 minutes
3. **PR #2170** (`2b9256737`) - 14.3% failures, full CPU spin loop

## Environment

- 3 gateway replicas (Granian or Gunicorn - both affected)
- PostgreSQL with PgBouncer
- nginx load balancer
- 4000 concurrent users (Locust load test)

---

## Binary Search Test Results

```
2026-01-23 #29 5757e7c9f Linting and doc updates ← KNOWN BAD
    │
    ├── 2026-01-22 #28 2b9256737 1865_logging-cpu-optimization (#2170)
    │       ↑
    │       └── ❌ TESTED: HARD CRASH
    │           └── RPS: 173-600, Failures: 14.3%, Blocked: 48 (111s), Idle-in-Tx: 140 (246s)
    │           └── CPU after load: 391-777% (SPIN LOOP)
    │
    ├── 2026-01-21 #20 591644d78 Resolve server deactivation error (#2253)
    │       ↑
    │       └── ❌ TESTED: **WORST OF ALL!**
    │           └── RPS: 180-1586, Failures: **22.5%**, Blocked: 48 (120s), Idle-in-Tx: **381 (222s!)**
    │           └── CPU after load: 97-295% (elevated, slowly recovering)
    │           └── ⚠️ CATASTROPHIC: 381 connections stuck for 3.7 minutes!
    │
    ├── 2026-01-21 #16 96db39855 fix(gateway): synchronize activation state (#2211)
    │       ↑
    │       └── ❌ TESTED: Performance regression + CPU spin loop
    │           └── RPS: 1200-1700 (↓45%), Failures: 0.8%, Blocked: 33 (104s), Idle-in-Tx: 70 (106s)
    │           └── CPU after load: 810-830% (SPIN LOOP)
    │           └── ⚠️ INTRODUCED: FOR UPDATE lock contention
    │
    ├── 2026-01-21 #15 ff4011699 fix(perf): resolve N+1 queries (#2204)
    │       ↑
    │       └── ✅ TESTED: Healthy
    │           └── RPS: 2000-2200, Failures: 0.1%, Blocked: 0, Idle-in-Tx: 50-70 (cycling OK)
    │           └── CPU after load: 1-2% (normal)
    │
2026-01-19 #0  5c33f658 Baseline ← KNOWN GOOD
    │       ↑
    │       └── ✅ TESTED: Stable
    │           └── RPS: 2370+, Failures: ~0%, Blocked: 0, Idle-in-Tx: 0
    │           └── CPU after load: 1-2% (normal)
```

### Summary Table

| Commit | RPS | Fail% | Blocked Queries | Idle-in-Tx | CPU After Load | Status |
|--------|-----|-------|-----------------|------------|----------------|--------|
| #0 `5c33f658` Baseline | 2370+ | ~0% | 0 | 0 | 1-2% | ✅ GOOD |
| #15 `ff4011699` N+1 fix (#2204) | 2000-2200 | 0.1% | 0 | 50-70 | 1-2% | ✅ GOOD |
| #16 `96db39855` State sync (#2211) | 1200-1700 | 0.8% | 33 (104s) | 70 (106s) | **810-830%** | ❌ BAD |
| **#20 `591644d78` Server deact (#2253)** | **180-1586** | **22.5%** | **48 (120s)** | **381 (222s!)** | 97-295% | ❌ **WORST** |
| #28 `2b9256737` Logging opt (#2170) | 173-600 | 14.3% | 48 (111s) | 140 (246s) | 391-777% | ❌ CRASH |

---

## Root Cause #1: PR #2211 - Cascading FOR UPDATE Locks

PR #2211 added cascading state updates that acquire FOR UPDATE locks on every related entity:

```python
# gateway_service.py - set_gateway_state() now does:
async def set_gateway_state(...):
    gateway = get_for_update(db, gateway_id)  # Lock #1: gateway row
    
    for tool in tools:
        await tool_service.set_tool_state(...)  # Lock #2-N: each tool row
    
    # NEW in PR #2211:
    for prompt in prompts:
        await prompt_service.set_prompt_state(...)  # Lock #N+1 to N+M: each prompt
    
    for resource in resources:
        await resource_service.set_resource_state(...)  # Lock #N+M+1 to N+M+K: each resource
```

**Impact:** A single gateway state change now acquires `1 + tools + prompts + resources` FOR UPDATE locks sequentially, causing massive contention under concurrent load.

---

## Root Cause #2: PR #2253 - Server Deactivation (WORST)

PR #2253 ("Resolve server deactivation error for servers with email teams") caused the **worst degradation**:
- **381 idle-in-transaction connections** stuck for **3.7 minutes**
- **22.5% failure rate** (worst of all tested commits)
- RPS oscillates wildly between 180-1586

This PR modifies `server_service.py` and likely adds additional database operations that compound the lock contention from PR #2211.

---

## Root Cause #3: PR #2170 - Logging Changes

PR #2170 ("logging-cpu-optimization") also worsens the problem:
- 14.3% failure rate
- Full CPU spin loop (391-777%) after load ends
- Idle-in-transaction connections stuck for 4+ minutes

---

## Suggested Fixes

### Immediate (PR #2211)

1. **Replace sequential FOR UPDATE with bulk UPDATE**
   ```python
   # Instead of:
   for prompt in prompts:
       await prompt_service.set_prompt_state(db, prompt.id, activate)
   
   # Use:
   from sqlalchemy import update
   db.execute(
       update(DbPrompt)
       .where(DbPrompt.gateway_id == gateway_id)
       .values(enabled=activate)
   )
   ```

2. **Use `skip_locked=True` for non-critical state updates**
   ```python
   stmt = stmt.with_for_update(skip_locked=True)
   # Return 409 Conflict if row is locked instead of waiting
   ```

3. **Add statement timeout**
   ```python
   db.execute(text("SET LOCAL statement_timeout = '5s'"))
   ```

### Short-term

4. **Investigate PR #2253** - Why does server deactivation cause 381 stuck connections?

5. **Investigate PR #2170** - Why do logging changes cause longer idle-in-transaction?

6. **Add database lock timeout**
   ```sql
   SET lock_timeout = '5s';
   ```

### Long-term

7. **Use optimistic locking** with version columns instead of FOR UPDATE
8. **Implement proper transaction timeouts** in async context
9. **Add circuit breaker** for database operations under high load
10. **Create load test regression suite** to catch similar issues

---

## Reproduction

1. Deploy with 3+ gateway replicas: `make testing-up`
2. Run load test: `make load-test-ui` with 4000 users
3. Monitor PostgreSQL:
   ```sql
   SELECT state, wait_event_type, COUNT(*), 
          MAX(EXTRACT(EPOCH FROM (NOW() - state_change)))::int as max_age
   FROM pg_stat_activity WHERE datname = 'mcp' 
   GROUP BY state, wait_event_type ORDER BY count DESC;
   ```
4. Stop load test
5. Check container CPU: `docker stats --no-stream`
6. Observe gateways at elevated CPU with 0 users

---

## References

- PR #2211: https://github.com/IBM/mcp-context-forge/pull/2211
- PR #2253: https://github.com/IBM/mcp-context-forge/pull/2253
- PR #2170: https://github.com/IBM/mcp-context-forge/pull/2170
- RCA document: `todo/rca-crash.md`
- Binary search results: `todo/commit-id-status.md`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERFORMANCE]: PR #2211 causes FOR UPDATE lock contention and CPU spin loop under high load #2355

Summary

Environment

Binary Search Test Results

Summary Table

Root Cause #1: PR #2211 - Cascading FOR UPDATE Locks

Root Cause #2: PR #2253 - Server Deactivation (WORST)

Root Cause #3: PR #2170 - Logging Changes

Suggested Fixes

Immediate (PR #2211)

Short-term

Long-term

Reproduction

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Commit	RPS	Fail%	Blocked Queries	Idle-in-Tx	CPU After Load	Status
#0 `5c33f658` Baseline	2370+	~0%	0	0	1-2%	✅ GOOD
#15 `ff4011699` N+1 fix (#2204)	2000-2200	0.1%	0	50-70	1-2%	✅ GOOD
#16 `96db39855` State sync (#2211)	1200-1700	0.8%	33 (104s)	70 (106s)	810-830%	❌ BAD
#20 `591644d78` Server deact (#2253)	180-1586	22.5%	48 (120s)	381 (222s!)	97-295%	❌ WORST
#28 `2b9256737` Logging opt (#2170)	173-600	14.3%	48 (111s)	140 (246s)	391-777%	❌ CRASH

[PERFORMANCE]: PR #2211 causes FOR UPDATE lock contention and CPU spin loop under high load #2355

Description

Summary

Environment

Binary Search Test Results

Summary Table

Root Cause #1: PR #2211 - Cascading FOR UPDATE Locks

Root Cause #2: PR #2253 - Server Deactivation (WORST)

Root Cause #3: PR #2170 - Logging Changes

Suggested Fixes

Immediate (PR #2211)

Short-term

Long-term

Reproduction

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions