[TESTING][RESILIENCE]: PostgreSQL Resilience Manual Test Plan (Connection Loss, Failover, Recovery)

# 🐘 [TESTING][RESILIENCE]: PostgreSQL Resilience Manual Test Plan (Connection Loss, Failover, Recovery)

## Goal

Produce a **comprehensive manual test plan** for PostgreSQL resilience testing including connection loss handling, failover behavior, recovery procedures, and transaction safety under failure conditions.

## Why Now?

Database resilience is critical for production stability:

1. **Availability**: Gateway must handle database outages gracefully
2. **Data Integrity**: No data loss or corruption during failures
3. **Recovery**: Automatic reconnection when database returns
4. **User Experience**: Meaningful errors during degraded state

---

## 📖 User Stories

<details>
<summary>US-1: Platform Operator - Connection Loss Handling</summary>

**As a** Platform Operator
**I want** the gateway to handle database connection loss gracefully
**So that** users receive appropriate errors and the system recovers automatically

**Acceptance Criteria:**

```gherkin
Feature: Database Connection Loss

 Scenario: Database becomes unreachable
 Given the gateway is running with PostgreSQL
 And the database becomes unreachable
 When a request requiring database access is made
 Then the request should fail with 503 Service Unavailable
 And the error message should indicate database issues
 And the health check should report unhealthy

 Scenario: Database recovers
 Given the database was unreachable
 When the database becomes available again
 Then the connection pool should recover
 And subsequent requests should succeed
 And health check should return healthy
```

</details>

<details>
<summary>US-2: DBA - Connection Pool Management</summary>

**As a** Database Administrator
**I want** connection pool to handle failures properly
**So that** resources are managed efficiently

**Acceptance Criteria:**

```gherkin
Feature: Connection Pool Resilience

 Scenario: Connection pool exhaustion
 Given all connections in the pool are in use
 When a new request arrives
 Then request should wait up to timeout
 And if timeout exceeded, return 503
 And pool should recover when connections freed

 Scenario: Stale connections
 Given idle connections in the pool
 When database restarts
 Then stale connections should be detected
 And new connections should be established
```

</details>

<details>
<summary>US-3: SRE - Failover Handling</summary>

**As an** SRE
**I want** the gateway to handle primary/replica failover
**So that** service continues during database maintenance

**Acceptance Criteria:**

```gherkin
Feature: Database Failover

 Scenario: Primary failover to replica
 Given a primary-replica PostgreSQL setup
 When the primary fails
 And replica is promoted
 Then gateway should reconnect to new primary
 And writes should resume
 And no data corruption should occur
```

</details>

---

## 🏗 Architecture

### Database Connection Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATABASE CONNECTION ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘

 GATEWAY CONNECTION POOL DATABASE
 ─────── ─────────────── ────────

 ┌────────────┐ ┌─────────────────┐ ┌─────────────┐
 │ Request │ │ SQLAlchemy │ │ PostgreSQL │
 │ Handler │──────────▶│ Connection │────────▶│ Primary │
 └────────────┘ │ Pool │ └─────────────┘
 │ │ │ │
 │ │ ┌───────────────┐ │ │
 │ │ │ Connection 1 │ │ │
 │ │ ├───────────────┤ │ ┌─────▼─────┐
 │ │ │ Connection 2 │ │ │ Streaming │
 │ │ ├───────────────┤ │ │ Replica │
 │ │ │ Connection N │ │ └───────────┘
 │ │ └───────────────┘ │
 │ │ │
 │ │ Pool Settings: │
 │ │ - pool_size=5 │
 │ │ - max_overflow=10 │
 │ │ - pool_timeout=30 │
 │ │ - pool_recycle= │
 │ │ 1800 │
 │ └─────────────────┬─┘
 │ │
 │ ┌───────▼───────┐
 │ │ Health Check │
 │ │ pool_pre_ping │
 │ └───────────────┘
```

### Failure Detection Flow

```
┌─────────────────────────────────────────────────────────────────────────────┐
│ FAILURE DETECTION AND RECOVERY │
└─────────────────────────────────────────────────────────────────────────────┘

 NORMAL FAILURE RECOVERY
 ────── ─────── ────────

 ┌─────────┐ ┌─────────┐ ┌─────────┐
 │ Request │ │ Request │ │ Request │
 └────┬────┘ └────┬────┘ └────┬────┘
 │ │ │
 ┌────▼────┐ ┌────▼────┐ ┌────▼────┐
 │Get Conn │ │Get Conn │ │Get Conn │
 │from Pool│ │from Pool│ │from Pool│
 └────┬────┘ └────┬────┘ └────┬────┘
 │ │ │
 ┌────▼────┐ ┌────▼────┐ ┌────▼────┐
 │ Execute │ │ Execute │ │ Execute │
 │ Query │ │ Query │ │ Query │
 └────┬────┘ └────┬────┘ └────┬────┘
 │ │ │
 ┌────▼────┐ ┌────▼────┐ ┌────▼────┐
 │ Return │ │CONN ERR │ │ Return │
 │ Result │ │DETECTED │ │ Result │
 └─────────┘ └────┬────┘ └─────────┘
 │
 ┌────▼────┐
 │Invalidate│
 │Connection│
 └────┬────┘
 │
 ┌────▼────┐
 │ Return │
 │ 503 │
 └─────────┘
```

---

## 📋 Test Environment Setup

### Prerequisites

```bash
# Environment variables
export DATABASE_URL="postgresql://user:pass@localhost:5432/mcpgateway"
export GATEWAY_URL="http://localhost:8000"

# PostgreSQL container for testing
docker run -d --name postgres-test \
 -e POSTGRES_USER=user \
 -e POSTGRES_PASSWORD=pass \
 -e POSTGRES_DB=mcpgateway \
 -p 5432:5432 \
 postgres:15

# Alternative: Docker Compose with replica
docker-compose -f docker-compose.pg-ha.yml up -d

# Start gateway
make dev
```

### Network Fault Injection Tools

```bash
# Install toxiproxy for network simulation
docker run -d --name toxiproxy \
 -p 8474:8474 \
 -p 15432:15432 \
 shopify/toxiproxy

# Configure proxy
toxiproxy-cli create postgres -l 0.0.0.0:15432 -u postgres-test:5432

# Use proxied connection
export DATABASE_URL="postgresql://user:pass@localhost:15432/mcpgateway"
```

---

## 🧪 Manual Test Cases

### Section 1: Connection Loss

| Case | Scenario | Trigger | Expected | Validation |
|------|----------|---------|----------|------------|
| CL-01 | Database shutdown | Stop container | 503 errors | Error messages |
| CL-02 | Network partition | Block port | 503 + timeout | Response time |
| CL-03 | Connection drop | Kill connections | Retry/fail | Reconnection |
| CL-04 | Slow network | Add latency | Timeouts | Query timing |

<details>
<summary>CL-01: Database Shutdown</summary>

**Preconditions:**
- Gateway running with PostgreSQL
- Gateway operational and healthy

**Steps:**

```bash
# Step 1: Verify gateway is healthy
curl -s "$GATEWAY_URL/health/ready" | jq .
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq .

# Step 2: Stop PostgreSQL
docker stop postgres-test

# Step 3: Wait for detection (connection timeout)
sleep 5

# Step 4: Verify health check fails
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/health/ready")
echo "Health check status: $HTTP_CODE"
[ "$HTTP_CODE" = "503" ] && echo "PASS: Health check failed" || echo "FAIL"

# Step 5: Make API request
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/api/servers" \
 -H "Authorization: Bearer $TOKEN")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
echo "API response: $HTTP_CODE"
[ "$HTTP_CODE" = "503" ] && echo "PASS: 503 returned" || echo "FAIL"

# Step 6: Restart PostgreSQL
docker start postgres-test

# Step 7: Wait for recovery
sleep 10

# Step 8: Verify recovery
curl -s "$GATEWAY_URL/health/ready" | jq .
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq '.[:1]'
```

**Expected Result:**
- Health check returns 503 when DB is down
- API requests return 503 with meaningful error
- Gateway recovers automatically when DB returns
- No manual intervention required

</details>

<details>
<summary>CL-02: Network Partition</summary>

**Preconditions:**
- Gateway running with toxiproxy
- Proxy configured for database

**Steps:**

```bash
# Step 1: Verify normal operation
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq '.total'

# Step 2: Add network partition via toxiproxy
toxiproxy-cli toxic add postgres -t timeout -a timeout=0

# Step 3: Wait for connection attempts to timeout
sleep 35 # Connection timeout typically 30s

# Step 4: Verify failure
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/api/servers" \
 -H "Authorization: Bearer $TOKEN")
echo "Response during partition: $HTTP_CODE"

# Step 5: Remove partition
toxiproxy-cli toxic delete postgres -n timeout_downstream

# Step 6: Verify recovery
sleep 5
curl -s "$GATEWAY_URL/health/ready" | jq .
```

**Expected Result:**
- Requests timeout during partition
- Proper timeout error returned
- Recovery after partition heals

</details>

### Section 2: Connection Pool

| Case | Scenario | Condition | Expected | Validation |
|------|----------|-----------|----------|------------|
| CP-01 | Pool exhaustion | Concurrent requests | Queue/503 | Wait behavior |
| CP-02 | Stale connections | DB restart | Reconnect | No stale errors |
| CP-03 | Pool recovery | After failure | Replenish | Connection count |

<details>
<summary>CP-01: Connection Pool Exhaustion</summary>

**Preconditions:**
- Pool size = 5, max_overflow = 10
- Slow endpoint or artificial delay

**Steps:**

```bash
# Step 1: Create slow query endpoint or use existing one
# (May need to add artificial delay for testing)

# Step 2: Launch concurrent requests exceeding pool size
for i in {1..20}; do
 curl -s "$GATEWAY_URL/api/tools" -H "Authorization: Bearer $TOKEN" &
done

# Step 3: Observe pool behavior
# Check logs for "pool limit reached" or similar

# Step 4: Wait for all to complete
wait

# Step 5: Verify all eventually succeeded or failed with 503
# (Should not have connection errors)
```

**Expected Result:**
- Requests queue when pool full
- Requests timeout after pool_timeout
- 503 returned for timed-out requests
- No connection leaks

</details>

<details>
<summary>CP-02: Stale Connection Detection</summary>

**Preconditions:**
- pool_pre_ping enabled
- Idle connections in pool

**Steps:**

```bash
# Step 1: Make initial request (populates pool)
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN"

# Step 2: Check pool statistics
psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity WHERE datname = 'mcpgateway';"

# Step 3: Restart database (invalidates connections)
docker restart postgres-test
sleep 5

# Step 4: Make request (should detect stale and reconnect)
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/api/servers" \
 -H "Authorization: Bearer $TOKEN")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
echo "After restart: $HTTP_CODE"

# Step 5: Verify success (pool_pre_ping detected stale)
[ "$HTTP_CODE" = "200" ] && echo "PASS: Stale connection handled" || echo "FAIL"
```

**Expected Result:**
- pool_pre_ping detects dead connections
- New connections established automatically
- Request succeeds without stale connection error

</details>

### Section 3: Failover

| Case | Scenario | Setup | Expected | Validation |
|------|----------|-------|----------|------------|
| FO-01 | Primary failure | Primary/replica | Switch to replica | Writes work |
| FO-02 | Replica lag | Async replication | Reads consistent | Data accuracy |
| FO-03 | Split brain | Network partition | No dual writes | Data integrity |

<details>
<summary>FO-01: Primary Failover</summary>

**Preconditions:**
- PostgreSQL primary + standby configured
- Gateway connected to primary

**Steps:**

```bash
# Step 1: Verify current primary
psql $DATABASE_URL -c "SELECT pg_is_in_recovery();"
# Should return 'f' for false (is primary)

# Step 2: Create test data
curl -s -X POST "$GATEWAY_URL/api/servers" \
 -H "Authorization: Bearer $TOKEN" \
 -H "Content-Type: application/json" \
 -d '{"name": "failover-test", "url": "http://test.com"}'

# Step 3: Simulate primary failure
docker stop postgres-primary

# Step 4: Promote standby
docker exec postgres-standby pg_ctl promote

# Step 5: Update connection string (or use PgBouncer/HAProxy)
# This depends on your HA setup

# Step 6: Verify writes work on new primary
curl -s -X POST "$GATEWAY_URL/api/servers" \
 -H "Authorization: Bearer $TOKEN" \
 -H "Content-Type: application/json" \
 -d '{"name": "after-failover", "url": "http://new.com"}'

# Step 7: Verify no data loss
curl -s "$GATEWAY_URL/api/servers?name=failover-test" \
 -H "Authorization: Bearer $TOKEN" | jq .
```

**Expected Result:**
- Failover completes within RTO
- No data loss during failover
- Writes resume on new primary
- Gateway reconnects automatically

</details>

### Section 4: Transaction Safety

| Case | Scenario | Condition | Expected | Validation |
|------|----------|-----------|----------|------------|
| TX-01 | Deadlock | Concurrent updates | One aborted | No corruption |
| TX-02 | Long transaction | Timeout | Rollback | Clean state |
| TX-03 | Mid-transaction fail | Connection drop | Rollback | No partial |

<details>
<summary>TX-01: Deadlock Detection</summary>

**Preconditions:**
- Two concurrent sessions
- Rows that can be locked

**Steps:**

```bash
# In terminal 1:
psql $DATABASE_URL << 'EOF'
BEGIN;
UPDATE servers SET description = 'lock1' WHERE name = 'server-a';
-- Wait here for terminal 2
SELECT pg_sleep(5);
UPDATE servers SET description = 'lock1' WHERE name = 'server-b';
COMMIT;
EOF

# In terminal 2 (run simultaneously):
psql $DATABASE_URL << 'EOF'
BEGIN;
UPDATE servers SET description = 'lock2' WHERE name = 'server-b';
-- This will create deadlock with terminal 1
UPDATE servers SET description = 'lock2' WHERE name = 'server-a';
COMMIT;
EOF

# One should succeed, one should get deadlock error
```

**Expected Result:**
- PostgreSQL detects deadlock
- One transaction aborted
- Other transaction completes
- No data corruption
- Gateway logs deadlock event

</details>

<details>
<summary>TX-02: Transaction Timeout</summary>

**Preconditions:**
- statement_timeout configured
- Long-running query possible

**Steps:**

```bash
# Step 1: Check current timeout
psql $DATABASE_URL -c "SHOW statement_timeout;"

# Step 2: Create scenario that times out via API
# (Depends on implementation - may need slow endpoint)

# Step 3: Verify timeout error returned
# Should be a clear timeout error, not generic 500

# Step 4: Verify transaction rolled back
# Subsequent queries should work normally
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq .
```

**Expected Result:**
- Long transactions timeout
- Clean rollback on timeout
- Connection returned to pool
- No lingering locks

</details>

---

## 📊 Test Matrix

| Test Case | Connection | Pool | Failover | Transaction | Docker | K8s |
|-----------|------------|------|----------|-------------|--------|-----|
| CL-01 | ✓ | | | | ✓ | |
| CL-02 | ✓ | | | | ✓ | |
| CL-03 | ✓ | ✓ | | | ✓ | |
| CL-04 | ✓ | | | | ✓ | |
| CP-01 | | ✓ | | | ✓ | ✓ |
| CP-02 | | ✓ | | | ✓ | |
| CP-03 | | ✓ | | | ✓ | |
| FO-01 | | | ✓ | | ✓ | ✓ |
| FO-02 | | | ✓ | | | |
| FO-03 | | | ✓ | | | |
| TX-01 | | | | ✓ | | |
| TX-02 | | | | ✓ | | |
| TX-03 | ✓ | | | ✓ | | |

---

## ✅ Success Criteria

- [ ] All 13 test cases pass
- [ ] Gateway survives database restarts
- [ ] Connection pool recovers automatically
- [ ] Stale connections detected and replaced
- [ ] Failover works with minimal downtime
- [ ] No data corruption during any failure
- [ ] Proper error messages returned
- [ ] Health checks reflect actual state

---

## 🔗 Related Files

- `mcpgateway/db.py` - Database configuration
- `mcpgateway/config.py` - Pool settings
- `mcpgateway/services/health_service.py` - Health checks

---

## 🔗 Related Issues

- #2459 - Backup and Restore
- #2462 - Health Monitoring

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TESTING][RESILIENCE]: PostgreSQL Resilience Manual Test Plan (Connection Loss, Failover, Recovery) #2466

🐘 [TESTING][RESILIENCE]: PostgreSQL Resilience Manual Test Plan (Connection Loss, Failover, Recovery)

Goal

Why Now?

📖 User Stories

🏗 Architecture

Database Connection Architecture

Failure Detection Flow

📋 Test Environment Setup

Prerequisites

Network Fault Injection Tools

🧪 Manual Test Cases

Section 1: Connection Loss

Section 2: Connection Pool

Section 3: Failover

Section 4: Transaction Safety

📊 Test Matrix

✅ Success Criteria

🔗 Related Files

🔗 Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Case	Scenario	Trigger	Expected	Validation
CL-01	Database shutdown	Stop container	503 errors	Error messages
CL-02	Network partition	Block port	503 + timeout	Response time
CL-03	Connection drop	Kill connections	Retry/fail	Reconnection
CL-04	Slow network	Add latency	Timeouts	Query timing

Case	Scenario	Condition	Expected	Validation
CP-01	Pool exhaustion	Concurrent requests	Queue/503	Wait behavior
CP-02	Stale connections	DB restart	Reconnect	No stale errors
CP-03	Pool recovery	After failure	Replenish	Connection count

Case	Scenario	Setup	Expected	Validation
FO-01	Primary failure	Primary/replica	Switch to replica	Writes work
FO-02	Replica lag	Async replication	Reads consistent	Data accuracy
FO-03	Split brain	Network partition	No dual writes	Data integrity

Case	Scenario	Condition	Expected	Validation
TX-01	Deadlock	Concurrent updates	One aborted	No corruption
TX-02	Long transaction	Timeout	Rollback	Clean state
TX-03	Mid-transaction fail	Connection drop	Rollback	No partial

[TESTING][RESILIENCE]: PostgreSQL Resilience Manual Test Plan (Connection Loss, Failover, Recovery) #2466

Description

🐘 [TESTING][RESILIENCE]: PostgreSQL Resilience Manual Test Plan (Connection Loss, Failover, Recovery)

Goal

Why Now?

📖 User Stories

🏗 Architecture

Database Connection Architecture

Failure Detection Flow

📋 Test Environment Setup

Prerequisites

Network Fault Injection Tools

🧪 Manual Test Cases

Section 1: Connection Loss

Section 2: Connection Pool

Section 3: Failover

Section 4: Transaction Safety

📊 Test Matrix

✅ Success Criteria

🔗 Related Files

🔗 Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions