-
Notifications
You must be signed in to change notification settings - Fork 614
[TESTING][RESILIENCE]: PostgreSQL Resilience Manual Test Plan (Connection Loss, Failover, Recovery) #2466
Description
🐘 [TESTING][RESILIENCE]: PostgreSQL Resilience Manual Test Plan (Connection Loss, Failover, Recovery)
Goal
Produce a comprehensive manual test plan for PostgreSQL resilience testing including connection loss handling, failover behavior, recovery procedures, and transaction safety under failure conditions.
Why Now?
Database resilience is critical for production stability:
- Availability: Gateway must handle database outages gracefully
- Data Integrity: No data loss or corruption during failures
- Recovery: Automatic reconnection when database returns
- User Experience: Meaningful errors during degraded state
📖 User Stories
US-1: Platform Operator - Connection Loss Handling
As a Platform Operator
I want the gateway to handle database connection loss gracefully
So that users receive appropriate errors and the system recovers automatically
Acceptance Criteria:
Feature: Database Connection Loss
Scenario: Database becomes unreachable
Given the gateway is running with PostgreSQL
And the database becomes unreachable
When a request requiring database access is made
Then the request should fail with 503 Service Unavailable
And the error message should indicate database issues
And the health check should report unhealthy
Scenario: Database recovers
Given the database was unreachable
When the database becomes available again
Then the connection pool should recover
And subsequent requests should succeed
And health check should return healthyUS-2: DBA - Connection Pool Management
As a Database Administrator
I want connection pool to handle failures properly
So that resources are managed efficiently
Acceptance Criteria:
Feature: Connection Pool Resilience
Scenario: Connection pool exhaustion
Given all connections in the pool are in use
When a new request arrives
Then request should wait up to timeout
And if timeout exceeded, return 503
And pool should recover when connections freed
Scenario: Stale connections
Given idle connections in the pool
When database restarts
Then stale connections should be detected
And new connections should be establishedUS-3: SRE - Failover Handling
As an SRE
I want the gateway to handle primary/replica failover
So that service continues during database maintenance
Acceptance Criteria:
Feature: Database Failover
Scenario: Primary failover to replica
Given a primary-replica PostgreSQL setup
When the primary fails
And replica is promoted
Then gateway should reconnect to new primary
And writes should resume
And no data corruption should occur🏗 Architecture
Database Connection Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATABASE CONNECTION ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘
GATEWAY CONNECTION POOL DATABASE
─────── ─────────────── ────────
┌────────────┐ ┌─────────────────┐ ┌─────────────┐
│ Request │ │ SQLAlchemy │ │ PostgreSQL │
│ Handler │──────────▶│ Connection │────────▶│ Primary │
└────────────┘ │ Pool │ └─────────────┘
│ │ │ │
│ │ ┌───────────────┐ │ │
│ │ │ Connection 1 │ │ │
│ │ ├───────────────┤ │ ┌─────▼─────┐
│ │ │ Connection 2 │ │ │ Streaming │
│ │ ├───────────────┤ │ │ Replica │
│ │ │ Connection N │ │ └───────────┘
│ │ └───────────────┘ │
│ │ │
│ │ Pool Settings: │
│ │ - pool_size=5 │
│ │ - max_overflow=10 │
│ │ - pool_timeout=30 │
│ │ - pool_recycle= │
│ │ 1800 │
│ └─────────────────┬─┘
│ │
│ ┌───────▼───────┐
│ │ Health Check │
│ │ pool_pre_ping │
│ └───────────────┘
Failure Detection Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ FAILURE DETECTION AND RECOVERY │
└─────────────────────────────────────────────────────────────────────────────┘
NORMAL FAILURE RECOVERY
────── ─────── ────────
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Request │ │ Request │ │ Request │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│Get Conn │ │Get Conn │ │Get Conn │
│from Pool│ │from Pool│ │from Pool│
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Execute │ │ Execute │ │ Execute │
│ Query │ │ Query │ │ Query │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Return │ │CONN ERR │ │ Return │
│ Result │ │DETECTED │ │ Result │
└─────────┘ └────┬────┘ └─────────┘
│
┌────▼────┐
│Invalidate│
│Connection│
└────┬────┘
│
┌────▼────┐
│ Return │
│ 503 │
└─────────┘
📋 Test Environment Setup
Prerequisites
# Environment variables
export DATABASE_URL="postgresql://user:pass@localhost:5432/mcpgateway"
export GATEWAY_URL="http://localhost:8000"
# PostgreSQL container for testing
docker run -d --name postgres-test \
-e POSTGRES_USER=user \
-e POSTGRES_PASSWORD=pass \
-e POSTGRES_DB=mcpgateway \
-p 5432:5432 \
postgres:15
# Alternative: Docker Compose with replica
docker-compose -f docker-compose.pg-ha.yml up -d
# Start gateway
make devNetwork Fault Injection Tools
# Install toxiproxy for network simulation
docker run -d --name toxiproxy \
-p 8474:8474 \
-p 15432:15432 \
shopify/toxiproxy
# Configure proxy
toxiproxy-cli create postgres -l 0.0.0.0:15432 -u postgres-test:5432
# Use proxied connection
export DATABASE_URL="postgresql://user:pass@localhost:15432/mcpgateway"🧪 Manual Test Cases
Section 1: Connection Loss
| Case | Scenario | Trigger | Expected | Validation |
|---|---|---|---|---|
| CL-01 | Database shutdown | Stop container | 503 errors | Error messages |
| CL-02 | Network partition | Block port | 503 + timeout | Response time |
| CL-03 | Connection drop | Kill connections | Retry/fail | Reconnection |
| CL-04 | Slow network | Add latency | Timeouts | Query timing |
CL-01: Database Shutdown
Preconditions:
- Gateway running with PostgreSQL
- Gateway operational and healthy
Steps:
# Step 1: Verify gateway is healthy
curl -s "$GATEWAY_URL/health/ready" | jq .
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq .
# Step 2: Stop PostgreSQL
docker stop postgres-test
# Step 3: Wait for detection (connection timeout)
sleep 5
# Step 4: Verify health check fails
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/health/ready")
echo "Health check status: $HTTP_CODE"
[ "$HTTP_CODE" = "503" ] && echo "PASS: Health check failed" || echo "FAIL"
# Step 5: Make API request
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/api/servers" \
-H "Authorization: Bearer $TOKEN")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
echo "API response: $HTTP_CODE"
[ "$HTTP_CODE" = "503" ] && echo "PASS: 503 returned" || echo "FAIL"
# Step 6: Restart PostgreSQL
docker start postgres-test
# Step 7: Wait for recovery
sleep 10
# Step 8: Verify recovery
curl -s "$GATEWAY_URL/health/ready" | jq .
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq '.[:1]'Expected Result:
- Health check returns 503 when DB is down
- API requests return 503 with meaningful error
- Gateway recovers automatically when DB returns
- No manual intervention required
CL-02: Network Partition
Preconditions:
- Gateway running with toxiproxy
- Proxy configured for database
Steps:
# Step 1: Verify normal operation
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq '.total'
# Step 2: Add network partition via toxiproxy
toxiproxy-cli toxic add postgres -t timeout -a timeout=0
# Step 3: Wait for connection attempts to timeout
sleep 35 # Connection timeout typically 30s
# Step 4: Verify failure
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/api/servers" \
-H "Authorization: Bearer $TOKEN")
echo "Response during partition: $HTTP_CODE"
# Step 5: Remove partition
toxiproxy-cli toxic delete postgres -n timeout_downstream
# Step 6: Verify recovery
sleep 5
curl -s "$GATEWAY_URL/health/ready" | jq .Expected Result:
- Requests timeout during partition
- Proper timeout error returned
- Recovery after partition heals
Section 2: Connection Pool
| Case | Scenario | Condition | Expected | Validation |
|---|---|---|---|---|
| CP-01 | Pool exhaustion | Concurrent requests | Queue/503 | Wait behavior |
| CP-02 | Stale connections | DB restart | Reconnect | No stale errors |
| CP-03 | Pool recovery | After failure | Replenish | Connection count |
CP-01: Connection Pool Exhaustion
Preconditions:
- Pool size = 5, max_overflow = 10
- Slow endpoint or artificial delay
Steps:
# Step 1: Create slow query endpoint or use existing one
# (May need to add artificial delay for testing)
# Step 2: Launch concurrent requests exceeding pool size
for i in {1..20}; do
curl -s "$GATEWAY_URL/api/tools" -H "Authorization: Bearer $TOKEN" &
done
# Step 3: Observe pool behavior
# Check logs for "pool limit reached" or similar
# Step 4: Wait for all to complete
wait
# Step 5: Verify all eventually succeeded or failed with 503
# (Should not have connection errors)Expected Result:
- Requests queue when pool full
- Requests timeout after pool_timeout
- 503 returned for timed-out requests
- No connection leaks
CP-02: Stale Connection Detection
Preconditions:
- pool_pre_ping enabled
- Idle connections in pool
Steps:
# Step 1: Make initial request (populates pool)
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN"
# Step 2: Check pool statistics
psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity WHERE datname = 'mcpgateway';"
# Step 3: Restart database (invalidates connections)
docker restart postgres-test
sleep 5
# Step 4: Make request (should detect stale and reconnect)
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/api/servers" \
-H "Authorization: Bearer $TOKEN")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
echo "After restart: $HTTP_CODE"
# Step 5: Verify success (pool_pre_ping detected stale)
[ "$HTTP_CODE" = "200" ] && echo "PASS: Stale connection handled" || echo "FAIL"Expected Result:
- pool_pre_ping detects dead connections
- New connections established automatically
- Request succeeds without stale connection error
Section 3: Failover
| Case | Scenario | Setup | Expected | Validation |
|---|---|---|---|---|
| FO-01 | Primary failure | Primary/replica | Switch to replica | Writes work |
| FO-02 | Replica lag | Async replication | Reads consistent | Data accuracy |
| FO-03 | Split brain | Network partition | No dual writes | Data integrity |
FO-01: Primary Failover
Preconditions:
- PostgreSQL primary + standby configured
- Gateway connected to primary
Steps:
# Step 1: Verify current primary
psql $DATABASE_URL -c "SELECT pg_is_in_recovery();"
# Should return 'f' for false (is primary)
# Step 2: Create test data
curl -s -X POST "$GATEWAY_URL/api/servers" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "failover-test", "url": "http://test.com"}'
# Step 3: Simulate primary failure
docker stop postgres-primary
# Step 4: Promote standby
docker exec postgres-standby pg_ctl promote
# Step 5: Update connection string (or use PgBouncer/HAProxy)
# This depends on your HA setup
# Step 6: Verify writes work on new primary
curl -s -X POST "$GATEWAY_URL/api/servers" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "after-failover", "url": "http://new.com"}'
# Step 7: Verify no data loss
curl -s "$GATEWAY_URL/api/servers?name=failover-test" \
-H "Authorization: Bearer $TOKEN" | jq .Expected Result:
- Failover completes within RTO
- No data loss during failover
- Writes resume on new primary
- Gateway reconnects automatically
Section 4: Transaction Safety
| Case | Scenario | Condition | Expected | Validation |
|---|---|---|---|---|
| TX-01 | Deadlock | Concurrent updates | One aborted | No corruption |
| TX-02 | Long transaction | Timeout | Rollback | Clean state |
| TX-03 | Mid-transaction fail | Connection drop | Rollback | No partial |
TX-01: Deadlock Detection
Preconditions:
- Two concurrent sessions
- Rows that can be locked
Steps:
# In terminal 1:
psql $DATABASE_URL << 'EOF'
BEGIN;
UPDATE servers SET description = 'lock1' WHERE name = 'server-a';
-- Wait here for terminal 2
SELECT pg_sleep(5);
UPDATE servers SET description = 'lock1' WHERE name = 'server-b';
COMMIT;
EOF
# In terminal 2 (run simultaneously):
psql $DATABASE_URL << 'EOF'
BEGIN;
UPDATE servers SET description = 'lock2' WHERE name = 'server-b';
-- This will create deadlock with terminal 1
UPDATE servers SET description = 'lock2' WHERE name = 'server-a';
COMMIT;
EOF
# One should succeed, one should get deadlock errorExpected Result:
- PostgreSQL detects deadlock
- One transaction aborted
- Other transaction completes
- No data corruption
- Gateway logs deadlock event
TX-02: Transaction Timeout
Preconditions:
- statement_timeout configured
- Long-running query possible
Steps:
# Step 1: Check current timeout
psql $DATABASE_URL -c "SHOW statement_timeout;"
# Step 2: Create scenario that times out via API
# (Depends on implementation - may need slow endpoint)
# Step 3: Verify timeout error returned
# Should be a clear timeout error, not generic 500
# Step 4: Verify transaction rolled back
# Subsequent queries should work normally
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq .Expected Result:
- Long transactions timeout
- Clean rollback on timeout
- Connection returned to pool
- No lingering locks
📊 Test Matrix
| Test Case | Connection | Pool | Failover | Transaction | Docker | K8s |
|---|---|---|---|---|---|---|
| CL-01 | ✓ | ✓ | ||||
| CL-02 | ✓ | ✓ | ||||
| CL-03 | ✓ | ✓ | ✓ | |||
| CL-04 | ✓ | ✓ | ||||
| CP-01 | ✓ | ✓ | ✓ | |||
| CP-02 | ✓ | ✓ | ||||
| CP-03 | ✓ | ✓ | ||||
| FO-01 | ✓ | ✓ | ✓ | |||
| FO-02 | ✓ | |||||
| FO-03 | ✓ | |||||
| TX-01 | ✓ | |||||
| TX-02 | ✓ | |||||
| TX-03 | ✓ | ✓ |
✅ Success Criteria
- All 13 test cases pass
- Gateway survives database restarts
- Connection pool recovers automatically
- Stale connections detected and replaced
- Failover works with minimal downtime
- No data corruption during any failure
- Proper error messages returned
- Health checks reflect actual state
🔗 Related Files
mcpgateway/db.py- Database configurationmcpgateway/config.py- Pool settingsmcpgateway/services/health_service.py- Health checks