Skip to content

[TESTING][RESILIENCE]: PostgreSQL Resilience Manual Test Plan (Connection Loss, Failover, Recovery) #2466

@crivetimihai

Description

@crivetimihai

🐘 [TESTING][RESILIENCE]: PostgreSQL Resilience Manual Test Plan (Connection Loss, Failover, Recovery)

Goal

Produce a comprehensive manual test plan for PostgreSQL resilience testing including connection loss handling, failover behavior, recovery procedures, and transaction safety under failure conditions.

Why Now?

Database resilience is critical for production stability:

  1. Availability: Gateway must handle database outages gracefully
  2. Data Integrity: No data loss or corruption during failures
  3. Recovery: Automatic reconnection when database returns
  4. User Experience: Meaningful errors during degraded state

📖 User Stories

US-1: Platform Operator - Connection Loss Handling

As a Platform Operator
I want the gateway to handle database connection loss gracefully
So that users receive appropriate errors and the system recovers automatically

Acceptance Criteria:

Feature: Database Connection Loss

  Scenario: Database becomes unreachable
    Given the gateway is running with PostgreSQL
    And the database becomes unreachable
    When a request requiring database access is made
    Then the request should fail with 503 Service Unavailable
    And the error message should indicate database issues
    And the health check should report unhealthy

  Scenario: Database recovers
    Given the database was unreachable
    When the database becomes available again
    Then the connection pool should recover
    And subsequent requests should succeed
    And health check should return healthy
US-2: DBA - Connection Pool Management

As a Database Administrator
I want connection pool to handle failures properly
So that resources are managed efficiently

Acceptance Criteria:

Feature: Connection Pool Resilience

  Scenario: Connection pool exhaustion
    Given all connections in the pool are in use
    When a new request arrives
    Then request should wait up to timeout
    And if timeout exceeded, return 503
    And pool should recover when connections freed

  Scenario: Stale connections
    Given idle connections in the pool
    When database restarts
    Then stale connections should be detected
    And new connections should be established
US-3: SRE - Failover Handling

As an SRE
I want the gateway to handle primary/replica failover
So that service continues during database maintenance

Acceptance Criteria:

Feature: Database Failover

  Scenario: Primary failover to replica
    Given a primary-replica PostgreSQL setup
    When the primary fails
    And replica is promoted
    Then gateway should reconnect to new primary
    And writes should resume
    And no data corruption should occur

🏗 Architecture

Database Connection Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    DATABASE CONNECTION ARCHITECTURE                          │
└─────────────────────────────────────────────────────────────────────────────┘

    GATEWAY                    CONNECTION POOL              DATABASE
    ───────                    ───────────────              ────────

  ┌────────────┐            ┌─────────────────┐         ┌─────────────┐
  │ Request    │           │  SQLAlchemy      │         │ PostgreSQL  │
  │ Handler    │──────────▶│  Connection      │────────▶│ Primary     │
  └────────────┘           │  Pool            │         └─────────────┘
        │                  │                   │               │
        │                  │ ┌───────────────┐ │               │
        │                  │ │ Connection 1  │ │               │
        │                  │ ├───────────────┤ │         ┌─────▼─────┐
        │                  │ │ Connection 2  │ │         │ Streaming │
        │                  │ ├───────────────┤ │         │ Replica   │
        │                  │ │ Connection N  │ │         └───────────┘
        │                  │ └───────────────┘ │
        │                  │                   │
        │                  │ Pool Settings:    │
        │                  │ - pool_size=5     │
        │                  │ - max_overflow=10 │
        │                  │ - pool_timeout=30 │
        │                  │ - pool_recycle=   │
        │                  │     1800          │
        │                  └─────────────────┬─┘
        │                                    │
        │                            ┌───────▼───────┐
        │                            │ Health Check  │
        │                            │ pool_pre_ping │
        │                            └───────────────┘

Failure Detection Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                    FAILURE DETECTION AND RECOVERY                            │
└─────────────────────────────────────────────────────────────────────────────┘

    NORMAL                  FAILURE                  RECOVERY
    ──────                  ───────                  ────────

  ┌─────────┐           ┌─────────┐            ┌─────────┐
  │ Request │           │ Request │            │ Request │
  └────┬────┘           └────┬────┘            └────┬────┘
       │                     │                      │
  ┌────▼────┐           ┌────▼────┐           ┌────▼────┐
  │Get Conn │           │Get Conn │           │Get Conn │
  │from Pool│           │from Pool│           │from Pool│
  └────┬────┘           └────┬────┘           └────┬────┘
       │                     │                      │
  ┌────▼────┐           ┌────▼────┐           ┌────▼────┐
  │ Execute │           │ Execute │           │ Execute │
  │ Query   │           │ Query   │           │ Query   │
  └────┬────┘           └────┬────┘           └────┬────┘
       │                     │                      │
  ┌────▼────┐           ┌────▼────┐           ┌────▼────┐
  │ Return  │           │CONN ERR │           │ Return  │
  │ Result  │           │DETECTED │           │ Result  │
  └─────────┘           └────┬────┘           └─────────┘
                             │
                        ┌────▼────┐
                        │Invalidate│
                        │Connection│
                        └────┬────┘
                             │
                        ┌────▼────┐
                        │ Return  │
                        │   503   │
                        └─────────┘

📋 Test Environment Setup

Prerequisites

# Environment variables
export DATABASE_URL="postgresql://user:pass@localhost:5432/mcpgateway"
export GATEWAY_URL="http://localhost:8000"

# PostgreSQL container for testing
docker run -d --name postgres-test \
  -e POSTGRES_USER=user \
  -e POSTGRES_PASSWORD=pass \
  -e POSTGRES_DB=mcpgateway \
  -p 5432:5432 \
  postgres:15

# Alternative: Docker Compose with replica
docker-compose -f docker-compose.pg-ha.yml up -d

# Start gateway
make dev

Network Fault Injection Tools

# Install toxiproxy for network simulation
docker run -d --name toxiproxy \
  -p 8474:8474 \
  -p 15432:15432 \
  shopify/toxiproxy

# Configure proxy
toxiproxy-cli create postgres -l 0.0.0.0:15432 -u postgres-test:5432

# Use proxied connection
export DATABASE_URL="postgresql://user:pass@localhost:15432/mcpgateway"

🧪 Manual Test Cases

Section 1: Connection Loss

Case Scenario Trigger Expected Validation
CL-01 Database shutdown Stop container 503 errors Error messages
CL-02 Network partition Block port 503 + timeout Response time
CL-03 Connection drop Kill connections Retry/fail Reconnection
CL-04 Slow network Add latency Timeouts Query timing
CL-01: Database Shutdown

Preconditions:

  • Gateway running with PostgreSQL
  • Gateway operational and healthy

Steps:

# Step 1: Verify gateway is healthy
curl -s "$GATEWAY_URL/health/ready" | jq .
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq .

# Step 2: Stop PostgreSQL
docker stop postgres-test

# Step 3: Wait for detection (connection timeout)
sleep 5

# Step 4: Verify health check fails
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/health/ready")
echo "Health check status: $HTTP_CODE"
[ "$HTTP_CODE" = "503" ] && echo "PASS: Health check failed" || echo "FAIL"

# Step 5: Make API request
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/api/servers" \
  -H "Authorization: Bearer $TOKEN")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
echo "API response: $HTTP_CODE"
[ "$HTTP_CODE" = "503" ] && echo "PASS: 503 returned" || echo "FAIL"

# Step 6: Restart PostgreSQL
docker start postgres-test

# Step 7: Wait for recovery
sleep 10

# Step 8: Verify recovery
curl -s "$GATEWAY_URL/health/ready" | jq .
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq '.[:1]'

Expected Result:

  • Health check returns 503 when DB is down
  • API requests return 503 with meaningful error
  • Gateway recovers automatically when DB returns
  • No manual intervention required
CL-02: Network Partition

Preconditions:

  • Gateway running with toxiproxy
  • Proxy configured for database

Steps:

# Step 1: Verify normal operation
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq '.total'

# Step 2: Add network partition via toxiproxy
toxiproxy-cli toxic add postgres -t timeout -a timeout=0

# Step 3: Wait for connection attempts to timeout
sleep 35  # Connection timeout typically 30s

# Step 4: Verify failure
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/api/servers" \
  -H "Authorization: Bearer $TOKEN")
echo "Response during partition: $HTTP_CODE"

# Step 5: Remove partition
toxiproxy-cli toxic delete postgres -n timeout_downstream

# Step 6: Verify recovery
sleep 5
curl -s "$GATEWAY_URL/health/ready" | jq .

Expected Result:

  • Requests timeout during partition
  • Proper timeout error returned
  • Recovery after partition heals

Section 2: Connection Pool

Case Scenario Condition Expected Validation
CP-01 Pool exhaustion Concurrent requests Queue/503 Wait behavior
CP-02 Stale connections DB restart Reconnect No stale errors
CP-03 Pool recovery After failure Replenish Connection count
CP-01: Connection Pool Exhaustion

Preconditions:

  • Pool size = 5, max_overflow = 10
  • Slow endpoint or artificial delay

Steps:

# Step 1: Create slow query endpoint or use existing one
# (May need to add artificial delay for testing)

# Step 2: Launch concurrent requests exceeding pool size
for i in {1..20}; do
  curl -s "$GATEWAY_URL/api/tools" -H "Authorization: Bearer $TOKEN" &
done

# Step 3: Observe pool behavior
# Check logs for "pool limit reached" or similar

# Step 4: Wait for all to complete
wait

# Step 5: Verify all eventually succeeded or failed with 503
# (Should not have connection errors)

Expected Result:

  • Requests queue when pool full
  • Requests timeout after pool_timeout
  • 503 returned for timed-out requests
  • No connection leaks
CP-02: Stale Connection Detection

Preconditions:

  • pool_pre_ping enabled
  • Idle connections in pool

Steps:

# Step 1: Make initial request (populates pool)
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN"

# Step 2: Check pool statistics
psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity WHERE datname = 'mcpgateway';"

# Step 3: Restart database (invalidates connections)
docker restart postgres-test
sleep 5

# Step 4: Make request (should detect stale and reconnect)
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/api/servers" \
  -H "Authorization: Bearer $TOKEN")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
echo "After restart: $HTTP_CODE"

# Step 5: Verify success (pool_pre_ping detected stale)
[ "$HTTP_CODE" = "200" ] && echo "PASS: Stale connection handled" || echo "FAIL"

Expected Result:

  • pool_pre_ping detects dead connections
  • New connections established automatically
  • Request succeeds without stale connection error

Section 3: Failover

Case Scenario Setup Expected Validation
FO-01 Primary failure Primary/replica Switch to replica Writes work
FO-02 Replica lag Async replication Reads consistent Data accuracy
FO-03 Split brain Network partition No dual writes Data integrity
FO-01: Primary Failover

Preconditions:

  • PostgreSQL primary + standby configured
  • Gateway connected to primary

Steps:

# Step 1: Verify current primary
psql $DATABASE_URL -c "SELECT pg_is_in_recovery();"
# Should return 'f' for false (is primary)

# Step 2: Create test data
curl -s -X POST "$GATEWAY_URL/api/servers" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "failover-test", "url": "http://test.com"}'

# Step 3: Simulate primary failure
docker stop postgres-primary

# Step 4: Promote standby
docker exec postgres-standby pg_ctl promote

# Step 5: Update connection string (or use PgBouncer/HAProxy)
# This depends on your HA setup

# Step 6: Verify writes work on new primary
curl -s -X POST "$GATEWAY_URL/api/servers" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name": "after-failover", "url": "http://new.com"}'

# Step 7: Verify no data loss
curl -s "$GATEWAY_URL/api/servers?name=failover-test" \
  -H "Authorization: Bearer $TOKEN" | jq .

Expected Result:

  • Failover completes within RTO
  • No data loss during failover
  • Writes resume on new primary
  • Gateway reconnects automatically

Section 4: Transaction Safety

Case Scenario Condition Expected Validation
TX-01 Deadlock Concurrent updates One aborted No corruption
TX-02 Long transaction Timeout Rollback Clean state
TX-03 Mid-transaction fail Connection drop Rollback No partial
TX-01: Deadlock Detection

Preconditions:

  • Two concurrent sessions
  • Rows that can be locked

Steps:

# In terminal 1:
psql $DATABASE_URL << 'EOF'
BEGIN;
UPDATE servers SET description = 'lock1' WHERE name = 'server-a';
-- Wait here for terminal 2
SELECT pg_sleep(5);
UPDATE servers SET description = 'lock1' WHERE name = 'server-b';
COMMIT;
EOF

# In terminal 2 (run simultaneously):
psql $DATABASE_URL << 'EOF'
BEGIN;
UPDATE servers SET description = 'lock2' WHERE name = 'server-b';
-- This will create deadlock with terminal 1
UPDATE servers SET description = 'lock2' WHERE name = 'server-a';
COMMIT;
EOF

# One should succeed, one should get deadlock error

Expected Result:

  • PostgreSQL detects deadlock
  • One transaction aborted
  • Other transaction completes
  • No data corruption
  • Gateway logs deadlock event
TX-02: Transaction Timeout

Preconditions:

  • statement_timeout configured
  • Long-running query possible

Steps:

# Step 1: Check current timeout
psql $DATABASE_URL -c "SHOW statement_timeout;"

# Step 2: Create scenario that times out via API
# (Depends on implementation - may need slow endpoint)

# Step 3: Verify timeout error returned
# Should be a clear timeout error, not generic 500

# Step 4: Verify transaction rolled back
# Subsequent queries should work normally
curl -s "$GATEWAY_URL/api/servers" -H "Authorization: Bearer $TOKEN" | jq .

Expected Result:

  • Long transactions timeout
  • Clean rollback on timeout
  • Connection returned to pool
  • No lingering locks

📊 Test Matrix

Test Case Connection Pool Failover Transaction Docker K8s
CL-01
CL-02
CL-03
CL-04
CP-01
CP-02
CP-03
FO-01
FO-02
FO-03
TX-01
TX-02
TX-03

✅ Success Criteria

  • All 13 test cases pass
  • Gateway survives database restarts
  • Connection pool recovers automatically
  • Stale connections detected and replaced
  • Failover works with minimal downtime
  • No data corruption during any failure
  • Proper error messages returned
  • Health checks reflect actual state

🔗 Related Files

  • mcpgateway/db.py - Database configuration
  • mcpgateway/config.py - Pool settings
  • mcpgateway/services/health_service.py - Health checks

🔗 Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasechoreLinting, formatting, dependency hygiene, or project maintenance choresmanual-testingManual testing / test planning issuesreadyValidated, ready-to-work-on itemstestingTesting (unit, e2e, manual, automated, etc)

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions