Skip to content

[TESTING][OPERATIONS]: Health Monitoring Manual Test Plan (Liveness, Readiness, Dependencies) #2462

@crivetimihai

Description

@crivetimihai

🏥 [TESTING][OPERATIONS]: Health Monitoring Manual Test Plan (Liveness, Readiness, Dependencies)

Goal

Produce a comprehensive manual test plan for health monitoring including liveness probes, readiness probes, dependency health checks, and the health dashboard.

Why Now?

Production deployments require reliable health monitoring:

  1. Kubernetes/Orchestration: Health probes drive container lifecycle
  2. Load Balancing: Readiness determines traffic routing
  3. Alerting: Health status triggers operational alerts
  4. Dependencies: External service health affects availability

📖 User Stories

US-1: Platform Operator - Liveness Monitoring

As a Platform Operator
I want a liveness probe endpoint
So that orchestrators can restart unhealthy instances

Acceptance Criteria:

Feature: Liveness Probe

  Scenario: Healthy instance returns 200
    Given the application is running normally
    When I call GET /health/live
    Then response should be 200 OK
    And body should indicate healthy status

  Scenario: Unhealthy instance returns 503
    Given the application has a critical failure
    When I call GET /health/live
    Then response should be 503 Service Unavailable
    And body should indicate unhealthy status

  Scenario: Liveness is lightweight
    Given the application is under load
    When liveness is checked frequently
    Then response should be fast (<50ms)
    And minimal resource usage

Technical Requirements:

  • Endpoint: /health/live or /healthz
  • No external dependency checks
  • Fast response (<50ms)
  • Returns 200 or 503
US-2: Load Balancer - Readiness Monitoring

As a Load Balancer
I want a readiness probe endpoint
So that traffic routes only to ready instances

Acceptance Criteria:

Feature: Readiness Probe

  Scenario: Ready instance returns 200
    Given the application is fully initialized
    And all dependencies are healthy
    When I call GET /health/ready
    Then response should be 200 OK

  Scenario: Initializing instance returns 503
    Given the application is still starting
    When I call GET /health/ready
    Then response should be 503 Service Unavailable

  Scenario: Dependency failure affects readiness
    Given the database is unreachable
    When I call GET /health/ready
    Then response should be 503 Service Unavailable
    And body should indicate which dependency failed
US-3: Operations Engineer - Dependency Health

As an Operations Engineer
I want detailed dependency health checks
So that I can diagnose connectivity issues

Acceptance Criteria:

Feature: Dependency Health

  Scenario: Database health check
    Given database connection is configured
    When I call GET /health/dependencies
    Then database health should be reported
    And connection pool status should be shown

  Scenario: Redis health check
    Given Redis is configured
    When I call GET /health/dependencies
    Then Redis connectivity should be reported
    And latency should be measured

  Scenario: MCP server health check
    Given MCP servers are registered
    When I call GET /health/dependencies
    Then each server's health should be reported
    And unreachable servers should be flagged

🏗 Architecture

Health Check Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                         HEALTH CHECK ARCHITECTURE                            │
└─────────────────────────────────────────────────────────────────────────────┘

    ORCHESTRATOR/LB              GATEWAY                    DEPENDENCIES
    ─────────────────           ───────                    ────────────

  ┌──────────────┐           ┌─────────────┐
  │ Kubernetes   │──────────▶│ /health/live│ (Liveness)
  │ kubelet      │           │             │──▶ Internal state only
  └──────────────┘           └─────────────┘    (fast, no deps)

  ┌──────────────┐           ┌─────────────┐         ┌─────────────┐
  │ Load         │──────────▶│/health/ready│────────▶│ Database    │
  │ Balancer     │           │             │         └─────────────┘
  └──────────────┘           │ (Readiness) │         ┌─────────────┐
                             │             │────────▶│ Redis       │
                             └─────────────┘         └─────────────┘

  ┌──────────────┐           ┌─────────────┐         ┌─────────────┐
  │ Monitoring   │──────────▶│/health/deps │────────▶│ MCP Servers │
  │ Dashboard    │           │             │         ├─────────────┤
  └──────────────┘           │ (Detailed)  │────────▶│ A2A Agents  │
                             │             │         ├─────────────┤
                             └─────────────┘         │ Federation  │
                                                     └─────────────┘

Health Response Structure

┌─────────────────────────────────────────────────────────────────────────────┐
│                         HEALTH RESPONSE FORMAT                               │
└─────────────────────────────────────────────────────────────────────────────┘

    LIVENESS (/health/live):
    ┌────────────────────────────────────────────────────────────────────┐
    │ {                                                                  │
    │   "status": "healthy",                                            │
    │   "timestamp": "2024-01-15T10:30:00Z"                            │
    │ }                                                                  │
    └────────────────────────────────────────────────────────────────────┘

    READINESS (/health/ready):
    ┌────────────────────────────────────────────────────────────────────┐
    │ {                                                                  │
    │   "status": "ready",                                              │
    │   "checks": {                                                      │
    │     "database": "ok",                                             │
    │     "redis": "ok"                                                 │
    │   }                                                                │
    │ }                                                                  │
    └────────────────────────────────────────────────────────────────────┘

    DETAILED (/health/dependencies):
    ┌────────────────────────────────────────────────────────────────────┐
    │ {                                                                  │
    │   "database": {                                                    │
    │     "status": "healthy",                                          │
    │     "latency_ms": 2,                                              │
    │     "pool_size": 10,                                              │
    │     "pool_available": 8                                           │
    │   },                                                               │
    │   "redis": {                                                       │
    │     "status": "healthy",                                          │
    │     "latency_ms": 1                                               │
    │   },                                                               │
    │   "mcp_servers": {                                                 │
    │     "total": 5,                                                    │
    │     "healthy": 4,                                                  │
    │     "unhealthy": ["server-xyz"]                                   │
    │   }                                                                │
    │ }                                                                  │
    └────────────────────────────────────────────────────────────────────┘

📋 Test Environment Setup

Prerequisites

export GATEWAY_URL="http://localhost:8000"
export DATABASE_URL="postgresql://..."
export REDIS_URL="redis://localhost:6379"

# Ensure all dependencies are running
docker-compose up -d postgres redis

# Start gateway
make dev

🧪 Manual Test Cases

Section 1: Liveness Probe

Case Scenario Condition Expected Validation
LP-01 Healthy instance Normal operation 200 OK Response body
LP-02 Response time Under load <50ms Timing
LP-03 No external deps Network isolated Still 200 Independence
LP-01: Healthy Instance Returns 200

Preconditions:

  • Gateway is running
  • Application fully started

Steps:

# Step 1: Check liveness endpoint
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/health/live")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -1)

echo "HTTP Code: $HTTP_CODE"
echo "Body: $BODY"

# Step 2: Verify 200 status
[ "$HTTP_CODE" = "200" ] && echo "PASS: Status 200" || echo "FAIL: Status $HTTP_CODE"

# Step 3: Verify response body
echo "$BODY" | jq -e '.status == "healthy"' > /dev/null && \
  echo "PASS: Status healthy" || echo "FAIL: Status not healthy"

Expected Result:

  • HTTP 200 returned
  • Body contains healthy status
  • Response is valid JSON
LP-02: Response Time Under Load

Preconditions:

  • Gateway under simulated load

Steps:

# Step 1: Generate background load
for i in {1..100}; do
  curl -s "$GATEWAY_URL/api/tools" -H "Authorization: Bearer $TOKEN" &
done

# Step 2: Measure liveness response time
for i in {1..10}; do
  TIME=$(curl -s -o /dev/null -w "%{time_total}" "$GATEWAY_URL/health/live")
  TIME_MS=$(echo "$TIME * 1000" | bc)
  echo "Request $i: ${TIME_MS}ms"

  # Verify under 50ms
  [ $(echo "$TIME_MS < 50" | bc) -eq 1 ] && echo "PASS" || echo "FAIL: Too slow"
done

# Step 3: Wait for background requests
wait

Expected Result:

  • All responses under 50ms
  • Consistent response times
  • No degradation under load

Section 2: Readiness Probe

Case Scenario Condition Expected Validation
RP-01 Ready instance All deps healthy 200 OK Response body
RP-02 Database down DB unreachable 503 Error details
RP-03 Redis down Redis unreachable 503 or 200 Depends on config
RP-04 Startup phase Still initializing 503 Not ready
RP-01: Ready Instance Returns 200

Preconditions:

  • Gateway running
  • All dependencies healthy

Steps:

# Step 1: Check readiness endpoint
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/health/ready")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -1)

echo "HTTP Code: $HTTP_CODE"
echo "Body: $BODY" | jq .

# Step 2: Verify all checks pass
echo "$BODY" | jq -e '.checks | to_entries | all(.value == "ok")' && \
  echo "PASS: All checks OK" || echo "FAIL: Some checks failed"

Expected Result:

  • HTTP 200 returned
  • All dependency checks pass
  • Response includes check details
RP-02: Database Down Returns 503

Preconditions:

  • Gateway running
  • Database container can be stopped

Steps:

# Step 1: Stop database
docker-compose stop postgres

# Step 2: Check readiness (wait for detection)
sleep 5
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/health/ready")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -1)

echo "HTTP Code: $HTTP_CODE"
echo "Body: $BODY" | jq .

# Step 3: Verify 503 and database failure indicated
[ "$HTTP_CODE" = "503" ] && echo "PASS: 503 returned" || echo "FAIL: Expected 503"
echo "$BODY" | jq -e '.checks.database != "ok"' && \
  echo "PASS: Database failure detected" || echo "FAIL: Database issue not detected"

# Step 4: Restart database
docker-compose start postgres
sleep 5

# Step 5: Verify recovery
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/health/ready")
[ "$HTTP_CODE" = "200" ] && echo "PASS: Recovered" || echo "FAIL: Not recovered"

Expected Result:

  • 503 returned when database down
  • Response indicates database failure
  • Recovery when database restored

Section 3: Dependency Health

Case Scenario Dependency Expected Validation
DH-01 All healthy All deps Full status All green
DH-02 Database latency Slow DB Latency reported Timing shown
DH-03 MCP server down One server Partial healthy Server flagged
DH-04 Connection pool High usage Pool stats Available count
DH-01: All Dependencies Healthy

Preconditions:

  • All dependencies running
  • MCP servers registered

Steps:

# Step 1: Call dependency health endpoint
curl -s "$GATEWAY_URL/health/dependencies" \
  -H "Authorization: Bearer $ADMIN_TOKEN" | jq .

# Step 2: Verify all components reported
RESPONSE=$(curl -s "$GATEWAY_URL/health/dependencies" \
  -H "Authorization: Bearer $ADMIN_TOKEN")

# Check database
echo "$RESPONSE" | jq -e '.database.status == "healthy"' && \
  echo "PASS: Database healthy" || echo "FAIL: Database unhealthy"

# Check Redis (if configured)
echo "$RESPONSE" | jq -e '.redis.status == "healthy"' && \
  echo "PASS: Redis healthy" || echo "WARN: Redis not configured/healthy"

# Check MCP servers
TOTAL=$(echo "$RESPONSE" | jq '.mcp_servers.total')
HEALTHY=$(echo "$RESPONSE" | jq '.mcp_servers.healthy')
echo "MCP Servers: $HEALTHY/$TOTAL healthy"

Expected Result:

  • All dependencies report healthy
  • Latency metrics included
  • Complete status overview
DH-03: MCP Server Down Detected

Preconditions:

  • Multiple MCP servers registered
  • One server can be stopped

Steps:

# Step 1: Check current server health
curl -s "$GATEWAY_URL/health/dependencies" \
  -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.mcp_servers'

# Step 2: Stop one MCP server (simulate failure)
# (Depends on your setup - could be docker stop, kill process, etc.)
docker stop mcp-server-test

# Step 3: Wait for health check cycle
sleep 30  # Or configured health check interval

# Step 4: Verify unhealthy server detected
RESPONSE=$(curl -s "$GATEWAY_URL/health/dependencies" \
  -H "Authorization: Bearer $ADMIN_TOKEN")

echo "$RESPONSE" | jq '.mcp_servers'

# Check unhealthy list includes stopped server
echo "$RESPONSE" | jq -e '.mcp_servers.unhealthy | length > 0' && \
  echo "PASS: Unhealthy server detected" || echo "FAIL: Not detected"

# Step 5: Restart server
docker start mcp-server-test

Expected Result:

  • Stopped server detected as unhealthy
  • Other servers remain healthy
  • Recovery detected after restart

Section 4: Health Dashboard

Case UI Element Action Expected
HD-01 Dashboard Access Shows all health
HD-02 Status indicators View Color-coded
HD-03 Dependency details Click Expanded info
HD-04 Auto-refresh Wait Updates periodically
HD-01: Health Dashboard Access

Preconditions:

  • Admin UI enabled
  • User logged in as admin

Steps:

1. Navigate to http://localhost:8080/admin/#health
2. Verify dashboard displays:
   - Overall system status (healthy/degraded/unhealthy)
   - Database connection status
   - Redis connection status (if configured)
   - MCP server count and health
   - Federation peer status (if configured)
3. Verify color-coded indicators:
   - Green = healthy
   - Yellow = degraded
   - Red = unhealthy
4. Verify refresh button works

Expected Result:

  • Dashboard loads with all health info
  • Status indicators color-coded
  • All dependencies shown
  • Manual refresh works

📊 Test Matrix

Test Case Liveness Readiness Dependencies UI Kubernetes
LP-01
LP-02
LP-03
RP-01
RP-02
RP-03
RP-04
DH-01
DH-02
DH-03
DH-04
HD-01
HD-02
HD-03
HD-04

✅ Success Criteria

  • All 15 test cases pass
  • Liveness probe responds in <50ms
  • Readiness reflects actual dependency status
  • Database failures detected
  • Redis failures detected (if configured)
  • MCP server health tracked
  • Health dashboard functional
  • Kubernetes integration works

🔗 Related Files

  • mcpgateway/routers/health.py
  • mcpgateway/services/health_service.py
  • mcpgateway/services/server_health_service.py
  • mcpgateway/admin.py (health dashboard)

🔗 Related Issues

  • Kubernetes deployment testing
  • Load balancer integration

Metadata

Metadata

Assignees

No one assigned

    Labels

    SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasechoreLinting, formatting, dependency hygiene, or project maintenance choresmanual-testingManual testing / test planning issuesreadyValidated, ready-to-work-on itemstestingTesting (unit, e2e, manual, automated, etc)

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions