-
Notifications
You must be signed in to change notification settings - Fork 615
[TESTING][OPERATIONS]: Health Monitoring Manual Test Plan (Liveness, Readiness, Dependencies) #2462
Description
🏥 [TESTING][OPERATIONS]: Health Monitoring Manual Test Plan (Liveness, Readiness, Dependencies)
Goal
Produce a comprehensive manual test plan for health monitoring including liveness probes, readiness probes, dependency health checks, and the health dashboard.
Why Now?
Production deployments require reliable health monitoring:
- Kubernetes/Orchestration: Health probes drive container lifecycle
- Load Balancing: Readiness determines traffic routing
- Alerting: Health status triggers operational alerts
- Dependencies: External service health affects availability
📖 User Stories
US-1: Platform Operator - Liveness Monitoring
As a Platform Operator
I want a liveness probe endpoint
So that orchestrators can restart unhealthy instances
Acceptance Criteria:
Feature: Liveness Probe
Scenario: Healthy instance returns 200
Given the application is running normally
When I call GET /health/live
Then response should be 200 OK
And body should indicate healthy status
Scenario: Unhealthy instance returns 503
Given the application has a critical failure
When I call GET /health/live
Then response should be 503 Service Unavailable
And body should indicate unhealthy status
Scenario: Liveness is lightweight
Given the application is under load
When liveness is checked frequently
Then response should be fast (<50ms)
And minimal resource usageTechnical Requirements:
- Endpoint:
/health/liveor/healthz - No external dependency checks
- Fast response (<50ms)
- Returns 200 or 503
US-2: Load Balancer - Readiness Monitoring
As a Load Balancer
I want a readiness probe endpoint
So that traffic routes only to ready instances
Acceptance Criteria:
Feature: Readiness Probe
Scenario: Ready instance returns 200
Given the application is fully initialized
And all dependencies are healthy
When I call GET /health/ready
Then response should be 200 OK
Scenario: Initializing instance returns 503
Given the application is still starting
When I call GET /health/ready
Then response should be 503 Service Unavailable
Scenario: Dependency failure affects readiness
Given the database is unreachable
When I call GET /health/ready
Then response should be 503 Service Unavailable
And body should indicate which dependency failedUS-3: Operations Engineer - Dependency Health
As an Operations Engineer
I want detailed dependency health checks
So that I can diagnose connectivity issues
Acceptance Criteria:
Feature: Dependency Health
Scenario: Database health check
Given database connection is configured
When I call GET /health/dependencies
Then database health should be reported
And connection pool status should be shown
Scenario: Redis health check
Given Redis is configured
When I call GET /health/dependencies
Then Redis connectivity should be reported
And latency should be measured
Scenario: MCP server health check
Given MCP servers are registered
When I call GET /health/dependencies
Then each server's health should be reported
And unreachable servers should be flagged🏗 Architecture
Health Check Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ HEALTH CHECK ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘
ORCHESTRATOR/LB GATEWAY DEPENDENCIES
───────────────── ─────── ────────────
┌──────────────┐ ┌─────────────┐
│ Kubernetes │──────────▶│ /health/live│ (Liveness)
│ kubelet │ │ │──▶ Internal state only
└──────────────┘ └─────────────┘ (fast, no deps)
┌──────────────┐ ┌─────────────┐ ┌─────────────┐
│ Load │──────────▶│/health/ready│────────▶│ Database │
│ Balancer │ │ │ └─────────────┘
└──────────────┘ │ (Readiness) │ ┌─────────────┐
│ │────────▶│ Redis │
└─────────────┘ └─────────────┘
┌──────────────┐ ┌─────────────┐ ┌─────────────┐
│ Monitoring │──────────▶│/health/deps │────────▶│ MCP Servers │
│ Dashboard │ │ │ ├─────────────┤
└──────────────┘ │ (Detailed) │────────▶│ A2A Agents │
│ │ ├─────────────┤
└─────────────┘ │ Federation │
└─────────────┘
Health Response Structure
┌─────────────────────────────────────────────────────────────────────────────┐
│ HEALTH RESPONSE FORMAT │
└─────────────────────────────────────────────────────────────────────────────┘
LIVENESS (/health/live):
┌────────────────────────────────────────────────────────────────────┐
│ { │
│ "status": "healthy", │
│ "timestamp": "2024-01-15T10:30:00Z" │
│ } │
└────────────────────────────────────────────────────────────────────┘
READINESS (/health/ready):
┌────────────────────────────────────────────────────────────────────┐
│ { │
│ "status": "ready", │
│ "checks": { │
│ "database": "ok", │
│ "redis": "ok" │
│ } │
│ } │
└────────────────────────────────────────────────────────────────────┘
DETAILED (/health/dependencies):
┌────────────────────────────────────────────────────────────────────┐
│ { │
│ "database": { │
│ "status": "healthy", │
│ "latency_ms": 2, │
│ "pool_size": 10, │
│ "pool_available": 8 │
│ }, │
│ "redis": { │
│ "status": "healthy", │
│ "latency_ms": 1 │
│ }, │
│ "mcp_servers": { │
│ "total": 5, │
│ "healthy": 4, │
│ "unhealthy": ["server-xyz"] │
│ } │
│ } │
└────────────────────────────────────────────────────────────────────┘
📋 Test Environment Setup
Prerequisites
export GATEWAY_URL="http://localhost:8000"
export DATABASE_URL="postgresql://..."
export REDIS_URL="redis://localhost:6379"
# Ensure all dependencies are running
docker-compose up -d postgres redis
# Start gateway
make dev🧪 Manual Test Cases
Section 1: Liveness Probe
| Case | Scenario | Condition | Expected | Validation |
|---|---|---|---|---|
| LP-01 | Healthy instance | Normal operation | 200 OK | Response body |
| LP-02 | Response time | Under load | <50ms | Timing |
| LP-03 | No external deps | Network isolated | Still 200 | Independence |
LP-01: Healthy Instance Returns 200
Preconditions:
- Gateway is running
- Application fully started
Steps:
# Step 1: Check liveness endpoint
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/health/live")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -1)
echo "HTTP Code: $HTTP_CODE"
echo "Body: $BODY"
# Step 2: Verify 200 status
[ "$HTTP_CODE" = "200" ] && echo "PASS: Status 200" || echo "FAIL: Status $HTTP_CODE"
# Step 3: Verify response body
echo "$BODY" | jq -e '.status == "healthy"' > /dev/null && \
echo "PASS: Status healthy" || echo "FAIL: Status not healthy"Expected Result:
- HTTP 200 returned
- Body contains healthy status
- Response is valid JSON
LP-02: Response Time Under Load
Preconditions:
- Gateway under simulated load
Steps:
# Step 1: Generate background load
for i in {1..100}; do
curl -s "$GATEWAY_URL/api/tools" -H "Authorization: Bearer $TOKEN" &
done
# Step 2: Measure liveness response time
for i in {1..10}; do
TIME=$(curl -s -o /dev/null -w "%{time_total}" "$GATEWAY_URL/health/live")
TIME_MS=$(echo "$TIME * 1000" | bc)
echo "Request $i: ${TIME_MS}ms"
# Verify under 50ms
[ $(echo "$TIME_MS < 50" | bc) -eq 1 ] && echo "PASS" || echo "FAIL: Too slow"
done
# Step 3: Wait for background requests
waitExpected Result:
- All responses under 50ms
- Consistent response times
- No degradation under load
Section 2: Readiness Probe
| Case | Scenario | Condition | Expected | Validation |
|---|---|---|---|---|
| RP-01 | Ready instance | All deps healthy | 200 OK | Response body |
| RP-02 | Database down | DB unreachable | 503 | Error details |
| RP-03 | Redis down | Redis unreachable | 503 or 200 | Depends on config |
| RP-04 | Startup phase | Still initializing | 503 | Not ready |
RP-01: Ready Instance Returns 200
Preconditions:
- Gateway running
- All dependencies healthy
Steps:
# Step 1: Check readiness endpoint
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/health/ready")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -1)
echo "HTTP Code: $HTTP_CODE"
echo "Body: $BODY" | jq .
# Step 2: Verify all checks pass
echo "$BODY" | jq -e '.checks | to_entries | all(.value == "ok")' && \
echo "PASS: All checks OK" || echo "FAIL: Some checks failed"Expected Result:
- HTTP 200 returned
- All dependency checks pass
- Response includes check details
RP-02: Database Down Returns 503
Preconditions:
- Gateway running
- Database container can be stopped
Steps:
# Step 1: Stop database
docker-compose stop postgres
# Step 2: Check readiness (wait for detection)
sleep 5
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/health/ready")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -1)
echo "HTTP Code: $HTTP_CODE"
echo "Body: $BODY" | jq .
# Step 3: Verify 503 and database failure indicated
[ "$HTTP_CODE" = "503" ] && echo "PASS: 503 returned" || echo "FAIL: Expected 503"
echo "$BODY" | jq -e '.checks.database != "ok"' && \
echo "PASS: Database failure detected" || echo "FAIL: Database issue not detected"
# Step 4: Restart database
docker-compose start postgres
sleep 5
# Step 5: Verify recovery
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/health/ready")
[ "$HTTP_CODE" = "200" ] && echo "PASS: Recovered" || echo "FAIL: Not recovered"Expected Result:
- 503 returned when database down
- Response indicates database failure
- Recovery when database restored
Section 3: Dependency Health
| Case | Scenario | Dependency | Expected | Validation |
|---|---|---|---|---|
| DH-01 | All healthy | All deps | Full status | All green |
| DH-02 | Database latency | Slow DB | Latency reported | Timing shown |
| DH-03 | MCP server down | One server | Partial healthy | Server flagged |
| DH-04 | Connection pool | High usage | Pool stats | Available count |
DH-01: All Dependencies Healthy
Preconditions:
- All dependencies running
- MCP servers registered
Steps:
# Step 1: Call dependency health endpoint
curl -s "$GATEWAY_URL/health/dependencies" \
-H "Authorization: Bearer $ADMIN_TOKEN" | jq .
# Step 2: Verify all components reported
RESPONSE=$(curl -s "$GATEWAY_URL/health/dependencies" \
-H "Authorization: Bearer $ADMIN_TOKEN")
# Check database
echo "$RESPONSE" | jq -e '.database.status == "healthy"' && \
echo "PASS: Database healthy" || echo "FAIL: Database unhealthy"
# Check Redis (if configured)
echo "$RESPONSE" | jq -e '.redis.status == "healthy"' && \
echo "PASS: Redis healthy" || echo "WARN: Redis not configured/healthy"
# Check MCP servers
TOTAL=$(echo "$RESPONSE" | jq '.mcp_servers.total')
HEALTHY=$(echo "$RESPONSE" | jq '.mcp_servers.healthy')
echo "MCP Servers: $HEALTHY/$TOTAL healthy"Expected Result:
- All dependencies report healthy
- Latency metrics included
- Complete status overview
DH-03: MCP Server Down Detected
Preconditions:
- Multiple MCP servers registered
- One server can be stopped
Steps:
# Step 1: Check current server health
curl -s "$GATEWAY_URL/health/dependencies" \
-H "Authorization: Bearer $ADMIN_TOKEN" | jq '.mcp_servers'
# Step 2: Stop one MCP server (simulate failure)
# (Depends on your setup - could be docker stop, kill process, etc.)
docker stop mcp-server-test
# Step 3: Wait for health check cycle
sleep 30 # Or configured health check interval
# Step 4: Verify unhealthy server detected
RESPONSE=$(curl -s "$GATEWAY_URL/health/dependencies" \
-H "Authorization: Bearer $ADMIN_TOKEN")
echo "$RESPONSE" | jq '.mcp_servers'
# Check unhealthy list includes stopped server
echo "$RESPONSE" | jq -e '.mcp_servers.unhealthy | length > 0' && \
echo "PASS: Unhealthy server detected" || echo "FAIL: Not detected"
# Step 5: Restart server
docker start mcp-server-testExpected Result:
- Stopped server detected as unhealthy
- Other servers remain healthy
- Recovery detected after restart
Section 4: Health Dashboard
| Case | UI Element | Action | Expected |
|---|---|---|---|
| HD-01 | Dashboard | Access | Shows all health |
| HD-02 | Status indicators | View | Color-coded |
| HD-03 | Dependency details | Click | Expanded info |
| HD-04 | Auto-refresh | Wait | Updates periodically |
HD-01: Health Dashboard Access
Preconditions:
- Admin UI enabled
- User logged in as admin
Steps:
1. Navigate to http://localhost:8080/admin/#health
2. Verify dashboard displays:
- Overall system status (healthy/degraded/unhealthy)
- Database connection status
- Redis connection status (if configured)
- MCP server count and health
- Federation peer status (if configured)
3. Verify color-coded indicators:
- Green = healthy
- Yellow = degraded
- Red = unhealthy
4. Verify refresh button works
Expected Result:
- Dashboard loads with all health info
- Status indicators color-coded
- All dependencies shown
- Manual refresh works
📊 Test Matrix
| Test Case | Liveness | Readiness | Dependencies | UI | Kubernetes |
|---|---|---|---|---|---|
| LP-01 | ✓ | ✓ | |||
| LP-02 | ✓ | ||||
| LP-03 | ✓ | ||||
| RP-01 | ✓ | ✓ | |||
| RP-02 | ✓ | ✓ | |||
| RP-03 | ✓ | ||||
| RP-04 | ✓ | ✓ | |||
| DH-01 | ✓ | ||||
| DH-02 | ✓ | ||||
| DH-03 | ✓ | ||||
| DH-04 | ✓ | ||||
| HD-01 | ✓ | ||||
| HD-02 | ✓ | ||||
| HD-03 | ✓ | ||||
| HD-04 | ✓ |
✅ Success Criteria
- All 15 test cases pass
- Liveness probe responds in <50ms
- Readiness reflects actual dependency status
- Database failures detected
- Redis failures detected (if configured)
- MCP server health tracked
- Health dashboard functional
- Kubernetes integration works
🔗 Related Files
mcpgateway/routers/health.pymcpgateway/services/health_service.pymcpgateway/services/server_health_service.pymcpgateway/admin.py(health dashboard)
🔗 Related Issues
- Kubernetes deployment testing
- Load balancer integration