[TESTING][OPERATIONS]: Health Monitoring Manual Test Plan (Liveness, Readiness, Dependencies)

# 🏥 [TESTING][OPERATIONS]: Health Monitoring Manual Test Plan (Liveness, Readiness, Dependencies)

## Goal

Produce a **comprehensive manual test plan** for health monitoring including liveness probes, readiness probes, dependency health checks, and the health dashboard.

## Why Now?

Production deployments require reliable health monitoring:

1. **Kubernetes/Orchestration**: Health probes drive container lifecycle
2. **Load Balancing**: Readiness determines traffic routing
3. **Alerting**: Health status triggers operational alerts
4. **Dependencies**: External service health affects availability

---

## 📖 User Stories

<details>
<summary>US-1: Platform Operator - Liveness Monitoring</summary>

**As a** Platform Operator
**I want** a liveness probe endpoint
**So that** orchestrators can restart unhealthy instances

**Acceptance Criteria:**

```gherkin
Feature: Liveness Probe

 Scenario: Healthy instance returns 200
 Given the application is running normally
 When I call GET /health/live
 Then response should be 200 OK
 And body should indicate healthy status

 Scenario: Unhealthy instance returns 503
 Given the application has a critical failure
 When I call GET /health/live
 Then response should be 503 Service Unavailable
 And body should indicate unhealthy status

 Scenario: Liveness is lightweight
 Given the application is under load
 When liveness is checked frequently
 Then response should be fast (<50ms)
 And minimal resource usage
```

**Technical Requirements:**
- Endpoint: `/health/live` or `/healthz`
- No external dependency checks
- Fast response (<50ms)
- Returns 200 or 503

</details>

<details>
<summary>US-2: Load Balancer - Readiness Monitoring</summary>

**As a** Load Balancer
**I want** a readiness probe endpoint
**So that** traffic routes only to ready instances

**Acceptance Criteria:**

```gherkin
Feature: Readiness Probe

 Scenario: Ready instance returns 200
 Given the application is fully initialized
 And all dependencies are healthy
 When I call GET /health/ready
 Then response should be 200 OK

 Scenario: Initializing instance returns 503
 Given the application is still starting
 When I call GET /health/ready
 Then response should be 503 Service Unavailable

 Scenario: Dependency failure affects readiness
 Given the database is unreachable
 When I call GET /health/ready
 Then response should be 503 Service Unavailable
 And body should indicate which dependency failed
```

</details>

<details>
<summary>US-3: Operations Engineer - Dependency Health</summary>

**As an** Operations Engineer
**I want** detailed dependency health checks
**So that** I can diagnose connectivity issues

**Acceptance Criteria:**

```gherkin
Feature: Dependency Health

 Scenario: Database health check
 Given database connection is configured
 When I call GET /health/dependencies
 Then database health should be reported
 And connection pool status should be shown

 Scenario: Redis health check
 Given Redis is configured
 When I call GET /health/dependencies
 Then Redis connectivity should be reported
 And latency should be measured

 Scenario: MCP server health check
 Given MCP servers are registered
 When I call GET /health/dependencies
 Then each server's health should be reported
 And unreachable servers should be flagged
```

</details>

---

## 🏗 Architecture

### Health Check Flow

```
┌─────────────────────────────────────────────────────────────────────────────┐
│ HEALTH CHECK ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘

 ORCHESTRATOR/LB GATEWAY DEPENDENCIES
 ───────────────── ─────── ────────────

 ┌──────────────┐ ┌─────────────┐
 │ Kubernetes │──────────▶│ /health/live│ (Liveness)
 │ kubelet │ │ │──▶ Internal state only
 └──────────────┘ └─────────────┘ (fast, no deps)

 ┌──────────────┐ ┌─────────────┐ ┌─────────────┐
 │ Load │──────────▶│/health/ready│────────▶│ Database │
 │ Balancer │ │ │ └─────────────┘
 └──────────────┘ │ (Readiness) │ ┌─────────────┐
 │ │────────▶│ Redis │
 └─────────────┘ └─────────────┘

 ┌──────────────┐ ┌─────────────┐ ┌─────────────┐
 │ Monitoring │──────────▶│/health/deps │────────▶│ MCP Servers │
 │ Dashboard │ │ │ ├─────────────┤
 └──────────────┘ │ (Detailed) │────────▶│ A2A Agents │
 │ │ ├─────────────┤
 └─────────────┘ │ Federation │
 └─────────────┘
```

### Health Response Structure

```
┌─────────────────────────────────────────────────────────────────────────────┐
│ HEALTH RESPONSE FORMAT │
└─────────────────────────────────────────────────────────────────────────────┘

 LIVENESS (/health/live):
 ┌────────────────────────────────────────────────────────────────────┐
 │ { │
 │ "status": "healthy", │
 │ "timestamp": "2024-01-15T10:30:00Z" │
 │ } │
 └────────────────────────────────────────────────────────────────────┘

 READINESS (/health/ready):
 ┌────────────────────────────────────────────────────────────────────┐
 │ { │
 │ "status": "ready", │
 │ "checks": { │
 │ "database": "ok", │
 │ "redis": "ok" │
 │ } │
 │ } │
 └────────────────────────────────────────────────────────────────────┘

 DETAILED (/health/dependencies):
 ┌────────────────────────────────────────────────────────────────────┐
 │ { │
 │ "database": { │
 │ "status": "healthy", │
 │ "latency_ms": 2, │
 │ "pool_size": 10, │
 │ "pool_available": 8 │
 │ }, │
 │ "redis": { │
 │ "status": "healthy", │
 │ "latency_ms": 1 │
 │ }, │
 │ "mcp_servers": { │
 │ "total": 5, │
 │ "healthy": 4, │
 │ "unhealthy": ["server-xyz"] │
 │ } │
 │ } │
 └────────────────────────────────────────────────────────────────────┘
```

---

## 📋 Test Environment Setup

### Prerequisites

```bash
export GATEWAY_URL="http://localhost:8000"
export DATABASE_URL="postgresql://..."
export REDIS_URL="redis://localhost:6379"

# Ensure all dependencies are running
docker-compose up -d postgres redis

# Start gateway
make dev
```

---

## 🧪 Manual Test Cases

### Section 1: Liveness Probe

| Case | Scenario | Condition | Expected | Validation |
|------|----------|-----------|----------|------------|
| LP-01 | Healthy instance | Normal operation | 200 OK | Response body |
| LP-02 | Response time | Under load | <50ms | Timing |
| LP-03 | No external deps | Network isolated | Still 200 | Independence |

<details>
<summary>LP-01: Healthy Instance Returns 200</summary>

**Preconditions:**
- Gateway is running
- Application fully started

**Steps:**

```bash
# Step 1: Check liveness endpoint
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/health/live")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -1)

echo "HTTP Code: $HTTP_CODE"
echo "Body: $BODY"

# Step 2: Verify 200 status
[ "$HTTP_CODE" = "200" ] && echo "PASS: Status 200" || echo "FAIL: Status $HTTP_CODE"

# Step 3: Verify response body
echo "$BODY" | jq -e '.status == "healthy"' > /dev/null && \
 echo "PASS: Status healthy" || echo "FAIL: Status not healthy"
```

**Expected Result:**
- HTTP 200 returned
- Body contains healthy status
- Response is valid JSON

</details>

<details>
<summary>LP-02: Response Time Under Load</summary>

**Preconditions:**
- Gateway under simulated load

**Steps:**

```bash
# Step 1: Generate background load
for i in {1..100}; do
 curl -s "$GATEWAY_URL/api/tools" -H "Authorization: Bearer $TOKEN" &
done

# Step 2: Measure liveness response time
for i in {1..10}; do
 TIME=$(curl -s -o /dev/null -w "%{time_total}" "$GATEWAY_URL/health/live")
 TIME_MS=$(echo "$TIME * 1000" | bc)
 echo "Request $i: ${TIME_MS}ms"

 # Verify under 50ms
 [ $(echo "$TIME_MS < 50" | bc) -eq 1 ] && echo "PASS" || echo "FAIL: Too slow"
done

# Step 3: Wait for background requests
wait
```

**Expected Result:**
- All responses under 50ms
- Consistent response times
- No degradation under load

</details>

### Section 2: Readiness Probe

| Case | Scenario | Condition | Expected | Validation |
|------|----------|-----------|----------|------------|
| RP-01 | Ready instance | All deps healthy | 200 OK | Response body |
| RP-02 | Database down | DB unreachable | 503 | Error details |
| RP-03 | Redis down | Redis unreachable | 503 or 200 | Depends on config |
| RP-04 | Startup phase | Still initializing | 503 | Not ready |

<details>
<summary>RP-01: Ready Instance Returns 200</summary>

**Preconditions:**
- Gateway running
- All dependencies healthy

**Steps:**

```bash
# Step 1: Check readiness endpoint
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/health/ready")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -1)

echo "HTTP Code: $HTTP_CODE"
echo "Body: $BODY" | jq .

# Step 2: Verify all checks pass
echo "$BODY" | jq -e '.checks | to_entries | all(.value == "ok")' && \
 echo "PASS: All checks OK" || echo "FAIL: Some checks failed"
```

**Expected Result:**
- HTTP 200 returned
- All dependency checks pass
- Response includes check details

</details>

<details>
<summary>RP-02: Database Down Returns 503</summary>

**Preconditions:**
- Gateway running
- Database container can be stopped

**Steps:**

```bash
# Step 1: Stop database
docker-compose stop postgres

# Step 2: Check readiness (wait for detection)
sleep 5
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY_URL/health/ready")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -1)

echo "HTTP Code: $HTTP_CODE"
echo "Body: $BODY" | jq .

# Step 3: Verify 503 and database failure indicated
[ "$HTTP_CODE" = "503" ] && echo "PASS: 503 returned" || echo "FAIL: Expected 503"
echo "$BODY" | jq -e '.checks.database != "ok"' && \
 echo "PASS: Database failure detected" || echo "FAIL: Database issue not detected"

# Step 4: Restart database
docker-compose start postgres
sleep 5

# Step 5: Verify recovery
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/health/ready")
[ "$HTTP_CODE" = "200" ] && echo "PASS: Recovered" || echo "FAIL: Not recovered"
```

**Expected Result:**
- 503 returned when database down
- Response indicates database failure
- Recovery when database restored

</details>

### Section 3: Dependency Health

| Case | Scenario | Dependency | Expected | Validation |
|------|----------|------------|----------|------------|
| DH-01 | All healthy | All deps | Full status | All green |
| DH-02 | Database latency | Slow DB | Latency reported | Timing shown |
| DH-03 | MCP server down | One server | Partial healthy | Server flagged |
| DH-04 | Connection pool | High usage | Pool stats | Available count |

<details>
<summary>DH-01: All Dependencies Healthy</summary>

**Preconditions:**
- All dependencies running
- MCP servers registered

**Steps:**

```bash
# Step 1: Call dependency health endpoint
curl -s "$GATEWAY_URL/health/dependencies" \
 -H "Authorization: Bearer $ADMIN_TOKEN" | jq .

# Step 2: Verify all components reported
RESPONSE=$(curl -s "$GATEWAY_URL/health/dependencies" \
 -H "Authorization: Bearer $ADMIN_TOKEN")

# Check database
echo "$RESPONSE" | jq -e '.database.status == "healthy"' && \
 echo "PASS: Database healthy" || echo "FAIL: Database unhealthy"

# Check Redis (if configured)
echo "$RESPONSE" | jq -e '.redis.status == "healthy"' && \
 echo "PASS: Redis healthy" || echo "WARN: Redis not configured/healthy"

# Check MCP servers
TOTAL=$(echo "$RESPONSE" | jq '.mcp_servers.total')
HEALTHY=$(echo "$RESPONSE" | jq '.mcp_servers.healthy')
echo "MCP Servers: $HEALTHY/$TOTAL healthy"
```

**Expected Result:**
- All dependencies report healthy
- Latency metrics included
- Complete status overview

</details>

<details>
<summary>DH-03: MCP Server Down Detected</summary>

**Preconditions:**
- Multiple MCP servers registered
- One server can be stopped

**Steps:**

```bash
# Step 1: Check current server health
curl -s "$GATEWAY_URL/health/dependencies" \
 -H "Authorization: Bearer $ADMIN_TOKEN" | jq '.mcp_servers'

# Step 2: Stop one MCP server (simulate failure)
# (Depends on your setup - could be docker stop, kill process, etc.)
docker stop mcp-server-test

# Step 3: Wait for health check cycle
sleep 30 # Or configured health check interval

# Step 4: Verify unhealthy server detected
RESPONSE=$(curl -s "$GATEWAY_URL/health/dependencies" \
 -H "Authorization: Bearer $ADMIN_TOKEN")

echo "$RESPONSE" | jq '.mcp_servers'

# Check unhealthy list includes stopped server
echo "$RESPONSE" | jq -e '.mcp_servers.unhealthy | length > 0' && \
 echo "PASS: Unhealthy server detected" || echo "FAIL: Not detected"

# Step 5: Restart server
docker start mcp-server-test
```

**Expected Result:**
- Stopped server detected as unhealthy
- Other servers remain healthy
- Recovery detected after restart

</details>

### Section 4: Health Dashboard

| Case | UI Element | Action | Expected |
|------|------------|--------|----------|
| HD-01 | Dashboard | Access | Shows all health |
| HD-02 | Status indicators | View | Color-coded |
| HD-03 | Dependency details | Click | Expanded info |
| HD-04 | Auto-refresh | Wait | Updates periodically |

<details>
<summary>HD-01: Health Dashboard Access</summary>

**Preconditions:**
- Admin UI enabled
- User logged in as admin

**Steps:**

```
1. Navigate to http://localhost:8080/admin/#health
2. Verify dashboard displays:
 - Overall system status (healthy/degraded/unhealthy)
 - Database connection status
 - Redis connection status (if configured)
 - MCP server count and health
 - Federation peer status (if configured)
3. Verify color-coded indicators:
 - Green = healthy
 - Yellow = degraded
 - Red = unhealthy
4. Verify refresh button works
```

**Expected Result:**
- Dashboard loads with all health info
- Status indicators color-coded
- All dependencies shown
- Manual refresh works

</details>

---

## 📊 Test Matrix

| Test Case | Liveness | Readiness | Dependencies | UI | Kubernetes |
|-----------|----------|-----------|--------------|-----|------------|
| LP-01 | ✓ | | | | ✓ |
| LP-02 | ✓ | | | | |
| LP-03 | ✓ | | | | |
| RP-01 | | ✓ | | | ✓ |
| RP-02 | | ✓ | | | ✓ |
| RP-03 | | ✓ | | | |
| RP-04 | | ✓ | | | ✓ |
| DH-01 | | | ✓ | | |
| DH-02 | | | ✓ | | |
| DH-03 | | | ✓ | | |
| DH-04 | | | ✓ | | |
| HD-01 | | | | ✓ | |
| HD-02 | | | | ✓ | |
| HD-03 | | | | ✓ | |
| HD-04 | | | | ✓ | |

---

## ✅ Success Criteria

- [ ] All 15 test cases pass
- [ ] Liveness probe responds in <50ms
- [ ] Readiness reflects actual dependency status
- [ ] Database failures detected
- [ ] Redis failures detected (if configured)
- [ ] MCP server health tracked
- [ ] Health dashboard functional
- [ ] Kubernetes integration works

---

## 🔗 Related Files

- `mcpgateway/routers/health.py`
- `mcpgateway/services/health_service.py`
- `mcpgateway/services/server_health_service.py`
- `mcpgateway/admin.py` (health dashboard)

---

## 🔗 Related Issues

- Kubernetes deployment testing
- Load balancer integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TESTING][OPERATIONS]: Health Monitoring Manual Test Plan (Liveness, Readiness, Dependencies) #2462

🏥 [TESTING][OPERATIONS]: Health Monitoring Manual Test Plan (Liveness, Readiness, Dependencies)

Goal

Why Now?

📖 User Stories

🏗 Architecture

Health Check Flow

Health Response Structure

📋 Test Environment Setup

Prerequisites

🧪 Manual Test Cases

Section 1: Liveness Probe

Section 2: Readiness Probe

Section 3: Dependency Health

Section 4: Health Dashboard

📊 Test Matrix

✅ Success Criteria

🔗 Related Files

🔗 Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Case	Scenario	Condition	Expected	Validation
LP-01	Healthy instance	Normal operation	200 OK	Response body
LP-02	Response time	Under load	<50ms	Timing
LP-03	No external deps	Network isolated	Still 200	Independence

Case	Scenario	Condition	Expected	Validation
RP-01	Ready instance	All deps healthy	200 OK	Response body
RP-02	Database down	DB unreachable	503	Error details
RP-03	Redis down	Redis unreachable	503 or 200	Depends on config
RP-04	Startup phase	Still initializing	503	Not ready

Case	Scenario	Dependency	Expected	Validation
DH-01	All healthy	All deps	Full status	All green
DH-02	Database latency	Slow DB	Latency reported	Timing shown
DH-03	MCP server down	One server	Partial healthy	Server flagged
DH-04	Connection pool	High usage	Pool stats	Available count

Case	UI Element	Action	Expected
HD-01	Dashboard	Access	Shows all health
HD-02	Status indicators	View	Color-coded
HD-03	Dependency details	Click	Expanded info
HD-04	Auto-refresh	Wait	Updates periodically

[TESTING][OPERATIONS]: Health Monitoring Manual Test Plan (Liveness, Readiness, Dependencies) #2462

Description

🏥 [TESTING][OPERATIONS]: Health Monitoring Manual Test Plan (Liveness, Readiness, Dependencies)

Goal

Why Now?

📖 User Stories

🏗 Architecture

Health Check Flow

Health Response Structure

📋 Test Environment Setup

Prerequisites

🧪 Manual Test Cases

Section 1: Liveness Probe

Section 2: Readiness Probe

Section 3: Dependency Health

Section 4: Health Dashboard

📊 Test Matrix

✅ Success Criteria

🔗 Related Files

🔗 Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions