[TESTING][RESILIENCE]: Kubernetes Resilience Manual Test Plan (Pod Deletion, Node Failure, Rolling Updates)

# ☸️ [TESTING][RESILIENCE]: Kubernetes Resilience Manual Test Plan (Pod Deletion, Node Failure, Rolling Updates)

## Goal

Produce a **comprehensive manual test plan** for Kubernetes resilience testing including pod lifecycle events, node failures, rolling deployments, and horizontal scaling behavior.

## Why Now?

Kubernetes deployments require validated resilience:

1. **Zero-Downtime**: Rolling updates shouldn't drop requests
2. **Self-Healing**: Pod failures should trigger restarts
3. **Scaling**: Horizontal scaling must work correctly
4. **Node Tolerance**: Node failures shouldn't cause outages

---

## 📖 User Stories

<details>
<summary>US-1: Platform Operator - Pod Lifecycle</summary>

**As a** Platform Operator
**I want** pods to handle lifecycle events gracefully
**So that** deployments don't cause service disruptions

**Acceptance Criteria:**

```gherkin
Feature: Pod Lifecycle

 Scenario: Pod deletion with graceful shutdown
 Given a pod is running and serving traffic
 When the pod receives SIGTERM
 Then it should stop accepting new connections
 And finish processing existing requests
 And exit within terminationGracePeriodSeconds

 Scenario: Pod restart due to liveness failure
 Given a pod is in an unhealthy state
 When liveness probe fails
 Then Kubernetes should restart the pod
 And traffic should route to healthy pods meanwhile
```

</details>

<details>
<summary>US-2: SRE - Rolling Deployments</summary>

**As an** SRE
**I want** rolling deployments to be zero-downtime
**So that** users don't experience service interruptions

**Acceptance Criteria:**

```gherkin
Feature: Rolling Deployments

 Scenario: Rolling update with no dropped requests
 Given a deployment with 3 replicas
 When I update the deployment image
 Then old pods should drain connections
 And new pods should be ready before old ones terminate
 And no requests should fail during rollout
```

</details>

---

## 🏗 Architecture

### Kubernetes Deployment Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES DEPLOYMENT ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘

 INGRESS SERVICE DEPLOYMENT
 ─────── ─────── ──────────

 ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐
 │ Ingress │───────▶│ Service │───────▶│ Pod 1 │
 │ Controller│ │ (ClusterIP)│ │ ┌───────────────┐ │
 └─────────────┘ │ │ │ │ mcpgateway │ │
 │ Selector: │ │ │ :8000 │ │
 │ app= │ │ └───────────────┘ │
 │ mcpgateway│ │ Liveness: /health │
 │ │ │ Readiness: /ready │
 │ │ └─────────────────────┘
 │ │
 │ │ ┌─────────────────────┐
 │ │───────▶│ Pod 2 │
 │ │ │ ┌───────────────┐ │
 │ │ │ │ mcpgateway │ │
 │ │ │ │ :8000 │ │
 │ │ │ └───────────────┘ │
 └─────────────┘ └─────────────────────┘


 ROLLING UPDATE SEQUENCE
 ───────────────────────

 Initial: [Pod-v1] [Pod-v1] [Pod-v1]
 │
 Step 1: [Pod-v1] [Pod-v1] [Pod-v1] [Pod-v2 Starting]
 │
 Step 2: [Pod-v1] [Pod-v1] [Terminating] [Pod-v2 Ready]
 │
 Step 3: [Pod-v1] [Pod-v2 Ready] [Pod-v2 Starting]
 │
 Step 4: [Terminating] [Pod-v2 Ready] [Pod-v2 Ready]
 │
 Final: [Pod-v2] [Pod-v2] [Pod-v2]
```

---

## 📋 Test Environment Setup

### Prerequisites

```bash
# Kubernetes cluster (kind, minikube, or real cluster)
kubectl cluster-info

# Deploy mcpgateway
kubectl apply -f charts/mcpgateway/

# Verify deployment
kubectl get pods -l app=mcpgateway
kubectl get svc mcpgateway

# Port forward for testing
kubectl port-forward svc/mcpgateway 8000:8000 &
export GATEWAY_URL="http://localhost:8000"
```

---

## 🧪 Manual Test Cases

### Section 1: Pod Lifecycle

| Case | Scenario | Action | Expected | Validation |
|------|----------|--------|----------|------------|
| PL-01 | Pod deletion | kubectl delete | Graceful drain | No 502s |
| PL-02 | Liveness failure | Simulate unhealthy | Pod restarted | RestartCount++ |
| PL-03 | Readiness failure | Simulate not ready | Removed from LB | No traffic |
| PL-04 | OOM kill | Memory pressure | Pod restarted | OOMKilled reason |

<details>
<summary>PL-01: Pod Deletion with Graceful Shutdown</summary>

**Preconditions:**
- Multiple pods running
- Load generator ready

**Steps:**

```bash
# Step 1: Start continuous load
while true; do
 curl -s -o /dev/null -w "%{http_code}\n" "$GATEWAY_URL/health" >> /tmp/responses.log
 sleep 0.1
done &
LOAD_PID=$!

# Step 2: Get pod name
POD=$(kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].metadata.name}')
echo "Deleting pod: $POD"

# Step 3: Delete pod
kubectl delete pod $POD

# Step 4: Wait for new pod
kubectl wait --for=condition=ready pod -l app=mcpgateway --timeout=60s

# Step 5: Stop load generator
kill $LOAD_PID

# Step 6: Analyze responses
echo "Total requests: $(wc -l < /tmp/responses.log)"
echo "Failed requests: $(grep -v 200 /tmp/responses.log | wc -l)"
grep -v 200 /tmp/responses.log | sort | uniq -c
```

**Expected Result:**
- Pod terminates gracefully
- No 502/503 errors during termination
- New pod starts and becomes ready
- Traffic continues uninterrupted

</details>

<details>
<summary>PL-02: Liveness Probe Failure</summary>

**Preconditions:**
- Pod with liveness probe configured
- Ability to simulate unhealthy state

**Steps:**

```bash
# Step 1: Check current restart count
kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].status.containerStatuses[0].restartCount}'

# Step 2: Simulate liveness failure
# Option A: Exec into pod and kill the process
POD=$(kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].metadata.name}')
kubectl exec $POD -- kill 1

# Option B: If app supports it, trigger unhealthy state via API

# Step 3: Watch pod status
kubectl get pods -l app=mcpgateway -w

# Step 4: Wait for restart
sleep 30

# Step 5: Verify restart count increased
kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].status.containerStatuses[0].restartCount}'

# Step 6: Verify pod is healthy again
kubectl get pods -l app=mcpgateway
curl -s "$GATEWAY_URL/health" | jq .
```

**Expected Result:**
- Liveness probe detects failure
- Kubernetes restarts the pod
- Restart count increases
- Pod becomes healthy after restart

</details>

### Section 2: Rolling Deployments

| Case | Scenario | Strategy | Expected | Validation |
|------|----------|----------|----------|------------|
| RD-01 | Image update | RollingUpdate | Zero downtime | No errors |
| RD-02 | Config change | RollingUpdate | Gradual rollout | Config applied |
| RD-03 | Rollback | kubectl rollout | Previous version | Quick rollback |

<details>
<summary>RD-01: Rolling Update Zero Downtime</summary>

**Preconditions:**
- Deployment with 3+ replicas
- RollingUpdate strategy configured

**Steps:**

```bash
# Step 1: Start continuous load in background
while true; do
 START=$(date +%s%N)
 HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/api/tools" \
 -H "Authorization: Bearer $TOKEN")
 END=$(date +%s%N)
 LATENCY=$(( ($END - $START) / 1000000 ))
 echo "$(date +%H:%M:%S) $HTTP_CODE ${LATENCY}ms"
 sleep 0.2
done > /tmp/rolling-update.log &
LOAD_PID=$!

# Step 2: Trigger rolling update (change image or env var)
kubectl set env deployment/mcpgateway ROLLING_UPDATE_TEST=$(date +%s)

# Step 3: Watch rollout
kubectl rollout status deployment/mcpgateway

# Step 4: Stop load generator
kill $LOAD_PID

# Step 5: Analyze results
echo "Total requests: $(wc -l < /tmp/rolling-update.log)"
echo "Failed requests: $(grep -v " 200 " /tmp/rolling-update.log | wc -l)"
echo "Max latency: $(awk '{print $3}' /tmp/rolling-update.log | sort -n | tail -1)"

# Step 6: Show any errors
grep -v " 200 " /tmp/rolling-update.log | head -10
```

**Expected Result:**
- All requests return 200
- No dropped connections
- Latency may spike briefly but no errors
- Rollout completes successfully

</details>

<details>
<summary>RD-03: Rollback</summary>

**Preconditions:**
- Previous deployment revision exists

**Steps:**

```bash
# Step 1: Check rollout history
kubectl rollout history deployment/mcpgateway

# Step 2: Start load generator
while true; do
 curl -s -o /dev/null -w "%{http_code}\n" "$GATEWAY_URL/health"
 sleep 0.1
done > /tmp/rollback.log &
LOAD_PID=$!

# Step 3: Trigger rollback
kubectl rollout undo deployment/mcpgateway

# Step 4: Wait for rollback
kubectl rollout status deployment/mcpgateway

# Step 5: Stop load
kill $LOAD_PID

# Step 6: Verify no errors
grep -v 200 /tmp/rollback.log | wc -l
```

**Expected Result:**
- Rollback completes quickly
- No service interruption
- Previous version restored

</details>

### Section 3: Node Failure

| Case | Scenario | Trigger | Expected | Validation |
|------|----------|---------|----------|------------|
| NF-01 | Node drain | kubectl drain | Pods rescheduled | No downtime |
| NF-02 | Node cordoned | kubectl cordon | No new pods | Existing work |
| NF-03 | Node crash | Kill node | Pods reschedule | Recovery time |

<details>
<summary>NF-01: Node Drain</summary>

**Preconditions:**
- Multi-node cluster
- Pods distributed across nodes
- PodDisruptionBudget configured

**Steps:**

```bash
# Step 1: Identify node with gateway pods
NODE=$(kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].spec.nodeName}')
echo "Draining node: $NODE"

# Step 2: Start load generator
while true; do
 curl -s -o /dev/null -w "%{http_code}\n" "$GATEWAY_URL/health"
 sleep 0.1
done > /tmp/drain.log &
LOAD_PID=$!

# Step 3: Drain the node
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data

# Step 4: Watch pods reschedule
kubectl get pods -l app=mcpgateway -o wide -w &

# Step 5: Wait for reschedule
sleep 30

# Step 6: Stop load
kill $LOAD_PID

# Step 7: Analyze
grep -v 200 /tmp/drain.log | wc -l

# Step 8: Uncordon node
kubectl uncordon $NODE
```

**Expected Result:**
- Pods evicted gracefully
- Rescheduled to other nodes
- PDB respected (min available)
- Minimal or no errors during drain

</details>

### Section 4: Horizontal Scaling

| Case | Scenario | Trigger | Expected | Validation |
|------|----------|---------|----------|------------|
| HS-01 | Scale up | Manual/HPA | New pods ready | Traffic distributed |
| HS-02 | Scale down | Manual/HPA | Graceful termination | No errors |
| HS-03 | HPA trigger | CPU load | Auto scale | Metrics accurate |

<details>
<summary>HS-01: Scale Up</summary>

**Preconditions:**
- Deployment with replicas=1

**Steps:**

```bash
# Step 1: Check current replicas
kubectl get deployment mcpgateway -o jsonpath='{.spec.replicas}'

# Step 2: Scale up
kubectl scale deployment mcpgateway --replicas=5

# Step 3: Watch pods come up
kubectl get pods -l app=mcpgateway -w

# Step 4: Wait for all ready
kubectl wait --for=condition=ready pod -l app=mcpgateway --timeout=120s

# Step 5: Verify traffic distribution
for i in {1..20}; do
 curl -s "$GATEWAY_URL/health" | jq -r '.hostname // empty'
done | sort | uniq -c

# Step 6: Verify all pods serving traffic
```

**Expected Result:**
- New pods start quickly
- Become ready and receive traffic
- Load distributed across all pods

</details>

---

## 📊 Test Matrix

| Test Case | Lifecycle | Rolling | Node | Scaling | GKE | EKS | AKS |
|-----------|-----------|---------|------|---------|-----|-----|-----|
| PL-01 | ✓ | | | | ✓ | ✓ | ✓ |
| PL-02 | ✓ | | | | ✓ | ✓ | ✓ |
| PL-03 | ✓ | | | | ✓ | ✓ | ✓ |
| PL-04 | ✓ | | | | ✓ | ✓ | ✓ |
| RD-01 | | ✓ | | | ✓ | ✓ | ✓ |
| RD-02 | | ✓ | | | ✓ | ✓ | ✓ |
| RD-03 | | ✓ | | | ✓ | ✓ | ✓ |
| NF-01 | | | ✓ | | ✓ | ✓ | ✓ |
| NF-02 | | | ✓ | | ✓ | ✓ | ✓ |
| NF-03 | | | ✓ | | ✓ | ✓ | ✓ |
| HS-01 | | | | ✓ | ✓ | ✓ | ✓ |
| HS-02 | | | | ✓ | ✓ | ✓ | ✓ |
| HS-03 | | | | ✓ | ✓ | ✓ | ✓ |

---

## ✅ Success Criteria

- [ ] All 13 test cases pass
- [ ] Pod deletion is graceful (no dropped requests)
- [ ] Liveness/readiness probes work correctly
- [ ] Rolling updates are zero-downtime
- [ ] Rollback works quickly
- [ ] Node drain respects PDB
- [ ] Scaling works correctly
- [ ] HPA triggers appropriately

---

## 🔗 Related Files

- `charts/mcpgateway/` - Helm chart
- `charts/mcpgateway/templates/deployment.yaml`
- `charts/mcpgateway/templates/pdb.yaml`

---

## 🔗 Related Issues

- #2462 - Health Monitoring
- Helm chart testing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TESTING][RESILIENCE]: Kubernetes Resilience Manual Test Plan (Pod Deletion, Node Failure, Rolling Updates) #2468

☸️ [TESTING][RESILIENCE]: Kubernetes Resilience Manual Test Plan (Pod Deletion, Node Failure, Rolling Updates)

Goal

Why Now?

📖 User Stories

🏗 Architecture

Kubernetes Deployment Architecture

📋 Test Environment Setup

Prerequisites

🧪 Manual Test Cases

Section 1: Pod Lifecycle

Section 2: Rolling Deployments

Section 3: Node Failure

Section 4: Horizontal Scaling

📊 Test Matrix

✅ Success Criteria

🔗 Related Files

🔗 Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Case	Scenario	Action	Expected	Validation
PL-01	Pod deletion	kubectl delete	Graceful drain	No 502s
PL-02	Liveness failure	Simulate unhealthy	Pod restarted	RestartCount++
PL-03	Readiness failure	Simulate not ready	Removed from LB	No traffic
PL-04	OOM kill	Memory pressure	Pod restarted	OOMKilled reason

Case	Scenario	Strategy	Expected	Validation
RD-01	Image update	RollingUpdate	Zero downtime	No errors
RD-02	Config change	RollingUpdate	Gradual rollout	Config applied
RD-03	Rollback	kubectl rollout	Previous version	Quick rollback

Case	Scenario	Trigger	Expected	Validation
NF-01	Node drain	kubectl drain	Pods rescheduled	No downtime
NF-02	Node cordoned	kubectl cordon	No new pods	Existing work
NF-03	Node crash	Kill node	Pods reschedule	Recovery time

Case	Scenario	Trigger	Expected	Validation
HS-01	Scale up	Manual/HPA	New pods ready	Traffic distributed
HS-02	Scale down	Manual/HPA	Graceful termination	No errors
HS-03	HPA trigger	CPU load	Auto scale	Metrics accurate

[TESTING][RESILIENCE]: Kubernetes Resilience Manual Test Plan (Pod Deletion, Node Failure, Rolling Updates) #2468

Description

☸️ [TESTING][RESILIENCE]: Kubernetes Resilience Manual Test Plan (Pod Deletion, Node Failure, Rolling Updates)

Goal

Why Now?

📖 User Stories

🏗 Architecture

Kubernetes Deployment Architecture

📋 Test Environment Setup

Prerequisites

🧪 Manual Test Cases

Section 1: Pod Lifecycle

Section 2: Rolling Deployments

Section 3: Node Failure

Section 4: Horizontal Scaling

📊 Test Matrix

✅ Success Criteria

🔗 Related Files

🔗 Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions