-
Notifications
You must be signed in to change notification settings - Fork 614
[TESTING][RESILIENCE]: Kubernetes Resilience Manual Test Plan (Pod Deletion, Node Failure, Rolling Updates) #2468
Copy link
Copy link
Open
Labels
SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseP2: Important but not vital; high-value items that are not crucial for the immediate releasechoreLinting, formatting, dependency hygiene, or project maintenance choresLinting, formatting, dependency hygiene, or project maintenance choresmanual-testingManual testing / test planning issuesManual testing / test planning issuesreadyValidated, ready-to-work-on itemsValidated, ready-to-work-on itemstestingTesting (unit, e2e, manual, automated, etc)Testing (unit, e2e, manual, automated, etc)
Milestone
Description
☸️ [TESTING][RESILIENCE]: Kubernetes Resilience Manual Test Plan (Pod Deletion, Node Failure, Rolling Updates)
Goal
Produce a comprehensive manual test plan for Kubernetes resilience testing including pod lifecycle events, node failures, rolling deployments, and horizontal scaling behavior.
Why Now?
Kubernetes deployments require validated resilience:
- Zero-Downtime: Rolling updates shouldn't drop requests
- Self-Healing: Pod failures should trigger restarts
- Scaling: Horizontal scaling must work correctly
- Node Tolerance: Node failures shouldn't cause outages
📖 User Stories
US-1: Platform Operator - Pod Lifecycle
As a Platform Operator
I want pods to handle lifecycle events gracefully
So that deployments don't cause service disruptions
Acceptance Criteria:
Feature: Pod Lifecycle
Scenario: Pod deletion with graceful shutdown
Given a pod is running and serving traffic
When the pod receives SIGTERM
Then it should stop accepting new connections
And finish processing existing requests
And exit within terminationGracePeriodSeconds
Scenario: Pod restart due to liveness failure
Given a pod is in an unhealthy state
When liveness probe fails
Then Kubernetes should restart the pod
And traffic should route to healthy pods meanwhileUS-2: SRE - Rolling Deployments
As an SRE
I want rolling deployments to be zero-downtime
So that users don't experience service interruptions
Acceptance Criteria:
Feature: Rolling Deployments
Scenario: Rolling update with no dropped requests
Given a deployment with 3 replicas
When I update the deployment image
Then old pods should drain connections
And new pods should be ready before old ones terminate
And no requests should fail during rollout🏗 Architecture
Kubernetes Deployment Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ KUBERNETES DEPLOYMENT ARCHITECTURE │
└─────────────────────────────────────────────────────────────────────────────┘
INGRESS SERVICE DEPLOYMENT
─────── ─────── ──────────
┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐
│ Ingress │───────▶│ Service │───────▶│ Pod 1 │
│ Controller│ │ (ClusterIP)│ │ ┌───────────────┐ │
└─────────────┘ │ │ │ │ mcpgateway │ │
│ Selector: │ │ │ :8000 │ │
│ app= │ │ └───────────────┘ │
│ mcpgateway│ │ Liveness: /health │
│ │ │ Readiness: /ready │
│ │ └─────────────────────┘
│ │
│ │ ┌─────────────────────┐
│ │───────▶│ Pod 2 │
│ │ │ ┌───────────────┐ │
│ │ │ │ mcpgateway │ │
│ │ │ │ :8000 │ │
│ │ │ └───────────────┘ │
└─────────────┘ └─────────────────────┘
ROLLING UPDATE SEQUENCE
───────────────────────
Initial: [Pod-v1] [Pod-v1] [Pod-v1]
│
Step 1: [Pod-v1] [Pod-v1] [Pod-v1] [Pod-v2 Starting]
│
Step 2: [Pod-v1] [Pod-v1] [Terminating] [Pod-v2 Ready]
│
Step 3: [Pod-v1] [Pod-v2 Ready] [Pod-v2 Starting]
│
Step 4: [Terminating] [Pod-v2 Ready] [Pod-v2 Ready]
│
Final: [Pod-v2] [Pod-v2] [Pod-v2]
📋 Test Environment Setup
Prerequisites
# Kubernetes cluster (kind, minikube, or real cluster)
kubectl cluster-info
# Deploy mcpgateway
kubectl apply -f charts/mcpgateway/
# Verify deployment
kubectl get pods -l app=mcpgateway
kubectl get svc mcpgateway
# Port forward for testing
kubectl port-forward svc/mcpgateway 8000:8000 &
export GATEWAY_URL="http://localhost:8000"🧪 Manual Test Cases
Section 1: Pod Lifecycle
| Case | Scenario | Action | Expected | Validation |
|---|---|---|---|---|
| PL-01 | Pod deletion | kubectl delete | Graceful drain | No 502s |
| PL-02 | Liveness failure | Simulate unhealthy | Pod restarted | RestartCount++ |
| PL-03 | Readiness failure | Simulate not ready | Removed from LB | No traffic |
| PL-04 | OOM kill | Memory pressure | Pod restarted | OOMKilled reason |
PL-01: Pod Deletion with Graceful Shutdown
Preconditions:
- Multiple pods running
- Load generator ready
Steps:
# Step 1: Start continuous load
while true; do
curl -s -o /dev/null -w "%{http_code}\n" "$GATEWAY_URL/health" >> /tmp/responses.log
sleep 0.1
done &
LOAD_PID=$!
# Step 2: Get pod name
POD=$(kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].metadata.name}')
echo "Deleting pod: $POD"
# Step 3: Delete pod
kubectl delete pod $POD
# Step 4: Wait for new pod
kubectl wait --for=condition=ready pod -l app=mcpgateway --timeout=60s
# Step 5: Stop load generator
kill $LOAD_PID
# Step 6: Analyze responses
echo "Total requests: $(wc -l < /tmp/responses.log)"
echo "Failed requests: $(grep -v 200 /tmp/responses.log | wc -l)"
grep -v 200 /tmp/responses.log | sort | uniq -cExpected Result:
- Pod terminates gracefully
- No 502/503 errors during termination
- New pod starts and becomes ready
- Traffic continues uninterrupted
PL-02: Liveness Probe Failure
Preconditions:
- Pod with liveness probe configured
- Ability to simulate unhealthy state
Steps:
# Step 1: Check current restart count
kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].status.containerStatuses[0].restartCount}'
# Step 2: Simulate liveness failure
# Option A: Exec into pod and kill the process
POD=$(kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].metadata.name}')
kubectl exec $POD -- kill 1
# Option B: If app supports it, trigger unhealthy state via API
# Step 3: Watch pod status
kubectl get pods -l app=mcpgateway -w
# Step 4: Wait for restart
sleep 30
# Step 5: Verify restart count increased
kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].status.containerStatuses[0].restartCount}'
# Step 6: Verify pod is healthy again
kubectl get pods -l app=mcpgateway
curl -s "$GATEWAY_URL/health" | jq .Expected Result:
- Liveness probe detects failure
- Kubernetes restarts the pod
- Restart count increases
- Pod becomes healthy after restart
Section 2: Rolling Deployments
| Case | Scenario | Strategy | Expected | Validation |
|---|---|---|---|---|
| RD-01 | Image update | RollingUpdate | Zero downtime | No errors |
| RD-02 | Config change | RollingUpdate | Gradual rollout | Config applied |
| RD-03 | Rollback | kubectl rollout | Previous version | Quick rollback |
RD-01: Rolling Update Zero Downtime
Preconditions:
- Deployment with 3+ replicas
- RollingUpdate strategy configured
Steps:
# Step 1: Start continuous load in background
while true; do
START=$(date +%s%N)
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/api/tools" \
-H "Authorization: Bearer $TOKEN")
END=$(date +%s%N)
LATENCY=$(( ($END - $START) / 1000000 ))
echo "$(date +%H:%M:%S) $HTTP_CODE ${LATENCY}ms"
sleep 0.2
done > /tmp/rolling-update.log &
LOAD_PID=$!
# Step 2: Trigger rolling update (change image or env var)
kubectl set env deployment/mcpgateway ROLLING_UPDATE_TEST=$(date +%s)
# Step 3: Watch rollout
kubectl rollout status deployment/mcpgateway
# Step 4: Stop load generator
kill $LOAD_PID
# Step 5: Analyze results
echo "Total requests: $(wc -l < /tmp/rolling-update.log)"
echo "Failed requests: $(grep -v " 200 " /tmp/rolling-update.log | wc -l)"
echo "Max latency: $(awk '{print $3}' /tmp/rolling-update.log | sort -n | tail -1)"
# Step 6: Show any errors
grep -v " 200 " /tmp/rolling-update.log | head -10Expected Result:
- All requests return 200
- No dropped connections
- Latency may spike briefly but no errors
- Rollout completes successfully
RD-03: Rollback
Preconditions:
- Previous deployment revision exists
Steps:
# Step 1: Check rollout history
kubectl rollout history deployment/mcpgateway
# Step 2: Start load generator
while true; do
curl -s -o /dev/null -w "%{http_code}\n" "$GATEWAY_URL/health"
sleep 0.1
done > /tmp/rollback.log &
LOAD_PID=$!
# Step 3: Trigger rollback
kubectl rollout undo deployment/mcpgateway
# Step 4: Wait for rollback
kubectl rollout status deployment/mcpgateway
# Step 5: Stop load
kill $LOAD_PID
# Step 6: Verify no errors
grep -v 200 /tmp/rollback.log | wc -lExpected Result:
- Rollback completes quickly
- No service interruption
- Previous version restored
Section 3: Node Failure
| Case | Scenario | Trigger | Expected | Validation |
|---|---|---|---|---|
| NF-01 | Node drain | kubectl drain | Pods rescheduled | No downtime |
| NF-02 | Node cordoned | kubectl cordon | No new pods | Existing work |
| NF-03 | Node crash | Kill node | Pods reschedule | Recovery time |
NF-01: Node Drain
Preconditions:
- Multi-node cluster
- Pods distributed across nodes
- PodDisruptionBudget configured
Steps:
# Step 1: Identify node with gateway pods
NODE=$(kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].spec.nodeName}')
echo "Draining node: $NODE"
# Step 2: Start load generator
while true; do
curl -s -o /dev/null -w "%{http_code}\n" "$GATEWAY_URL/health"
sleep 0.1
done > /tmp/drain.log &
LOAD_PID=$!
# Step 3: Drain the node
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data
# Step 4: Watch pods reschedule
kubectl get pods -l app=mcpgateway -o wide -w &
# Step 5: Wait for reschedule
sleep 30
# Step 6: Stop load
kill $LOAD_PID
# Step 7: Analyze
grep -v 200 /tmp/drain.log | wc -l
# Step 8: Uncordon node
kubectl uncordon $NODEExpected Result:
- Pods evicted gracefully
- Rescheduled to other nodes
- PDB respected (min available)
- Minimal or no errors during drain
Section 4: Horizontal Scaling
| Case | Scenario | Trigger | Expected | Validation |
|---|---|---|---|---|
| HS-01 | Scale up | Manual/HPA | New pods ready | Traffic distributed |
| HS-02 | Scale down | Manual/HPA | Graceful termination | No errors |
| HS-03 | HPA trigger | CPU load | Auto scale | Metrics accurate |
HS-01: Scale Up
Preconditions:
- Deployment with replicas=1
Steps:
# Step 1: Check current replicas
kubectl get deployment mcpgateway -o jsonpath='{.spec.replicas}'
# Step 2: Scale up
kubectl scale deployment mcpgateway --replicas=5
# Step 3: Watch pods come up
kubectl get pods -l app=mcpgateway -w
# Step 4: Wait for all ready
kubectl wait --for=condition=ready pod -l app=mcpgateway --timeout=120s
# Step 5: Verify traffic distribution
for i in {1..20}; do
curl -s "$GATEWAY_URL/health" | jq -r '.hostname // empty'
done | sort | uniq -c
# Step 6: Verify all pods serving trafficExpected Result:
- New pods start quickly
- Become ready and receive traffic
- Load distributed across all pods
📊 Test Matrix
| Test Case | Lifecycle | Rolling | Node | Scaling | GKE | EKS | AKS |
|---|---|---|---|---|---|---|---|
| PL-01 | ✓ | ✓ | ✓ | ✓ | |||
| PL-02 | ✓ | ✓ | ✓ | ✓ | |||
| PL-03 | ✓ | ✓ | ✓ | ✓ | |||
| PL-04 | ✓ | ✓ | ✓ | ✓ | |||
| RD-01 | ✓ | ✓ | ✓ | ✓ | |||
| RD-02 | ✓ | ✓ | ✓ | ✓ | |||
| RD-03 | ✓ | ✓ | ✓ | ✓ | |||
| NF-01 | ✓ | ✓ | ✓ | ✓ | |||
| NF-02 | ✓ | ✓ | ✓ | ✓ | |||
| NF-03 | ✓ | ✓ | ✓ | ✓ | |||
| HS-01 | ✓ | ✓ | ✓ | ✓ | |||
| HS-02 | ✓ | ✓ | ✓ | ✓ | |||
| HS-03 | ✓ | ✓ | ✓ | ✓ |
✅ Success Criteria
- All 13 test cases pass
- Pod deletion is graceful (no dropped requests)
- Liveness/readiness probes work correctly
- Rolling updates are zero-downtime
- Rollback works quickly
- Node drain respects PDB
- Scaling works correctly
- HPA triggers appropriately
🔗 Related Files
charts/mcpgateway/- Helm chartcharts/mcpgateway/templates/deployment.yamlcharts/mcpgateway/templates/pdb.yaml
🔗 Related Issues
- [TESTING][OPERATIONS]: Health Monitoring Manual Test Plan (Liveness, Readiness, Dependencies) #2462 - Health Monitoring
- Helm chart testing
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releaseP2: Important but not vital; high-value items that are not crucial for the immediate releasechoreLinting, formatting, dependency hygiene, or project maintenance choresLinting, formatting, dependency hygiene, or project maintenance choresmanual-testingManual testing / test planning issuesManual testing / test planning issuesreadyValidated, ready-to-work-on itemsValidated, ready-to-work-on itemstestingTesting (unit, e2e, manual, automated, etc)Testing (unit, e2e, manual, automated, etc)