Skip to content

[TESTING][RESILIENCE]: Kubernetes Resilience Manual Test Plan (Pod Deletion, Node Failure, Rolling Updates) #2468

@crivetimihai

Description

@crivetimihai

☸️ [TESTING][RESILIENCE]: Kubernetes Resilience Manual Test Plan (Pod Deletion, Node Failure, Rolling Updates)

Goal

Produce a comprehensive manual test plan for Kubernetes resilience testing including pod lifecycle events, node failures, rolling deployments, and horizontal scaling behavior.

Why Now?

Kubernetes deployments require validated resilience:

  1. Zero-Downtime: Rolling updates shouldn't drop requests
  2. Self-Healing: Pod failures should trigger restarts
  3. Scaling: Horizontal scaling must work correctly
  4. Node Tolerance: Node failures shouldn't cause outages

📖 User Stories

US-1: Platform Operator - Pod Lifecycle

As a Platform Operator
I want pods to handle lifecycle events gracefully
So that deployments don't cause service disruptions

Acceptance Criteria:

Feature: Pod Lifecycle

  Scenario: Pod deletion with graceful shutdown
    Given a pod is running and serving traffic
    When the pod receives SIGTERM
    Then it should stop accepting new connections
    And finish processing existing requests
    And exit within terminationGracePeriodSeconds

  Scenario: Pod restart due to liveness failure
    Given a pod is in an unhealthy state
    When liveness probe fails
    Then Kubernetes should restart the pod
    And traffic should route to healthy pods meanwhile
US-2: SRE - Rolling Deployments

As an SRE
I want rolling deployments to be zero-downtime
So that users don't experience service interruptions

Acceptance Criteria:

Feature: Rolling Deployments

  Scenario: Rolling update with no dropped requests
    Given a deployment with 3 replicas
    When I update the deployment image
    Then old pods should drain connections
    And new pods should be ready before old ones terminate
    And no requests should fail during rollout

🏗 Architecture

Kubernetes Deployment Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KUBERNETES DEPLOYMENT ARCHITECTURE                        │
└─────────────────────────────────────────────────────────────────────────────┘

    INGRESS                 SERVICE                  DEPLOYMENT
    ───────                 ───────                  ──────────

  ┌─────────────┐        ┌─────────────┐        ┌─────────────────────┐
  │   Ingress   │───────▶│   Service   │───────▶│    Pod 1            │
  │   Controller│        │  (ClusterIP)│        │  ┌───────────────┐  │
  └─────────────┘        │             │        │  │ mcpgateway    │  │
                         │   Selector: │        │  │ :8000         │  │
                         │   app=      │        │  └───────────────┘  │
                         │   mcpgateway│        │  Liveness: /health  │
                         │             │        │  Readiness: /ready  │
                         │             │        └─────────────────────┘
                         │             │
                         │             │        ┌─────────────────────┐
                         │             │───────▶│    Pod 2            │
                         │             │        │  ┌───────────────┐  │
                         │             │        │  │ mcpgateway    │  │
                         │             │        │  │ :8000         │  │
                         │             │        │  └───────────────┘  │
                         └─────────────┘        └─────────────────────┘


    ROLLING UPDATE SEQUENCE
    ───────────────────────

    Initial:  [Pod-v1] [Pod-v1] [Pod-v1]
                 │
    Step 1:   [Pod-v1] [Pod-v1] [Pod-v1] [Pod-v2 Starting]
                 │
    Step 2:   [Pod-v1] [Pod-v1] [Terminating] [Pod-v2 Ready]
                 │
    Step 3:   [Pod-v1] [Pod-v2 Ready] [Pod-v2 Starting]
                 │
    Step 4:   [Terminating] [Pod-v2 Ready] [Pod-v2 Ready]
                 │
    Final:    [Pod-v2] [Pod-v2] [Pod-v2]

📋 Test Environment Setup

Prerequisites

# Kubernetes cluster (kind, minikube, or real cluster)
kubectl cluster-info

# Deploy mcpgateway
kubectl apply -f charts/mcpgateway/

# Verify deployment
kubectl get pods -l app=mcpgateway
kubectl get svc mcpgateway

# Port forward for testing
kubectl port-forward svc/mcpgateway 8000:8000 &
export GATEWAY_URL="http://localhost:8000"

🧪 Manual Test Cases

Section 1: Pod Lifecycle

Case Scenario Action Expected Validation
PL-01 Pod deletion kubectl delete Graceful drain No 502s
PL-02 Liveness failure Simulate unhealthy Pod restarted RestartCount++
PL-03 Readiness failure Simulate not ready Removed from LB No traffic
PL-04 OOM kill Memory pressure Pod restarted OOMKilled reason
PL-01: Pod Deletion with Graceful Shutdown

Preconditions:

  • Multiple pods running
  • Load generator ready

Steps:

# Step 1: Start continuous load
while true; do
  curl -s -o /dev/null -w "%{http_code}\n" "$GATEWAY_URL/health" >> /tmp/responses.log
  sleep 0.1
done &
LOAD_PID=$!

# Step 2: Get pod name
POD=$(kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].metadata.name}')
echo "Deleting pod: $POD"

# Step 3: Delete pod
kubectl delete pod $POD

# Step 4: Wait for new pod
kubectl wait --for=condition=ready pod -l app=mcpgateway --timeout=60s

# Step 5: Stop load generator
kill $LOAD_PID

# Step 6: Analyze responses
echo "Total requests: $(wc -l < /tmp/responses.log)"
echo "Failed requests: $(grep -v 200 /tmp/responses.log | wc -l)"
grep -v 200 /tmp/responses.log | sort | uniq -c

Expected Result:

  • Pod terminates gracefully
  • No 502/503 errors during termination
  • New pod starts and becomes ready
  • Traffic continues uninterrupted
PL-02: Liveness Probe Failure

Preconditions:

  • Pod with liveness probe configured
  • Ability to simulate unhealthy state

Steps:

# Step 1: Check current restart count
kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].status.containerStatuses[0].restartCount}'

# Step 2: Simulate liveness failure
# Option A: Exec into pod and kill the process
POD=$(kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].metadata.name}')
kubectl exec $POD -- kill 1

# Option B: If app supports it, trigger unhealthy state via API

# Step 3: Watch pod status
kubectl get pods -l app=mcpgateway -w

# Step 4: Wait for restart
sleep 30

# Step 5: Verify restart count increased
kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].status.containerStatuses[0].restartCount}'

# Step 6: Verify pod is healthy again
kubectl get pods -l app=mcpgateway
curl -s "$GATEWAY_URL/health" | jq .

Expected Result:

  • Liveness probe detects failure
  • Kubernetes restarts the pod
  • Restart count increases
  • Pod becomes healthy after restart

Section 2: Rolling Deployments

Case Scenario Strategy Expected Validation
RD-01 Image update RollingUpdate Zero downtime No errors
RD-02 Config change RollingUpdate Gradual rollout Config applied
RD-03 Rollback kubectl rollout Previous version Quick rollback
RD-01: Rolling Update Zero Downtime

Preconditions:

  • Deployment with 3+ replicas
  • RollingUpdate strategy configured

Steps:

# Step 1: Start continuous load in background
while true; do
  START=$(date +%s%N)
  HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" "$GATEWAY_URL/api/tools" \
    -H "Authorization: Bearer $TOKEN")
  END=$(date +%s%N)
  LATENCY=$(( ($END - $START) / 1000000 ))
  echo "$(date +%H:%M:%S) $HTTP_CODE ${LATENCY}ms"
  sleep 0.2
done > /tmp/rolling-update.log &
LOAD_PID=$!

# Step 2: Trigger rolling update (change image or env var)
kubectl set env deployment/mcpgateway ROLLING_UPDATE_TEST=$(date +%s)

# Step 3: Watch rollout
kubectl rollout status deployment/mcpgateway

# Step 4: Stop load generator
kill $LOAD_PID

# Step 5: Analyze results
echo "Total requests: $(wc -l < /tmp/rolling-update.log)"
echo "Failed requests: $(grep -v " 200 " /tmp/rolling-update.log | wc -l)"
echo "Max latency: $(awk '{print $3}' /tmp/rolling-update.log | sort -n | tail -1)"

# Step 6: Show any errors
grep -v " 200 " /tmp/rolling-update.log | head -10

Expected Result:

  • All requests return 200
  • No dropped connections
  • Latency may spike briefly but no errors
  • Rollout completes successfully
RD-03: Rollback

Preconditions:

  • Previous deployment revision exists

Steps:

# Step 1: Check rollout history
kubectl rollout history deployment/mcpgateway

# Step 2: Start load generator
while true; do
  curl -s -o /dev/null -w "%{http_code}\n" "$GATEWAY_URL/health"
  sleep 0.1
done > /tmp/rollback.log &
LOAD_PID=$!

# Step 3: Trigger rollback
kubectl rollout undo deployment/mcpgateway

# Step 4: Wait for rollback
kubectl rollout status deployment/mcpgateway

# Step 5: Stop load
kill $LOAD_PID

# Step 6: Verify no errors
grep -v 200 /tmp/rollback.log | wc -l

Expected Result:

  • Rollback completes quickly
  • No service interruption
  • Previous version restored

Section 3: Node Failure

Case Scenario Trigger Expected Validation
NF-01 Node drain kubectl drain Pods rescheduled No downtime
NF-02 Node cordoned kubectl cordon No new pods Existing work
NF-03 Node crash Kill node Pods reschedule Recovery time
NF-01: Node Drain

Preconditions:

  • Multi-node cluster
  • Pods distributed across nodes
  • PodDisruptionBudget configured

Steps:

# Step 1: Identify node with gateway pods
NODE=$(kubectl get pods -l app=mcpgateway -o jsonpath='{.items[0].spec.nodeName}')
echo "Draining node: $NODE"

# Step 2: Start load generator
while true; do
  curl -s -o /dev/null -w "%{http_code}\n" "$GATEWAY_URL/health"
  sleep 0.1
done > /tmp/drain.log &
LOAD_PID=$!

# Step 3: Drain the node
kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data

# Step 4: Watch pods reschedule
kubectl get pods -l app=mcpgateway -o wide -w &

# Step 5: Wait for reschedule
sleep 30

# Step 6: Stop load
kill $LOAD_PID

# Step 7: Analyze
grep -v 200 /tmp/drain.log | wc -l

# Step 8: Uncordon node
kubectl uncordon $NODE

Expected Result:

  • Pods evicted gracefully
  • Rescheduled to other nodes
  • PDB respected (min available)
  • Minimal or no errors during drain

Section 4: Horizontal Scaling

Case Scenario Trigger Expected Validation
HS-01 Scale up Manual/HPA New pods ready Traffic distributed
HS-02 Scale down Manual/HPA Graceful termination No errors
HS-03 HPA trigger CPU load Auto scale Metrics accurate
HS-01: Scale Up

Preconditions:

  • Deployment with replicas=1

Steps:

# Step 1: Check current replicas
kubectl get deployment mcpgateway -o jsonpath='{.spec.replicas}'

# Step 2: Scale up
kubectl scale deployment mcpgateway --replicas=5

# Step 3: Watch pods come up
kubectl get pods -l app=mcpgateway -w

# Step 4: Wait for all ready
kubectl wait --for=condition=ready pod -l app=mcpgateway --timeout=120s

# Step 5: Verify traffic distribution
for i in {1..20}; do
  curl -s "$GATEWAY_URL/health" | jq -r '.hostname // empty'
done | sort | uniq -c

# Step 6: Verify all pods serving traffic

Expected Result:

  • New pods start quickly
  • Become ready and receive traffic
  • Load distributed across all pods

📊 Test Matrix

Test Case Lifecycle Rolling Node Scaling GKE EKS AKS
PL-01
PL-02
PL-03
PL-04
RD-01
RD-02
RD-03
NF-01
NF-02
NF-03
HS-01
HS-02
HS-03

✅ Success Criteria

  • All 13 test cases pass
  • Pod deletion is graceful (no dropped requests)
  • Liveness/readiness probes work correctly
  • Rolling updates are zero-downtime
  • Rollback works quickly
  • Node drain respects PDB
  • Scaling works correctly
  • HPA triggers appropriately

🔗 Related Files

  • charts/mcpgateway/ - Helm chart
  • charts/mcpgateway/templates/deployment.yaml
  • charts/mcpgateway/templates/pdb.yaml

🔗 Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    SHOULDP2: Important but not vital; high-value items that are not crucial for the immediate releasechoreLinting, formatting, dependency hygiene, or project maintenance choresmanual-testingManual testing / test planning issuesreadyValidated, ready-to-work-on itemstestingTesting (unit, e2e, manual, automated, etc)

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions