Skip to content

K8s scenarios: Node NotReady, Pending Pods, Stuck Rollouts #262

@davincios

Description

@davincios

Goal

We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These three scenarios test infrastructure-level failures where the root cause isn't visible in a single evidence source. The agent has to correlate node conditions with pod states, read scheduler events, and understand deployment rollout mechanics.

Background

Requires the test harness from #260 and the 000-healthy base scenario (which can come from #261, or be created standalone here). These are independent of the other scenario issues and can be delivered in any order.

Scenarios

004-node-not-ready

A node goes unhealthy due to disk pressure, memory pressure, or PID exhaustion. Some pods on that node become unresponsive or get stuck terminating. Pods on other nodes stay healthy.

Key signals: one node with ready: false and a pressure condition set to true in eks_node_health, pods on that node in Unknown phase, NodeNotReady warning events. The tricky part is that some pods on other nodes are still fine, so the agent must not declare healthy.

005-pending-pod

A pod can't be scheduled. The scheduler has no node that satisfies the pod's resource requests, affinity rules, or tolerations. The pod sits in Pending state indefinitely.

Key signals: pod in Pending phase with no node_name assigned, PodScheduled: False condition, FailedScheduling events saying something like "0/3 nodes are available: insufficient cpu". Node health shows high allocatable usage.

006-deployment-rollout-stuck

A deployment update created a new ReplicaSet, but the new pods can't come up. The old pods are still running fine. The deployment shows unavailable replicas and eventually hits ProgressDeadlineExceeded.

Key signals: deployment with desired > ready, degraded: true, Progressing condition set to False. Mix of old healthy pods and new broken pods. The agent must identify the stuck rollout rather than declaring partial health.

Done when

  • All three scenarios load and validate through the scenario loader
  • Agent returns the correct root_cause_category for each
  • Agent does not declare healthy when only some pods/nodes are healthy
  • Evidence files match real EKS/Datadog response shapes
  • make lint && make typecheck && make test-cov pass

Reference

  • Test harness: Build the K8s synthetic test harness #260
  • EKS node health shape: app/tools/EKSNodeHealthTool/__init__.py
  • EKS deployment shape: app/tools/EKSListDeploymentsTool/__init__.py
  • EKS events shape: app/tools/EKSEventsTool/__init__.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions