K8s scenarios: Node NotReady, Pending Pods, Stuck Rollouts

## Goal

We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These three scenarios test infrastructure-level failures where the root cause isn't visible in a single evidence source. The agent has to correlate node conditions with pod states, read scheduler events, and understand deployment rollout mechanics.

## Background

Requires the test harness from #260 and the 000-healthy base scenario (which can come from #261, or be created standalone here). These are independent of the other scenario issues and can be delivered in any order.

## Scenarios

**004-node-not-ready**

A node goes unhealthy due to disk pressure, memory pressure, or PID exhaustion. Some pods on that node become unresponsive or get stuck terminating. Pods on other nodes stay healthy.

Key signals: one node with `ready: false` and a pressure condition set to true in eks_node_health, pods on that node in `Unknown` phase, NodeNotReady warning events. The tricky part is that some pods on other nodes are still fine, so the agent must not declare healthy.

**005-pending-pod**

A pod can't be scheduled. The scheduler has no node that satisfies the pod's resource requests, affinity rules, or tolerations. The pod sits in Pending state indefinitely.

Key signals: pod in `Pending` phase with no node_name assigned, `PodScheduled: False` condition, FailedScheduling events saying something like "0/3 nodes are available: insufficient cpu". Node health shows high allocatable usage.

**006-deployment-rollout-stuck**

A deployment update created a new ReplicaSet, but the new pods can't come up. The old pods are still running fine. The deployment shows unavailable replicas and eventually hits `ProgressDeadlineExceeded`.

Key signals: deployment with desired > ready, `degraded: true`, Progressing condition set to False. Mix of old healthy pods and new broken pods. The agent must identify the stuck rollout rather than declaring partial health.

## Done when

- All three scenarios load and validate through the scenario loader
- Agent returns the correct root_cause_category for each
- Agent does not declare healthy when only some pods/nodes are healthy
- Evidence files match real EKS/Datadog response shapes
- `make lint && make typecheck && make test-cov` pass

## Reference

- Test harness: #260
- EKS node health shape: `app/tools/EKSNodeHealthTool/__init__.py`
- EKS deployment shape: `app/tools/EKSListDeploymentsTool/__init__.py`
- EKS events shape: `app/tools/EKSEventsTool/__init__.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s scenarios: Node NotReady, Pending Pods, Stuck Rollouts #262

Goal

Background

Scenarios

Done when

Reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

K8s scenarios: Node NotReady, Pending Pods, Stuck Rollouts #262

Description

Goal

Background

Scenarios

Done when

Reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions