You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These three scenarios test infrastructure-level failures where the root cause isn't visible in a single evidence source. The agent has to correlate node conditions with pod states, read scheduler events, and understand deployment rollout mechanics.
Background
Requires the test harness from #260 and the 000-healthy base scenario (which can come from #261, or be created standalone here). These are independent of the other scenario issues and can be delivered in any order.
Scenarios
004-node-not-ready
A node goes unhealthy due to disk pressure, memory pressure, or PID exhaustion. Some pods on that node become unresponsive or get stuck terminating. Pods on other nodes stay healthy.
Key signals: one node with ready: false and a pressure condition set to true in eks_node_health, pods on that node in Unknown phase, NodeNotReady warning events. The tricky part is that some pods on other nodes are still fine, so the agent must not declare healthy.
005-pending-pod
A pod can't be scheduled. The scheduler has no node that satisfies the pod's resource requests, affinity rules, or tolerations. The pod sits in Pending state indefinitely.
Key signals: pod in Pending phase with no node_name assigned, PodScheduled: False condition, FailedScheduling events saying something like "0/3 nodes are available: insufficient cpu". Node health shows high allocatable usage.
006-deployment-rollout-stuck
A deployment update created a new ReplicaSet, but the new pods can't come up. The old pods are still running fine. The deployment shows unavailable replicas and eventually hits ProgressDeadlineExceeded.
Key signals: deployment with desired > ready, degraded: true, Progressing condition set to False. Mix of old healthy pods and new broken pods. The agent must identify the stuck rollout rather than declaring partial health.
Done when
All three scenarios load and validate through the scenario loader
Agent returns the correct root_cause_category for each
Agent does not declare healthy when only some pods/nodes are healthy
Evidence files match real EKS/Datadog response shapes
Goal
We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These three scenarios test infrastructure-level failures where the root cause isn't visible in a single evidence source. The agent has to correlate node conditions with pod states, read scheduler events, and understand deployment rollout mechanics.
Background
Requires the test harness from #260 and the 000-healthy base scenario (which can come from #261, or be created standalone here). These are independent of the other scenario issues and can be delivered in any order.
Scenarios
004-node-not-ready
A node goes unhealthy due to disk pressure, memory pressure, or PID exhaustion. Some pods on that node become unresponsive or get stuck terminating. Pods on other nodes stay healthy.
Key signals: one node with
ready: falseand a pressure condition set to true in eks_node_health, pods on that node inUnknownphase, NodeNotReady warning events. The tricky part is that some pods on other nodes are still fine, so the agent must not declare healthy.005-pending-pod
A pod can't be scheduled. The scheduler has no node that satisfies the pod's resource requests, affinity rules, or tolerations. The pod sits in Pending state indefinitely.
Key signals: pod in
Pendingphase with no node_name assigned,PodScheduled: Falsecondition, FailedScheduling events saying something like "0/3 nodes are available: insufficient cpu". Node health shows high allocatable usage.006-deployment-rollout-stuck
A deployment update created a new ReplicaSet, but the new pods can't come up. The old pods are still running fine. The deployment shows unavailable replicas and eventually hits
ProgressDeadlineExceeded.Key signals: deployment with desired > ready,
degraded: true, Progressing condition set to False. Mix of old healthy pods and new broken pods. The agent must identify the stuck rollout rather than declaring partial health.Done when
make lint && make typecheck && make test-covpassReference
app/tools/EKSNodeHealthTool/__init__.pyapp/tools/EKSListDeploymentsTool/__init__.pyapp/tools/EKSEventsTool/__init__.py