You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These three failure modes are what every on-call engineer sees weekly. If the agent can't nail these, nothing else matters.
Background
Requires the test harness from #260. These are the simplest K8s failures to diagnose because the signal is usually right there in the pod status and events. Good starting point for validating the full pipeline works.
Also includes the 000-healthy base scenario that all other scenarios inherit from.
Scenarios
000-healthy
The baseline. A healthy EKS cluster with all pods running, zero restarts, no warning events, all deployments at desired replica count, all nodes ready. This is the base scenario that others inherit from via base: 000-healthy in their scenario.yml.
The agent should return root_cause_category: healthy.
Full evidence set: eks_pods, eks_events, eks_deployments, eks_node_health, datadog_logs, datadog_monitors.
001-crashloop-backoff
Container crashes repeatedly. Kubelet backs off restarts exponentially. Inherits from 000-healthy, overrides the pod and event fixtures.
Key signals: container in waiting state with reason: CrashLoopBackOff, restart count in double digits, BackOff warning events, application crash logs in Datadog.
002-oom-killed
Container exceeds its memory limit. The OOM killer terminates it with exit code 137. Inherits from 000-healthy.
Key signals: container in terminated state with reason: OOMKilled, exitCode: 137, memory allocation failure logs. This one matters because it looks like a crash but the fix is different (increase limits vs fix the code).
003-image-pull-backoff
The container image can't be pulled. Wrong tag, missing registry credentials, or the image doesn't exist. Inherits from 000-healthy.
Key signals: container in waiting state with reason: ImagePullBackOff, events showing ErrImagePull and Failed to pull image. No application logs because the container never starts.
Done when
All four scenarios load and validate through the scenario loader
Agent returns healthy for 000 and the correct category for 001, 002, 003
Evidence files match real EKS/Datadog response shapes
Goal
We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These three failure modes are what every on-call engineer sees weekly. If the agent can't nail these, nothing else matters.
Background
Requires the test harness from #260. These are the simplest K8s failures to diagnose because the signal is usually right there in the pod status and events. Good starting point for validating the full pipeline works.
Also includes the
000-healthybase scenario that all other scenarios inherit from.Scenarios
000-healthy
The baseline. A healthy EKS cluster with all pods running, zero restarts, no warning events, all deployments at desired replica count, all nodes ready. This is the base scenario that others inherit from via
base: 000-healthyin their scenario.yml.The agent should return
root_cause_category: healthy.Full evidence set: eks_pods, eks_events, eks_deployments, eks_node_health, datadog_logs, datadog_monitors.
001-crashloop-backoff
Container crashes repeatedly. Kubelet backs off restarts exponentially. Inherits from 000-healthy, overrides the pod and event fixtures.
Key signals: container in
waitingstate withreason: CrashLoopBackOff, restart count in double digits, BackOff warning events, application crash logs in Datadog.002-oom-killed
Container exceeds its memory limit. The OOM killer terminates it with exit code 137. Inherits from 000-healthy.
Key signals: container in
terminatedstate withreason: OOMKilled,exitCode: 137, memory allocation failure logs. This one matters because it looks like a crash but the fix is different (increase limits vs fix the code).003-image-pull-backoff
The container image can't be pulled. Wrong tag, missing registry credentials, or the image doesn't exist. Inherits from 000-healthy.
Key signals: container in
waitingstate withreason: ImagePullBackOff, events showingErrImagePullandFailed to pull image. No application logs because the container never starts.Done when
healthyfor 000 and the correct category for 001, 002, 003make lint && make typecheck && make test-covpassReference
tests/synthetic/rds_postgres/000-healthy/app/tools/EKSListPodsTool/__init__.pyapp/tools/EKSEventsTool/__init__.py