K8s scenarios: CrashLoopBackOff, OOMKilled, ImagePullBackOff

## Goal

We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These three failure modes are what every on-call engineer sees weekly. If the agent can't nail these, nothing else matters.

## Background

Requires the test harness from #260. These are the simplest K8s failures to diagnose because the signal is usually right there in the pod status and events. Good starting point for validating the full pipeline works.

Also includes the `000-healthy` base scenario that all other scenarios inherit from.

## Scenarios

**000-healthy**

The baseline. A healthy EKS cluster with all pods running, zero restarts, no warning events, all deployments at desired replica count, all nodes ready. This is the base scenario that others inherit from via `base: 000-healthy` in their scenario.yml.

The agent should return `root_cause_category: healthy`.

Full evidence set: eks_pods, eks_events, eks_deployments, eks_node_health, datadog_logs, datadog_monitors.

**001-crashloop-backoff**

Container crashes repeatedly. Kubelet backs off restarts exponentially. Inherits from 000-healthy, overrides the pod and event fixtures.

Key signals: container in `waiting` state with `reason: CrashLoopBackOff`, restart count in double digits, BackOff warning events, application crash logs in Datadog.

**002-oom-killed**

Container exceeds its memory limit. The OOM killer terminates it with exit code 137. Inherits from 000-healthy.

Key signals: container in `terminated` state with `reason: OOMKilled`, `exitCode: 137`, memory allocation failure logs. This one matters because it looks like a crash but the fix is different (increase limits vs fix the code).

**003-image-pull-backoff**

The container image can't be pulled. Wrong tag, missing registry credentials, or the image doesn't exist. Inherits from 000-healthy.

Key signals: container in `waiting` state with `reason: ImagePullBackOff`, events showing `ErrImagePull` and `Failed to pull image`. No application logs because the container never starts.

## Done when

- All four scenarios load and validate through the scenario loader
- Agent returns `healthy` for 000 and the correct category for 001, 002, 003
- Evidence files match real EKS/Datadog response shapes
- `make lint && make typecheck && make test-cov` pass

## Reference

- Test harness: #260
- RDS healthy scenario for reference: `tests/synthetic/rds_postgres/000-healthy/`
- EKS pod response shape: `app/tools/EKSListPodsTool/__init__.py`
- EKS events response shape: `app/tools/EKSEventsTool/__init__.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s scenarios: CrashLoopBackOff, OOMKilled, ImagePullBackOff #261

Goal

Background

Scenarios

Done when

Reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

K8s scenarios: CrashLoopBackOff, OOMKilled, ImagePullBackOff #261

Description

Goal

Background

Scenarios

Done when

Reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions