Skip to content

K8s scenarios: CrashLoopBackOff, OOMKilled, ImagePullBackOff #261

@davincios

Description

@davincios

Goal

We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These three failure modes are what every on-call engineer sees weekly. If the agent can't nail these, nothing else matters.

Background

Requires the test harness from #260. These are the simplest K8s failures to diagnose because the signal is usually right there in the pod status and events. Good starting point for validating the full pipeline works.

Also includes the 000-healthy base scenario that all other scenarios inherit from.

Scenarios

000-healthy

The baseline. A healthy EKS cluster with all pods running, zero restarts, no warning events, all deployments at desired replica count, all nodes ready. This is the base scenario that others inherit from via base: 000-healthy in their scenario.yml.

The agent should return root_cause_category: healthy.

Full evidence set: eks_pods, eks_events, eks_deployments, eks_node_health, datadog_logs, datadog_monitors.

001-crashloop-backoff

Container crashes repeatedly. Kubelet backs off restarts exponentially. Inherits from 000-healthy, overrides the pod and event fixtures.

Key signals: container in waiting state with reason: CrashLoopBackOff, restart count in double digits, BackOff warning events, application crash logs in Datadog.

002-oom-killed

Container exceeds its memory limit. The OOM killer terminates it with exit code 137. Inherits from 000-healthy.

Key signals: container in terminated state with reason: OOMKilled, exitCode: 137, memory allocation failure logs. This one matters because it looks like a crash but the fix is different (increase limits vs fix the code).

003-image-pull-backoff

The container image can't be pulled. Wrong tag, missing registry credentials, or the image doesn't exist. Inherits from 000-healthy.

Key signals: container in waiting state with reason: ImagePullBackOff, events showing ErrImagePull and Failed to pull image. No application logs because the container never starts.

Done when

  • All four scenarios load and validate through the scenario loader
  • Agent returns healthy for 000 and the correct category for 001, 002, 003
  • Evidence files match real EKS/Datadog response shapes
  • make lint && make typecheck && make test-cov pass

Reference

  • Test harness: Build the K8s synthetic test harness #260
  • RDS healthy scenario for reference: tests/synthetic/rds_postgres/000-healthy/
  • EKS pod response shape: app/tools/EKSListPodsTool/__init__.py
  • EKS events response shape: app/tools/EKSEventsTool/__init__.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions