K8s scenarios: Eviction, DNS failures, Probe failures, Quota limits

## Goal

We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These four scenarios are the hard ones. Each failure looks like something simpler on the surface, and the agent has to reason its way to the actual root cause. This is where we measure whether the agent can think, not just pattern match.

## Background

Requires the test harness from #260 and the 000-healthy base scenario (which can come from #261, or be created standalone here). These are independent of the other scenario issues and can be delivered in any order.

All four scenarios include Axis 2 annotations (ruling_out_keywords, required_queries) so we can score not just whether the agent got the right answer, but whether it investigated the right signals and dismissed the right alternatives.

## Scenarios

**007-evicted-pods**

Pods get evicted because the node is running out of ephemeral storage. This looks like a pod crash, but the cause is node-level resource pressure, not application code. The fix is completely different.

Key signals: pods with `phase: Failed`, `reason: Evicted`, messages about low ephemeral storage. Node health shows disk pressure. The agent must say eviction, not crash.

Axis 2: agent output must mention that this is a node-level issue and not a container crash.

**008-dns-resolution-failure**

CoreDNS is broken. Application pods can't resolve service names, so they get connection timeouts and 503s. The symptom looks like a network issue or a downstream service being down, but the actual cause is DNS.

Key signals: CoreDNS pods unhealthy or restarting, application logs showing "no such host" and "i/o timeout" on DNS lookups. The agent must identify DNS as the root cause, not blame the network or the downstream service.

Axis 2: agent output must specifically call out DNS resolution, not generic networking.

**009-probe-failure**

Liveness or readiness probes are failing. The pod shows as Running (it is running), but it's not healthy. Liveness probe failures cause restarts, readiness failures remove the pod from service endpoints. This is confusing because the pod status says Running.

Key signals: pod Running with high restart count, Unhealthy warning events with probe failure messages (HTTP 503, TCP connection refused), application logs showing the health endpoint is erroring. The agent must not declare healthy just because the pod is Running.

Axis 2: agent output must distinguish probe failure from CrashLoopBackOff.

**010-resource-quota-exceeded**

The namespace hit its ResourceQuota. Existing pods are healthy, but new pods can't be created. A deployment wants 5 replicas but only has 2 because the quota blocks creation of the remaining 3.

Key signals: events with `reason: FailedCreate` and messages about exceeding quota. Deployment shows desired > ready. Existing pods are healthy. This is the hardest one because the running pods are genuinely fine. The problem is invisible unless you check the events and quota.

Axis 2: agent output must identify the quota as the blocker, not resource limits on the pod spec.

## Done when

- All four scenarios load and validate through the scenario loader
- Agent returns the correct root_cause_category for each
- Axis 2 test suite scores reasoning quality (ruling_out_keywords matched, required_queries audited)
- Agent distinguishes these subtle failures from their simpler lookalikes
- Difficulty 3 scenarios pass or are tracked as xfail gaps
- `make lint && make typecheck && make test-cov` pass

## Reference

- Test harness: #260
- Axis 2 pattern: `tests/synthetic/rds_postgres/test_suite_axis2.py`
- SelectiveBackend pattern: `tests/synthetic/mock_grafana_backend/selective_backend.py`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K8s scenarios: Eviction, DNS failures, Probe failures, Quota limits #263

Goal

Background

Scenarios

Done when

Reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

K8s scenarios: Eviction, DNS failures, Probe failures, Quota limits #263

Description

Goal

Background

Scenarios

Done when

Reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions