Skip to content

K8s scenarios: Eviction, DNS failures, Probe failures, Quota limits #263

@davincios

Description

@davincios

Goal

We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These four scenarios are the hard ones. Each failure looks like something simpler on the surface, and the agent has to reason its way to the actual root cause. This is where we measure whether the agent can think, not just pattern match.

Background

Requires the test harness from #260 and the 000-healthy base scenario (which can come from #261, or be created standalone here). These are independent of the other scenario issues and can be delivered in any order.

All four scenarios include Axis 2 annotations (ruling_out_keywords, required_queries) so we can score not just whether the agent got the right answer, but whether it investigated the right signals and dismissed the right alternatives.

Scenarios

007-evicted-pods

Pods get evicted because the node is running out of ephemeral storage. This looks like a pod crash, but the cause is node-level resource pressure, not application code. The fix is completely different.

Key signals: pods with phase: Failed, reason: Evicted, messages about low ephemeral storage. Node health shows disk pressure. The agent must say eviction, not crash.

Axis 2: agent output must mention that this is a node-level issue and not a container crash.

008-dns-resolution-failure

CoreDNS is broken. Application pods can't resolve service names, so they get connection timeouts and 503s. The symptom looks like a network issue or a downstream service being down, but the actual cause is DNS.

Key signals: CoreDNS pods unhealthy or restarting, application logs showing "no such host" and "i/o timeout" on DNS lookups. The agent must identify DNS as the root cause, not blame the network or the downstream service.

Axis 2: agent output must specifically call out DNS resolution, not generic networking.

009-probe-failure

Liveness or readiness probes are failing. The pod shows as Running (it is running), but it's not healthy. Liveness probe failures cause restarts, readiness failures remove the pod from service endpoints. This is confusing because the pod status says Running.

Key signals: pod Running with high restart count, Unhealthy warning events with probe failure messages (HTTP 503, TCP connection refused), application logs showing the health endpoint is erroring. The agent must not declare healthy just because the pod is Running.

Axis 2: agent output must distinguish probe failure from CrashLoopBackOff.

010-resource-quota-exceeded

The namespace hit its ResourceQuota. Existing pods are healthy, but new pods can't be created. A deployment wants 5 replicas but only has 2 because the quota blocks creation of the remaining 3.

Key signals: events with reason: FailedCreate and messages about exceeding quota. Deployment shows desired > ready. Existing pods are healthy. This is the hardest one because the running pods are genuinely fine. The problem is invisible unless you check the events and quota.

Axis 2: agent output must identify the quota as the blocker, not resource limits on the pod spec.

Done when

  • All four scenarios load and validate through the scenario loader
  • Agent returns the correct root_cause_category for each
  • Axis 2 test suite scores reasoning quality (ruling_out_keywords matched, required_queries audited)
  • Agent distinguishes these subtle failures from their simpler lookalikes
  • Difficulty 3 scenarios pass or are tracked as xfail gaps
  • make lint && make typecheck && make test-cov pass

Reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions