Goal
We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These four scenarios are the hard ones. Each failure looks like something simpler on the surface, and the agent has to reason its way to the actual root cause. This is where we measure whether the agent can think, not just pattern match.
Background
Requires the test harness from #260 and the 000-healthy base scenario (which can come from #261, or be created standalone here). These are independent of the other scenario issues and can be delivered in any order.
All four scenarios include Axis 2 annotations (ruling_out_keywords, required_queries) so we can score not just whether the agent got the right answer, but whether it investigated the right signals and dismissed the right alternatives.
Scenarios
007-evicted-pods
Pods get evicted because the node is running out of ephemeral storage. This looks like a pod crash, but the cause is node-level resource pressure, not application code. The fix is completely different.
Key signals: pods with phase: Failed, reason: Evicted, messages about low ephemeral storage. Node health shows disk pressure. The agent must say eviction, not crash.
Axis 2: agent output must mention that this is a node-level issue and not a container crash.
008-dns-resolution-failure
CoreDNS is broken. Application pods can't resolve service names, so they get connection timeouts and 503s. The symptom looks like a network issue or a downstream service being down, but the actual cause is DNS.
Key signals: CoreDNS pods unhealthy or restarting, application logs showing "no such host" and "i/o timeout" on DNS lookups. The agent must identify DNS as the root cause, not blame the network or the downstream service.
Axis 2: agent output must specifically call out DNS resolution, not generic networking.
009-probe-failure
Liveness or readiness probes are failing. The pod shows as Running (it is running), but it's not healthy. Liveness probe failures cause restarts, readiness failures remove the pod from service endpoints. This is confusing because the pod status says Running.
Key signals: pod Running with high restart count, Unhealthy warning events with probe failure messages (HTTP 503, TCP connection refused), application logs showing the health endpoint is erroring. The agent must not declare healthy just because the pod is Running.
Axis 2: agent output must distinguish probe failure from CrashLoopBackOff.
010-resource-quota-exceeded
The namespace hit its ResourceQuota. Existing pods are healthy, but new pods can't be created. A deployment wants 5 replicas but only has 2 because the quota blocks creation of the remaining 3.
Key signals: events with reason: FailedCreate and messages about exceeding quota. Deployment shows desired > ready. Existing pods are healthy. This is the hardest one because the running pods are genuinely fine. The problem is invisible unless you check the events and quota.
Axis 2: agent output must identify the quota as the blocker, not resource limits on the pod spec.
Done when
- All four scenarios load and validate through the scenario loader
- Agent returns the correct root_cause_category for each
- Axis 2 test suite scores reasoning quality (ruling_out_keywords matched, required_queries audited)
- Agent distinguishes these subtle failures from their simpler lookalikes
- Difficulty 3 scenarios pass or are tracked as xfail gaps
make lint && make typecheck && make test-cov pass
Reference
Goal
We want OpenSRE to be the most reliable agent for diagnosing Kubernetes failures. These four scenarios are the hard ones. Each failure looks like something simpler on the surface, and the agent has to reason its way to the actual root cause. This is where we measure whether the agent can think, not just pattern match.
Background
Requires the test harness from #260 and the 000-healthy base scenario (which can come from #261, or be created standalone here). These are independent of the other scenario issues and can be delivered in any order.
All four scenarios include Axis 2 annotations (ruling_out_keywords, required_queries) so we can score not just whether the agent got the right answer, but whether it investigated the right signals and dismissed the right alternatives.
Scenarios
007-evicted-pods
Pods get evicted because the node is running out of ephemeral storage. This looks like a pod crash, but the cause is node-level resource pressure, not application code. The fix is completely different.
Key signals: pods with
phase: Failed,reason: Evicted, messages about low ephemeral storage. Node health shows disk pressure. The agent must say eviction, not crash.Axis 2: agent output must mention that this is a node-level issue and not a container crash.
008-dns-resolution-failure
CoreDNS is broken. Application pods can't resolve service names, so they get connection timeouts and 503s. The symptom looks like a network issue or a downstream service being down, but the actual cause is DNS.
Key signals: CoreDNS pods unhealthy or restarting, application logs showing "no such host" and "i/o timeout" on DNS lookups. The agent must identify DNS as the root cause, not blame the network or the downstream service.
Axis 2: agent output must specifically call out DNS resolution, not generic networking.
009-probe-failure
Liveness or readiness probes are failing. The pod shows as Running (it is running), but it's not healthy. Liveness probe failures cause restarts, readiness failures remove the pod from service endpoints. This is confusing because the pod status says Running.
Key signals: pod Running with high restart count, Unhealthy warning events with probe failure messages (HTTP 503, TCP connection refused), application logs showing the health endpoint is erroring. The agent must not declare healthy just because the pod is Running.
Axis 2: agent output must distinguish probe failure from CrashLoopBackOff.
010-resource-quota-exceeded
The namespace hit its ResourceQuota. Existing pods are healthy, but new pods can't be created. A deployment wants 5 replicas but only has 2 because the quota blocks creation of the remaining 3.
Key signals: events with
reason: FailedCreateand messages about exceeding quota. Deployment shows desired > ready. Existing pods are healthy. This is the hardest one because the running pods are genuinely fine. The problem is invisible unless you check the events and quota.Axis 2: agent output must identify the quota as the blocker, not resource limits on the pod spec.
Done when
make lint && make typecheck && make test-covpassReference
tests/synthetic/rds_postgres/test_suite_axis2.pytests/synthetic/mock_grafana_backend/selective_backend.py