Skip to content

test(synthetic): add 8 Kubernetes RCA scenarios#661

Merged
davincios merged 4 commits into
Tracer-Cloud:mainfrom
hamzzaaamalik:k8s-synthetic-scenarios
Apr 19, 2026
Merged

test(synthetic): add 8 Kubernetes RCA scenarios#661
davincios merged 4 commits into
Tracer-Cloud:mainfrom
hamzzaaamalik:k8s-synthetic-scenarios

Conversation

@hamzzaaamalik

Copy link
Copy Markdown
Collaborator

Summary

  • Adds 8 Kubernetes failure scenarios to the synthetic test suite from feat: add Kubernetes synthetic RCA test harness #583, covering the most common production K8s issues: out-of-memory crashes, bad image tags, pending pods, broken health probes, quota exhaustion, DNS failures, node failures, and stuck rollouts.
  • Each scenario is a small folder of fixture files (alert + evidence + expected answer) that the agent investigates end-to-end.
  • The agent passes all 9 scenarios on a clean run.

Why this matters

  • Gives the team automated coverage for the K8s diagnosis pipeline any future agent change can be checked against these scenarios before shipping.
  • Each scenario is designed to be hard: healthy pods sit alongside the broken one, surface symptoms hide the real cause so passing them really tests the agent.

One thing to fix later

Two scenarios (quota and rollout-stuck) needed a small workaround because EKS tool output is currently dropped before reaching the agent (known issue from #583). For now I mirrored the key signals into Datadog logs so the agent can see them. Once the EKS evidence wiring is fixed in a follow-up, the workaround can be removed.

Test plan

  • Structural validation passes
  • Full suite scoring run: 9/9 pass
  • Reviewer can re-run to confirm

@greptile-apps

greptile-apps Bot commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds 8 synthetic Kubernetes RCA scenarios (OOMKilled, ImagePullBackOff, pending/unschedulable, liveness-probe killing, resource-quota exceeded, DNS failure, node-not-ready, stuck rollout) to the existing test suite from #583. Each scenario follows the base-override fixture model and includes alert, evidence, and graded answer.yml files.

Prior review concerns (contradictory deployment counters in 002/008, missing available_evidence in 005/006, root-cause tokens in ruling_out_keywords for 004/008) appear to have been addressed. One pattern remains: scenarios 003, 005, and 006 still carry trivially-positive ruling_out_keywords tokens that would appear in any correct answer, giving Axis 2 scoring a free pass without confirming the agent actually reasoned through alternative hypotheses.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 quality improvements to scoring metadata, not runtime or data-correctness issues.

All P0/P1 issues from previous review rounds (contradictory ready/unavailable counts, missing available_evidence, root-cause tokens in ruling_out_keywords for 004 and 008) are resolved. The only remaining finding is a P2 concern about weak ruling_out_keywords in three answer.yml files — a scoring-signal quality issue that doesn't affect test correctness or the agent's pass/fail outcome.

tests/synthetic/eks/003-pending-insufficient-resources/answer.yml, tests/synthetic/eks/005-resource-quota-exceeded/answer.yml, tests/synthetic/eks/006-dns-resolution-failure/answer.yml — ruling_out_keywords should use negative-framing tokens.

Important Files Changed

Filename Overview
tests/synthetic/eks/003-pending-insufficient-resources/answer.yml ruling_out_keywords ("nodes", "Ready") are trivially positive-evidence tokens, not negative-framing checks that confirm the agent ruled out node failure or OOM.
tests/synthetic/eks/004-liveness-probe-killing/answer.yml ruling_out_keywords now correct ("not OOM", "exit code 0") after prior review feedback; scenario and fixtures look consistent.
tests/synthetic/eks/005-resource-quota-exceeded/answer.yml ruling_out_keywords ("existing pods", "Ready") are positive-evidence tokens that trivially appear in any correct diagnosis; doesn't confirm the agent ruled out scheduler/capacity or other hypotheses.
tests/synthetic/eks/006-dns-resolution-failure/answer.yml ruling_out_keywords ("Ready", "restart") are trivial tokens for this scenario; better negative-framing tokens would be "not OOM", "not crashloop", or "not probe".
tests/synthetic/eks/008-deployment-rollout-stuck/answer.yml ruling_out_keywords corrected to "not OOM", "not quota", "old ReplicaSet" per prior review feedback. Looks good.
tests/synthetic/eks/005-resource-quota-exceeded/scenario.yml available_evidence now explicitly listed including Datadog workaround entries; prior concern about missing field is resolved.
tests/synthetic/eks/006-dns-resolution-failure/scenario.yml available_evidence explicitly declared with datadog_logs/monitors; EKS-level fixtures fall back to healthy base, correctly expressing the adversarial all-pods-healthy signal.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: tests/synthetic/eks/003-pending-insufficient-resources/answer.yml
Line: 17-19

Comment:
**`ruling_out_keywords` are positive-evidence tokens, not negative-framing**

`nodes` and `Ready` will appear trivially in any correct diagnosis of this scenario (the answer naturally mentions "both nodes are Ready=True but their allocatable CPU is nearly zero"). They carry no signal that the agent specifically ruled out an alternative hypothesis (e.g. node failure, OOM, image pull).

Compare the fixed usage in scenarios 004 and 008: `"not OOM"`, `"exit code 0"`, `"not quota"`. For scenario 003 the intended ruling-out claim is that the nodes themselves are healthy and the block is CPU capacity, so tokens like `"not node failure"` or `"not OOM"` would actually test that conclusion. The same pattern applies to scenario 005 (`existing pods`, `Ready`) and scenario 006 (`Ready`, `restart`).

How can I resolve this? If you propose a fix, please make it concise.

Reviews (4): Last reviewed commit: "fix(synthetic): make available_evidence ..." | Re-trigger Greptile

Comment thread tests/synthetic/eks/002-image-pull-backoff/eks_deployments.json Outdated
Comment thread tests/synthetic/eks/008-deployment-rollout-stuck/answer.yml Outdated
Comment thread tests/synthetic/eks/008-deployment-rollout-stuck/eks_deployments.json Outdated
Comment thread tests/synthetic/eks/004-liveness-probe-killing/answer.yml Outdated
@greptile-apps

greptile-apps Bot commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Want your agent to iterate on Greptile's feedback? Try greploops.

Comment thread tests/synthetic/eks/005-resource-quota-exceeded/scenario.yml
@davincios davincios merged commit a01f6a1 into Tracer-Cloud:main Apr 19, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants