Skip to content

test(synthetic): add 5 EKS noise scenarios#666

Merged
rrajan94 merged 4 commits into
Tracer-Cloud:mainfrom
hamzzaaamalik:eks-noise-scenarios
Apr 20, 2026
Merged

test(synthetic): add 5 EKS noise scenarios#666
rrajan94 merged 4 commits into
Tracer-Cloud:mainfrom
hamzzaaamalik:eks-noise-scenarios

Conversation

@hamzzaaamalik

@hamzzaaamalik hamzzaaamalik commented Apr 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Adds 5 noise/false-positive scenarios to the EKS synthetic suite. Each one looks like a real failure on the surface but the cluster is actually healthy — the agent should NOT diagnose a problem.
  • Brings EKS noise coverage closer to RDS Postgres (which already has 5).

The 5 scenarios

# Scenario The trap
009 noisy-healthy-restart-recovered One pod restarted an hour ago - currently fine
010 red-herring-old-rollout Old pod being torn down looks broken - new pods are healthy
011 recovered-rollout A bad rollout already auto-rolled back - deployment is stable
012 pending-recovered A pod was Pending - autoscaler added a node, now Running
013 spurious-alert-storm A 30-second node flap caused many warnings - all recovered

Real-LLM scoring (mock backends)

5/5 pass on the latest evidence state.

Note: 011 and 013 were initially flaky — the agent returned unknown / infrastructure on some runs. Mirroring critical recovery context into datadog_logs (the diagnose step weights Datadog evidence more heavily than EKS events for confident classification) got both to consistently pass.

The underlying behaviour is still worth tightening: the diagnose prompt in app/nodes/root_cause_diagnosis/prompt_builder.py could weight current-state signals above historical Warning events more aggressively. Happy to follow up in a separate PR.

What's in each scenario folder

  • scenario.yml - base inheritance from 000-healthy + adversarial signal metadata
  • alert.json - symptom-level alert (e.g. KubernetesPodCrashLooping)
  • answer.yml - expected category healthy, with negative-evidence rule-outs (e.g. "not OOM", "not insufficient")
  • 1–3 evidence file overrides showing the noise (everything else inherits from 000-healthy)
  • eks_events.json is kept strictly Warning-typed; recovery context lives in datadog_logs

Test plan

  • pytest tests/synthetic/eks/ - all 14 scenarios load
  • Real-LLM run - 5/5 pass
  • Reviewer can re-run scoring to verify

@greptile-apps

greptile-apps Bot commented Apr 19, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds 5 EKS noise/false-positive scenarios (009–013) to the synthetic test suite, bringing EKS coverage in line with RDS Postgres. All 33 new files are test data only (JSON evidence fixtures, YAML scenario metadata and answer keys) with no changes to production code.

  • P1 — scenario 012 answer.yml: model_response claims the recovered pod has been "Running and Ready for over 15 minutes", but the latest evidence timestamp is 10:30:00Z — only ~82 seconds after started_at (10:28:38Z). An evaluator cross-checking evidence timestamps against the narrative will find the assertion unsupportable; the log, pod, or event files need a later timestamp to make the arithmetic hold.
  • P2 — scenarios 011 and 012 eks_events.json: Raw Kubernetes event message fields embed \"see datadog_logs\" cross-reference hints. These are not realistic (real controller messages don't contain inter-tool navigation pointers) and reduce the adversarial challenge for scenarios marked scenario_difficulty: 3.

Confidence Score: 4/5

Safe to merge with one scenario (012) containing a provably unsupportable timestamp claim in its expected model response.

One P1 finding remains: scenario 012's model_response asserts a 15-minute runtime that the evidence cannot support — this breaks the internal consistency contract these test fixtures rely on and will produce misleading evaluation results. The P2 cross-reference hints are addressable but not blocking. All prior review thread concerns (deletion_timestamp, log timestamp math, evidence attribution) appear resolved in this revision.

tests/synthetic/eks/012-pending-recovered/answer.yml (15-minute claim), tests/synthetic/eks/011-recovered-rollout/eks_events.json and tests/synthetic/eks/012-pending-recovered/eks_events.json (navigational hints in event messages)

Important Files Changed

Filename Overview
tests/synthetic/eks/009-noisy-healthy-restart-recovered/answer.yml Expected answer for stale-restart noise scenario; timestamps now internally consistent (09:25Z start, 10:30Z log, >60 min claim correct).
tests/synthetic/eks/010-red-herring-old-rollout/eks_pods.json Old pod now has deletion_timestamp set, resolving previous inconsistency with answer.yml validated claim.
tests/synthetic/eks/011-recovered-rollout/answer.yml Evidence attribution corrected — DeploymentRollback claim now cites datadog_logs rather than eks_events; consistent with the events file.
tests/synthetic/eks/012-pending-recovered/answer.yml P1: model_response claims pod has been Running "over 15 minutes" but latest evidence timestamp is only ~90 seconds after started_at; claim is not verifiable from the provided evidence.
tests/synthetic/eks/011-recovered-rollout/eks_events.json P2: Event message embeds "see datadog_logs" cross-reference, reducing adversarial difficulty for a scenario marked difficulty-3.
tests/synthetic/eks/012-pending-recovered/eks_events.json P2: FailedScheduling message embeds "see datadog_logs" cross-reference, same navigational hint pattern as scenario 011.
tests/synthetic/eks/013-spurious-alert-storm/answer.yml Known gap (agent classifies as infrastructure) documented in PR; forbidden category retained intentionally as a future regression gate.
tests/synthetic/eks/013-spurious-alert-storm/eks_events.json Three Warning-typed events covering the 30-second flap window; no Normal events mistakenly placed under warning_events.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[scenario_loader.py] --> B{base: 000-healthy?}
    B -- Yes --> C[Merge scenario.yml on top of base]
    B -- No --> D[Use scenario.yml directly]
    C --> E[Resolve evidence files\nscenario dir first, then base dir]
    D --> E
    E --> F[alert.json]
    E --> G[answer.yml]
    E --> H[Evidence overrides\neks_pods / eks_events / etc]
    H --> I{File in scenario dir?}
    I -- Yes --> J[Use scenario override]
    I -- No --> K[Fall back to 000-healthy]
    J --> L[K8sScenarioFixture]
    K --> L
    F --> L
    G --> L
    L --> M[test_suite.py scoring]
    M --> N{required_keywords\nforbidden_categories\nruling_out_keywords}
    N -- Pass --> O[healthy ✓]
    N -- Fail --> P[Test failure ✗]
Loading

Comments Outside Diff (1)

  1. tests/synthetic/eks/012-pending-recovered/answer.yml, line 734 (link)

    P1 "Over 15 minutes" claim not supported by evidence timestamps

    The model_response states the pod "has been Running and Ready for over 15 minutes", but the latest evidence timestamp across all files for this scenario is 2026-04-18T10:30:00Z (the startup log), and the pod's started_at is 2026-04-18T10:28:38Z — a gap of ~82 seconds. No evidence in the scenario supports a 15-minute runtime claim, so any evaluator or LLM that cross-checks timestamps against the narrative will find the assertion unsupportable. Either advance the latest log/event timestamp to at least 10:43:38Z to make the arithmetic hold, or change the narrative to "over 1 minute" / "recently transitioned to Running".

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: tests/synthetic/eks/012-pending-recovered/answer.yml
    Line: 734
    
    Comment:
    **"Over 15 minutes" claim not supported by evidence timestamps**
    
    The `model_response` states the pod "has been Running and Ready for over 15 minutes", but the latest evidence timestamp across all files for this scenario is `2026-04-18T10:30:00Z` (the startup log), and the pod's `started_at` is `2026-04-18T10:28:38Z` — a gap of ~82 seconds. No evidence in the scenario supports a 15-minute runtime claim, so any evaluator or LLM that cross-checks timestamps against the narrative will find the assertion unsupportable. Either advance the latest log/event timestamp to at least `10:43:38Z` to make the arithmetic hold, or change the narrative to "over 1 minute" / "recently transitioned to Running".
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: tests/synthetic/eks/012-pending-recovered/answer.yml
Line: 734

Comment:
**"Over 15 minutes" claim not supported by evidence timestamps**

The `model_response` states the pod "has been Running and Ready for over 15 minutes", but the latest evidence timestamp across all files for this scenario is `2026-04-18T10:30:00Z` (the startup log), and the pod's `started_at` is `2026-04-18T10:28:38Z` — a gap of ~82 seconds. No evidence in the scenario supports a 15-minute runtime claim, so any evaluator or LLM that cross-checks timestamps against the narrative will find the assertion unsupportable. Either advance the latest log/event timestamp to at least `10:43:38Z` to make the arithmetic hold, or change the narrative to "over 1 minute" / "recently transitioned to Running".

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/synthetic/eks/011-recovered-rollout/eks_events.json
Line: 10

Comment:
**Cross-reference hint embedded in raw Kubernetes event message**

The `ProgressDeadlineExceeded` event message ends with `"(deployment has since auto-rolled back; see datadog_logs)"`. The same pattern appears in scenario 012's `FailedScheduling` event: `"(cluster autoscaler subsequently added a node; pod is now Running — see datadog_logs)"`. In a real cluster these fields contain only the controller message — no cross-source navigation hints. Embedding `"see datadog_logs"` effectively hands the agent the answer to "which evidence source should I consult next?", quietly reducing the adversarial challenge for two scenarios marked `scenario_difficulty: 3`. Consider removing the explicit cross-reference and letting the agent infer the next tool call from the event timestamps alone.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (4): Last reviewed commit: "fix(synthetic): correct stale 09:45Z ref..." | Re-trigger Greptile

Comment thread tests/synthetic/eks/010-red-herring-old-rollout/eks_pods.json
Comment thread tests/synthetic/eks/010-red-herring-old-rollout/eks_events.json Outdated
Comment thread tests/synthetic/eks/011-recovered-rollout/answer.yml Outdated
Comment thread tests/synthetic/eks/009-noisy-healthy-restart-recovered/datadog_logs.json Outdated
@rrajan94 rrajan94 merged commit b08f5de into Tracer-Cloud:main Apr 20, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants