test(synthetic): add 5 EKS noise scenarios by hamzzaaamalik · Pull Request #666 · Tracer-Cloud/opensre

hamzzaaamalik · 2026-04-19T13:13:54Z

Summary

Adds 5 noise/false-positive scenarios to the EKS synthetic suite. Each one looks like a real failure on the surface but the cluster is actually healthy — the agent should NOT diagnose a problem.
Brings EKS noise coverage closer to RDS Postgres (which already has 5).

The 5 scenarios

#	Scenario	The trap
009	noisy-healthy-restart-recovered	One pod restarted an hour ago - currently fine
010	red-herring-old-rollout	Old pod being torn down looks broken - new pods are healthy
011	recovered-rollout	A bad rollout already auto-rolled back - deployment is stable
012	pending-recovered	A pod was Pending - autoscaler added a node, now Running
013	spurious-alert-storm	A 30-second node flap caused many warnings - all recovered

Real-LLM scoring (mock backends)

5/5 pass on the latest evidence state.

Note: 011 and 013 were initially flaky — the agent returned unknown / infrastructure on some runs. Mirroring critical recovery context into datadog_logs (the diagnose step weights Datadog evidence more heavily than EKS events for confident classification) got both to consistently pass.

The underlying behaviour is still worth tightening: the diagnose prompt in app/nodes/root_cause_diagnosis/prompt_builder.py could weight current-state signals above historical Warning events more aggressively. Happy to follow up in a separate PR.

What's in each scenario folder

scenario.yml - base inheritance from 000-healthy + adversarial signal metadata
alert.json - symptom-level alert (e.g. KubernetesPodCrashLooping)
answer.yml - expected category healthy, with negative-evidence rule-outs (e.g. "not OOM", "not insufficient")
1–3 evidence file overrides showing the noise (everything else inherits from 000-healthy)
eks_events.json is kept strictly Warning-typed; recovery context lives in datadog_logs

Test plan

pytest tests/synthetic/eks/ - all 14 scenarios load
Real-LLM run - 5/5 pass
Reviewer can re-run scoring to verify

greptile-apps · 2026-04-19T13:17:45Z

Greptile Summary

Adds 5 EKS noise/false-positive scenarios (009–013) to the synthetic test suite, bringing EKS coverage in line with RDS Postgres. All 33 new files are test data only (JSON evidence fixtures, YAML scenario metadata and answer keys) with no changes to production code.

P1 — scenario 012 answer.yml: model_response claims the recovered pod has been "Running and Ready for over 15 minutes", but the latest evidence timestamp is 10:30:00Z — only ~82 seconds after started_at (10:28:38Z). An evaluator cross-checking evidence timestamps against the narrative will find the assertion unsupportable; the log, pod, or event files need a later timestamp to make the arithmetic hold.
P2 — scenarios 011 and 012 eks_events.json: Raw Kubernetes event message fields embed \"see datadog_logs\" cross-reference hints. These are not realistic (real controller messages don't contain inter-tool navigation pointers) and reduce the adversarial challenge for scenarios marked scenario_difficulty: 3.

Confidence Score: 4/5

Safe to merge with one scenario (012) containing a provably unsupportable timestamp claim in its expected model response.

One P1 finding remains: scenario 012's model_response asserts a 15-minute runtime that the evidence cannot support — this breaks the internal consistency contract these test fixtures rely on and will produce misleading evaluation results. The P2 cross-reference hints are addressable but not blocking. All prior review thread concerns (deletion_timestamp, log timestamp math, evidence attribution) appear resolved in this revision.

tests/synthetic/eks/012-pending-recovered/answer.yml (15-minute claim), tests/synthetic/eks/011-recovered-rollout/eks_events.json and tests/synthetic/eks/012-pending-recovered/eks_events.json (navigational hints in event messages)

Important Files Changed

Filename	Overview
tests/synthetic/eks/009-noisy-healthy-restart-recovered/answer.yml	Expected answer for stale-restart noise scenario; timestamps now internally consistent (09:25Z start, 10:30Z log, >60 min claim correct).
tests/synthetic/eks/010-red-herring-old-rollout/eks_pods.json	Old pod now has `deletion_timestamp` set, resolving previous inconsistency with `answer.yml` validated claim.
tests/synthetic/eks/011-recovered-rollout/answer.yml	Evidence attribution corrected — DeploymentRollback claim now cites `datadog_logs` rather than `eks_events`; consistent with the events file.
tests/synthetic/eks/012-pending-recovered/answer.yml	P1: `model_response` claims pod has been Running "over 15 minutes" but latest evidence timestamp is only ~90 seconds after `started_at`; claim is not verifiable from the provided evidence.
tests/synthetic/eks/011-recovered-rollout/eks_events.json	P2: Event message embeds "see datadog_logs" cross-reference, reducing adversarial difficulty for a scenario marked difficulty-3.
tests/synthetic/eks/012-pending-recovered/eks_events.json	P2: FailedScheduling message embeds "see datadog_logs" cross-reference, same navigational hint pattern as scenario 011.
tests/synthetic/eks/013-spurious-alert-storm/answer.yml	Known gap (agent classifies as `infrastructure`) documented in PR; forbidden category retained intentionally as a future regression gate.
tests/synthetic/eks/013-spurious-alert-storm/eks_events.json	Three Warning-typed events covering the 30-second flap window; no Normal events mistakenly placed under `warning_events`.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[scenario_loader.py] --> B{base: 000-healthy?}
    B -- Yes --> C[Merge scenario.yml on top of base]
    B -- No --> D[Use scenario.yml directly]
    C --> E[Resolve evidence files\nscenario dir first, then base dir]
    D --> E
    E --> F[alert.json]
    E --> G[answer.yml]
    E --> H[Evidence overrides\neks_pods / eks_events / etc]
    H --> I{File in scenario dir?}
    I -- Yes --> J[Use scenario override]
    I -- No --> K[Fall back to 000-healthy]
    J --> L[K8sScenarioFixture]
    K --> L
    F --> L
    G --> L
    L --> M[test_suite.py scoring]
    M --> N{required_keywords\nforbidden_categories\nruling_out_keywords}
    N -- Pass --> O[healthy ✓]
    N -- Fail --> P[Test failure ✗]

Comments Outside Diff (1)

tests/synthetic/eks/012-pending-recovered/answer.yml, line 734 (link)

"Over 15 minutes" claim not supported by evidence timestamps

The model_response states the pod "has been Running and Ready for over 15 minutes", but the latest evidence timestamp across all files for this scenario is 2026-04-18T10:30:00Z (the startup log), and the pod's started_at is 2026-04-18T10:28:38Z — a gap of ~82 seconds. No evidence in the scenario supports a 15-minute runtime claim, so any evaluator or LLM that cross-checks timestamps against the narrative will find the assertion unsupportable. Either advance the latest log/event timestamp to at least 10:43:38Z to make the arithmetic hold, or change the narrative to "over 1 minute" / "recently transitioned to Running".

Prompt To Fix With AI

This is a comment left during a code review.
Path: tests/synthetic/eks/012-pending-recovered/answer.yml
Line: 734

Comment:
**"Over 15 minutes" claim not supported by evidence timestamps**

The `model_response` states the pod "has been Running and Ready for over 15 minutes", but the latest evidence timestamp across all files for this scenario is `2026-04-18T10:30:00Z` (the startup log), and the pod's `started_at` is `2026-04-18T10:28:38Z` — a gap of ~82 seconds. No evidence in the scenario supports a 15-minute runtime claim, so any evaluator or LLM that cross-checks timestamps against the narrative will find the assertion unsupportable. Either advance the latest log/event timestamp to at least `10:43:38Z` to make the arithmetic hold, or change the narrative to "over 1 minute" / "recently transitioned to Running".

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: tests/synthetic/eks/012-pending-recovered/answer.yml
Line: 734

Comment:
**"Over 15 minutes" claim not supported by evidence timestamps**

The `model_response` states the pod "has been Running and Ready for over 15 minutes", but the latest evidence timestamp across all files for this scenario is `2026-04-18T10:30:00Z` (the startup log), and the pod's `started_at` is `2026-04-18T10:28:38Z` — a gap of ~82 seconds. No evidence in the scenario supports a 15-minute runtime claim, so any evaluator or LLM that cross-checks timestamps against the narrative will find the assertion unsupportable. Either advance the latest log/event timestamp to at least `10:43:38Z` to make the arithmetic hold, or change the narrative to "over 1 minute" / "recently transitioned to Running".

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/synthetic/eks/011-recovered-rollout/eks_events.json
Line: 10

Comment:
**Cross-reference hint embedded in raw Kubernetes event message**

The `ProgressDeadlineExceeded` event message ends with `"(deployment has since auto-rolled back; see datadog_logs)"`. The same pattern appears in scenario 012's `FailedScheduling` event: `"(cluster autoscaler subsequently added a node; pod is now Running — see datadog_logs)"`. In a real cluster these fields contain only the controller message — no cross-source navigation hints. Embedding `"see datadog_logs"` effectively hands the agent the answer to "which evidence source should I consult next?", quietly reducing the adversarial challenge for two scenarios marked `scenario_difficulty: 3`. Consider removing the explicit cross-reference and letting the agent infer the next tool call from the event timestamps alone.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (4): Last reviewed commit: "fix(synthetic): correct stale 09:45Z ref..." | Re-trigger Greptile}

test(synthetic): add 5 EKS noise scenarios

401ba42

greptile-apps Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread tests/synthetic/eks/010-red-herring-old-rollout/eks_pods.json

Comment thread tests/synthetic/eks/010-red-herring-old-rollout/eks_events.json Outdated

Comment thread tests/synthetic/eks/009-noisy-healthy-restart-recovered/datadog_logs.json

test(synthetic): add 5 EKS noise scenarios_

369091a

greptile-apps Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread tests/synthetic/eks/011-recovered-rollout/answer.yml Outdated

test(synthetic): add 5 EKS noise scenarios

c6ad041

greptile-apps Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread tests/synthetic/eks/009-noisy-healthy-restart-recovered/datadog_logs.json Outdated

fix(synthetic): correct stale 09:45Z reference in 009 log message

42f793a

rrajan94 merged commit b08f5de into Tracer-Cloud:main Apr 20, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(synthetic): add 5 EKS noise scenarios#666

test(synthetic): add 5 EKS noise scenarios#666
rrajan94 merged 4 commits into
Tracer-Cloud:mainfrom
hamzzaaamalik:eks-noise-scenarios

hamzzaaamalik commented Apr 19, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 19, 2026 •

edited

Loading

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hamzzaaamalik commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The 5 scenarios

Real-LLM scoring (mock backends)

What's in each scenario folder

Test plan

Uh oh!

greptile-apps Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hamzzaaamalik commented Apr 19, 2026 •

edited

Loading

greptile-apps Bot commented Apr 19, 2026 •

edited

Loading