feat: add Kubernetes synthetic RCA test harness#583
Conversation
Mirror the existing RDS Postgres synthetic suite under tests/synthetic/ for Kubernetes workloads so future K8s scenarios (crash-loop, OOM-killed, image-pull-backoff, etc.) can plug in as scenario directories rather than one-off test files. Harness components: * tests/synthetic/k8s_schemas.py: TypedDicts, controlled vocabularies and validators for K8s scenario metadata, alert envelopes, and every declared evidence source (eks_pods, eks_events, eks_deployments, eks_node_health, eks_pod_logs, datadog_logs, datadog_monitors). * tests/synthetic/mock_eks_backend/ and mock_datadog_backend/: fixture backends that satisfy runtime-checkable Protocols and return the exact envelope shape the real tool functions under app/tools/EKS*/ and app/tools/DataDog*/ produce. Each package ships a straight fixture backend for Axis 1 and a selective variant that records every tool invocation for Axis 2 reasoning-quality scoring. * tests/synthetic/eks/scenario_loader.py: scenario discovery, base scenario inheritance (single-level, chained bases rejected), evidence file fallback from scenario dir to base dir, typed ScenarioFixture. * tests/synthetic/eks/run_suite.py: CLI runner plus score_trajectory / score_reasoning / score_result functions mirroring the RDS harness, with an _EVIDENCE_KEY_MAP hook for future refinement. * tests/synthetic/eks/test_suite.py: unit tests for the loader, backend shapes, scorer, and scenario inheritance, plus an end-to-end smoke test that drives run_investigation() with a canned plan (no real LLM call required). * tests/synthetic/eks/test_suite_axis2.py: pytest parametrisation for adversarial reasoning tests using the selective backends. * tests/synthetic/eks/000-healthy/: placeholder scenario directory with scenario_difficulty: 0 so the parametrised level1..4 collections stay empty until the first real failure scenarios land under Tracer-Cloud#261 / Tracer-Cloud#262 / Tracer-Cloud#263. Its only job is to exercise the full harness wiring end-to-end. * Makefile: adds test-k8s-synthetic target, mirroring test-rds-synthetic. Agent-side wiring for the mock backends: * app/nodes/plan_actions/detect_sources.py: extend the EKS and Datadog paths to accept a pre-injected _backend dict key the same way Grafana already does. Backend-only mode deliberately does not set connection_verified, so only the five fixture-supported EKS tools activate (list_eks_pods, get_eks_events, list_eks_deployments, get_eks_node_health, get_eks_pod_logs) and the six unsupported ones stay quiet. * app/tools/EKSListClustersTool/__init__.py: introduce _eks_available_or_backend helper for the wired tools; make _eks_creds safe against missing role_arn (backend-only call path). * app/tools/EKS*Tool/__init__.py (5 wired tools) and app/tools/DataDog{Logs,Monitors}Tool/__init__.py: add an eks_backend / datadog_backend kwarg to each tool function and short-circuit to the mock when present, matching the pattern the Grafana tools already use. Known follow-ups flagged as separate issues before this PR was opened: * Issue for EKS evidence mappers in app/nodes/investigate/processing/post_process.py: the EKS tools' output is currently dropped by merge_evidence because the EVIDENCE_MAPPERS registry has no entries for them. Until that lands, the end-to-end smoke test in this PR only asserts on datadog_* evidence keys. * Issue for is_clearly_healthy _INVESTIGATED_EVIDENCE_KEYS: the frozenset used by the healthy short-circuit has no eks_* entries, so pure-EKS healthy scenarios will not fast-path out of the reasoning LLM call. Also independent of this PR; flagged for a follow-up.
Greptile SummaryThis PR adds a Kubernetes synthetic RCA test harness under All three concerns from the previous review round have been addressed: the Confidence Score: 5/5Safe to merge — all three prior review concerns are fully resolved; remaining findings are P2 cleanup items that don't affect correctness. All production-path changes are strictly additive (new kwarg defaults, new if-backend-not-None short-circuits, shared availability helpers). The harness is well-tested with 21 passing unit/integration tests, no new runtime dependencies, and backward compatibility is preserved. Remaining findings are dead code and a duplicated constant in test infrastructure. tests/synthetic/eks/run_suite.py (unused ResolvedBackends), tests/synthetic/mock_datadog_backend/backend.py (duplicated _ERROR_KEYWORDS), Makefile (missing .PHONY entry). Important Files Changed
|
CodeQL py/ineffectual-statement (7 alerts) and py/unused-global-variable
(1 alert):
* Remove the trailing `...` Ellipsis literal from every Protocol method
body in tests/synthetic/mock_eks_backend/backend.py (5 methods) and
tests/synthetic/mock_datadog_backend/backend.py (2 methods). In
Python a docstring is itself a valid function body, so the `...` was
a dead expression that CodeQL flagged as ineffectual.
* Remove the unused _BASE_SCENARIO_YML module-level constant from
tests/synthetic/eks/test_suite.py. It was left over from an earlier
refactor where the template was inlined per test method.
Greptile P2 follow-ups:
* detect_sources.py: when a backend is pre-injected under
resolved_integrations["aws"]["_backend"] but the alert's annotations
do not carry a cluster_name / eks_cluster key, fall back to the first
entry in cluster_names on the integration dict. Without this
fallback, future synthetic scenarios that forget to put the cluster
name in commonAnnotations silently produce zero EKS tool activity
with no diagnostic.
* Lift the backend-aware availability helpers out of tool-specific
modules and into a new shared app/tools/utils/availability.py. The
previous layout had _eks_available_or_backend defined in
EKSListClustersTool/__init__.py and imported by 5 other EKS tools,
and _dd_available_or_backend defined in DataDogLogsTool/__init__.py
and imported by DataDogMonitorsTool. The new file exposes
eks_available_or_backend and datadog_available_or_backend at the
module level, and every wired tool imports directly from the utils
module. No behaviour change — pure relocation of the two helpers.
* Clarify the set-membership semantics of TrajectoryScore.sequencing_ok
in tests/synthetic/eks/run_suite.py. The field stays named
sequencing_ok for parallelism with the RDS synthetic suite (which
also uses the set-membership check), but the comment now makes it
explicit that ordering is intentionally not enforced because actions
run in parallel and completion order is non-deterministic.
All three gate commands still pass locally:
make lint → All checks passed!
make typecheck → Success: no issues found in 341 source files
make test-cov → 2137 passed, 1 skipped (pre-existing pyenv-shim
failures in tests/cli_smoke_test.py are unchanged
and pass in CI)
Targeted tests also pass:
pytest tests/synthetic/eks/ → 21 passed, 5 skipped
pytest tests/tools/test_eks_* → 51 passed
pytest tests/tools/test_datadog_* → 30 passed
Relates to Tracer-Cloud#260.
Fixes #260
Describe the changes you have made in this PR -
Adds a fully-wired Kubernetes synthetic RCA test harness under
tests/synthetic/eks/, mirroring the existing RDS Postgres suite attests/synthetic/rds_postgres/. Future Kubernetes failure scenarios (issues #261, #262, #263) can now plug in as scenario directories rather than hand-rolled one-off test files, anddetect_sourcesplus the wired EKS and Datadog tools transparently accept injected fixture backends in test mode while leaving real-credential behaviour untouched.What this PR adds
New files — harness infrastructure (under
tests/synthetic/)tests/synthetic/k8s_schemas.py— controlled vocabularies (4 engines, 6 workload types, 11 failure modes, 7 evidence sources, 7 trajectory actions), TypedDicts for the alert envelope and every evidence fixture, aK8sScenarioEvidencedataclass, and validators following the same patterns astests/synthetic/schemas.pyin the RDS suite. Kept deliberately separate from the RDS schemas so the two suites can evolve independently.tests/synthetic/mock_eks_backend/—EKSBackend@runtime_checkableProtocol plusFixtureEKSBackendandSelectiveEKSBackend. The fixture backend exposeslist_pods,get_events,list_deployments,get_node_health,get_pod_logs, each returning the exact envelope the real tool function inapp/tools/EKS*/produces. The selective subclass records every tool invocation into an audit log for Axis 2 reasoning-quality scoring.tests/synthetic/mock_datadog_backend/—DatadogBackendProtocol plusFixtureDatadogBackendandSelectiveDatadogBackend, wrappingdatadog_logs.json/datadog_monitors.jsonfixtures in the envelopes thatquery_datadog_logsandquery_datadog_monitorsreturn in production.tests/synthetic/eks/scenario_loader.py— mirror oftests/synthetic/rds_postgres/scenario_loader.py:load_all_scenarios,load_scenario, single-level base inheritance with chained-inheritance rejection, file-level evidence fallback from scenario directory to base directory.tests/synthetic/eks/run_suite.py— CLI runner plus scorer.TrajectoryScore/ReasoningScore/ScenarioScoredataclasses andscore_trajectory/score_reasoning/score_resultfunctions parallel the RDS suite.run_scenariobuildsresolved_integrationswith bothaws(EKS) anddatadogentries containing the injected_backendobjects, then delegates torun_investigation.tests/synthetic/eks/test_suite.py— pytest coverage. Loader validation, schema compliance, mock backend shape assertions, scorer unit tests, aTestScenarioInheritanceclass mirroring the RDS suite (metadata inheritance, evidence fallback, local override, chained-inheritance rejection, missing-base rejection), and aTestHarnessEndToEndclass that drives the fullrun_investigationpipeline against the placeholder with a monkey-patched planner (no real LLM call required). The parametrisedtest_level1..4_scenariotests useskipifguards so they collect zero cases while only the placeholder scenario exists.tests/synthetic/eks/test_suite_axis2.py— Axis 2 pytest module using the selective backends andscore_reasoningfor adversarial reasoning-quality checks. Parameter set is empty until scenarios declareruling_out_keywordsorrequired_queries.tests/synthetic/eks/000-healthy/— placeholder scenario directory withscenario_difficulty: 0so it stays out of the level-1..4 parametrizations. Containsscenario.yml,alert.json,answer.yml, and every declared evidence fixture (eks_pods.json,eks_events.json,eks_deployments.json,eks_node_health.json,datadog_logs.json,datadog_monitors.json).eks_pod_logsis deliberately omitted fromavailable_evidenceto exercise the "missing evidence source raises ValueError" path in the mock backend.Modified files — minimal agent-side wiring
The harness has to feed the fixture backends into the real pipeline the same way the RDS synthetic suite feeds
FixtureGrafanaBackendinto the Grafana tools. That pattern was not yet present for EKS or Datadog, so:app/nodes/plan_actions/detect_sources.py— extend the EKS and Datadog paths to accept a pre-injected_backendkey, matching how the existing Grafana path already works. The EKS integration continues to live underresolved_integrations[\"aws\"]as before;detect_sourcesnow reads_backendfrom that dict and propagates it tosources[\"eks\"][\"_backend\"]. Same treatment for Datadog. Critically, backend-only mode deliberately does NOT setconnection_verified: True— this keeps the 6 unwired EKS tools and the 4 unwired Datadog tools inactive in test mode, so a scenario cannot accidentally trigger a real AWS or Datadog call.app/tools/EKSListClustersTool/__init__.py— introduce a new_eks_available_or_backend(sources)helper that returnsTruewhen eitherconnection_verifiedor_backendis present. Only the 5 wired tools import this helper; the other 6 EKS tools keep using the existing_eks_availablecheck. Also relaxes_eks_credsfromeks[\"role_arn\"]toeks.get(\"role_arn\", \"\")so it no longer KeyErrors when called from the backend-only path.5 EKS tools (
EKSListPodsTool,EKSEventsTool,EKSListDeploymentsTool,EKSNodeHealthTool,EKSPodLogsTool) — each gets aneks_backend: Any = Nonekwarg added to its function signature (plus aneks_backend = eks.get(\"_backend\")line in itsextract_paramshelper). The function body short-circuits toeks_backend.<method>(...)when the kwarg is set, before any call tobuild_k8s_clients. The result is cast todict[str, Any]to satisfy mypy'swarn_return_any.role_arnwas also loosened from positional-required to default-empty to support the backend-only call path.app/tools/DataDogLogsTool/__init__.pyandapp/tools/DataDogMonitorsTool/__init__.py— add a_dd_available_or_backendhelper inDataDogLogsTool(imported byDataDogMonitorsTool), plus adatadog_backend: Any = Nonekwarg on each tool function and a short-circuit todatadog_backend.query_logs(...)/datadog_backend.query_monitors(...)before the realmake_clientpath. Same cast pattern as the EKS tools.Makefile— add atest-k8s-synthetictarget mirroringtest-rds-synthetic:Out of scope for this PR
Real Kubernetes failure scenarios (K8s scenarios: CrashLoopBackOff, OOMKilled, ImagePullBackOff #261, K8s scenarios: Node NotReady, Pending Pods, Stuck Rollouts #262, K8s scenarios: Eviction, DNS failures, Probe failures, Quota limits #263). The issue description is explicit: "This issue covers the test harness itself, not the individual scenarios." This PR ships the harness plus a single
000-healthyplaceholder whose only job is to prove the harness wiring works end-to-end. Difficulty-level parametrizations stay empty until those issues add real scenarios.Wiring the 6 remaining EKS tools and 4 remaining Datadog tools. Only the tools whose output corresponds to a declared evidence source in the issue scope are wired; this keeps the blast radius minimal. When future scenarios need additional sources, the same short 3-line pattern copied from the wired tools applies.
Pre-existing gaps flagged separately (not fixed in this PR)
While building the harness I identified two pre-existing gaps in the existing EKS plumbing. Both are NOT part of the harness scope, but #261 / #262 / #263 will need them to work end-to-end. Both have been filed as standalone bug reports before this PR was opened so they can be picked up in parallel by any contributor:
[BUG] EKS tool output silently dropped by merge_evidence — no mappers in post_process.py #581 — `[BUG] EKS tool output silently dropped by merge_evidence — no mappers in post_process.py` —
EVIDENCE_MAPPERSinapp/nodes/investigate/processing/post_process.pyhas mappers for Grafana, Datadog, CloudWatch, S3, Lambda, GitHub, Honeycomb, Coralogix and Vercel, but no entries for anylist_eks_*/get_eks_*/describe_eks_*action name. Tool output is silently discarded bymerge_evidence(). Small additive fix (~50 lines in a single file). Complete proposed patch in the issue body.[BUG] is_clearly_healthy short-circuit never fires for pure-EKS healthy states — eks_* keys missing from _INVESTIGATED_EVIDENCE_KEYS #582 — `[BUG] is_clearly_healthy short-circuit never fires for pure-EKS healthy states` —
_INVESTIGATED_EVIDENCE_KEYSinapp/nodes/root_cause_diagnosis/evidence_checker.pyhas noeks_*entries, so a pure-Kubernetes healthy state never short-circuits out of the reasoning LLM. Five-line fix. Ordering-depends on [BUG] EKS tool output silently dropped by merge_evidence — no mappers in post_process.py #581 landing first. Complete proposed patch in the issue body.Because these are pre-existing and independent, this PR's end-to-end smoke test (
TestHarnessEndToEnd::test_placeholder_runs_through_full_pipeline) only asserts ondatadog_*evidence keys — the existing Datadog mappers are enough to trigger the healthy short-circuit for the 000-healthy placeholder. A scope-note in that test's class docstring points forward to the follow-up issues so the EKS assertions can be enabled once #581 lands.Testing
All three gate commands pass locally on Python 3.12 and are the same commands CI runs:
Targeted verification of the new suite and every touched tool file:
The end-to-end smoke test drives the full LangGraph pipeline against 000-healthy with a canned planner plan, asserting that
detect_sourcespicks up the injected backends, the executor routes each oflist_eks_pods/get_eks_events/list_eks_deployments/get_eks_node_health/query_datadog_logs/query_datadog_monitorsto its fixture mock, Datadog evidence flows into state, anddiagnose_root_causereturnsroot_cause_category: healthyvia the healthy short-circuit without any real LLM call. No Anthropic or OpenAI API key is required —monkeypatch.setenv(\"ANTHROPIC_API_KEY\", \"sk-test-dummy\")is enough to satisfyLLMSettingsbecause every real LLM call in the run is either mocked (plan_actions) or bypassed (diagnose_root_cause healthy short-circuit).Screenshots of the UI changes (If any) -
N/A — no user-facing UI changes. This PR touches only test infrastructure plus the backend-injection seams in a handful of tool files and
detect_sources. Production behaviour against real EKS / Datadog credentials is identical; the new code paths are only reachable when a `_backend` object is present in the integration dict, which is only produced by the synthetic harness.Impact analysis
_eks_available_or_backend,_dd_available_or_backend) or an additive kwarg with a default (eks_backend=None,datadog_backend=None) that only matters when explicitly set. Real-credential investigations take the unchanged code path.if eks_backend is not None:/if datadog_backend is not None:check at the top of its function, taken once per invocation..envwrites, no credentials stored anywhere.Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
Problem solved: OpenSRE already has an RDS Postgres synthetic suite under
tests/synthetic/rds_postgres/that gives the team a reproducible, offline way to benchmark the agent's root-cause reasoning against fixture data — no real cloud calls, no flaky APIs, clear pass/fail per scenario. There was no equivalent for Kubernetes, so any improvement to the K8s investigation path could not be measured the same way. This PR adds the parallel infrastructure so follow-up issues can drop in K8s failure scenarios as scenario directories.Alternatives considered:
Mock at the Kubernetes Python SDK layer instead of the tool-function layer. Rejected: mocking
build_k8s_clientswould force every test to reason about raw K8s SDK objects rather than the higher-level tool response shapes the pipeline actually consumes. The issue text is explicit that "Response shapes must match what the EKS tools in `app/tools/EKS*/` and Datadog tools return", which points at the tool-function layer as the right seam.Mock at the
run_investigationentry point usingunittest.mock.patch. Rejected: it only exercises the glue and gives zero confidence that real scenarios will actually drivedetect_sources→plan_actions→ executor → tools correctly, which is the whole point of a synthetic suite.Wire backends into every EKS and Datadog tool upfront, even the ones the initial placeholder does not declare. Rejected to minimise blast radius. Only the 5 EKS tools and 2 Datadog tools whose output corresponds to a declared evidence source in the issue scope are wired. The unwired tools continue to use
connection_verifiedalone as their availability gate, so they stay completely inactive in test mode and cannot accidentally hit real AWS or Datadog. When future scenarios need additional sources, the same short 3-line pattern is trivial to copy.Why this implementation:
_backendinjection seam that the Grafana tools already use. No new abstractions are introduced; the pattern is one the team has already accepted during the RDS harness work, which minimises cognitive load for reviewers.scenario_difficulty: 0. Thetest_level1..4_scenarioparametrizations collect zero cases until real failure scenarios land in follow-up issues. The placeholder's only job is to exercise the harness plumbing once, not to grade an LLM on a synthetic case the LLM was not trained or evaluated on.Key components and their jobs:
K8sScenarioEvidence+ validators (k8s_schemas.py) — structural gate between raw JSON fixture files and the typed container the loader returns. RaisesValueErrorwith file-qualified context on any malformed fixture.K8sScenarioFixture+load_scenario/load_all_scenarios(scenario_loader.py) — discover, validate, and return a typed snapshot of a scenario directory. Handles single-level base inheritance and file-level evidence fallback.FixtureEKSBackend/FixtureDatadogBackend— satisfy runtime-checkable Protocols, wrap scenario fixtures in the exact envelopes the real tool functions produce, and raiseValueErrorwhen the caller requests a source the scenario did not declare.SelectiveEKSBackend/SelectiveDatadogBackend— subclass their non-selective counterparts and record each tool invocation into an audit log for Axis 2 reasoning-quality checks.run_scenarioinrun_suite.py— builds theresolved_integrationsdict with either fresh fixture backends or pre-built selective backends, delegates torun_investigation, collects any audit log produced by selective backends, and callsscore_result.score_trajectory/score_reasoning/score_result— Axis 1 correctness (category, keywords, forbidden terms, required evidence sources, trajectory efficiency) and Axis 2 adversarial reasoning (ruling-out keywords plus required-query audit). Direct parallel of the RDS scorer with_EVIDENCE_KEY_MAPadjusted for K8s.detect_sources.pyEKS and Datadog path changes — accept_backendin the incomingresolved_integrationsdict, propagate it to the relevantsources[...]entry, and deliberately skip settingconnection_verifiedin backend-only mode so only fixture-aware tools activate.*_backend: Any = Nonekwarg that short-circuits to the mock when set. Real-credential behaviour is completely untouched.Checklist before requesting a review