fix: plumb EKS tool output into state evidence via post_process mappers#584
Conversation
The EKS investigation tools under app/tools/EKS*/ already return their
results in a consistent envelope shape, but merge_evidence() in
app/nodes/investigate/processing/post_process.py has no EVIDENCE_MAPPERS
entries for any list_eks_* / get_eks_* / describe_eks_* action name.
The mapper dict contains entries for Grafana, Datadog, CloudWatch, S3,
Lambda, GitHub, Honeycomb, Coralogix and Vercel — EKS was missed.
As a result, successfully-executed EKS tools silently drop their data
on the floor: merge_evidence loops over execution_results, looks up
the action name in EVIDENCE_MAPPERS, gets None back, and never stores
the result in state["evidence"]. Downstream, diagnose_root_cause builds
its prompt from a state dict that has zero eks_* keys regardless of
what the tools returned. The agent effectively investigates Kubernetes
incidents without access to the data it just gathered.
This patch closes the gap for the five fixture-supported EKS tools:
* list_eks_pods → eks_pods / eks_failing_pods / eks_high_restart_pods / eks_total_pods
* get_eks_events → eks_events / eks_total_warning_count
* list_eks_deployments → eks_deployments / eks_degraded_deployments / eks_total_deployments
* get_eks_node_health → eks_node_health / eks_not_ready_count / eks_total_nodes
* get_eks_pod_logs → eks_pod_logs / eks_pod_logs_pod_name / eks_pod_logs_namespace
Each mapper follows the same shape as the existing Datadog mappers
(_map_datadog_logs, _map_datadog_monitors, etc.): read the fields the
corresponding tool function populates, write them to dedicated state
keys. Registered in the EVIDENCE_MAPPERS dict alongside the existing
entries.
build_evidence_summary() gets matching elif branches so the tracker
message ("eks:3 pods (0 failing)", "eks:2 deployments (1 degraded)",
...) mirrors the existing Datadog / Grafana branches.
Wiring the remaining six EKS tools (list_eks_clusters,
list_eks_namespaces, describe_eks_cluster, describe_eks_addon,
get_eks_deployment_status, get_eks_nodegroup_health) is deliberately
out of scope until a scenario needs them — same rationale as the scope
boundary in the parallel synthetic K8s harness work under Tracer-Cloud#260.
A new test module at tests/nodes/investigate/test_post_process.py
covers:
* each EKS action is registered in EVIDENCE_MAPPERS (5 cases)
* each mapper returns the expected keys when given a representative
successful tool result
* each mapper returns sensible defaults when fields are missing
* merge_evidence() skips failed results (ActionExecutionResult with
success=False) for EKS actions the same way it does for other tools
* build_evidence_summary() emits the expected human-readable strings
for every EKS tool
Fixes Tracer-Cloud#581.
Greptile SummaryThis PR wires five EKS investigation tools into the evidence pipeline by adding mapper functions, Confidence Score: 5/5Safe to merge — purely additive change with no modifications to existing code paths. All five mapper field names were verified against the corresponding EKS tool return statements and match exactly. Defaults are sensible (empty list / 0). The EVIDENCE_MAPPERS registry entries and build_evidence_summary branches are consistent with each other and with existing patterns. 19 new tests cover mapper correctness, failed-result skipping, missing-fields defaults, and summary output. No existing mappers, keys, or code paths are modified. No files require special attention. Important Files Changed
|
There was a problem hiding this comment.
Pull request overview
This PR fixes missing post-processing plumbing for EKS investigation tools so their successful outputs are merged into state["evidence"] and reflected in the investigator’s evidence summary (Fixes #581).
Changes:
- Add five EKS evidence mapper functions and register them in
EVIDENCE_MAPPERS. - Extend
build_evidence_summary()with EKS-specific summary branches. - Add unit tests covering EKS mapper registration, merge behavior, and summary rendering.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
app/nodes/investigate/processing/post_process.py |
Adds EKS mappers, registers them, and emits EKS collection summaries so tool output is preserved and visible downstream. |
tests/nodes/investigate/test_post_process.py |
Adds focused tests verifying EKS evidence mapping/merging and evidence-summary strings. |
tests/nodes/investigate/__init__.py |
Package marker for the new test module. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| summary_parts.append( | ||
| f"eks:{data.get('total_nodes', 0)} nodes ({not_ready} not ready)" | ||
| ) | ||
| elif action_name == "get_eks_pod_logs" and data.get("logs"): |
There was a problem hiding this comment.
build_evidence_summary() only emits the get_eks_pod_logs summary when data.get("logs") is truthy. Since the EKS tool returns logs as a string, a successful call that returns an empty string (valid when the container has no log output in the requested window) will currently produce no EKS summary entry, which makes the tracker output look like the tool never ran. Consider checking data.get("logs") is not None (or checking for the presence of the logs key) and emitting eks:0 log lines from <pod> when empty, to stay consistent with the other EKS branches that report zero counts.
| elif action_name == "get_eks_pod_logs" and data.get("logs"): | |
| elif action_name == "get_eks_pod_logs" and data.get("logs") is not None: |
Fixes #581
Describe the changes you have made in this PR -
The EKS investigation tools under
app/tools/EKS*/already return their results in a consistent envelope shape, butmerge_evidence()inapp/nodes/investigate/processing/post_process.pyhas noEVIDENCE_MAPPERSentries for any of thelist_eks_*/get_eks_*/describe_eks_*action names. The mapper dict contains entries for Grafana, Datadog, CloudWatch, S3, Lambda, GitHub, Honeycomb, Coralogix and Vercel — EKS was missed.As a result, successfully-executed EKS tools silently drop their data on the floor:
merge_evidenceloops overexecution_results, looks up the action name inEVIDENCE_MAPPERS, getsNoneback, and never stores the result instate["evidence"]. Downstream,diagnose_root_causebuilds its prompt from a state dict that has zeroeks_*keys regardless of what the tools returned. The agent effectively investigates Kubernetes incidents without access to the data it just gathered — the opposite of what the EKS toolset was added for.build_evidence_summary()had the same gap: noelifbranches for EKS, so the tracker message emitted at the end ofnode_investigatenever mentioned any EKS activity either.What this PR changes
app/nodes/investigate/processing/post_process.py— five new mapper functions, five new registry entries, and five new summary branches. No other files underapp/are touched.New mapper functions (placed next to the existing
_map_datadog_*mappers, same shape):_map_eks_pods(data)→{eks_pods, eks_failing_pods, eks_high_restart_pods, eks_total_pods}_map_eks_events(data)→{eks_events, eks_total_warning_count}_map_eks_deployments(data)→{eks_deployments, eks_degraded_deployments, eks_total_deployments}_map_eks_node_health(data)→{eks_node_health, eks_not_ready_count, eks_total_nodes}_map_eks_pod_logs(data)→{eks_pod_logs, eks_pod_logs_pod_name, eks_pod_logs_namespace}Each one reads the fields the corresponding tool function populates (verified by reading each tool's
returnstatement inapp/tools/EKS*Tool/__init__.py) and writes them to dedicated state keys with.get(..., <sensible default>)so a missing or partial tool result does not raise.Registered in
EVIDENCE_MAPPERSalongside the existing entries:New
elifbranches inbuild_evidence_summary()so the tracker output mirrors the existing Grafana / Datadog style:eks:3 pods (1 failing)—list_eks_podseks:2 warning events—get_eks_eventseks:2 deployments (1 degraded)—list_eks_deploymentseks:3 nodes (1 not ready)—get_eks_node_healtheks:42 log lines from payments-api-7f9—get_eks_pod_logsScope boundary
Wiring the remaining six EKS tools (
list_eks_clusters,list_eks_namespaces,describe_eks_cluster,describe_eks_addon,get_eks_deployment_status,get_eks_nodegroup_health) is deliberately out of scope for this PR. Those tools surface supplementary data (cluster inventory, addon versions, nodegroup health metadata) that no current investigation or synthetic scenario consumes. When a future scenario or code path needs one, the same 5-line mapper pattern applies and can be added in a follow-up PR.Testing
All three gate commands pass locally on Python 3.12:
The baseline before this PR was 2137 passed. The delta of 19 matches the 19 new tests added in
tests/nodes/investigate/test_post_process.py:The new test module covers:
TestEKSMappersRegistered— asserts each of the 5 EKS action names is present inEVIDENCE_MAPPERS(one test per action, so adding a sixth mapper later catches removal regressions cleanly).TestListEKSPodsMapper— feeds a representativelist_eks_podsresult (3 pods, 1 failing, 0 high-restart) and asserts all foureks_*keys appear with the expected values; a second test covers the "missing fields" defaults path.TestGetEKSEventsMapper— a Warning-events-present case (OOMKilled) and a healthy no-events case.TestListEKSDeploymentsMapper— mixed case with one healthy and one degraded deployment, asserts thedegraded_deploymentssubset is the right one.TestGetEKSNodeHealthMapper— two nodes, one not ready, assertseks_not_ready_count == 1andeks_total_nodes == 2.TestGetEKSPodLogsMapper— three-line log fixture with a fatal marker, asserts the marker round-trips toeks_pod_logs.TestMergeEvidenceSkipsFailedResults— passes a failedActionExecutionResultforlist_eks_podsand asserts noeks_*keys appear in the returned evidence dict (matches the existingif not result.success: continuebehaviour atpost_process.py:349).TestBuildEvidenceSummaryEKS— six cases, one per tool summary branch, asserting the exact substring the tracker will emit.Screenshots of the UI changes (If any) -
N/A — no user-facing UI changes. This PR touches the evidence post-processing layer only. Production behaviour for Grafana, Datadog, CloudWatch and every other existing evidence source is identical: the change is additive.
Impact analysis
elifbranches. No existing keys, functions or code paths are modified. Real-credential EKS investigations that were previously silently discarding tool output will now populatestate["evidence"]["eks_*"]as they should always have done. If any downstream consumer was coincidentally relying on the absence ofeks_*keys (unlikely but possible), they would see the new keys appear — hence flagged in this section rather than assumed to be a non-issue..envwrites, no credentials handled.Related issues
Build the K8s synthetic test harness #260 — Kubernetes synthetic test harness (parallel PR feat: add Kubernetes synthetic RCA test harness #583): the synthetic harness depends on EKS evidence actually flowing into
state["evidence"]beforediagnose_root_causeruns. Without the mappers added here, scenarios declaringrequired_evidence_sources: [eks_pods, eks_events, ...]in theiranswer.ymlfail the scorer's evidence check regardless of whether the agent called the right tools. Once both feat: add Kubernetes synthetic RCA test harness #583 and this PR land, the end-to-end smoke test in the harness PR can be extended to assert on theeks_*evidence keys.[BUG] is_clearly_healthy short-circuit never fires for pure-EKS healthy states — eks_* keys missing from _INVESTIGATED_EVIDENCE_KEYS #582 — sister gap in
app/nodes/root_cause_diagnosis/evidence_checker.py:_INVESTIGATED_EVIDENCE_KEYShas noeks_*entries sois_clearly_healthynever fires the healthy short-circuit for pure-Kubernetes healthy states. Filed as a separate bug and handled in its own PR to keep scopes clean. The ordering is: this PR first (sostate["evidence"]actually getseks_*keys), then [BUG] is_clearly_healthy short-circuit never fires for pure-EKS healthy states — eks_* keys missing from _INVESTIGATED_EVIDENCE_KEYS #582 (so the short-circuit recognises them).K8s scenarios: CrashLoopBackOff, OOMKilled, ImagePullBackOff #261 / K8s scenarios: Node NotReady, Pending Pods, Stuck Rollouts #262 / K8s scenarios: Eviction, DNS failures, Probe failures, Quota limits #263 — real Kubernetes failure scenarios. These depend on the plumbing in this PR to run end-to-end and grade correctly.
Code Understanding and AI Usage
Did you use AI assistance (ChatGPT, Claude, Copilot, etc.) to write any part of this code?
If you used AI assistance:
Explain your implementation approach:
Problem solved: EKS investigation tools were silently dropping their output at the
merge_evidencestep because theEVIDENCE_MAPPERSregistry inpost_process.pyhad no entries for any EKS action name. This was discovered while scoping the Kubernetes synthetic test harness work (#260): I noticed the harness would never see EKS evidence instate["evidence"]no matter what the tools returned, traced it back to the mapper gap, and filed this issue so the fix could be reviewed as a standalone scoped change rather than bundled into the harness PR.Alternatives considered:
Add a fallback in
merge_evidencethat stores raw tool results under<action_name>_rawkeys when no mapper exists. Rejected: it hides the real bug (missing mapper) and creates a different problem (diagnose_root_causedoes not look at arbitrary*_rawkeys), so the agent still would not see the data.Write a single generic EKS mapper that stores the entire result dict under one key per tool. Rejected: breaks the pattern every other evidence family follows (one dict key per conceptual evidence slice, e.g.
datadog_logsanddatadog_error_logsare separate keys), and makes downstream consumers indiagnose_root_cause/build_diagnosis_promptharder to write.Fix all 11 EKS tools' mappers in one pass, including the six tools whose output no current scenario consumes. Rejected to keep the PR focused. The unused six can land in a follow-up whenever a scenario needs them. Shipping the five that matter now unblocks Build the K8s synthetic test harness #260 / K8s scenarios: CrashLoopBackOff, OOMKilled, ImagePullBackOff #261 / K8s scenarios: Node NotReady, Pending Pods, Stuck Rollouts #262 / K8s scenarios: Eviction, DNS failures, Probe failures, Quota limits #263 immediately.
Why this implementation:
Follows the exact pattern of the existing Datadog mappers.
_map_datadog_logsand_map_datadog_monitorsare the closest analogues because Datadog tools return similar envelope shapes (a top-level dict withsource,available, typed fields,error). Each new mapper mirrors that style so the whole file remains visually consistent and reviewers already know what to look for.Uses
.get()with sensible defaults rather thandata["key"]so a partial tool result (e.g. an EKS call that succeeded structurally but returned zero pods) populates the evidence dict cleanly with empty collections instead of raisingKeyError. The existing Datadog mappers do the same thing (data.get("logs", []),data.get("total", 0)).build_evidence_summarybranches usedata.get(X) is not Nonerather than truthiness for list fields that can legitimately be empty on healthy scenarios (warning_events=[],failing_pods=[]). This avoids the tracker silently swallowing healthy-state summaries. The counting cases ("eks:3 pods (0 failing)") are more informative than an empty summary line.Key components and their jobs:
_map_eks_pods(data)— projects alist_eks_podstool result into four state keys.eks_total_podsis the raw count,eks_podsis the full list,eks_failing_podsis the subset whose phase is notRunningorSucceeded,eks_high_restart_podsis the subset whose containers have more than 3 restarts. These subset splits are computed by the tool itself inapp/tools/EKSListPodsTool/__init__.py:81-83, so the mapper just passes them through._map_eks_events(data)— projectsget_eks_eventsoutput.eks_eventsis the Warning-event list,eks_total_warning_countis the raw count. The tool filters at source so onlytype == "Warning"events appear._map_eks_deployments(data)— projectslist_eks_deployments.eks_deploymentsis all deployments,eks_degraded_deploymentsis the subset whereunavailable > 0 or ready < desired(computed by the tool atapp/tools/EKSListDeploymentsTool/__init__.py:70-72)._map_eks_node_health(data)— projectsget_eks_node_health.eks_node_healthis the full node list (with flattened string condition fields, per the tool's output shape),eks_not_ready_countis the integer count of nodes whoseReadycondition is not"True"(computed by the tool atapp/tools/EKSNodeHealthTool/__init__.py:72)._map_eks_pod_logs(data)— projectsget_eks_pod_logs. Flat string log output plus the pod name and namespace so downstream consumers can cite the source.The
EVIDENCE_MAPPERSregistry additions are inserted alphabetically-adjacent to the existing Datadog block, matching the file's implicit grouping-by-evidence-family ordering.The new
build_evidence_summarybranches are placed in the same Datadog-adjacent block so they show up in the same relative position when the tracker emits them.Checklist before requesting a review