Skip to content

[BUG] EKS tool output silently dropped by merge_evidence — no mappers in post_process.py #581

@ebrahim-sameh

Description

@ebrahim-sameh

Summary

EKS investigation tools (list_eks_pods, get_eks_events, list_eks_deployments, get_eks_node_health, get_eks_pod_logs, and others under app/tools/EKS*/) execute successfully and return correctly-shaped result dicts, but their output is silently dropped by merge_evidence() in app/nodes/investigate/processing/post_process.py because the EVIDENCE_MAPPERS registry has no entries for any list_eks_* / get_eks_* / describe_eks_* action name. As a result, state["evidence"] never gets any eks_* keys, and diagnose_root_cause builds its prompt without ever seeing pod status, events, deployment health, or node conditions that were gathered earlier in the investigation.

Expected vs actual behavior

Expected: When list_eks_pods runs during an investigation and returns its standard envelope

{
    "source": "eks",
    "available": True,
    "cluster_name": "...",
    "namespace": "...",
    "total_pods": 3,
    "pods": [...],
    "failing_pods": [...],
    "high_restart_pods": [...],
    "error": None,
}

the relevant fields should be written into state["evidence"] under dedicated keys (eks_pods, eks_failing_pods, eks_high_restart_pods, eks_total_pods), matching the pattern the Grafana and Datadog tools already follow via _map_grafana_logs, _map_grafana_metrics, _map_datadog_logs, _map_datadog_monitors, etc. diagnose_root_cause should then see EKS evidence when building its diagnosis prompt.

Actual: merge_evidence() at app/nodes/investigate/processing/post_process.py:346-360 looks up each executed action name in the EVIDENCE_MAPPERS dict:

mapper = EVIDENCE_MAPPERS.get(action_name)
if mapper:
    evidence.update(mapper(result.data))

If mapper is None, the result is simply not merged into evidence. There is no fall-through that stores the raw dict. The EVIDENCE_MAPPERS dict contains entries for the Grafana toolset, the Datadog toolset, get_cloudwatch_logs, S3 tools, Lambda tools, GitHub tools, Honeycomb, Coralogix, and Vercel — but no entries for any of the 11 EKS tools (describe_eks_addon, describe_eks_cluster, get_eks_deployment_status, get_eks_events, get_eks_node_health, get_eks_nodegroup_health, get_eks_pod_logs, list_eks_clusters, list_eks_deployments, list_eks_namespaces, list_eks_pods).

build_evidence_summary() at post_process.py:395-474 has the same gap: it has dedicated elif branches for Grafana, Datadog, S3, Lambda, CloudWatch, Honeycomb, Coralogix, and diagnostic-code tools, but none for EKS. The tracker message emitted at the end of node_investigate therefore never reports anything from EKS runs, even when the tools executed and returned data.

Downstream consequence: diagnose_root_cause reads state["evidence"] to build its prompt. Because the EKS tools' output was discarded, the agent effectively investigates every Kubernetes incident without access to the data it just gathered. It will diagnose as if no EKS telemetry existed — the opposite of the behavior the EKS toolset was added for.

Steps to reproduce

  1. Clone the repo and install dev dependencies: pip install -e ".[dev]".

  2. Run the following check in a Python shell from the repo root:

    from app.nodes.investigate.processing.post_process import EVIDENCE_MAPPERS
    
    eks_action_prefixes = ("list_eks", "get_eks", "describe_eks")
    eks_mappers = [k for k in EVIDENCE_MAPPERS if k.startswith(eks_action_prefixes)]
    datadog_mappers = [k for k in EVIDENCE_MAPPERS if k.startswith("query_datadog")]
    grafana_mappers = [k for k in EVIDENCE_MAPPERS if k.startswith("query_grafana")]
    
    print(f"EKS mappers registered:     {eks_mappers}")
    print(f"Datadog mappers registered: {datadog_mappers}")
    print(f"Grafana mappers registered: {grafana_mappers}")
  3. Observe the output:

    EKS mappers registered:     []
    Datadog mappers registered: ['query_datadog_logs', 'query_datadog_monitors', 'query_datadog_events', 'query_datadog_all']
    Grafana mappers registered: ['query_grafana_logs', 'query_grafana_traces', 'query_grafana_metrics', 'query_grafana_alert_rules', 'query_grafana_service_names']
    

    EKS is the only major evidence family with zero mappers.

  4. Integration-level reproducer: run any investigation flow in which the planner selects list_eks_pods (for example a Kubernetes alert where the aws integration supplies role_arn and the alert annotations include cluster_name). Log state["evidence"] after node_investigate returns. No eks_* keys will appear regardless of what the tool returned.

Can you reproduce it consistently?

Yes

How often does it occur?

Every time

Operating system

Linux

Logs and error output

N/A — the failure is silent. The EKS tool itself logs a line like

[eks] list_eks_pods cluster=payments-prod-eks ns=payments

and returns its populated dict successfully. The drop happens in the post-processing layer, which emits no log when a mapper is missing. This is a large part of why the gap has been easy to miss.

Additional context

Proposed fix (additive, low-risk, suitable for a first-time contributor — follows the exact pattern of the existing _map_datadog_logs / _map_datadog_monitors mappers):

  1. Add five mapper functions to app/nodes/investigate/processing/post_process.py:

    def _map_eks_pods(data: dict) -> dict:
        return {
            "eks_pods": data.get("pods", []),
            "eks_failing_pods": data.get("failing_pods", []),
            "eks_high_restart_pods": data.get("high_restart_pods", []),
            "eks_total_pods": data.get("total_pods", 0),
        }
    
    
    def _map_eks_events(data: dict) -> dict:
        return {
            "eks_events": data.get("warning_events", []),
            "eks_total_warning_count": data.get("total_warning_count", 0),
        }
    
    
    def _map_eks_deployments(data: dict) -> dict:
        return {
            "eks_deployments": data.get("deployments", []),
            "eks_degraded_deployments": data.get("degraded_deployments", []),
            "eks_total_deployments": data.get("total_deployments", 0),
        }
    
    
    def _map_eks_node_health(data: dict) -> dict:
        return {
            "eks_node_health": data.get("nodes", []),
            "eks_not_ready_count": data.get("not_ready_count", 0),
            "eks_total_nodes": data.get("total_nodes", 0),
        }
    
    
    def _map_eks_pod_logs(data: dict) -> dict:
        return {
            "eks_pod_logs": data.get("logs", ""),
            "eks_pod_logs_pod_name": data.get("pod_name", ""),
            "eks_pod_logs_namespace": data.get("namespace", ""),
        }
  2. Register each in the EVIDENCE_MAPPERS dict alongside the existing entries:

    "list_eks_pods": _map_eks_pods,
    "get_eks_events": _map_eks_events,
    "list_eks_deployments": _map_eks_deployments,
    "get_eks_node_health": _map_eks_node_health,
    "get_eks_pod_logs": _map_eks_pod_logs,
  3. Add corresponding elif branches to build_evidence_summary() so the tracker reports EKS activity in the same "source:count ..." style the Grafana and Datadog branches use. Example:

    elif action_name == "list_eks_pods" and data.get("pods") is not None:
        failing = len(data.get("failing_pods", []))
        summary_parts.append(f"eks:{data.get('total_pods', 0)} pods ({failing} failing)")
    elif action_name == "get_eks_events" and data.get("warning_events") is not None:
        summary_parts.append(f"eks:{data.get('total_warning_count', 0)} warning events")
    # ... one branch per EKS tool
  4. Add a unit test that feeds a canned ActionExecutionResult(success=True, data={...}) into merge_evidence for each of the five EKS action names and asserts the expected keys appear in the returned evidence dict. There is no existing test file for post_process.py yet — creating tests/nodes/investigate/test_post_process.py with parametrized cases (one per tool) is a reasonable entry point. The existing Datadog mapper tests (if any) can be used as a reference; otherwise the pattern is straightforward.

Why this matters now: the Kubernetes synthetic test harness being built under #260 assumes EKS tool output flows into state["evidence"] before diagnose_root_cause runs. The follow-up scenario issues #261, #262, #263 will declare required_evidence_sources: [eks_pods, eks_events, ...] in their answer.yml files and will unfairly fail the scorer's evidence check regardless of whether the agent called the right tools, until this is fixed.

Related: #260, #261, #262, #263. Scope of this fix: the five fixture-supported tools above. Wiring the remaining six EKS tools (list_eks_clusters, list_eks_namespaces, describe_eks_cluster, describe_eks_addon, get_eks_deployment_status, get_eks_nodegroup_health) is out of scope for this issue and can be handled in a separate follow-up once scenarios that need them exist.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions