Summary
Elastic Agent, AWS Billing Integration appears to repeatedly stop / start which causes a large amount of GetCostAndUsage api calls, This in turn incurs significant charges.
Description
We have recently been retesting the Billing part of the AWS Integration for Elastic Agent in our K8s environment.
We have the Elastic Agents running as a daemonset in the k8s cluster and they are all managed by Fleet.
The versions we are running are as follows
Elastic Agent/Cluster Version: 8.7.1
AWS Integration Version: 1.42.0
EKS version: 1.25
We have configured the Billing settings with a period of 24hours.
Unfortunately since re enabling the Billing facet of the AWS Integration we have observed repeated stopping / starting of the AWS Billing process. Here is an excerpt from the Elastic Agents Logs.
[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-.......... (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '138198' exited with code '-1'
[elastic_agent][info] Spawned new unit aws/metrics-default-aws/metrics-billing-......................: Starting: spawned pid '138687'
[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-..................... (STARTING->HEALTHY): Healthy
[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-........................... (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '138687' exited with code '-1'
[elastic_agent][info] Spawned new unit aws/metrics-default-aws/metrics-billing-.........................: Starting: spawned pid '139045'
[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-......... (STARTING->HEALTHY): Healthy
As a result of this we are seeing repeated GetCostAndUsage api calls flagged in Cloudtrail.

This is activity is subsequently reflected in the large CostExplorer spikes we are seeing.

This $1,000+ cost spike is unfortunately entirely related to the AWS Billing Integration.
We initially thought that the issue could be related to pod resource limits which are setup for our Elastic Agents - see below

But we monitored the Elastic Agent pods using watch during the period where we enabled the Billing Integration, and this didn't seem to indicate that that the resource limits were being hit.

We also watched for events in the namespace where the elastic-agents are running and we didn't observe any tell-tale events such as OOM.
In the snapshot from Cloudtrail above, I have highlighted that we quickly hit the threshold for GetCostAndUsage api calls. This is represented by the Error Code ThrottlingException
Below is more detail on the ThrottlingException error

Note that the group by dimension keys which are referenced in this event are not always the same.
Summary
Elastic Agent, AWS Billing Integration appears to repeatedly stop / start which causes a large amount of
GetCostAndUsageapi calls, This in turn incurs significant charges.Description
We have recently been retesting the Billing part of the AWS Integration for Elastic Agent in our K8s environment.
We have the Elastic Agents running as a daemonset in the k8s cluster and they are all managed by Fleet.
The versions we are running are as follows
Elastic Agent/Cluster Version: 8.7.1
AWS Integration Version: 1.42.0
EKS version: 1.25
We have configured the Billing settings with a period of 24hours.
Unfortunately since re enabling the Billing facet of the AWS Integration we have observed repeated stopping / starting of the AWS Billing process. Here is an excerpt from the Elastic Agents Logs.
As a result of this we are seeing repeated
GetCostAndUsageapi calls flagged in Cloudtrail.This is activity is subsequently reflected in the large CostExplorer spikes we are seeing.
This $1,000+ cost spike is unfortunately entirely related to the AWS Billing Integration.
We initially thought that the issue could be related to pod resource limits which are setup for our Elastic Agents - see below
But we monitored the Elastic Agent pods using
watchduring the period where we enabled the Billing Integration, and this didn't seem to indicate that that the resource limits were being hit.We also watched for events in the namespace where the elastic-agents are running and we didn't observe any tell-tale events such as OOM.
In the snapshot from Cloudtrail above, I have highlighted that we quickly hit the threshold for
GetCostAndUsageapi calls. This is represented by the Error CodeThrottlingExceptionBelow is more detail on the
ThrottlingExceptionerrorNote that the group by dimension keys which are referenced in this event are not always the same.