Skip to content

AWS Billing Integration causes large spike in CostExplorer charges #7350

@AndyDevman

Description

@AndyDevman

Summary

Elastic Agent, AWS Billing Integration appears to repeatedly stop / start which causes a large amount of GetCostAndUsage api calls, This in turn incurs significant charges.

Description

We have recently been retesting the Billing part of the AWS Integration for Elastic Agent in our K8s environment.

We have the Elastic Agents running as a daemonset in the k8s cluster and they are all managed by Fleet.

The versions we are running are as follows

Elastic Agent/Cluster Version: 8.7.1
AWS Integration Version: 1.42.0
EKS version: 1.25

We have configured the Billing settings with a period of 24hours.

Unfortunately since re enabling the Billing facet of the AWS Integration we have observed repeated stopping / starting of the AWS Billing process. Here is an excerpt from the Elastic Agents Logs.

[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-.......... (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '138198' exited with code '-1'

[elastic_agent][info] Spawned new unit aws/metrics-default-aws/metrics-billing-......................: Starting: spawned pid '138687'

[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-..................... (STARTING->HEALTHY): Healthy

[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-........................... (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '138687' exited with code '-1'

[elastic_agent][info] Spawned new unit aws/metrics-default-aws/metrics-billing-.........................: Starting: spawned pid '139045'

[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-......... (STARTING->HEALTHY): Healthy

As a result of this we are seeing repeated GetCostAndUsage api calls flagged in Cloudtrail.

image

This is activity is subsequently reflected in the large CostExplorer spikes we are seeing.

image

This $1,000+ cost spike is unfortunately entirely related to the AWS Billing Integration.

We initially thought that the issue could be related to pod resource limits which are setup for our Elastic Agents - see below

image

But we monitored the Elastic Agent pods using watch during the period where we enabled the Billing Integration, and this didn't seem to indicate that that the resource limits were being hit.

image

We also watched for events in the namespace where the elastic-agents are running and we didn't observe any tell-tale events such as OOM.

In the snapshot from Cloudtrail above, I have highlighted that we quickly hit the threshold for GetCostAndUsage api calls. This is represented by the Error Code ThrottlingException

Below is more detail on the ThrottlingException error

image

Note that the group by dimension keys which are referenced in this event are not always the same.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions