AWS Billing Integration causes large spike in CostExplorer charges

### Summary

Elastic Agent,  AWS Billing Integration appears to repeatedly stop / start which causes a large amount of `GetCostAndUsage` api calls, This in turn incurs significant charges.


### Description

We have recently been retesting the Billing part of the AWS Integration for Elastic Agent in our K8s environment.

We have the Elastic Agents running as a daemonset in the k8s cluster and they are all managed by Fleet.

The versions we are running are as follows

Elastic Agent/Cluster Version: 8.7.1
AWS Integration Version: 1.42.0
EKS version: 1.25

We have configured the Billing settings with a period of 24hours.


Unfortunately since re enabling the Billing facet of the AWS Integration we have observed repeated stopping / starting of the AWS Billing process.  Here is an excerpt from the Elastic Agents Logs.

```
[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-.......... (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '138198' exited with code '-1'

[elastic_agent][info] Spawned new unit aws/metrics-default-aws/metrics-billing-......................: Starting: spawned pid '138687'

[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-..................... (STARTING->HEALTHY): Healthy

[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-........................... (HEALTHY->STOPPED): Suppressing FAILED state due to restart for '138687' exited with code '-1'

[elastic_agent][info] Spawned new unit aws/metrics-default-aws/metrics-billing-.........................: Starting: spawned pid '139045'

[elastic_agent][info] Unit state changed aws/metrics-default-aws/metrics-billing-......... (STARTING->HEALTHY): Healthy
```


As a result of this we are seeing repeated `GetCostAndUsage` api calls flagged in Cloudtrail.

![image](https://github.com/elastic/integrations/assets/46032005/47e26ec6-81a6-4ba7-a08f-3b3fa93ba33c)



This is activity is subsequently reflected in the large CostExplorer spikes we are seeing.

![image](https://github.com/elastic/integrations/assets/46032005/fc5b15b0-5c54-40c2-97a5-e1ebdf9962b9)

This $1,000+ cost spike is unfortunately entirely related to the AWS Billing Integration.


We initially thought that the issue could be related to pod resource limits which are setup for our Elastic Agents - see below

![image](https://github.com/elastic/integrations/assets/46032005/d7ad8cf5-901b-4f70-952a-771d33ce82c4)

But we monitored the Elastic Agent pods using `watch` during the period where we enabled the Billing Integration, and this didn't seem to indicate that that the resource limits were being hit.

![image](https://github.com/elastic/integrations/assets/46032005/d9fb8430-75ee-479d-b8b7-88c45740ffc1)

We also watched for events in the namespace where the elastic-agents are running and we didn't observe any tell-tale events such as OOM.


In the snapshot from Cloudtrail above, I have highlighted that we quickly hit the threshold for `GetCostAndUsage` api calls.  This is represented by the Error Code `ThrottlingException`

Below is more detail on the `ThrottlingException` error

![image](https://github.com/elastic/integrations/assets/46032005/1ea0dbd9-66a9-49f9-907c-adf1c82bac6b)

Note that the group by dimension keys which are referenced in this event are not always the same.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Billing Integration causes large spike in CostExplorer charges #7350

Summary

Description

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

AWS Billing Integration causes large spike in CostExplorer charges #7350

Description

Summary

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions