Significant memory usage increase for AWS Operator with 1.18

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Version

equal or higher than v1.18.2 and lower than v1.19.0

### What happened?

We noticed a significant regression in memory usage for the AWS Operator after upgrading from 1.17 to 1.18. We started getting consistent OOMs on previously unnoticeable scale-ups in our AWS clusters (order of magnitude of 100s of nodes) with Operator memory set to 5GiB, which used to be more than enough in the past.

I managed to catch a `pprof` heap profile of the Operator shortly before an OOM happened, it looks like this:

<img width="2380" height="976" alt="Image" src="https://github.com/user-attachments/assets/b6acfa6d-4b64-4562-bd97-030263d833db" />

<img width="2372" height="909" alt="Image" src="https://github.com/user-attachments/assets/785c7af7-d879-4f82-8c15-58ee1af00685" />

So there are millions of AWS Route Table objects taking all the memory. 

Looking further into it, I found https://github.com/cilium/cilium/pull/37229 added in 1.18, which added Route Tables refreshes for every single instance in the `resync` operation:
https://github.com/cilium/cilium/blob/80a40250fa8931099ea34293441557427a1c62a0/pkg/aws/eni/instances.go#L222

Moreover, [`routeTableFilters`](https://github.com/cilium/cilium/blob/c3666e5d0603ace4bfc08b682df513491937af9f/pkg/aws/ec2/ec2.go#L73) are never set, so it fetches all route tables. 

Running `aws ec2 describe-route-tables > route-tables.json` in one of our affected accounts, we get 12MiB:
```
$ du -h route-tables.json
 12M	route-tables.json
```
It's of course seralized differently in Operator memory but this gives some sense of the size of data retrieved at each call.

Looking at CloudTrail, we have 100s of calls per Operator (and we even have rate-limiting from time to time!).
So it's very easy for us to blow up memory now because Route Tables are quite large. Even with the [VPC filter](https://github.com/cilium/cilium/blob/37dd9c1e0c9c047db65f34b8a709179f9ff700b7/pkg/aws/ec2/ec2.go#L658) added on `main`, the number of calls and the number of results will still be very high.

To summarize, this is quite disrupting for our operations because it severely limits our ability to rapidly upscale clusters since we very quickly blow up the Operator. We temporarily increased the memory however we'd like to find a solution for this.

Route tables are extremely static objects which almost never change. Why should they be refreshed for every single instance, 100s of times? Moreover, it looks like the goal of this logic is "When creating a new ENI in AWS, trying the best to select a subnet with the same route table as the host's primary ENI" - but in our case this is unnecessary because our Cilium-managed ENIs are always in separate subnets from the host ENIs (we manage capacity differently for hosts vs pods). Could we have a way to completely turn off this logic maybe?

### How can we reproduce the issue?

-

### Cilium Version

```
Daemon: 1.18.2 e359538840 2025-09-25T14:38:13+02:00 go version go1.24.7 X:boringcrypto linux/arm64
```

### Kernel Version

n/a

### Kubernetes Version

n/a

### Regression

1.17

### Sysdump

_No response_

### Relevant log output

```shell

```

### Anything else?

_No response_

### Cilium Users Document

- [ ] Are you a user of Cilium? Please add yourself to the [Users doc](https://github.com/cilium/cilium/blob/main/USERS.md)

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant memory usage increase for AWS Operator with 1.18 #42310

Is there an existing issue for this?

Version

What happened?

How can we reproduce the issue?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Significant memory usage increase for AWS Operator with 1.18 #42310

Description

Is there an existing issue for this?

Version

What happened?

How can we reproduce the issue?

Cilium Version

Kernel Version

Kubernetes Version

Regression

Sysdump

Relevant log output

Anything else?

Cilium Users Document

Code of Conduct

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions