-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Significant memory usage increase for AWS Operator with 1.18 #42310
Description
Is there an existing issue for this?
- I have searched the existing issues
Version
equal or higher than v1.18.2 and lower than v1.19.0
What happened?
We noticed a significant regression in memory usage for the AWS Operator after upgrading from 1.17 to 1.18. We started getting consistent OOMs on previously unnoticeable scale-ups in our AWS clusters (order of magnitude of 100s of nodes) with Operator memory set to 5GiB, which used to be more than enough in the past.
I managed to catch a pprof heap profile of the Operator shortly before an OOM happened, it looks like this:
So there are millions of AWS Route Table objects taking all the memory.
Looking further into it, I found #37229 added in 1.18, which added Route Tables refreshes for every single instance in the resync operation:
cilium/pkg/aws/eni/instances.go
Line 222 in 80a4025
| routeTables, err := m.api.GetRouteTables(ctx) |
Moreover, routeTableFilters are never set, so it fetches all route tables.
Running aws ec2 describe-route-tables > route-tables.json in one of our affected accounts, we get 12MiB:
$ du -h route-tables.json
12M route-tables.json
It's of course seralized differently in Operator memory but this gives some sense of the size of data retrieved at each call.
Looking at CloudTrail, we have 100s of calls per Operator (and we even have rate-limiting from time to time!).
So it's very easy for us to blow up memory now because Route Tables are quite large. Even with the VPC filter added on main, the number of calls and the number of results will still be very high.
To summarize, this is quite disrupting for our operations because it severely limits our ability to rapidly upscale clusters since we very quickly blow up the Operator. We temporarily increased the memory however we'd like to find a solution for this.
Route tables are extremely static objects which almost never change. Why should they be refreshed for every single instance, 100s of times? Moreover, it looks like the goal of this logic is "When creating a new ENI in AWS, trying the best to select a subnet with the same route table as the host's primary ENI" - but in our case this is unnecessary because our Cilium-managed ENIs are always in separate subnets from the host ENIs (we manage capacity differently for hosts vs pods). Could we have a way to completely turn off this logic maybe?
How can we reproduce the issue?
Cilium Version
Daemon: 1.18.2 e359538840 2025-09-25T14:38:13+02:00 go version go1.24.7 X:boringcrypto linux/arm64
Kernel Version
n/a
Kubernetes Version
n/a
Regression
1.17
Sysdump
No response
Relevant log output
Anything else?
No response
Cilium Users Document
- Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- I agree to follow this project's Code of Conduct