Skip to content

Significant memory usage increase for AWS Operator with 1.18 #42310

@antonipp

Description

@antonipp

Is there an existing issue for this?

  • I have searched the existing issues

Version

equal or higher than v1.18.2 and lower than v1.19.0

What happened?

We noticed a significant regression in memory usage for the AWS Operator after upgrading from 1.17 to 1.18. We started getting consistent OOMs on previously unnoticeable scale-ups in our AWS clusters (order of magnitude of 100s of nodes) with Operator memory set to 5GiB, which used to be more than enough in the past.

I managed to catch a pprof heap profile of the Operator shortly before an OOM happened, it looks like this:

Image Image

So there are millions of AWS Route Table objects taking all the memory.

Looking further into it, I found #37229 added in 1.18, which added Route Tables refreshes for every single instance in the resync operation:

routeTables, err := m.api.GetRouteTables(ctx)

Moreover, routeTableFilters are never set, so it fetches all route tables.

Running aws ec2 describe-route-tables > route-tables.json in one of our affected accounts, we get 12MiB:

$ du -h route-tables.json
 12M	route-tables.json

It's of course seralized differently in Operator memory but this gives some sense of the size of data retrieved at each call.

Looking at CloudTrail, we have 100s of calls per Operator (and we even have rate-limiting from time to time!).
So it's very easy for us to blow up memory now because Route Tables are quite large. Even with the VPC filter added on main, the number of calls and the number of results will still be very high.

To summarize, this is quite disrupting for our operations because it severely limits our ability to rapidly upscale clusters since we very quickly blow up the Operator. We temporarily increased the memory however we'd like to find a solution for this.

Route tables are extremely static objects which almost never change. Why should they be refreshed for every single instance, 100s of times? Moreover, it looks like the goal of this logic is "When creating a new ENI in AWS, trying the best to select a subnet with the same route table as the host's primary ENI" - but in our case this is unnecessary because our Cilium-managed ENIs are always in separate subnets from the host ENIs (we manage capacity differently for hosts vs pods). Could we have a way to completely turn off this logic maybe?

How can we reproduce the issue?

Cilium Version

Daemon: 1.18.2 e359538840 2025-09-25T14:38:13+02:00 go version go1.24.7 X:boringcrypto linux/arm64

Kernel Version

n/a

Kubernetes Version

n/a

Regression

1.17

Sysdump

No response

Relevant log output

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects/v1.18This issue affects v1.18 brancharea/agentCilium agent related.area/operatorImpacts the cilium-operator componentkind/bugThis is a bug in the Cilium logic.kind/community-reportThis was reported by a user in the Cilium community, eg via Slack.kind/regressionThis functionality worked fine before, but was broken in a newer release of Cilium.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions