Skip to content

ipam/crd: Fix ENI leak due to miscounting of empty interface slots#21800

Merged
aanm merged 3 commits intocilium:masterfrom
ctripcloud:fix-eni-leak-upstream
Oct 25, 2022
Merged

ipam/crd: Fix ENI leak due to miscounting of empty interface slots#21800
aanm merged 3 commits intocilium:masterfrom
ctripcloud:fix-eni-leak-upstream

Conversation

@jaffcheng
Copy link
Copy Markdown
Contributor

@jaffcheng jaffcheng commented Oct 19, 2022

please see commit msg

Fixes: #21747

Fix ENI leak in Alibaba due to miscounting of empty interface slots

Currently, in CRD ipam modes, before the operator creates a new interface
(or ENI for brevity), it checks the number of remaining empty ENI slots
to ensure the max ENI limit does not exceed. However, the check is made
against the field `AllocationAction.AvailableInterfaces`, which is set as
`(empty ENI slots) + (attached ENIs that have not yet reached the IP address capacity)`
in `PrepareIPAllocation`. This makes the ENI limit check ineffective in
situations where an ENI with free space somehow can't be assigned with more
IPs, e.g. the subnet of ENI is exhausted.
As a result, the operator continuously tries to create and attach ENIs to
the instance, the createInterface calls would succeed but attachInterface calls
might fail due to ENI limit, and ENIs are leaked and exhausted eventually.

The cause of miscouting might be the ambiguous field name `AvailableInterfaces`,
which might be interpreted with different meanings in different contexts:
  1. empty ENI slots, in interface creation context
  2. attached ENIs with IP space, in IP allocation context
  3. sum of 1. and 2., in node ipam stats context

This patch fixes this by checking max ENI limit against the correct value,
and also replaces `AllocationAction.AvailableInterfaces`
with clearer names `InterfaceCandidates` `EmptyInterfaceSlots` to avoid potential
misunderstanding in the future.

Fixes: cilium#21747
Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
@jaffcheng jaffcheng requested review from a team as code owners October 19, 2022 09:11
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 19, 2022
@jaffcheng jaffcheng requested a review from qmonnet October 19, 2022 09:11
@qmonnet qmonnet added kind/bug This is a bug in the Cilium logic. release-note/bug This PR fixes an issue in a previous release of Cilium. area/alibaba Impacts Alibaba based IPAM. and removed kind/bug This is a bug in the Cilium logic. labels Oct 19, 2022
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Oct 19, 2022
Copy link
Copy Markdown
Member

@qmonnet qmonnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as far as I can tell, but I'm not familiar with Alibaba's IPAM.

From the issue report I understand the bug has been present since at least 1.10, is this fix something that we should consider for backports?

@qmonnet qmonnet added the upgrade-impact This PR has potential upgrade or downgrade impact. label Oct 19, 2022
Deprecate `cilium_operator_ipam_available_interfaces` and add the following metrics:
* cilium_operator_ipam_interface_candidates
* cilium_operator_ipam_empty_interface_slots

Related: cilium#21747
Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
@jaffcheng jaffcheng force-pushed the fix-eni-leak-upstream branch from c14f045 to f35bf0b Compare October 19, 2022 15:26
@jaffcheng
Copy link
Copy Markdown
Contributor Author

Thanks for the comment.

From the issue report I understand the bug has been present since at least 1.10, is this fix something that we should consider for backports?

Yes, I think this needs to be backported. I'm a little surprised this issue hasn't been reported by another user, since it should affect AWS ENI as well if I understand correctly.

Copy link
Copy Markdown
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix makes sense to me, thanks! Should we go a step further and ensure that we deallocate the ENI if we fail to attach it?

Copy link
Copy Markdown
Contributor

@tommyp1ckles tommyp1ckles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure changes LGTM

Delete ENI after a failed AttachNetworkInterface API call.

Fixes: cilium#21747
Signed-off-by: Jaff Cheng <jaff.cheng.sh@gmail.com>
@jaffcheng
Copy link
Copy Markdown
Contributor Author

Should we go a step further and ensure that we deallocate the ENI if we fail to attach it?

@christarazi Thanks for the advice! Just realized we already have ENI rollback logic, but alibabacloud missed it after a failed AttachNetworkInterface API call. This could explain why AWS users haven't run into this.
Added a commit to fix this, please take another look.

@christarazi
Copy link
Copy Markdown
Member

Looks good, thanks!

@aanm
Copy link
Copy Markdown
Member

aanm commented Oct 21, 2022

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Oct 21, 2022
@aanm aanm merged commit 0b7919a into cilium:master Oct 25, 2022
@jaffcheng jaffcheng deleted the fix-eni-leak-upstream branch October 25, 2022 09:44
@pchaigno
Copy link
Copy Markdown
Member

pchaigno commented Nov 7, 2022

Removing backport labels here as this PR has upgrade impact and deprecations. Please yell if some part of the PR does need to be bsckported.

@christarazi
Copy link
Copy Markdown
Member

@jaffcheng Do we need to do the same thing for AWS ENI?

@jaffcheng
Copy link
Copy Markdown
Contributor Author

I think this fix should also affect AWS ENI. IIUC, if an AWS node encounters this, the cilium-operator without this PR would keep creating and attaching(limit error) and deleting an ENI, but no ENI leak would happen.

@christarazi
Copy link
Copy Markdown
Member

@jaffcheng Could you open a PR to resolve discrepancies between AWS and Alibaba?

@jaffcheng
Copy link
Copy Markdown
Contributor Author

@jaffcheng Could you open a PR to resolve discrepancies between AWS and Alibaba?

@christarazi Sorry for my wording not being clear enough, I think this PR should have fixed the ENI slot miscounting problem for AWS:

118c148#diff-30611d880237aa1610d4ba922a8639e304d4f1e398b98418bdd4200ee4536066R225-R252

The discrepancies between Alibaba and AWS are fixed in this commit:
8e23fe4

@christarazi
Copy link
Copy Markdown
Member

Ah perfect, thanks @jaffcheng!

HadrienPatte added a commit that referenced this pull request Apr 2, 2026
Remove `cilium_operator_ipam_ips` and `cilium_operator_ipam_available_interfaces`
IPAM metrics that have been deprecated and marked for deletion
respectively since 1.15 and 1.14.

Followup to #21800

Followup to #24776

Signed-off-by: Hadrien Patte <hadrien.patte@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/alibaba Impacts Alibaba based IPAM. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. upgrade-impact This PR has potential upgrade or downgrade impact.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

alibabacloud: ENI leak when IP addresses run out in a subnet

6 participants