Protect ENI and Azure IPAM from misbehaving cloud APIs by tgraf · Pull Request #11231 · cilium/cilium

tgraf · 2020-04-29T16:42:54Z

See individual commits

maintainer-s-little-helper · 2020-04-29T16:42:56Z

Please set the appropriate release note label.

maintainer-s-little-helper · 2020-04-29T16:42:57Z

Please set the appropriate release note label.

maintainer-s-little-helper · 2020-04-29T16:42:57Z

Please set the appropriate release note label.

qmonnet · 2020-04-30T08:55:26Z

test-me-please

qmonnet · 2020-04-30T09:43:37Z

PR looks good but node_manager unit tests are failing, can you please have a look?

…ync was successful Cloud APIs can get into a bad state. This could result in the operator being restarted. If that happens and the Cloud API synchronization then failed the CiliumNode resource would have its status overwritten. This is not desirable. Require a sucessful Cloud API sync before updating the CiliumNode resource. Fixes: #11052 Signed-off-by: Thomas Graf <thomas@cilium.io>

The initial synchronization is blocking but did not return the error so far. Treat the initial synchronization as critical. If that can't succeed, restart the operator to indicate the problem. IP allocation will not succeed anyway. Signed-off-by: Thomas Graf <thomas@cilium.io>

…nstable It is possible for the cloud APIs being used by the operator to get into a state where POST and PATCH operations are still succeeding while GET operations are failing. This can result in the operator to continously creating resources while being unable to ever synchronize the state successfully. Require a successful synchronization of all resources in order to continue performing mutating operations. Signed-off-by: Thomas Graf <thomas@cilium.io>

christarazi

Looks good to me! Have some non-blocking nits, feel free to skip.

christarazi · 2020-05-20T21:54:07Z

+		if n.retry != nil {
+			n.retry.Trigger()
+		}
+		return fmt.Errorf("instances API is unstable. Blocking mutating operations. See logs for details.")


Nit: we can use errors.New() for simple string errors. This can help reduce the usage and hopefully the need to import fmt, but I assume that we already use fmt.Errorf for good reasons in this file. Feel free to ignore if so.

I'm keeping fmt.Errorf() for now to keep it consistent with the rest of the code in this file. I think we can definitely start using errors.New() and errors.Wrap() but we should make a consistent decision to do so after some discussion because I'm sure it will have implications on assumptions that all contributors are currently making.

christarazi · 2020-05-20T21:54:27Z

-	n.instancesAPI.Resync(ctx)
+	resyncTime := n.instancesAPI.Resync(ctx)
+	if resyncTime.IsZero() {
+		return fmt.Errorf("Initial synchronization with instances API failed")


Nit: we can use errors.New() for simple string errors. This can help reduce the usage and hopefully the need to import fmt, but I assume that we already use fmt.Errorf for good reasons in this file. Feel free to ignore if so.

coveralls · 2020-05-20T22:14:44Z

Coverage decreased (-0.02%) to 37.09% when pulling a8f0d9e on pr/tgraf/fix-eni-sync into f829636 on master.

tgraf · 2020-05-22T14:30:24Z

test-me-please

ungureanuvladvictor

Just one q related to logging, otherwise will let current reviewers to handle this PR.

ungureanuvladvictor · 2020-05-23T20:27:21Z

+func (n *NodeManager) Start(ctx context.Context) error {
 	// Trigger the initial resync in a blocking manner
-	n.instancesAPI.Resync(ctx)
+	resyncTime := n.instancesAPI.Resync(ctx)


I assume the Resync call stack at some point it will log the error from the cloud provider? If yes maybe add in the lower fmt.Errorf something like "please check previous logs for errors"?

tgraf added kind/bug This is a bug in the Cilium logic. priority/high This is considered vital to an upcoming release. labels Apr 29, 2020

tgraf requested a review from a team as a code owner April 29, 2020 16:42

maintainer-s-little-helper Bot added the dont-merge/needs-release-note label Apr 29, 2020

tgraf added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Apr 29, 2020

maintainer-s-little-helper Bot removed the dont-merge/needs-release-note label Apr 29, 2020

qmonnet approved these changes Apr 30, 2020

View reviewed changes

aanm approved these changes May 1, 2020

View reviewed changes

aanm added kind/bug This is a bug in the Cilium logic. and removed kind/bug This is a bug in the Cilium logic. labels May 1, 2020

tgraf marked this pull request as draft May 1, 2020 16:17

qmonnet requested changes May 1, 2020

View reviewed changes

Comment thread pkg/ipam/node.go

tgraf added 3 commits May 20, 2020 23:44

tgraf force-pushed the pr/tgraf/fix-eni-sync branch from 0bfa3c0 to a8f0d9e Compare May 20, 2020 21:44

tgraf marked this pull request as ready for review May 20, 2020 21:44

tgraf requested review from a team as code owners May 20, 2020 21:44

tgraf requested a review from a team May 20, 2020 21:44

christarazi approved these changes May 20, 2020

View reviewed changes

tgraf requested a review from qmonnet May 22, 2020 14:28

ungureanuvladvictor reviewed May 23, 2020

View reviewed changes

aanm approved these changes May 25, 2020

View reviewed changes

qmonnet approved these changes May 27, 2020

View reviewed changes

tgraf merged commit b5c5ca9 into master May 27, 2020

tgraf deleted the pr/tgraf/fix-eni-sync branch May 27, 2020 14:53

tgraf added needs-backport/1.7 and removed needs-backport/1.6 labels May 27, 2020

tklauser mentioned this pull request Jun 3, 2020

v1.7 backports 2020-06-03 #11855

Merged

christarazi mentioned this pull request Jun 4, 2020

v1.7 backports 2020-06-04 #11906

Merged

joestringer mentioned this pull request Jun 8, 2020

v1.7 backports 2020-06-08 #11971

Merged

joestringer added backport-pending/1.7 and removed needs-backport/1.7 labels Jun 8, 2020

nebril mentioned this pull request Jun 30, 2020

v1.7 backports 2020-06-30 #12337

Merged

joestringer added backport-pending/1.7 and removed needs-backport/1.7 labels Jun 30, 2020

qmonnet added backport-done/1.7 and removed backport-pending/1.7 labels Aug 3, 2020

Conversation

tgraf commented Apr 29, 2020

Uh oh!

maintainer-s-little-helper Bot commented Apr 29, 2020

Uh oh!

maintainer-s-little-helper Bot commented Apr 29, 2020

Uh oh!

maintainer-s-little-helper Bot commented Apr 29, 2020

Uh oh!

qmonnet commented Apr 30, 2020

Uh oh!

qmonnet commented Apr 30, 2020

Uh oh!

Uh oh!

christarazi left a comment

Choose a reason for hiding this comment

Uh oh!

christarazi May 20, 2020

Choose a reason for hiding this comment

Uh oh!

tgraf May 22, 2020

Choose a reason for hiding this comment

Uh oh!

christarazi May 20, 2020

Choose a reason for hiding this comment

Uh oh!

coveralls commented May 20, 2020

Uh oh!

tgraf commented May 22, 2020

Uh oh!

ungureanuvladvictor left a comment

Choose a reason for hiding this comment

Uh oh!

ungureanuvladvictor May 23, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants