Skip to content

Protect ENI and Azure IPAM from misbehaving cloud APIs#11231

Merged
tgraf merged 3 commits intomasterfrom
pr/tgraf/fix-eni-sync
May 27, 2020
Merged

Protect ENI and Azure IPAM from misbehaving cloud APIs#11231
tgraf merged 3 commits intomasterfrom
pr/tgraf/fix-eni-sync

Conversation

@tgraf
Copy link
Copy Markdown
Contributor

@tgraf tgraf commented Apr 29, 2020

See individual commits

@tgraf tgraf added kind/bug This is a bug in the Cilium logic. priority/high This is considered vital to an upcoming release. labels Apr 29, 2020
@tgraf tgraf requested a review from a team as a code owner April 29, 2020 16:42
@maintainer-s-little-helper
Copy link
Copy Markdown

Please set the appropriate release note label.

2 similar comments
@maintainer-s-little-helper
Copy link
Copy Markdown

Please set the appropriate release note label.

@maintainer-s-little-helper
Copy link
Copy Markdown

Please set the appropriate release note label.

@tgraf tgraf added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Apr 29, 2020
@qmonnet
Copy link
Copy Markdown
Member

qmonnet commented Apr 30, 2020

test-me-please

@qmonnet
Copy link
Copy Markdown
Member

qmonnet commented Apr 30, 2020

PR looks good but node_manager unit tests are failing, can you please have a look?

@aanm aanm added kind/bug This is a bug in the Cilium logic. and removed kind/bug This is a bug in the Cilium logic. labels May 1, 2020
@tgraf tgraf marked this pull request as draft May 1, 2020 16:17
Comment thread pkg/ipam/node.go
tgraf added 3 commits May 20, 2020 23:44
…ync was successful

Cloud APIs can get into a bad state. This could result in the operator
being restarted. If that happens and the Cloud API synchronization then
failed the CiliumNode resource would have its status overwritten. This
is not desirable. Require a sucessful Cloud API sync before updating the
CiliumNode resource.

Fixes: #11052

Signed-off-by: Thomas Graf <thomas@cilium.io>
The initial synchronization is blocking but did not return the error so
far. Treat the initial synchronization as critical. If that can't
succeed, restart the operator to indicate the problem.

IP allocation will not succeed anyway.

Signed-off-by: Thomas Graf <thomas@cilium.io>
…nstable

It is possible for the cloud APIs being used by the operator to get into
a state where POST and PATCH operations are still succeeding while GET
operations are failing. This can result in the operator to continously
creating resources while being unable to ever synchronize the state
successfully. Require a successful synchronization of all resources in
order to continue performing mutating operations.

Signed-off-by: Thomas Graf <thomas@cilium.io>
@tgraf tgraf force-pushed the pr/tgraf/fix-eni-sync branch from 0bfa3c0 to a8f0d9e Compare May 20, 2020 21:44
@tgraf tgraf marked this pull request as ready for review May 20, 2020 21:44
@tgraf tgraf requested review from a team as code owners May 20, 2020 21:44
@tgraf tgraf requested a review from a team May 20, 2020 21:44
Copy link
Copy Markdown
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Have some non-blocking nits, feel free to skip.

Comment thread pkg/ipam/node.go
if n.retry != nil {
n.retry.Trigger()
}
return fmt.Errorf("instances API is unstable. Blocking mutating operations. See logs for details.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can use errors.New() for simple string errors. This can help reduce the usage and hopefully the need to import fmt, but I assume that we already use fmt.Errorf for good reasons in this file. Feel free to ignore if so.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm keeping fmt.Errorf() for now to keep it consistent with the rest of the code in this file. I think we can definitely start using errors.New() and errors.Wrap() but we should make a consistent decision to do so after some discussion because I'm sure it will have implications on assumptions that all contributors are currently making.

Comment thread pkg/ipam/node_manager.go
n.instancesAPI.Resync(ctx)
resyncTime := n.instancesAPI.Resync(ctx)
if resyncTime.IsZero() {
return fmt.Errorf("Initial synchronization with instances API failed")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we can use errors.New() for simple string errors. This can help reduce the usage and hopefully the need to import fmt, but I assume that we already use fmt.Errorf for good reasons in this file. Feel free to ignore if so.

@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage decreased (-0.02%) to 37.09% when pulling a8f0d9e on pr/tgraf/fix-eni-sync into f829636 on master.

@tgraf tgraf requested a review from qmonnet May 22, 2020 14:28
@tgraf
Copy link
Copy Markdown
Contributor Author

tgraf commented May 22, 2020

test-me-please

Copy link
Copy Markdown
Member

@ungureanuvladvictor ungureanuvladvictor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one q related to logging, otherwise will let current reviewers to handle this PR.

Comment thread pkg/ipam/node_manager.go Outdated
func (n *NodeManager) Start(ctx context.Context) error {
// Trigger the initial resync in a blocking manner
n.instancesAPI.Resync(ctx)
resyncTime := n.instancesAPI.Resync(ctx)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the Resync call stack at some point it will log the error from the cloud provider? If yes maybe add in the lower fmt.Errorf something like "please check previous logs for errors"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug This is a bug in the Cilium logic. priority/high This is considered vital to an upcoming release. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants