Protect ENI and Azure IPAM from misbehaving cloud APIs#11231
Conversation
|
Please set the appropriate release note label. |
2 similar comments
|
Please set the appropriate release note label. |
|
Please set the appropriate release note label. |
|
test-me-please |
|
PR looks good but |
…ync was successful Cloud APIs can get into a bad state. This could result in the operator being restarted. If that happens and the Cloud API synchronization then failed the CiliumNode resource would have its status overwritten. This is not desirable. Require a sucessful Cloud API sync before updating the CiliumNode resource. Fixes: #11052 Signed-off-by: Thomas Graf <thomas@cilium.io>
The initial synchronization is blocking but did not return the error so far. Treat the initial synchronization as critical. If that can't succeed, restart the operator to indicate the problem. IP allocation will not succeed anyway. Signed-off-by: Thomas Graf <thomas@cilium.io>
…nstable It is possible for the cloud APIs being used by the operator to get into a state where POST and PATCH operations are still succeeding while GET operations are failing. This can result in the operator to continously creating resources while being unable to ever synchronize the state successfully. Require a successful synchronization of all resources in order to continue performing mutating operations. Signed-off-by: Thomas Graf <thomas@cilium.io>
0bfa3c0 to
a8f0d9e
Compare
christarazi
left a comment
There was a problem hiding this comment.
Looks good to me! Have some non-blocking nits, feel free to skip.
| if n.retry != nil { | ||
| n.retry.Trigger() | ||
| } | ||
| return fmt.Errorf("instances API is unstable. Blocking mutating operations. See logs for details.") |
There was a problem hiding this comment.
Nit: we can use errors.New() for simple string errors. This can help reduce the usage and hopefully the need to import fmt, but I assume that we already use fmt.Errorf for good reasons in this file. Feel free to ignore if so.
There was a problem hiding this comment.
I'm keeping fmt.Errorf() for now to keep it consistent with the rest of the code in this file. I think we can definitely start using errors.New() and errors.Wrap() but we should make a consistent decision to do so after some discussion because I'm sure it will have implications on assumptions that all contributors are currently making.
| n.instancesAPI.Resync(ctx) | ||
| resyncTime := n.instancesAPI.Resync(ctx) | ||
| if resyncTime.IsZero() { | ||
| return fmt.Errorf("Initial synchronization with instances API failed") |
There was a problem hiding this comment.
Nit: we can use errors.New() for simple string errors. This can help reduce the usage and hopefully the need to import fmt, but I assume that we already use fmt.Errorf for good reasons in this file. Feel free to ignore if so.
|
test-me-please |
ungureanuvladvictor
left a comment
There was a problem hiding this comment.
Just one q related to logging, otherwise will let current reviewers to handle this PR.
| func (n *NodeManager) Start(ctx context.Context) error { | ||
| // Trigger the initial resync in a blocking manner | ||
| n.instancesAPI.Resync(ctx) | ||
| resyncTime := n.instancesAPI.Resync(ctx) |
There was a problem hiding this comment.
I assume the Resync call stack at some point it will log the error from the cloud provider? If yes maybe add in the lower fmt.Errorf something like "please check previous logs for errors"?
See individual commits