Ensure completion.WaitGroup always has a timeout#44731
Ensure completion.WaitGroup always has a timeout#44731jrajahalme merged 9 commits intocilium:mainfrom
Conversation
ba42c21 to
dbbae1e
Compare
|
/test |
1 similar comment
|
/test |
dbbae1e to
89ef52c
Compare
|
/test |
1 similar comment
|
/test |
youngnick
left a comment
There was a problem hiding this comment.
LGTM from SIG-Servicemesh.
|
Thanks for looking into this. After working on #43953 and #44328 for quite a while to debug/fix other issues (the latter to make backports of the fix easier to do, as we agreed on), I'm not too sure this approach of passing both a timeout and a context to the WaitGroup is the most optimal and cleanest one. Given we already pass in a context, I think its much cleaner to just do roughly the same as proposed in #43953 and have the WithTimeout outside the WaitGroup to more cleanly separate them. That way we can pass in the context from the WithTimeout into the WaitGroup - and thats exactly whats being done today in a few places eg. this where it makes sense. I agree some other places deserve a timeout, but that could also just as well be in the actual envoy/xds code that has the actual side-effect, in order to properly clean up those resources as well. Doesn't hurt to have it both places, but we should figure out what the exact root cause of #44543 is (I'm still not 100% sure we have that, but I might be wrong), and ensure that in case it happens, we don't leak neither goroutines, memory, or block the ipcache - no matter if its indefinitely or for 1+min. |
89ef52c to
a48ee36
Compare
Remove the unused `*completion.Completion` parameter from AckingResourceMutatorRevertFunc. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Refactor addVersionCompletion as addCurrentVersionCompletion without the version parameter that was always just 'm.version' (i.e., the current version). addCurrentVersionCompletion now inlines checking for ACK'ed nodes, and only installs completions for non-ACK'ed nodes. Use m.addCurrentVersionCompletion for m.useCurrent, since the prototype is now the same. Add maybeAddCurrentVersionCompletion to cut down on repeated code. Remove blocks for futile pendingCompletion reuse checks, as we are always allocating a new pendingCompletion anyway. Move currentVersionAcked to tests as it is no longer used in production code. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Use different timeouts depending on whether a success or failure is expected. This speeds up the runtime of 'go test ./pkg/envoy/xds/.' from ~15 seconds to ~2.5 seconds. Running with '-count=10' locally a few times did not reveal any flakes. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Use the definition of 'updateEnvoy' directly in the if statment so that becomes evident that the else condition is never executed since the if condition is always true if the else condition is true. Add comment about revert functions not being passed up by ApplyPolicyMapChanges. Remove dead code for UseCurrentNetworkPolicy. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
beedade to
eae0b5f
Compare
Add an id function that can be used to identify the remaining resources the Completion is waiting for. When the completion.WaitGroup context times out, return the blocking Completion's remaining resources in the returned error. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Change UpdatePolicyMaps() to Wait before returning and return the error from Wait. Remove the unused 2nd paramerer. Add endpoint-policy-update-timeout to the context for the proxyWaitGroup. If an error is returned, make the sole caller re-trigger the policy updates with an empty WaitGroup, so that if there are no other queued events, then the policy update is done immediately again. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
eae0b5f to
5fdad4c
Compare
|
/test |
Add a new CLI option
--endpoint-policy-update-timeout, and corresponding helm valueendpointPolicyUpdateTimeoutDurationdefaulting to 10 seconds. This is used on the context passed tocompletion.NewWaitGroup, so that by default endpointmanager waits for up to 10 seconds for Envoypolicy updates to be applied and acknowledged by Envoy.
Change
UpdatePolicyMaps()to Wait before returning and return the error fromWait. Remove the unused 2nd parameter. Addendpoint-policy-update-timeoutto the context for the proxyWaitGroup. If an error is returned, make the sole caller re-trigger the policy updates with an empty WaitGroup, so that if there are no other queued events, then the policy update is done immediately again.When the
completion.WaitGroupcontext times out, return the blocking Completion's remaining resources in the returned error.Only wait for ACKs from xDS nodes not ACKed yet.
Fixes: #44543