Skip to content

Ensure completion.WaitGroup always has a timeout#44731

Merged
jrajahalme merged 9 commits intocilium:mainfrom
jrajahalme:completion-waitgroup-no-wait-cancel
Mar 30, 2026
Merged

Ensure completion.WaitGroup always has a timeout#44731
jrajahalme merged 9 commits intocilium:mainfrom
jrajahalme:completion-waitgroup-no-wait-cancel

Conversation

@jrajahalme
Copy link
Copy Markdown
Member

@jrajahalme jrajahalme commented Mar 11, 2026

Add a new CLI option --endpoint-policy-update-timeout, and corresponding helm value endpointPolicyUpdateTimeoutDuration defaulting to 10 seconds. This is used on the context passed to completion.NewWaitGroup, so that by default endpointmanager waits for up to 10 seconds for Envoy
policy updates to be applied and acknowledged by Envoy.

Change UpdatePolicyMaps() to Wait before returning and return the error from Wait. Remove the unused 2nd parameter. Add endpoint-policy-update-timeout to the context for the proxyWaitGroup. If an error is returned, make the sole caller re-trigger the policy updates with an empty WaitGroup, so that if there are no other queued events, then the policy update is done immediately again.

When the completion.WaitGroup context times out, return the blocking Completion's remaining resources in the returned error.

Only wait for ACKs from xDS nodes not ACKed yet.

Fixes: #44543

@jrajahalme jrajahalme added kind/bug This is a bug in the Cilium logic. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. labels Mar 11, 2026
@jrajahalme jrajahalme requested review from a team as code owners March 11, 2026 12:10
@jrajahalme jrajahalme requested review from sayboras and vipul-21 March 11, 2026 12:10
@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 11, 2026
@jrajahalme jrajahalme marked this pull request as draft March 11, 2026 12:10
@github-actions github-actions bot added the sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. label Mar 11, 2026
@jrajahalme jrajahalme added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Mar 11, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 11, 2026
@jrajahalme jrajahalme force-pushed the completion-waitgroup-no-wait-cancel branch from ba42c21 to dbbae1e Compare March 11, 2026 12:33
@jrajahalme
Copy link
Copy Markdown
Member Author

/test

1 similar comment
@jrajahalme
Copy link
Copy Markdown
Member Author

/test

@jrajahalme jrajahalme force-pushed the completion-waitgroup-no-wait-cancel branch from dbbae1e to 89ef52c Compare March 11, 2026 16:27
@jrajahalme
Copy link
Copy Markdown
Member Author

/test

1 similar comment
@cilium-ariane
Copy link
Copy Markdown

cilium-ariane bot commented Mar 11, 2026

/test

Copy link
Copy Markdown
Contributor

@youngnick youngnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from SIG-Servicemesh.

@odinuge
Copy link
Copy Markdown
Member

odinuge commented Mar 12, 2026

Thanks for looking into this.

After working on #43953 and #44328 for quite a while to debug/fix other issues (the latter to make backports of the fix easier to do, as we agreed on), I'm not too sure this approach of passing both a timeout and a context to the WaitGroup is the most optimal and cleanest one.

Given we already pass in a context, I think its much cleaner to just do roughly the same as proposed in #43953 and have the WithTimeout outside the WaitGroup to more cleanly separate them. That way we can pass in the context from the WithTimeout into the WaitGroup - and thats exactly whats being done today in a few places eg. this where it makes sense.

I agree some other places deserve a timeout, but that could also just as well be in the actual envoy/xds code that has the actual side-effect, in order to properly clean up those resources as well. Doesn't hurt to have it both places, but we should figure out what the exact root cause of #44543 is (I'm still not 100% sure we have that, but I might be wrong), and ensure that in case it happens, we don't leak neither goroutines, memory, or block the ipcache - no matter if its indefinitely or for 1+min.

@jrajahalme jrajahalme force-pushed the completion-waitgroup-no-wait-cancel branch from 89ef52c to a48ee36 Compare March 18, 2026 15:20
Remove the unused `*completion.Completion` parameter from
AckingResourceMutatorRevertFunc.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Refactor addVersionCompletion as addCurrentVersionCompletion without the
version parameter that was always just 'm.version' (i.e., the current
version). addCurrentVersionCompletion now inlines checking for ACK'ed
nodes, and only installs completions for non-ACK'ed nodes.

Use m.addCurrentVersionCompletion for m.useCurrent, since the prototype
is now the same.

Add maybeAddCurrentVersionCompletion to cut down on repeated code.

Remove blocks for futile pendingCompletion reuse checks, as we are always
allocating a new pendingCompletion anyway.

Move currentVersionAcked to tests as it is no longer used in production
code.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Use different timeouts depending on whether a success or failure is
expected. This speeds up the runtime of 'go test ./pkg/envoy/xds/.' from
~15 seconds to ~2.5 seconds. Running with '-count=10' locally a few times
did not reveal any flakes.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Use the definition of 'updateEnvoy' directly in the if statment so that
becomes evident that the else condition is never executed since the if
condition is always true if the else condition is true.

Add comment about revert functions not being passed up by
ApplyPolicyMapChanges.

Remove dead code for UseCurrentNetworkPolicy.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme force-pushed the completion-waitgroup-no-wait-cancel branch from beedade to eae0b5f Compare March 26, 2026 09:34
Add an id function that can be used to identify the remaining resources
the Completion is waiting for.

When the completion.WaitGroup context times out, return the blocking
Completion's remaining resources in the returned error.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
Change UpdatePolicyMaps() to Wait before returning and return the error
from Wait. Remove the unused 2nd paramerer. Add
endpoint-policy-update-timeout to the context for the proxyWaitGroup. If
an error is returned, make the sole caller re-trigger the policy updates
with an empty WaitGroup, so that if there are no other queued events,
then the policy update is done immediately again.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme force-pushed the completion-waitgroup-no-wait-cancel branch from eae0b5f to 5fdad4c Compare March 26, 2026 09:41
@jrajahalme jrajahalme added the affects/v1.18 This issue affects v1.18 branch label Mar 26, 2026
@jrajahalme
Copy link
Copy Markdown
Member Author

/test

@jrajahalme jrajahalme removed the request for review from nddq March 30, 2026 12:37
@jrajahalme jrajahalme added this pull request to the merge queue Mar 30, 2026
Merged via the queue into cilium:main with commit 4314040 Mar 30, 2026
80 of 83 checks passed
@jrajahalme jrajahalme deleted the completion-waitgroup-no-wait-cancel branch March 30, 2026 14:44
@maintainer-s-little-helper maintainer-s-little-helper bot added ready-to-merge This PR has passed all tests and received consensus from code owners to merge. labels Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects/v1.18 This issue affects v1.18 branch area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. kind/bug This is a bug in the Cilium logic. needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

identity updater is wedged forever when missing NPDS ack from envoy

10 participants