completion: fix memory leak in waitGroup by cancelling context on Wait#44328
Merged
jrajahalme merged 1 commit intocilium:mainfrom Mar 5, 2026
Merged
completion: fix memory leak in waitGroup by cancelling context on Wait#44328jrajahalme merged 1 commit intocilium:mainfrom
jrajahalme merged 1 commit intocilium:mainfrom
Conversation
de1bc5d to
588ef3b
Compare
Member
Author
|
/test |
1 similar comment
Member
Author
|
/test |
Member
Author
|
Looks like a combination of flakes and #44334. I'll rebase once main is in a better state. |
588ef3b to
d2eed03
Compare
Member
Author
|
/test |
This waitGroup implementation creates a new WithCancel context for each
wait group - but it never closes it. The actual behavior/leak depend on
the underlying context - but no matter, its a leak. Its also pretty bad
during a lot of churn. The amount of bytes/sec is pretty tiny, but over
hours and hours of uptime, this can increase drastically..
Given we use context.WithCancel for the context passed in here, we end
up adding a separate map entry to that parent context for each
iteration. For the usages where the parent context is cancelled its fine
- but thats not the case in all places, hence the leak.
I've manually inspected the 'children' field of the 'cancelCtx' struct
used in the 'UpdatePolicyMap' and can indeed see the length is increasing
over time.
This effectively adds a separatate Cancel that can be used to clean up
the WaitGroup in case Wait is never executed, and the user explicitly
does not know if the context was closed or not. Most use cases can rely
on Wait closing the context, alternatively checking if the context was
cancelled before exucuting Wait.
Fixes: 551377f ("envoy: Make NACK cancel the WaitGroup")
Signed-off-by: Odin Ugedal <odin@ugedal.com>
Signed-off-by: Odin Ugedal <ougedal@palantir.com>
d2eed03 to
cc5eca8
Compare
Member
Author
|
/test |
Member
Author
|
/test |
jrajahalme
approved these changes
Mar 5, 2026
Member
jrajahalme
left a comment
There was a problem hiding this comment.
This fixes the most glaring memory leak; will look for cases where a created waitgroup is not waited on in a followup to clean up the remaining loose ends.
Member
Author
|
Thanks @jrajahalme! Do you mind adding backport labels to v1.19, v1.18 and v1.17? |
Member
Author
|
ping @jrajahalme can you add the backport labels here? I'll get #43953 back going once the agreed on backport of this version is done. |
3 tasks
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See commit message for more info. This fixes a gnarly memory leak. Not sure if this is the best way to fix, and we need to look into the guarantees and write better docs - but its a start. I also want to run the tests to see if anything depend on this in a different way.
Alternative to #43953 thats lower lift and easier to backport.
I also partially bisected this down to where the WithCancel was introduced, and didn't spend more time digging into that. This most likely affect all versions of cilium, although in practice its mostly affecting environments with a lot of churn.
I've also updated tests to "document" the existing behavior and how the new cancel works together with it.