Skip to content

completion: fix memory leak in waitGroup by cancelling context on Wait#44328

Merged
jrajahalme merged 1 commit intocilium:mainfrom
odinuge:odinuge/fix-ctx-leak-v2
Mar 5, 2026
Merged

completion: fix memory leak in waitGroup by cancelling context on Wait#44328
jrajahalme merged 1 commit intocilium:mainfrom
odinuge:odinuge/fix-ctx-leak-v2

Conversation

@odinuge
Copy link
Copy Markdown
Member

@odinuge odinuge commented Feb 13, 2026

See commit message for more info. This fixes a gnarly memory leak. Not sure if this is the best way to fix, and we need to look into the guarantees and write better docs - but its a start. I also want to run the tests to see if anything depend on this in a different way.

Alternative to #43953 thats lower lift and easier to backport.

I also partially bisected this down to where the WithCancel was introduced, and didn't spend more time digging into that. This most likely affect all versions of cilium, although in practice its mostly affecting environments with a lot of churn.

I've also updated tests to "document" the existing behavior and how the new cancel works together with it.

Fix a slow memory leak triggered by incremental policy updates

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 13, 2026
@odinuge odinuge force-pushed the odinuge/fix-ctx-leak-v2 branch from de1bc5d to 588ef3b Compare February 13, 2026 08:32
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 13, 2026

/test

1 similar comment
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 13, 2026

/test

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 13, 2026

Looks like a combination of flakes and #44334. I'll rebase once main is in a better state.

@odinuge odinuge force-pushed the odinuge/fix-ctx-leak-v2 branch from 588ef3b to d2eed03 Compare February 20, 2026 10:34
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 20, 2026

/test

This waitGroup implementation creates a new WithCancel context for each
wait group - but it never closes it. The actual behavior/leak depend on
the underlying context - but no matter, its a leak. Its also pretty bad
during a lot of churn. The amount of bytes/sec is pretty tiny, but over
hours and hours of uptime, this can increase drastically..

Given we use context.WithCancel for the context passed in here, we end
up adding a separate map entry to that parent context for each
iteration. For the usages where the parent context is cancelled its fine
- but thats not the case in all places, hence the leak.

I've manually inspected the 'children' field of the 'cancelCtx' struct
used in the 'UpdatePolicyMap' and can indeed see the length is increasing
over time.

This effectively adds a separatate Cancel that can be used to clean up
the WaitGroup in case Wait is never executed, and the user explicitly
does not know if the context was closed or not. Most use cases can rely
on Wait closing the context, alternatively checking if the context was
cancelled before exucuting Wait.

Fixes: 551377f ("envoy: Make NACK cancel the WaitGroup")
Signed-off-by: Odin Ugedal <odin@ugedal.com>
Signed-off-by: Odin Ugedal <ougedal@palantir.com>
@odinuge odinuge force-pushed the odinuge/fix-ctx-leak-v2 branch from d2eed03 to cc5eca8 Compare March 4, 2026 07:41
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Mar 4, 2026

/test

@odinuge odinuge marked this pull request as ready for review March 4, 2026 07:57
@odinuge odinuge requested a review from a team as a code owner March 4, 2026 07:57
@odinuge odinuge requested a review from jrajahalme March 4, 2026 07:57
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Mar 4, 2026

/test

Copy link
Copy Markdown
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes the most glaring memory leak; will look for cases where a created waitgroup is not waited on in a followup to clean up the remaining loose ends.

@jrajahalme jrajahalme added the release-note/bug This PR fixes an issue in a previous release of Cilium. label Mar 5, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Mar 5, 2026
@jrajahalme jrajahalme added this pull request to the merge queue Mar 5, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 5, 2026
@jrajahalme jrajahalme added this pull request to the merge queue Mar 5, 2026
Merged via the queue into cilium:main with commit 18211a4 Mar 5, 2026
82 checks passed
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Mar 5, 2026
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Mar 6, 2026

Thanks @jrajahalme! Do you mind adding backport labels to v1.19, v1.18 and v1.17?

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Mar 11, 2026

ping @jrajahalme can you add the backport labels here? I'll get #43953 back going once the agreed on backport of this version is done.

@jrajahalme jrajahalme added needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Mar 24, 2026
@jrajahalme jrajahalme added the needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch label Mar 24, 2026
@tklauser tklauser mentioned this pull request Mar 25, 2026
5 tasks
@tklauser tklauser added backport-pending/1.19 The backport for Cilium 1.19.x for this PR is in progress. and removed needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch labels Mar 25, 2026
@github-actions github-actions bot added backport-done/1.19 The backport for Cilium 1.19.x for this PR is done. and removed backport-pending/1.19 The backport for Cilium 1.19.x for this PR is in progress. labels Mar 25, 2026
@viktor-kurchenko viktor-kurchenko added backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. and removed needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch labels Mar 30, 2026
@viktor-kurchenko viktor-kurchenko added backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. and removed needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch labels Mar 30, 2026
@github-actions github-actions bot added backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. and removed backport-pending/1.17 The backport for Cilium 1.17.x for this PR is in progress. backport-pending/1.18 The backport for Cilium 1.18.x for this PR is in progress. labels Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-done/1.17 The backport for Cilium 1.17.x for this PR is done. backport-done/1.18 The backport for Cilium 1.18.x for this PR is done. backport-done/1.19 The backport for Cilium 1.19.x for this PR is done. ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants