completion: fix memory leak in waitGroup by odinuge · Pull Request #43953 · cilium/cilium

odinuge · 2026-01-23T12:48:28Z

See commit message for more info. This fixes a gnarly memory leak. Not sure if this is the best way to fix, and we need to look into the guarantees and write better docs - but its a start. I also want to run the tests to see if anything depend on this in a different way.

I also partially bisected this down to where the WithCancel was introduced, and didn't spend more time digging into that. This most likely affect all versions of cilium, although in practice its mostly affecting environments with a lot of churn.

I've also updated tests to "document" the existing behavior and how the new cancel works together with it.

Fix a slow memory leak triggered by incremental policy updates

pkg/completion/completion_test.go

odinuge · 2026-01-23T12:49:28Z

/test

odinuge · 2026-01-23T12:55:23Z

pkg/completion/completion.go

+func NewWaitGroup(ctx context.Context) (*WaitGroup, context.CancelFunc) {
 	ctx2, cancel := context.WithCancel(ctx)
-	return &WaitGroup{ctx: ctx2, cancel: cancel}
+	return &WaitGroup{ctx: ctx2, cancel: cancel}, cancel


An other alternative I'll look into is effectively doing something like a atomic.Bool, but not sure if its worth the rewrite tbh.

Since this will need to be backported it is worth keeping the memory leak fix minimal and leave any cleanups, clarifications, etc. for another PR for the main. On this note, it would be a bit less invasive to not add a second return parameter here, but add a new Cancel() function instead, e.g.,

// Cancel releases resources held for the WaitGroup func (wg *WaitGroup) Cancel() { wg.cancel() // Wait to clear completions, does not actually wait since we cancelled above wg.Wait() }

I'd also add an explicit wg.cancel() call to the end of Wait():

@@ -107,6 +119,7 @@ Loop: } } wg.pendingCompletions = nil + wg.cancel() return err }

Then make use of the new Cancel() at prepareForProxyUpdates:

diff --git a/pkg/endpoint/regenerationcontext.go b/pkg/endpoint/regenerationcontext.go index b90714d137..0bd1e6cef4 100644 --- a/pkg/endpoint/regenerationcontext.go +++ b/pkg/endpoint/regenerationcontext.go @@ -76,7 +76,11 @@ type datapathRegenerationContext struct { func (ctx *datapathRegenerationContext) prepareForProxyUpdates(parentCtx context.Context) { completionCtx, completionCancel := context.WithTimeout(parentCtx, EndpointGenerationTimeout) - ctx.proxyWaitGroup = completion.NewWaitGroup(completionCtx) + wg := completion.NewWaitGroup(completionCtx) + ctx.proxyWaitGroup = wg ctx.completionCtx = completionCtx - ctx.completionCancel = completionCancel + ctx.completionCancel = func() { + completionCancel() + wg.Cancel() + } }

I'm mostly worried about ^ since it partially somewhat hides that this has to be closed to not leak memory. And the fact that people are not used to doing .Close() on a golang waitGroup. For backport its probably fine tho. Still not sure if that will really be easier to backport - since we still need to be very careful to not leak memory or to introduce correctness bugs. I can give it a stab in a separate PR and see how it works.

I'm not that familiar with this library, but couldn't we just call cancel when we finish waiting (i.e. after we iterate all the pending completions and we set wg.pendingCompletions = nil) as a easy way to have a backportable fix?

odinuge · 2026-01-23T13:29:57Z

/test

odinuge · 2026-01-26T07:45:06Z

/test

odinuge · 2026-01-26T13:16:26Z

Seems like mostly flakes.

odinuge · 2026-01-27T10:14:12Z

/ci-l3-l4

odinuge · 2026-01-27T10:14:20Z

/ci-clustermesh

odinuge · 2026-01-27T12:03:25Z

Seems like #43949 broke the tests - so I'll have to rebase.

odinuge · 2026-02-05T14:55:14Z

Got busy with other work this week, so I'll try rebase and update the tests tomorrow!

This waitGroup implementation creates a new WithCancel context for each wait group - but it never closes it. The actual behavior/leak depend on the underlying context - but no matter, its a leak. Its also pretty bad during a lot of churn. The amount of bytes/sec is pretty tiny, but over hours and hours of uptime, this can increase drastically.. Given we use context.WithCancel for the context passed in here, we end up adding a separate map entry to that parent context for each iteration. For the usages where the parent context is cancelled its fine - but thats not the case in all places, hence the leak. I've manually inspected the 'children' field of the 'cancelCtx' struct used in the 'UpdatePolicyMap' and can indeed see the length is increasing over time. Fixes: 551377f ("envoy: Make NACK cancel the WaitGroup") Signed-off-by: Odin Ugedal <odin@ugedal.com> Signed-off-by: Odin Ugedal <ougedal@palantir.com>

The current api is a bit ambiguous around reuse of these wait groups. This commit tries to somewhat document the current behavior and create tests for it. As far as I can see there is no real reuse of these, so it should be mostly fine. We can change the API, but I think its fine for now to keep-as-is - since there might be some reason for it to work like this. Signed-off-by: Odin Ugedal <odin@ugedal.com> Signed-off-by: Odin Ugedal <ougedal@palantir.com>

odinuge · 2026-02-06T07:56:08Z

/test

odinuge · 2026-02-06T08:53:30Z

Detected unexpected endpoint BPF program removal.

This does seem like a flake.

unable to update bpf map" module=agent.datapath.ipcache-bpf-listener ipAddr="{IP:0.0.0.0 Mask:00000000}" identity="{Source:local overwrittenLegacySource: ID:world-ipv4 _:[] modifiedByLegacyAPI:false shadowed:false}" modification=Upsert error="update map cilium_ipcache_v2: update: cannot allocate memory

This also.

odinuge · 2026-02-06T08:53:38Z

/ci-clustermesh

odinuge · 2026-02-06T08:53:43Z

/ci-l7

odinuge · 2026-02-06T14:23:40Z

/ci-kpr

odinuge · 2026-02-06T14:25:06Z

Also cc @jrajahalme since you have been working on this stuff.

odinuge · 2026-02-06T14:27:16Z

/ci-kpr

jrajahalme

Thanks for finding this leak! Suggesting an alternative the is likely a bit easier to backport?

jrajahalme · 2026-02-10T09:39:47Z

pkg/completion/completion.go

+func NewWaitGroup(ctx context.Context) (*WaitGroup, context.CancelFunc) {
 	ctx2, cancel := context.WithCancel(ctx)
-	return &WaitGroup{ctx: ctx2, cancel: cancel}
+	return &WaitGroup{ctx: ctx2, cancel: cancel}, cancel


Since this will need to be backported it is worth keeping the memory leak fix minimal and leave any cleanups, clarifications, etc. for another PR for the main. On this note, it would be a bit less invasive to not add a second return parameter here, but add a new Cancel() function instead, e.g.,

// Cancel releases resources held for the WaitGroup func (wg *WaitGroup) Cancel() { wg.cancel() // Wait to clear completions, does not actually wait since we cancelled above wg.Wait() }

I'd also add an explicit wg.cancel() call to the end of Wait():

@@ -107,6 +119,7 @@ Loop: } } wg.pendingCompletions = nil + wg.cancel() return err }

Then make use of the new Cancel() at prepareForProxyUpdates:

diff --git a/pkg/endpoint/regenerationcontext.go b/pkg/endpoint/regenerationcontext.go index b90714d137..0bd1e6cef4 100644 --- a/pkg/endpoint/regenerationcontext.go +++ b/pkg/endpoint/regenerationcontext.go @@ -76,7 +76,11 @@ type datapathRegenerationContext struct { func (ctx *datapathRegenerationContext) prepareForProxyUpdates(parentCtx context.Context) { completionCtx, completionCancel := context.WithTimeout(parentCtx, EndpointGenerationTimeout) - ctx.proxyWaitGroup = completion.NewWaitGroup(completionCtx) + wg := completion.NewWaitGroup(completionCtx) + ctx.proxyWaitGroup = wg ctx.completionCtx = completionCtx - ctx.completionCancel = completionCancel + ctx.completionCancel = func() { + completionCancel() + wg.Cancel() + } }

jrajahalme · 2026-02-10T09:41:21Z

pkg/endpointmanager/manager.go

 		if err := mgr.waitForProxyCompletions(proxyWaitGroup); err != nil {
 			mgr.logger.Warn("Failed to apply L7 proxy policy changes. These will be re-applied in future updates.", logfields.Error, err)
 		}
+		cancel()


If we add a cancel() call to the WaitGroup.Wait() it does not need to be added here.

wait, it is actually needed since waitForProxyCompletions() skips calling proxyWaitGroup.Wait() if proxyWaitGroup's context has errored:

diff --git a/pkg/endpoint/endpoint.go b/pkg/endpoint/endpoint.go index 6b68a7cbc0..d08df9375b 100644 --- a/pkg/endpoint/endpoint.go +++ b/pkg/endpoint/endpoint.go @@ -614,6 +614,7 @@ func (e *Endpoint) waitForProxyCompletions(proxyWaitGroup *completion.WaitGroup) err := proxyWaitGroup.Context().Err() if err != nil { + proxyWaitGroup.Cancel() return fmt.Errorf("context cancelled before waiting for proxy updates: %w", err) } diff --git a/pkg/endpointmanager/manager.go b/pkg/endpointmanager/manager.go index d8625fa758..c14d967f1a 100644 --- a/pkg/endpointmanager/manager.go +++ b/pkg/endpointmanager/manager.go @@ -178,6 +178,7 @@ func (mgr *endpointManager) WithPeriodicEndpointRegeneration(ctx context.Context func (mgr *endpointManager) waitForProxyCompletions(proxyWaitGroup *completion.WaitGroup) error { err := proxyWaitGroup.Context().Err() if err != nil { + proxyWaitGroup.Cancel() return fmt.Errorf("context cancelled before waiting for proxy updates: %w", err) }

If we add a cancel() call to the WaitGroup.Wait() it does not need to be added here.

Yee, this is indeed the place where we see the most leaks.

wait, it is actually needed since waitForProxyCompletions() skips calling proxyWaitGroup.Wait() if proxyWaitGroup's context has errored:

Yeah, thats why I don't fully like the idea of putting the cancel inside the Wait - since that would then require every single completion.WaitGroup to run Wait (or separately Close) to not leak.

jrajahalme · 2026-02-10T09:52:50Z

Adding the wg.cancel() call to the end of WaitGroup.Wait() also makes clear that the WaitGroup is not to be reused, as any further completions will not be waited for after the first Wait() call completes.

odinuge · 2026-02-13T08:31:02Z

Thanks for the comments @jrajahalme and @tommyp1ckles. Agree that a simpler approach is probably just as sane, and also much easier to backport. I opened #44328 so that we can do that first, and then potentially think about this solution later 😄 I'll just close this one out.

odinuge · 2026-03-10T16:08:13Z

Initial fix merged in #44328 now. Waiting for that to be backported before looking at this again.

odinuge · 2026-03-24T13:07:03Z

I'll close this in favor of #44731

maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 23, 2026

odinuge commented Jan 23, 2026

View reviewed changes

pkg/completion/completion_test.go Outdated Show resolved Hide resolved

odinuge commented Jan 23, 2026

View reviewed changes

odinuge force-pushed the odinuge/fix-ctx-leak branch 2 times, most recently from 9598c37 to 7a764f7 Compare January 23, 2026 13:05

odinuge-work added 2 commits February 6, 2026 08:18

odinuge force-pushed the odinuge/fix-ctx-leak branch from 7a764f7 to 6f4ef74 Compare February 6, 2026 07:51

odinuge marked this pull request as ready for review February 6, 2026 14:24

odinuge requested review from a team as code owners February 6, 2026 14:24

odinuge requested review from sayboras and tommyp1ckles February 6, 2026 14:24

jrajahalme reviewed Feb 10, 2026

View reviewed changes

maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 10, 2026

odinuge mentioned this pull request Feb 13, 2026

completion: fix memory leak in waitGroup by cancelling context on Wait #44328

Merged

odinuge marked this pull request as draft February 13, 2026 08:48

odinuge mentioned this pull request Mar 12, 2026

Ensure completion.WaitGroup always has a timeout #44731

Merged

odinuge closed this Mar 24, 2026

Conversation

odinuge commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

odinuge commented Jan 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

odinuge commented Jan 23, 2026

Uh oh!

odinuge commented Jan 26, 2026

Uh oh!

odinuge commented Jan 26, 2026

Uh oh!

odinuge commented Jan 27, 2026

Uh oh!

odinuge commented Jan 27, 2026

Uh oh!

odinuge commented Jan 27, 2026

Uh oh!

odinuge commented Feb 5, 2026

Uh oh!

odinuge commented Feb 6, 2026

Uh oh!

odinuge commented Feb 6, 2026

Uh oh!

odinuge commented Feb 6, 2026

Uh oh!

odinuge commented Feb 6, 2026

Uh oh!

odinuge commented Feb 6, 2026

Uh oh!

odinuge commented Feb 6, 2026

Uh oh!

odinuge commented Feb 6, 2026

Uh oh!

jrajahalme left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jrajahalme commented Feb 10, 2026

Uh oh!

odinuge commented Feb 13, 2026

Uh oh!

odinuge commented Mar 10, 2026

Uh oh!

odinuge commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

odinuge commented Jan 23, 2026 •

edited

Loading

jrajahalme left a comment •

edited

Loading