Skip to content

completion: fix memory leak in waitGroup#43953

Closed
odinuge wants to merge 2 commits intocilium:mainfrom
odinuge:odinuge/fix-ctx-leak
Closed

completion: fix memory leak in waitGroup#43953
odinuge wants to merge 2 commits intocilium:mainfrom
odinuge:odinuge/fix-ctx-leak

Conversation

@odinuge
Copy link
Copy Markdown
Member

@odinuge odinuge commented Jan 23, 2026

See commit message for more info. This fixes a gnarly memory leak. Not sure if this is the best way to fix, and we need to look into the guarantees and write better docs - but its a start. I also want to run the tests to see if anything depend on this in a different way.

I also partially bisected this down to where the WithCancel was introduced, and didn't spend more time digging into that. This most likely affect all versions of cilium, although in practice its mostly affecting environments with a lot of churn.

I've also updated tests to "document" the existing behavior and how the new cancel works together with it.

Fix a slow memory leak triggered by incremental policy updates

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Jan 23, 2026
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Jan 23, 2026

/test

func NewWaitGroup(ctx context.Context) (*WaitGroup, context.CancelFunc) {
ctx2, cancel := context.WithCancel(ctx)
return &WaitGroup{ctx: ctx2, cancel: cancel}
return &WaitGroup{ctx: ctx2, cancel: cancel}, cancel
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An other alternative I'll look into is effectively doing something like a atomic.Bool, but not sure if its worth the rewrite tbh.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this will need to be backported it is worth keeping the memory leak fix minimal and leave any cleanups, clarifications, etc. for another PR for the main. On this note, it would be a bit less invasive to not add a second return parameter here, but add a new Cancel() function instead, e.g.,

// Cancel releases resources held for the WaitGroup
func (wg *WaitGroup) Cancel() {
	wg.cancel()
	// Wait to clear completions, does not actually wait since we cancelled above
	wg.Wait()
}

I'd also add an explicit wg.cancel() call to the end of Wait():

@@ -107,6 +119,7 @@ Loop:
 		}
 	}
 	wg.pendingCompletions = nil
+	wg.cancel()
 	return err
 }

Then make use of the new Cancel() at prepareForProxyUpdates:

diff --git a/pkg/endpoint/regenerationcontext.go b/pkg/endpoint/regenerationcontext.go
index b90714d137..0bd1e6cef4 100644
--- a/pkg/endpoint/regenerationcontext.go
+++ b/pkg/endpoint/regenerationcontext.go
@@ -76,7 +76,11 @@ type datapathRegenerationContext struct {
 
 func (ctx *datapathRegenerationContext) prepareForProxyUpdates(parentCtx context.Context) {
 	completionCtx, completionCancel := context.WithTimeout(parentCtx, EndpointGenerationTimeout)
-	ctx.proxyWaitGroup = completion.NewWaitGroup(completionCtx)
+	wg := completion.NewWaitGroup(completionCtx)
+	ctx.proxyWaitGroup = wg
 	ctx.completionCtx = completionCtx
-	ctx.completionCancel = completionCancel
+	ctx.completionCancel = func() {
+		completionCancel()
+		wg.Cancel()
+	}
 }

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly worried about ^ since it partially somewhat hides that this has to be closed to not leak memory. And the fact that people are not used to doing .Close() on a golang waitGroup. For backport its probably fine tho. Still not sure if that will really be easier to backport - since we still need to be very careful to not leak memory or to introduce correctness bugs. I can give it a stab in a separate PR and see how it works.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that familiar with this library, but couldn't we just call cancel when we finish waiting (i.e. after we iterate all the pending completions and we set wg.pendingCompletions = nil) as a easy way to have a backportable fix?

@odinuge odinuge force-pushed the odinuge/fix-ctx-leak branch 2 times, most recently from 9598c37 to 7a764f7 Compare January 23, 2026 13:05
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Jan 23, 2026

/test

1 similar comment
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Jan 26, 2026

/test

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Jan 26, 2026

Seems like mostly flakes.

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Jan 27, 2026

/ci-l3-l4

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Jan 27, 2026

/ci-clustermesh

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Jan 27, 2026

Seems like #43949 broke the tests - so I'll have to rebase.

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 5, 2026

Got busy with other work this week, so I'll try rebase and update the tests tomorrow!

This waitGroup implementation creates a new WithCancel context for each
wait group - but it never closes it. The actual behavior/leak depend on
the underlying context - but no matter, its a leak. Its also pretty bad
during a lot of churn. The amount of bytes/sec is pretty tiny, but over
hours and hours of uptime, this can increase drastically..

Given we use context.WithCancel for the context passed in here, we end
up adding a separate map entry to that parent context for each
iteration. For the usages where the parent context is cancelled its fine
- but thats not the case in all places, hence the leak.

I've manually inspected the 'children' field of the 'cancelCtx' struct
used in the 'UpdatePolicyMap' and can indeed see the length is increasing
over time.

Fixes: 551377f ("envoy: Make NACK cancel the WaitGroup")
Signed-off-by: Odin Ugedal <odin@ugedal.com>
Signed-off-by: Odin Ugedal <ougedal@palantir.com>
The current api is a bit ambiguous around reuse of these wait groups.
This commit tries to somewhat document the current behavior and create
tests for it. As far as I can see there is no real reuse of these, so it
should be mostly fine. We can change the API, but I think its fine for
now to keep-as-is - since there might be some reason for it to work like
this.

Signed-off-by: Odin Ugedal <odin@ugedal.com>
Signed-off-by: Odin Ugedal <ougedal@palantir.com>
@odinuge odinuge force-pushed the odinuge/fix-ctx-leak branch from 7a764f7 to 6f4ef74 Compare February 6, 2026 07:51
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 6, 2026

/test

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 6, 2026

Detected unexpected endpoint BPF program removal.

This does seem like a flake.

unable to update bpf map" module=agent.datapath.ipcache-bpf-listener ipAddr="{IP:0.0.0.0 Mask:00000000}" identity="{Source:local overwrittenLegacySource: ID:world-ipv4 _:[] modifiedByLegacyAPI:false shadowed:false}" modification=Upsert error="update map cilium_ipcache_v2: update: cannot allocate memory

This also.

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 6, 2026

/ci-clustermesh

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 6, 2026

/ci-l7

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 6, 2026

/ci-kpr

@odinuge odinuge marked this pull request as ready for review February 6, 2026 14:24
@odinuge odinuge requested review from a team as code owners February 6, 2026 14:24
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 6, 2026

Also cc @jrajahalme since you have been working on this stuff.

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 6, 2026

/ci-kpr

Copy link
Copy Markdown
Member

@jrajahalme jrajahalme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this leak! Suggesting an alternative the is likely a bit easier to backport?

func NewWaitGroup(ctx context.Context) (*WaitGroup, context.CancelFunc) {
ctx2, cancel := context.WithCancel(ctx)
return &WaitGroup{ctx: ctx2, cancel: cancel}
return &WaitGroup{ctx: ctx2, cancel: cancel}, cancel
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this will need to be backported it is worth keeping the memory leak fix minimal and leave any cleanups, clarifications, etc. for another PR for the main. On this note, it would be a bit less invasive to not add a second return parameter here, but add a new Cancel() function instead, e.g.,

// Cancel releases resources held for the WaitGroup
func (wg *WaitGroup) Cancel() {
	wg.cancel()
	// Wait to clear completions, does not actually wait since we cancelled above
	wg.Wait()
}

I'd also add an explicit wg.cancel() call to the end of Wait():

@@ -107,6 +119,7 @@ Loop:
 		}
 	}
 	wg.pendingCompletions = nil
+	wg.cancel()
 	return err
 }

Then make use of the new Cancel() at prepareForProxyUpdates:

diff --git a/pkg/endpoint/regenerationcontext.go b/pkg/endpoint/regenerationcontext.go
index b90714d137..0bd1e6cef4 100644
--- a/pkg/endpoint/regenerationcontext.go
+++ b/pkg/endpoint/regenerationcontext.go
@@ -76,7 +76,11 @@ type datapathRegenerationContext struct {
 
 func (ctx *datapathRegenerationContext) prepareForProxyUpdates(parentCtx context.Context) {
 	completionCtx, completionCancel := context.WithTimeout(parentCtx, EndpointGenerationTimeout)
-	ctx.proxyWaitGroup = completion.NewWaitGroup(completionCtx)
+	wg := completion.NewWaitGroup(completionCtx)
+	ctx.proxyWaitGroup = wg
 	ctx.completionCtx = completionCtx
-	ctx.completionCancel = completionCancel
+	ctx.completionCancel = func() {
+		completionCancel()
+		wg.Cancel()
+	}
 }

if err := mgr.waitForProxyCompletions(proxyWaitGroup); err != nil {
mgr.logger.Warn("Failed to apply L7 proxy policy changes. These will be re-applied in future updates.", logfields.Error, err)
}
cancel()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a cancel() call to the WaitGroup.Wait() it does not need to be added here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, it is actually needed since waitForProxyCompletions() skips calling proxyWaitGroup.Wait() if proxyWaitGroup's context has errored:

diff --git a/pkg/endpoint/endpoint.go b/pkg/endpoint/endpoint.go
index 6b68a7cbc0..d08df9375b 100644
--- a/pkg/endpoint/endpoint.go
+++ b/pkg/endpoint/endpoint.go
@@ -614,6 +614,7 @@ func (e *Endpoint) waitForProxyCompletions(proxyWaitGroup *completion.WaitGroup)
 
 	err := proxyWaitGroup.Context().Err()
 	if err != nil {
+		proxyWaitGroup.Cancel()
 		return fmt.Errorf("context cancelled before waiting for proxy updates: %w", err)
 	}
 
diff --git a/pkg/endpointmanager/manager.go b/pkg/endpointmanager/manager.go
index d8625fa758..c14d967f1a 100644
--- a/pkg/endpointmanager/manager.go
+++ b/pkg/endpointmanager/manager.go
@@ -178,6 +178,7 @@ func (mgr *endpointManager) WithPeriodicEndpointRegeneration(ctx context.Context
 func (mgr *endpointManager) waitForProxyCompletions(proxyWaitGroup *completion.WaitGroup) error {
 	err := proxyWaitGroup.Context().Err()
 	if err != nil {
+		proxyWaitGroup.Cancel()
 		return fmt.Errorf("context cancelled before waiting for proxy updates: %w", err)
 	}
 

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a cancel() call to the WaitGroup.Wait() it does not need to be added here.

Yee, this is indeed the place where we see the most leaks.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, it is actually needed since waitForProxyCompletions() skips calling proxyWaitGroup.Wait() if proxyWaitGroup's context has errored:

Yeah, thats why I don't fully like the idea of putting the cancel inside the Wait - since that would then require every single completion.WaitGroup to run Wait (or separately Close) to not leak.

@jrajahalme
Copy link
Copy Markdown
Member

Adding the wg.cancel() call to the end of WaitGroup.Wait() also makes clear that the WaitGroup is not to be reused, as any further completions will not be waited for after the first Wait() call completes.

@jrajahalme jrajahalme added affects/v1.16 This issue affects v1.16 branch needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch kind/bug This is a bug in the Cilium logic. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. release-note/bug This PR fixes an issue in a previous release of Cilium. labels Feb 10, 2026
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Feb 10, 2026
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Feb 13, 2026

Thanks for the comments @jrajahalme and @tommyp1ckles. Agree that a simpler approach is probably just as sane, and also much easier to backport. I opened #44328 so that we can do that first, and then potentially think about this solution later 😄 I'll just close this one out.

@odinuge odinuge marked this pull request as draft February 13, 2026 08:48
@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Mar 10, 2026

Initial fix merged in #44328 now. Waiting for that to be backported before looking at this again.

@odinuge
Copy link
Copy Markdown
Member Author

odinuge commented Mar 24, 2026

I'll close this in favor of #44731

@odinuge odinuge closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects/v1.16 This issue affects v1.16 branch area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. kind/bug This is a bug in the Cilium logic. needs-backport/1.17 This PR / issue needs backporting to the v1.17 branch needs-backport/1.18 This PR / issue needs backporting to the v1.18 branch needs-backport/1.19 This PR / issue needs backporting to the v1.19 branch release-note/bug This PR fixes an issue in a previous release of Cilium.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants