Skip to content

Only garbage collect composed resource watches#6395

Merged
negz merged 3 commits intocrossplane:mainfrom
negz:watchful
Apr 25, 2025
Merged

Only garbage collect composed resource watches#6395
negz merged 3 commits intocrossplane:mainfrom
negz:watchful

Conversation

@negz
Copy link
Copy Markdown
Member

@negz negz commented Apr 22, 2025

Description of your changes

Fixes #5151
Fixes #5400
Fixes #6396 1

Realtime compositions use a garbage collector to cleanup watches. We do this in part to keep memory consumption down, and in part to avoid watching a type that no longer exists.

Say for example some XRs of kind: A compose a resource of kind: B. The kind: A controller will watch kind: B. Later the XRs are updated to a new Composition. They no longer compose kind: B. The provider for kind: B is uninstalled, so it's no longer a valid type.

If Crossplane doesn't stop the watch for kind: B the watch's underlying informer will spew errors about being unable to list kind: B. This doesn't break anything per-se but it'll at a minimum flood Crossplane's logs with scary errors.

Realtime composition has four types of watches: claims, XRs, composed resources, and composition revisions. Before this commit the garbage collector would GC them all. Now it only GCs composed resource watches.

There's no reason to GC the other types of watches. Those watches should never be stopped while the XR controller is still running; they'll be stopped when the XR controller stops running.

The garbage collector uses a cache-backed client to determine what watches to stop. The cache can be stale. So it's possible that the garbage collector will stop a watch for a type that's actually in use. For example if the type should be watched, then shouldn't, then should again in rapid succession. e.g. If there's only one XR of kind: A and it composes one resource of kind: B, then stops composing it, then quickly starts composing it again.

Right now we can tolerate this for composed resource watches because XRs are still reconciled every poll interval - the watch'll be restarted anytime any XR that composes that resource type is reconciled at poll interval.

We can't tolerate this for claims, XRs, and composition revisions. If these watches are garbage collected nothing would restart them until Crossplane itself restarts (or their XRD was recreated).

I have:

Need help with this checklist? See the cheat sheet.

Footnotes

  1. I pushed a second commit that fixes this issue - the one that causes the realtime composition E2E tests to fail so often. See the commit message for details on that.

Realtime compositions use a garbage collector to cleanup watches. We do
this in part to keep memory consumption down, and in part to avoid
watching a type that no longer exists.

Say for example some XRs of kind: A compose a resource of kind: B. The
kind: A controller will watch kind: B. Later the XRs are updated to a
new Composition. They no longer compose kind: B. The provider for kind:
B is uninstalled, so it's no longer a valid type.

If Crossplane doesn't stop the watch for kind: B the watch's underlying
informer will spew errors about being unable to list kind: B. This
doesn't break anything per-se but it'll at a minimum flood Crossplane's
logs with scary errors.

Realtime composition has four types of watches: claims, XRs, composed
resources, and composition revisions. Before this commit the garbage
collector would GC them all. Now it only GCs composed resource watches.

There's no reason to GC the other types of watches. Those watches should
never be stopped while the XR controller is still running; they'll be
stopped when the XR controller stops running.

The garbage collector uses a cache-backed client to determine what
watches to stop. The cache can be stale. So it's possible that the
garbage collector will stop a watch for a type that's actually in use.
For example if the type should be watched, then shouldn't, then should
again in rapid succession. e.g. If there's only one XR of kind: A and it
composes one resource of kind: B, then stops composing it, then quickly
starts composing it again.

Right now we can tolerate this for composed resource watches because XRs
are still reconciled every poll interval - the watch'll be restarted
anytime any XR that composes that resource type is reconciled at poll
interval.

We can't tolerate this for claims, XRs, and composition revisions. If
these watches are garbage collected nothing would restart them until
Crossplane itself restarts (or their XRD was recreated).

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz requested a review from a team as a code owner April 22, 2025 02:00
@negz negz requested review from jbw976 and turkenh April 22, 2025 02:00
Copy link
Copy Markdown
Contributor

@bobh66 bobh66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 22, 2025

FYI the realtime composition E2Es failed due to #6396, which I plan to look at separately.

Pushed a commit that hopefully fixes #6396.

Copy link
Copy Markdown
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code change looks as described in the PR body, thanks for those details and it sounds reasonable.

Question in reference to not stopping watches for composed resources:

This doesn't break anything per-se but it'll at a minimum flood Crossplane's logs with scary errors

would this same statement apply to any scenarios for not stopping other types of watches? i.e. with the change in this PR, are there scenarios you can think of where this flood of scary errors could happen?

@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 22, 2025

i.e. with the change in this PR, are there scenarios you can think of where this flood of scary errors could happen?

When you use a cache-backed controller-runtime client to read some type of object, controller-runtime starts a watch in a background Goroutine. The watch populates the cache. If something goes wrong with the watch you'll see a lot of logs like this:

2025-04-22T18:22:14Z	DEBUG	crossplane	Watch error - probably due to CRD being uninstalled	{"error": "failed to list *composed.Unstructured: nopresources.nop.crossplane.io is forbidden: User \"system:serviceaccount:crossplane-system:crossplane\" cannot list resource \"nopresources\" in API group \"nop.crossplane.io\" at the cluster scope"}

Most of this log comes from the underlying client-go machinery, but we prefix it with the Watch error - probably due to CRD being uninstalled to give folks a hint as to what's (probably) happening. Like the prefix says, the most likely reason for the watch to fail is because the CRD has been removed.

So with our dynamic watches, the challenge is to make sure we stop the watch (informer) ASAP when it's not needed anymore. Ideally before the CRD is uninstalled, so that folks never see this log line. We do that two ways.

The garbage collector this PR touches runs every 60 seconds and stops watches for composed resources that nothing composes anymore.

// GarbageCollectCustomResourceInformers garbage collects informers for custom
// resources (e.g. Crossplane XRs, claims and composed resources) when the CRD
// that defines them is deleted. The garbage collector runs until the supplied
// context is cancelled.
func (e *ControllerEngine) GarbageCollectCustomResourceInformers(ctx context.Context) error {

There's also another garbage collector that watches for CRD deletes, and stops related watches.

This PR doesn't really change any of that for other types of watches (claims, XRs, CompositionRevisions). Those should all be stopped when the XRD is deleted, not before, not after. If the CRD for the XR gets deleted before the XRD is deleted you'd get these scary logs, but that'd be the least of your problems. 🤷‍♂️

@negz negz requested review from bobh66 and jbw976 April 23, 2025 06:23
@negz negz marked this pull request as draft April 23, 2025 07:05
@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 23, 2025

My latest commit seems to have somehow broken realtime composition selection. Switching to draft while I figure that out.

Edit: Figured it out. I was passing &unstructructured.Unstructured{} to NewStoppableSource, but it was important that for revisions it actually be a CompositionRevision.

rev, ok := e.Object.(*v1.CompositionRevision)
if !ok {
// should not happen
return

negz added 2 commits April 23, 2025 19:11
Before this commit, calling a StoppableSource's Stop method would
potentially start a new informer.

We stop a source by removing its EventHandler from the relevant
informer. Previously each StoppableSource had a map of all informers
(for some reason - not sure why I did this) despite only needing one. So
they'd call GetInformers to get the informer they wanted to remove the
event handler from.

The problem is that GetInformer will implicitly try to start a new
informer if the informer isn't already started. So to stop a source we'd
first need to start its informer. That's not always possible - e.g. if
Crossplane doesn't have RBAC access to the informer's type, or the CRD
for the informer's type no longer exists.

Now each StoppableSource only contains the informer for the type they're
a source of events for, so they don't need to call GetInformer to stop a
watch.

Signed-off-by: Nic Cope <nicc@rk0n.org>
@negz negz marked this pull request as ready for review April 24, 2025 04:23
@negz
Copy link
Copy Markdown
Member Author

negz commented Apr 24, 2025

@jbw976 @bobh66 @turkenh PTAL. This should now fix the issue that causes the RealtimeCompositions E2E test to fail often.

Copy link
Copy Markdown
Member

@jbw976 jbw976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome @negz, great debugging effort through some complicated code interactions! 🫡

@negz negz merged commit e776324 into crossplane:main Apr 25, 2025
38 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can't stop composed resource watches when XRD is deleted Crossplane fails to synchronize claims with XRs cannot get composite resource

3 participants