Only garbage collect composed resource watches#6395
Conversation
Realtime compositions use a garbage collector to cleanup watches. We do this in part to keep memory consumption down, and in part to avoid watching a type that no longer exists. Say for example some XRs of kind: A compose a resource of kind: B. The kind: A controller will watch kind: B. Later the XRs are updated to a new Composition. They no longer compose kind: B. The provider for kind: B is uninstalled, so it's no longer a valid type. If Crossplane doesn't stop the watch for kind: B the watch's underlying informer will spew errors about being unable to list kind: B. This doesn't break anything per-se but it'll at a minimum flood Crossplane's logs with scary errors. Realtime composition has four types of watches: claims, XRs, composed resources, and composition revisions. Before this commit the garbage collector would GC them all. Now it only GCs composed resource watches. There's no reason to GC the other types of watches. Those watches should never be stopped while the XR controller is still running; they'll be stopped when the XR controller stops running. The garbage collector uses a cache-backed client to determine what watches to stop. The cache can be stale. So it's possible that the garbage collector will stop a watch for a type that's actually in use. For example if the type should be watched, then shouldn't, then should again in rapid succession. e.g. If there's only one XR of kind: A and it composes one resource of kind: B, then stops composing it, then quickly starts composing it again. Right now we can tolerate this for composed resource watches because XRs are still reconciled every poll interval - the watch'll be restarted anytime any XR that composes that resource type is reconciled at poll interval. We can't tolerate this for claims, XRs, and composition revisions. If these watches are garbage collected nothing would restart them until Crossplane itself restarts (or their XRD was recreated). Signed-off-by: Nic Cope <nicc@rk0n.org>
jbw976
left a comment
There was a problem hiding this comment.
The code change looks as described in the PR body, thanks for those details and it sounds reasonable.
Question in reference to not stopping watches for composed resources:
This doesn't break anything per-se but it'll at a minimum flood Crossplane's logs with scary errors
would this same statement apply to any scenarios for not stopping other types of watches? i.e. with the change in this PR, are there scenarios you can think of where this flood of scary errors could happen?
When you use a cache-backed controller-runtime client to read some type of object, controller-runtime starts a watch in a background Goroutine. The watch populates the cache. If something goes wrong with the watch you'll see a lot of logs like this: Most of this log comes from the underlying client-go machinery, but we prefix it with the So with our dynamic watches, the challenge is to make sure we stop the watch (informer) ASAP when it's not needed anymore. Ideally before the CRD is uninstalled, so that folks never see this log line. We do that two ways. The garbage collector this PR touches runs every 60 seconds and stops watches for composed resources that nothing composes anymore. crossplane/internal/engine/engine.go Lines 504 to 508 in b0cc38c There's also another garbage collector that watches for CRD deletes, and stops related watches. This PR doesn't really change any of that for other types of watches (claims, XRs, CompositionRevisions). Those should all be stopped when the XRD is deleted, not before, not after. If the CRD for the XR gets deleted before the XRD is deleted you'd get these scary logs, but that'd be the least of your problems. 🤷♂️ |
|
Edit: Figured it out. I was passing crossplane/internal/controller/apiextensions/definition/handlers.go Lines 44 to 47 in b0cc38c |
Before this commit, calling a StoppableSource's Stop method would potentially start a new informer. We stop a source by removing its EventHandler from the relevant informer. Previously each StoppableSource had a map of all informers (for some reason - not sure why I did this) despite only needing one. So they'd call GetInformers to get the informer they wanted to remove the event handler from. The problem is that GetInformer will implicitly try to start a new informer if the informer isn't already started. So to stop a source we'd first need to start its informer. That's not always possible - e.g. if Crossplane doesn't have RBAC access to the informer's type, or the CRD for the informer's type no longer exists. Now each StoppableSource only contains the informer for the type they're a source of events for, so they don't need to call GetInformer to stop a watch. Signed-off-by: Nic Cope <nicc@rk0n.org>
Signed-off-by: Nic Cope <nicc@rk0n.org>
Description of your changes
Fixes #5151
Fixes #5400
Fixes #6396 1
Realtime compositions use a garbage collector to cleanup watches. We do this in part to keep memory consumption down, and in part to avoid watching a type that no longer exists.
Say for example some XRs of kind: A compose a resource of kind: B. The kind: A controller will watch kind: B. Later the XRs are updated to a new Composition. They no longer compose kind: B. The provider for kind: B is uninstalled, so it's no longer a valid type.
If Crossplane doesn't stop the watch for kind: B the watch's underlying informer will spew errors about being unable to list kind: B. This doesn't break anything per-se but it'll at a minimum flood Crossplane's logs with scary errors.
Realtime composition has four types of watches: claims, XRs, composed resources, and composition revisions. Before this commit the garbage collector would GC them all. Now it only GCs composed resource watches.
There's no reason to GC the other types of watches. Those watches should never be stopped while the XR controller is still running; they'll be stopped when the XR controller stops running.
The garbage collector uses a cache-backed client to determine what watches to stop. The cache can be stale. So it's possible that the garbage collector will stop a watch for a type that's actually in use. For example if the type should be watched, then shouldn't, then should again in rapid succession. e.g. If there's only one XR of kind: A and it composes one resource of kind: B, then stops composing it, then quickly starts composing it again.
Right now we can tolerate this for composed resource watches because XRs are still reconciled every poll interval - the watch'll be restarted anytime any XR that composes that resource type is reconciled at poll interval.
We can't tolerate this for claims, XRs, and composition revisions. If these watches are garbage collected nothing would restart them until Crossplane itself restarts (or their XRD was recreated).
I have:
earthly +reviewableto ensure this PR is ready for review.Added or updated e2e tests.Linked a PR or a docs tracking issue to document this change.Addedbackport release-x.ylabels to auto-backport this PR.Followed the API promotion workflow if this PR introduces, removes, or promotes an API.Need help with this checklist? See the cheat sheet.
Footnotes
I pushed a second commit that fixes this issue - the one that causes the realtime composition E2E tests to fail so often. See the commit message for details on that. ↩