Implement dynamic watch manager responsible for event dispatch to watching controllers#490
Conversation
|
This looks extremely promising! I may have missed some notes b/c its late, but overall this should really simplify things and get us 90% of the way to removing the need for finalizers. The last step being wiping the constraints cached in OPA when a template is deleted. |
| ctrl.SetLogger(crzap.Logger(true)) | ||
|
|
||
| mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{ | ||
| NewCache: dynamiccache.New, |
There was a problem hiding this comment.
If the two managers share the same cache, the parent manager needs to have some way of making sure child managers don't un-watch resources that the parent manager is interested in (e.g. Constraint Templates here).
There was a problem hiding this comment.
The link is broken for me. Where is this handled? I kept an eye out, but couldn't find anything obvious.
There was a problem hiding this comment.
Let me make sure I understand the question.
Within a watch manger, with multiple registrars watching the same resource, the watch manager performs reference counting and only removes the informer when the last registrar unregisters. That happens in this code in doRemoveWatch:
// Skip if there are additional watchers that would prevent us from removing it
m := wm.managedKinds.Get() // Not a deadlock but beware if assumptions change...
if _, ok := m[gvk]; ok {
return nil
}
If you're asking about having multiple watch managers in-process, they would be unaware of each other. I'd say that is an unsupported use-case - should we document that to make it more obvious? Do you have another idea?
Also one more clarification - we no longer need multiple controller-runtime managers in a single process. The parent/child manger relationship previously used has been removed.
There was a problem hiding this comment.
The controller runtime manager and the watch manager share the same cache right?
What happens if I add a sync for constraint templates and then remove it. Is the watch that the controller manager established preserved somehow?
There was a problem hiding this comment.
In practice they wouldn't affect each other as the watch manger is only dealing in unstructured informers and the constrainttemplate controller would be using a structured informer. These are completely separate in the InformerCache.
I could think of safeguards to put in but perhaps we should handle those as followup tasks?
There was a problem hiding this comment.
Ah, I see the unstructured/structured non-interaction thing.
I think safeguards would be good as this seems like a very subtle behavioral difference to rely on and I wonder if the fact that they are different caches is officially part of the contract.
That being said, I'm fine with deferring more explicit safeguards to a follow-on change. Hopefully sooner rather than later.
| log.Error(err, "invalid constraint GroupVersion", "gvk", gvk) | ||
| return reconcile.Result{}, nil | ||
| } | ||
| // TODO(OREN) do we need to ask the watch manager if we are registered for this kind? |
There was a problem hiding this comment.
Asking the watch manager seems superfluous, as the channel wouldn't have received the event w/o registration.
There was a problem hiding this comment.
A scenario I was considering was when events are in-flight in the various queues and are received after we have deregistered the watch. I think concretely for this controller it will be fine - de-registration happens after the parent template is removed from OPA, by which point either constraints should have been removed by the garbage collector or if their cache lags, a lookup in OPA should fail and we should drop the request.
...which I believe we were going to address within the constraint framework, correct? |
| syncAdder := syncc.Adder{Opa: opa} | ||
| // Events will be used to receive events from dynamic watches registered | ||
| // via the registrar below. | ||
| events := make(chan event.GenericEvent, 1024) |
There was a problem hiding this comment.
Is there an upper limit to how many sources we can have?
There was a problem hiding this comment.
There is not. Each controller has their own channel to receive event fan-out from the watch manager. The channel is buffered only to provide some cushion in case the controller is slow to consume from that channel. Consumption is comprised of encoding the event and putting it into a work queue, so it is not blocked on reconciliation time and I do not expect this to become an issue.
The current event distribution is a bit naive and can be improved:
// Distribute the event
for _, w := range watchers {
select {
case w <- e:
// TODO(OREN) add timeout
case <-stop:
}
}
^ theoretically a slow consumer that ends up blocking would cause event distribution to block for other consumers as well. This can be improved with a goroutine per consumer in distributeEvent, but at the cost of additional complexity.
|
Known open tasks before final review:
|
|
New stuff!
Also - need to look at the transactionality of config set changes to cached data. |
1c49883 to
ad16949
Compare
|
@maxsmythe this should be ready for review. |
maxsmythe
left a comment
There was a problem hiding this comment.
Some discussion around the watch manager and I still have a question about how to avoid removing informers for controllers not using the watch manager, otherwise mostly nits.
pkg/watch/registrar.go
Outdated
| "fmt" | ||
| "sync" | ||
|
|
||
| "sigs.k8s.io/controller-runtime/pkg/event" |
There was a problem hiding this comment.
nit: consolidate import groups
There was a problem hiding this comment.
I've been keeping separation between standard library and external imports. Is that acceptable or do you prefer a flat list mixing both?
There was a problem hiding this comment.
That separation is fine, there are some places where there are more than 2 import blocks though.
There was a problem hiding this comment.
Sorry about that, these are artifacts of GoLand / goimports introducing breaks and me not paying much attention to the collapsed blocks. I've fixed where you called them out.
There was a problem hiding this comment.
ACk, thanks. I hear you WRT the goimports thing. I get the same thing and try to stay on it, but it's definitely not 100% success rate on my end either.
| ctrl.SetLogger(crzap.Logger(true)) | ||
|
|
||
| mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{ | ||
| NewCache: dynamiccache.New, |
There was a problem hiding this comment.
The link is broken for me. Where is this handled? I kept an eye out, but couldn't find anything obvious.
shomron
left a comment
There was a problem hiding this comment.
@maxsmythe I've reworked the replay threading, please take another look. Thanks again for all your time!
|
Notes on latest changes:
|
pkg/watch/replay.go
Outdated
| case <-ctx.Done(): | ||
| return nil | ||
| case req := <-wm.replayRequests: | ||
| byRegistrar, inProgress := m[req.gvk] |
There was a problem hiding this comment.
Should inProgress be triggered by whether the replay is in byRegistrar?
What happens if Registrar A adds a watch, Registrar B adds a watch (and gets replay), and then Registrar C adds a watch? Will a replay trigger?
There was a problem hiding this comment.
You're right, this looks like a bug. I'll fix and add a corresponding test scenario. Good catch!
There was a problem hiding this comment.
Extended test (which initially failed 👍) and fixed.
pkg/watch/manager_test.go
Outdated
| } | ||
|
|
||
| // Verify ReplaceWatch replaces the set of watched resources for a registrar. New watches will be added, | ||
| // unneeded watches will be removed, and watches that haven't change will remain unchanged. |
maxsmythe
left a comment
There was a problem hiding this comment.
1 comment on the replay code, otherwise LGTM
f2098c2 to
818e468
Compare
|
One issue: were we going to add a NOTICE file due to the forking? |
|
@maxsmythe above I wrote ...Will followup with comprehensive NOTICE in a separate PR. The fork is documented and original copyright is maintained. Reading through the Apache guidelines - it's not clear to me that it's necessary. They say:
... and in bold:
controller-runtime does not include a NOTICE file we would duplicate. We include their original license and give attribution to the fork. I'm happy to follow your guidance on this and/or address this at a later time with legal consultation. WDYT? |
3bbe5be to
5ff95d9
Compare
|
@maxsmythe ok, I've moved the controller-runtime forked code under |
|
LGTM |
| constraintsCache *ConstraintsCache) reconcile.Reconciler { | ||
| return &ReconcileConstraint{ | ||
| Client: mgr.GetClient(), | ||
| // Separate reader and writer because manager's default client bypasses the cache for unstructured resources. |
There was a problem hiding this comment.
Separate reader and writer because manager's default client bypasses the cache for unstructured resources.
Is there any documentations around this?
could there be any issues with reader.Get or reader.List when reading from the cache? e.g. data freshness
There was a problem hiding this comment.
I don't know that it's documented, but the code is here: https://github.com/kubernetes-sigs/controller-runtime/blob/v0.5.0/pkg/manager/manager.go#L328-L330
Regarding data freshness - we should be good. Although the watch.Manager does not block for cache sync when adding a new watch, if you do a reader.Get or reader.List, controller-runtime will block on WaitForCacheSync here:
| client.Client | ||
| cs *watch.ControllerSwitch | ||
| reader client.Reader | ||
|
|
There was a problem hiding this comment.
whats the reason for removing this for sync?
There was a problem hiding this comment.
The ControllerSwitch was only needed in controllers that would fight conflict with the teardown logic in main.go:L194. The sync controller isn't writing status or removing finalizers so there shouldn't be any contention to worry about.
…ching controllers
This commit is significant change in how dynamic watches are implemented and routed. Previously, the watch manager would
manage the lifecycle of subordinate controllers via a subordinate controller-runtime Manager which was rebuilt whenever
the set of watched resources would change. This would dispose of informer caches and require a full resync from the API server.
This approach worked around the inability of controller-runtime to remove active informers and controllers running within a Manager.
Here we introduce a new approach where there is only a single controller-runtime Manager and controllers are fixed.
Instead, the informer cache has been forked to allow removal of informers once they are no longer needed.
Since controllers still cannot be removed and event handlers cannot be detached from a running informer, all events for
dynamically-watched resources are routed through the watch manager and distributed to the interested parties.
There is now a single constraint controller for all constraint kinds, and a single sync controller for all synced resources.
* Fork controller-runtime 0.5.0's cache.Cache into a new DynamicCache, supporting removal of informers from its informer map.
* Introduce non-blocking API to DynamicCache to support adding informers without waiting for them to sync.
* Rewrite watch.Manager to add/remove informers in the dynamicCache. Informer events will be sent to the watch manager,
and distributed to any watching registrars.
* Constraint and Sync controllers are now singletons which demultiplex incoming events and process multiple resource kinds.
* Add tests for the config controller verifying cache correctness and interactions with config set changes.
* Sync controller will now drop events for resources that are no longer in the active watch set.
* Removed ordering constraints between ConstraintTemplate and Constraint controllers - deletion can be processed out-of-order.
Signed-off-by: Oren Shomron <shomron@gmail.com>
┌────────────────────────────────────────────┐ ┌─────────────────────────────────────┐
│ watch.Manager │ │ Informer Cache │
│ │ │ │
│ ┌────────────────┐ │ │ │
│┌──────┬──Fan-out───│ EventHandler │◀─────┼─────┐ │┌───────────────────────────────────┐│
││ │ └────────────────┘ │ ├─────┤│ Informer ││
││ │ │ │ │└───────────────────────────────────┘│
││ ▼ │ │ │┌───────────────────────────────────┐│
││ ┌──────┐ ┌──────────────────────────┐ │ ├─────┤│ Informer ││
││ │Events├ ─ ─│Registrar (constraint-ctl)│ │ │ │└───────────────────────────────────┘│
││ └──────┘ └──────────────────────────┘ │ │ │┌───────────────────────────────────┐│
││ ┌──────┐ ┌──────────────────────────┐ │ └─────┤│ Informer ││
│└─▶│Events├ ─ ─│Registrar (config-ctl) │ │ │└───────────────────────────────────┘│
│ └───┬──┘ └──────────────────────────┘ │ │ │
└───────┼────────────────────────────────────┘ └─────────────────────────────────────┘
│
│
│ source.Channel
│
│
│ ┌─────────────────────┐
│ │Constraint Controller│
│ │ (Single) │
│ │ │
└──▶│ │
│ │
│ │
└─────────────────────┘
sozercan
left a comment
There was a problem hiding this comment.
ran load tests against v3.1-beta7, master (from 3/19, commit 5fe2a5), and dynamic-cache, results seems around consistent.
|
@maxsmythe / @sozercan / @ritazh thank you all for your time and thoughtful reviews! Are there any remaining concerns, or are we good to merge? |
|
Mergeable from my view. I can merge after Rita's LGTM |
Implement dynamic watch manager responsible for event dispatch to watching controllers
This commit is significant change in how dynamic watches are implemented and routed. Previously, the watch manager would
manage the lifecycle of subordinate controllers via a subordinate controller-runtime Manager which was rebuilt whenever
the set of watched resources would change. This would dispose of informer caches and require a full resync from the API server.
This approach worked around the inability of controller-runtime to remove active informers and controllers running within a Manager.
Here we introduce a new approach where there is only a single controller-runtime Manager and controllers are fixed.
Instead, the informer cache has been forked to allow removal of informers once they are no longer needed.
Since controllers still cannot be removed and event handlers cannot be detached from a running informer, all events for
dynamically-watched resources are routed through the watch manager and distributed to the interested parties.
There is now a single constraint controller for all constraint kinds, and a single sync controller for all synced resources.
and distributed to any watching registrars.
Signed-off-by: Oren Shomron shomron@gmail.com
Relates to discussion in #473