Skip to content

Implement dynamic watch manager responsible for event dispatch to watching controllers#490

Merged
maxsmythe merged 1 commit intoopen-policy-agent:masterfrom
shomron:dynamic-cache
Mar 20, 2020
Merged

Implement dynamic watch manager responsible for event dispatch to watching controllers#490
maxsmythe merged 1 commit intoopen-policy-agent:masterfrom
shomron:dynamic-cache

Conversation

@shomron
Copy link
Copy Markdown
Contributor

@shomron shomron commented Feb 21, 2020

Implement dynamic watch manager responsible for event dispatch to watching controllers

This commit is significant change in how dynamic watches are implemented and routed. Previously, the watch manager would
manage the lifecycle of subordinate controllers via a subordinate controller-runtime Manager which was rebuilt whenever
the set of watched resources would change. This would dispose of informer caches and require a full resync from the API server.
This approach worked around the inability of controller-runtime to remove active informers and controllers running within a Manager.

Here we introduce a new approach where there is only a single controller-runtime Manager and controllers are fixed.
Instead, the informer cache has been forked to allow removal of informers once they are no longer needed.
Since controllers still cannot be removed and event handlers cannot be detached from a running informer, all events for
dynamically-watched resources are routed through the watch manager and distributed to the interested parties.
There is now a single constraint controller for all constraint kinds, and a single sync controller for all synced resources.

  • Fork controller-runtime 0.5.0's cache.Cache into a new DynamicCache, supporting removal of informers from its informer map.
  • Introduce non-blocking API to DynamicCache to support adding informers without waiting for them to sync.
  • Rewrite watch.Manager to add/remove informers in the dynamicCache. Informer events will be sent to the watch manager,
    and distributed to any watching registrars.
  • Constraint and Sync controllers are now singletons which demultiplex incoming events and process multiple resource kinds.
  • Add tests for the config controller verifying cache correctness and interactions with config set changes.
  • Sync controller will now drop events for resources that are no longer in the active watch set.
  • Removed ordering constraints between ConstraintTemplate and Constraint controllers - deletion can be processed out-of-order.

Signed-off-by: Oren Shomron shomron@gmail.com

┌────────────────────────────────────────────┐           ┌─────────────────────────────────────┐
│               watch.Manager                │           │           Informer Cache            │
│                                            │           │                                     │
│                    ┌────────────────┐      │           │                                     │
│┌──────┬──Fan-out───│  EventHandler  │◀─────┼─────┐     │┌───────────────────────────────────┐│
││      │            └────────────────┘      │     ├─────┤│             Informer              ││
││      │                                    │     │     │└───────────────────────────────────┘│
││      ▼                                    │     │     │┌───────────────────────────────────┐│
││  ┌──────┐    ┌──────────────────────────┐ │     ├─────┤│             Informer              ││
││  │Events├ ─ ─│Registrar (constraint-ctl)│ │     │     │└───────────────────────────────────┘│
││  └──────┘    └──────────────────────────┘ │     │     │┌───────────────────────────────────┐│
││  ┌──────┐    ┌──────────────────────────┐ │     └─────┤│             Informer              ││
│└─▶│Events├ ─ ─│Registrar (config-ctl)    │ │           │└───────────────────────────────────┘│
│   └───┬──┘    └──────────────────────────┘ │           │                                     │
└───────┼────────────────────────────────────┘           └─────────────────────────────────────┘
        │
        │
        │ source.Channel
        │
        │
        │   ┌─────────────────────┐
        │   │Constraint Controller│
        │   │      (Single)       │
        │   │                     │
        └──▶│                     │
            │                     │
            │                     │
            └─────────────────────┘

Relates to discussion in #473

@maxsmythe
Copy link
Copy Markdown
Contributor

This looks extremely promising! I may have missed some notes b/c its late, but overall this should really simplify things and get us 90% of the way to removing the need for finalizers. The last step being wiping the constraints cached in OPA when a template is deleted.

ctrl.SetLogger(crzap.Logger(true))

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
NewCache: dynamiccache.New,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the two managers share the same cache, the parent manager needs to have some way of making sure child managers don't un-watch resources that the parent manager is interested in (e.g. Constraint Templates here).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is handled here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link is broken for me. Where is this handled? I kept an eye out, but couldn't find anything obvious.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me make sure I understand the question.
Within a watch manger, with multiple registrars watching the same resource, the watch manager performs reference counting and only removes the informer when the last registrar unregisters. That happens in this code in doRemoveWatch:

// Skip if there are additional watchers that would prevent us from removing it
m := wm.managedKinds.Get() // Not a deadlock but beware if assumptions change...
if _, ok := m[gvk]; ok {
	return nil
}

If you're asking about having multiple watch managers in-process, they would be unaware of each other. I'd say that is an unsupported use-case - should we document that to make it more obvious? Do you have another idea?

Also one more clarification - we no longer need multiple controller-runtime managers in a single process. The parent/child manger relationship previously used has been removed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller runtime manager and the watch manager share the same cache right?

What happens if I add a sync for constraint templates and then remove it. Is the watch that the controller manager established preserved somehow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice they wouldn't affect each other as the watch manger is only dealing in unstructured informers and the constrainttemplate controller would be using a structured informer. These are completely separate in the InformerCache.
I could think of safeguards to put in but perhaps we should handle those as followup tasks?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see the unstructured/structured non-interaction thing.

I think safeguards would be good as this seems like a very subtle behavioral difference to rely on and I wonder if the fact that they are different caches is officially part of the contract.

That being said, I'm fine with deferring more explicit safeguards to a follow-on change. Hopefully sooner rather than later.

log.Error(err, "invalid constraint GroupVersion", "gvk", gvk)
return reconcile.Result{}, nil
}
// TODO(OREN) do we need to ask the watch manager if we are registered for this kind?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking the watch manager seems superfluous, as the channel wouldn't have received the event w/o registration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A scenario I was considering was when events are in-flight in the various queues and are received after we have deregistered the watch. I think concretely for this controller it will be fine - de-registration happens after the parent template is removed from OPA, by which point either constraints should have been removed by the garbage collector or if their cache lags, a lookup in OPA should fail and we should drop the request.

@shomron
Copy link
Copy Markdown
Contributor Author

shomron commented Feb 25, 2020

The last step being wiping the constraints cached in OPA when a template is deleted.

...which I believe we were going to address within the constraint framework, correct?

syncAdder := syncc.Adder{Opa: opa}
// Events will be used to receive events from dynamic watches registered
// via the registrar below.
events := make(chan event.GenericEvent, 1024)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an upper limit to how many sources we can have?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not. Each controller has their own channel to receive event fan-out from the watch manager. The channel is buffered only to provide some cushion in case the controller is slow to consume from that channel. Consumption is comprised of encoding the event and putting it into a work queue, so it is not blocked on reconciliation time and I do not expect this to become an issue.

The current event distribution is a bit naive and can be improved:

	// Distribute the event
	for _, w := range watchers {
		select {
		case w <- e:
		// TODO(OREN) add timeout
		case <-stop:
		}
	}

^ theoretically a slow consumer that ends up blocking would cause event distribution to block for other consumers as well. This can be improved with a goroutine per consumer in distributeEvent, but at the cost of additional complexity.

@shomron
Copy link
Copy Markdown
Contributor Author

shomron commented Feb 26, 2020

Known open tasks before final review:

  • Reintroduced controllerSwitch mutex to controllers. This is to enable disabling controllers (which may have events already in their queues) during shutdown when main.go does status teardown.
  • Determine whether a NOTICE file is needed to call out the controller-runtime cache fork. - Will followup with comprehensive NOTICE in a separate PR. The fork is documented and original copyright is maintained.
  • Unit tests for the new functionality
  • Look into config controller - can data wipe be done on just the difference of the watched set?

@shomron
Copy link
Copy Markdown
Contributor Author

shomron commented Mar 4, 2020

New stuff!

  • Reintroduced the controller switch. We had discussed an option of using the watch manager as a switch, however that would not address reconciles mid-way through racing status updates with the main teardown logic.
  • The config controller now passes a filtering OpaData client to the sync controller. This client will drop any resources that are not in the current watch set as specified by the config controller, which addresses any potential ReconcileRequests till in-flight when the watch set changes.
  • Tests :) Fixed tests in the various controllers. Added unit and integration tests for the watch manager. Still would like a few more (to test replay).
  • Tests :( Investigating test failures in the forked dynamic cache. I believe this may be a version mismatch (forked from 0.5.0, but our dependencies are still 0.4.0).

Also - need to look at the transactionality of config set changes to cached data.

@shomron shomron changed the title WIP: Implement watch.Manager using dynamic cache Implement dynamic watch manager responsible for event dispatch to watching controllers Mar 9, 2020
@shomron shomron force-pushed the dynamic-cache branch 3 times, most recently from 1c49883 to ad16949 Compare March 11, 2020 00:39
@shomron
Copy link
Copy Markdown
Contributor Author

shomron commented Mar 11, 2020

@maxsmythe this should be ready for review.

Copy link
Copy Markdown
Contributor

@maxsmythe maxsmythe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some discussion around the watch manager and I still have a question about how to avoid removing informers for controllers not using the watch manager, otherwise mostly nits.

"fmt"
"sync"

"sigs.k8s.io/controller-runtime/pkg/event"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consolidate import groups

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been keeping separation between standard library and external imports. Is that acceptable or do you prefer a flat list mixing both?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That separation is fine, there are some places where there are more than 2 import blocks though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, these are artifacts of GoLand / goimports introducing breaks and me not paying much attention to the collapsed blocks. I've fixed where you called them out.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACk, thanks. I hear you WRT the goimports thing. I get the same thing and try to stay on it, but it's definitely not 100% success rate on my end either.

ctrl.SetLogger(crzap.Logger(true))

mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
NewCache: dynamiccache.New,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link is broken for me. Where is this handled? I kept an eye out, but couldn't find anything obvious.

Copy link
Copy Markdown
Contributor Author

@shomron shomron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxsmythe I've reworked the replay threading, please take another look. Thanks again for all your time!

@shomron
Copy link
Copy Markdown
Contributor Author

shomron commented Mar 16, 2020

Notes on latest changes:

  • replay events now performed asynchronously
  • doAddWatch: Only fetch informer for first watcher
  • doRemoveWatch: remove registrar from watch list even when it is not the last watcher
  • replaceWatches: fix bug that prevented watches from being added if there was another registrar watching them
  • tests: reset controller-runtime metrics registry between tests to avoid duplicate registration errors
  • add test for ReplaceWatch
  • watch.Set: remove double-locking

case <-ctx.Done():
return nil
case req := <-wm.replayRequests:
byRegistrar, inProgress := m[req.gvk]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should inProgress be triggered by whether the replay is in byRegistrar?

What happens if Registrar A adds a watch, Registrar B adds a watch (and gets replay), and then Registrar C adds a watch? Will a replay trigger?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this looks like a bug. I'll fix and add a corresponding test scenario. Good catch!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended test (which initially failed 👍) and fixed.

}

// Verify ReplaceWatch replaces the set of watched resources for a registrar. New watches will be added,
// unneeded watches will be removed, and watches that haven't change will remain unchanged.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change -> changed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Copy Markdown
Contributor

@maxsmythe maxsmythe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 comment on the replay code, otherwise LGTM

@shomron shomron force-pushed the dynamic-cache branch 2 times, most recently from f2098c2 to 818e468 Compare March 17, 2020 23:36
Copy link
Copy Markdown
Contributor

@maxsmythe maxsmythe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Well done!

@maxsmythe
Copy link
Copy Markdown
Contributor

maxsmythe commented Mar 18, 2020

One issue: were we going to add a NOTICE file due to the forking?

@shomron
Copy link
Copy Markdown
Contributor Author

shomron commented Mar 18, 2020

@maxsmythe above I wrote ...Will followup with comprehensive NOTICE in a separate PR. The fork is documented and original copyright is maintained.

Reading through the Apache guidelines - it's not clear to me that it's necessary.

They say:

NOTICE is reserved for a certain subset of legally required notifications which are not satisfied by either the text of LICENSE or the presence of licensing information embedded within the bundled dependency. Aside from Apache-licensed dependencies which supply NOTICE files of their own, it is uncommon for a dependency to require additions to NOTICE.

Copyright notifications which have been relocated from source files (rather than removed) must be preserved in NOTICE. However, elements such as the copyright notifications embedded within BSD and MIT licenses need not be duplicated in NOTICE -- it suffices to leave those notices in their original locations.

It is important to keep NOTICE as brief and simple as possible, as each addition places a burden on downstream consumers.

... and in bold:

Do not add anything to NOTICE which is not legally required.

controller-runtime does not include a NOTICE file we would duplicate. We include their original license and give attribution to the fork.

I'm happy to follow your guidance on this and/or address this at a later time with legal consultation.

WDYT?

@shomron shomron force-pushed the dynamic-cache branch 2 times, most recently from 3bbe5be to 5ff95d9 Compare March 18, 2020 21:55
@shomron
Copy link
Copy Markdown
Contributor Author

shomron commented Mar 18, 2020

@maxsmythe ok, I've moved the controller-runtime forked code under third_party, added that to various Dockerfiles and Makefiles, marked modified code inline with a disclaimer, brought in the original controller-runtime LICENSE file, and added a new top-level NOTICE file. I also threw in a .dockerignore for good measure. Let me know how this looks.

@maxsmythe
Copy link
Copy Markdown
Contributor

LGTM

constraintsCache *ConstraintsCache) reconcile.Reconciler {
return &ReconcileConstraint{
Client: mgr.GetClient(),
// Separate reader and writer because manager's default client bypasses the cache for unstructured resources.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate reader and writer because manager's default client bypasses the cache for unstructured resources.

Is there any documentations around this?

could there be any issues with reader.Get or reader.List when reading from the cache? e.g. data freshness

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that it's documented, but the code is here: https://github.com/kubernetes-sigs/controller-runtime/blob/v0.5.0/pkg/manager/manager.go#L328-L330

Regarding data freshness - we should be good. Although the watch.Manager does not block for cache sync when adding a new watch, if you do a reader.Get or reader.List, controller-runtime will block on WaitForCacheSync here:

if !cache.WaitForCacheSync(syncStop, i.Informer.HasSynced) {

client.Client
cs *watch.ControllerSwitch
reader client.Reader

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the reason for removing this for sync?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ControllerSwitch was only needed in controllers that would fight conflict with the teardown logic in main.go:L194. The sync controller isn't writing status or removing finalizers so there shouldn't be any contention to worry about.

…ching controllers

This commit is significant change in how dynamic watches are implemented and routed. Previously, the watch manager would
manage the lifecycle of subordinate controllers via a subordinate controller-runtime Manager which was rebuilt whenever
the set of watched resources would change. This would dispose of informer caches and require a full resync from the API server.
This approach worked around the inability of controller-runtime to remove active informers and controllers running within a Manager.

Here we introduce a new approach where there is only a single controller-runtime Manager and controllers are fixed.
Instead, the informer cache has been forked to allow removal of informers once they are no longer needed.
Since controllers still cannot be removed and event handlers cannot be detached from a running informer, all events for
dynamically-watched resources are routed through the watch manager and distributed to the interested parties.
There is now a single constraint controller for all constraint kinds, and a single sync controller for all synced resources.

* Fork controller-runtime 0.5.0's cache.Cache into a new DynamicCache, supporting removal of informers from its informer map.
* Introduce non-blocking API to DynamicCache to support adding informers without waiting for them to sync.
* Rewrite watch.Manager to add/remove informers in the dynamicCache. Informer events will be sent to the watch manager,
  and distributed to any watching registrars.
* Constraint and Sync controllers are now singletons which demultiplex incoming events and process multiple resource kinds.
* Add tests for the config controller verifying cache correctness and interactions with config set changes.
* Sync controller will now drop events for resources that are no longer in the active watch set.
* Removed ordering constraints between ConstraintTemplate and Constraint controllers - deletion can be processed out-of-order.

Signed-off-by: Oren Shomron <shomron@gmail.com>

┌────────────────────────────────────────────┐           ┌─────────────────────────────────────┐
│               watch.Manager                │           │           Informer Cache            │
│                                            │           │                                     │
│                    ┌────────────────┐      │           │                                     │
│┌──────┬──Fan-out───│  EventHandler  │◀─────┼─────┐     │┌───────────────────────────────────┐│
││      │            └────────────────┘      │     ├─────┤│             Informer              ││
││      │                                    │     │     │└───────────────────────────────────┘│
││      ▼                                    │     │     │┌───────────────────────────────────┐│
││  ┌──────┐    ┌──────────────────────────┐ │     ├─────┤│             Informer              ││
││  │Events├ ─ ─│Registrar (constraint-ctl)│ │     │     │└───────────────────────────────────┘│
││  └──────┘    └──────────────────────────┘ │     │     │┌───────────────────────────────────┐│
││  ┌──────┐    ┌──────────────────────────┐ │     └─────┤│             Informer              ││
│└─▶│Events├ ─ ─│Registrar (config-ctl)    │ │           │└───────────────────────────────────┘│
│   └───┬──┘    └──────────────────────────┘ │           │                                     │
└───────┼────────────────────────────────────┘           └─────────────────────────────────────┘
        │
        │
        │ source.Channel
        │
        │
        │   ┌─────────────────────┐
        │   │Constraint Controller│
        │   │      (Single)       │
        │   │                     │
        └──▶│                     │
            │                     │
            │                     │
            └─────────────────────┘
Copy link
Copy Markdown
Member

@sozercan sozercan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ran load tests against v3.1-beta7, master (from 3/19, commit 5fe2a5), and dynamic-cache, results seems around consistent.

@shomron
Copy link
Copy Markdown
Contributor Author

shomron commented Mar 20, 2020

@maxsmythe / @sozercan / @ritazh thank you all for your time and thoughtful reviews! Are there any remaining concerns, or are we good to merge?

@maxsmythe
Copy link
Copy Markdown
Contributor

Mergeable from my view. I can merge after Rita's LGTM

Copy link
Copy Markdown
Member

@ritazh ritazh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants