add optional mutation checks for shared informer cache by deads2k · Pull Request #27784 · kubernetes/kubernetes

deads2k · 2016-06-21T17:08:09Z

We need to make sure that no one is mutating caches if they're using a shared informer. It is important that whatever is tracking those changes gets the object before anyone else possibly could.

This adds the ability to track the original objects in the cache and their current values. Go doesn't have an exit hook or a way to say "wait for non-daemon go-funcs to complete before exit", so this runs a gofunc on a loop that can panic the entire process. It's gated behind an env var.

@derekwaynecarr did I get the right spots to make sure that e2e runs with this flag?
@smarterclayton @kubernetes/rh-cluster-infra

This change is

derekwaynecarr · 2016-06-21T17:09:37Z

I think you need to also edit hack/jenkins/e2e-runner.sh / cc @pwittrock

derekwaynecarr · 2016-06-21T17:10:44Z

actually, i am not sure which jenkins script is used, containerized or not. @fejta or @spxtr?

stevekuznetsov · 2016-06-21T17:48:27Z

pkg/controller/framework/mutation_detector.go

How many of these threads are running? How will the panic interact with GC? Is there a chance of running into an issue like #23210 ?

How many of these threads are running? How will the panic interact with GC? Is there a chance of running into an issue like #23210 ?

I don't think that is the same as a timer.

I'm not sure what the panic does to GC, but you'll notice that we aren't attempting to recover from it. We want to die. We'll set this env var during testing to try to flush out naughty controllers.

Would this code catch the bug in deployment we recently had?

Would this code catch the bug in deployment we recently had?

There were two, both related to annotation map mutation on shallow copies. One was a consistent failure (and this code was actually how we found it). The other would have been a flake, but the window for catching it would be small since it change and changed back very quickly.

I have a hard time convincing myself this is worth it when transient
changes will probably slip by and even constant changes that happen rarely
will merely result in flaky behavior.

OTOH I don't have better ideas.

On Thu, Aug 18, 2016 at 5:42 AM, David Eads notifications@github.com
wrote:

In pkg/controller/framework/mutation_detector.go
#27784 (comment)
:

+// cacheObj holds the actual object and a copy
+type cacheObj struct {

cached interface{}

copied interface{}
+}

+func (d *defaultCacheMutationDetector) Run(stopCh <-chan struct{}) {

// we DON'T want protection from panics. If we're running this code, we want to die

go func() {

for {

d.CompareObjects()

select {

case <-stopCh:

return

case <-time.After(d.period):

Would this code catch the bug in deployment we recently had?

There were two, both related to annotation map mutation on shallow copies.
One was a consistent failure (and this code was actually how we found it).
The other would have been a flake, but the window for catching it would be
small since it change and changed back very quickly.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/pull/27784/files/0cea212f3d60e5ceb403607af009ef1633d7bbfa#r75300013,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAnglrlqXS87dNKbNNRrZHDefT7B1W2lks5qhFMdgaJpZM4I69qi
.

I have a hard time convincing myself this is worth it when transient
changes will probably slip by and even constant changes that happen rarely
will merely result in flaky behavior.

OTOH I don't have better ideas.

This has caught problems, both in openshift and kube. The problems aren't imaginary and they are difficult for reviewers to catch.

OK. I will stop objecting.

On Thu, Aug 18, 2016 at 1:08 PM, David Eads notifications@github.com
wrote:

In pkg/controller/framework/mutation_detector.go
#27784 (comment)
:

+// cacheObj holds the actual object and a copy
+type cacheObj struct {

cached interface{}

copied interface{}
+}

+func (d *defaultCacheMutationDetector) Run(stopCh <-chan struct{}) {

// we DON'T want protection from panics. If we're running this code, we want to die

go func() {

for {

d.CompareObjects()

select {

case <-stopCh:

return

case <-time.After(d.period):

I have a hard time convincing myself this is worth it when transient
changes will probably slip by and even constant changes that happen rarely
will merely result in flaky behavior.

OTOH I don't have better ideas.

This has caught problems, both in openshift and kube. The problems aren't
imaginary and they are difficult for reviewers to catch.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/pull/27784/files/0cea212f3d60e5ceb403607af009ef1633d7bbfa#r75379572,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAngloIYVTdJqYfHbnO9ywwCNWu8BAE5ks5qhLvPgaJpZM4I69qi
.

fejta · 2016-06-21T18:33:44Z

For the time being jenkins calls dockerized-e2e-runner.sh which in turn calls e2e-runner.sh

0xmichalis · 2016-06-23T08:41:02Z

pkg/controller/framework/mutation_detector.go

+func (dummyMutationDetector) CompareObjects() {
+}
+
+// cacheMutationDetector gives a way to detect if a cached object has been mutated


s/cacheMutationDetector/defaultCacheMutationDetector/

nit: . missing at EOL

0xmichalis · 2016-06-23T08:44:55Z

Good idea. Two nits aside, lgtm

pwittrock · 2016-06-23T22:10:09Z

@deads2k where was this added? Would it be part of the node e2e ginkgo tests? Does it need to be added to them?

deads2k · 2016-06-24T11:42:11Z

@deads2k where was this added? Would it be part of the node e2e ginkgo tests? Does it need to be added to them?

I've tried to add them it to all the e2e runs, but I'm not certain which of the numerous scripts needs to be modified.

pwittrock · 2016-06-24T16:14:10Z

@vishh could you take a look to make sure this does what we need for node e2e?

vishh · 2016-06-24T21:15:00Z

@deads2k You need to edit test/e2e_node/e2e_service.go as well.

vishh · 2016-06-24T21:15:34Z

BTW is there an issue that warrants this PR?

smarterclayton · 2016-06-24T23:40:23Z

It's primarily defensive, but the implications of a cache mutation are very
serious and would be almost impossible to diagnose in production. So we
will soak test with this. I believe we've already found controllers that
mutated their caches.

On Fri, Jun 24, 2016 at 5:15 PM, Vish Kannan notifications@github.com
wrote:

BTW is there an issue that warrants this PR?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#27784 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p1v-1AC_UTuxpg3zSY5aEwXYmkbxks5qPEkEgaJpZM4I69qi
.

vishh · 2016-06-25T00:28:55Z

Ack. This feature has to be made discover-able somehow.

On Fri, Jun 24, 2016 at 4:40 PM, Clayton Coleman notifications@github.com
wrote:

It's primarily defensive, but the implications of a cache mutation are very
serious and would be almost impossible to diagnose in production. So we
will soak test with this. I believe we've already found controllers that
mutated their caches.

On Fri, Jun 24, 2016 at 5:15 PM, Vish Kannan notifications@github.com
wrote:

BTW is there an issue that warrants this PR?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#27784 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe/ABG_p1v-1AC_UTuxpg3zSY5aEwXYmkbxks5qPEkEgaJpZM4I69qi

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#27784 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGvIKKN2-hhHu5kgKgWQjYgf9vszaHLCks5qPGsHgaJpZM4I69qi
.

lavalamp · 2016-07-14T23:14:17Z

pkg/controller/framework/shared_informer.go

+		listerWatcher:         lw,
+		objectType:            objType,
+		fullResyncPeriod:      resyncPeriod,
+		cacheMutationDetector: NewCacheMutationDetector(fmt.Sprintf("%v", reflect.TypeOf(objType))),


use %t and don't import reflect

lavalamp · 2016-07-14T23:54:44Z

This is a good goal but can we brainstorm about other designs?

@rsc If go had a "const" keyword we wouldn't need to do things like this...

deads2k · 2016-07-15T14:45:55Z

This is a good goal but can we brainstorm about other designs?

I'm open to other suggestions, but this is all I came up with. Since everything gets direct access to the struct, there's no spot to intercept a mutation. We caught one from the deployment controller with it. This still doesn't catch deletions, but those are a little harder to do accidentally.

deads2k · 2016-10-13T12:58:38Z

@jayunit100 Is there a way for me to make the CI e2e tests use this flag on their processes? I've added it to local-up-cluster.sh and test.sh which I think covers people's "normal" local machine startup and test-go and test-integration.

k8s-ci-robot · 2016-10-13T13:24:43Z

Jenkins verification failed for commit dee1f2a761b906acc7d1da86056bd8aa0630cebc. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

deads2k · 2016-10-13T17:12:49Z

@Kargakis ping

0xmichalis · 2016-10-13T20:00:46Z

pkg/client/cache/mutation_detector.go

+
+	lock        sync.Mutex
+	cachedObjs  []cacheObj
+	failureFunc func(message string)


add godoc that if this is not specified, we panic

add godoc that if this is not specified, we panic

Sure though that's desired behavior. This func only exists for unit testing.

Then call that out

0xmichalis · 2016-10-13T20:05:29Z

pkg/client/cache/mutation_detector.go

+
+	d.lock.Lock()
+	defer d.lock.Unlock()
+	d.cachedObjs = append(d.cachedObjs, cacheObj{cached: obj, copied: copiedObj})


Don't we want to dedup?

Don't we want to dedup?

I think we want a reference to every reference that has existed since we don't know which reference any particular observer (and naughty mutator) may have gotten.

0xmichalis · 2016-10-13T20:06:23Z

pkg/client/cache/shared_informer.go

 	for _, d := range obj.(Deltas) {
 		switch d.Type {
 		case Sync, Added, Updated:
+			s.cacheMutationDetector.AddObject(d.Object)


Don't we also want to remove on Deleted?

Don't we also want to remove on Deleted?

Deletion doesn't guarantee references are gone. If we end up with memory pressure problems during runs, we could consider that.

0xmichalis · 2016-10-14T09:32:38Z

Aside the comment for failureFunc, this LGTM

deads2k · 2016-10-18T13:19:55Z

Aside the comment for failureFunc, this LGTM

comment added.

k8s-ci-robot · 2016-10-18T14:07:20Z

Jenkins GKE smoke e2e failed for commit aee54ae. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-github-robot · 2016-10-18T20:58:41Z

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

k8s-ci-robot · 2016-10-18T21:22:56Z

Jenkins GCI GKE smoke e2e failed for commit aee54ae. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

k8s-github-robot · 2016-10-18T21:38:57Z

Automatic merge from submit-queue

googlebot added the cla: yes label Jun 21, 2016

k8s-github-robot assigned mikedanese Jun 21, 2016

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jun 21, 2016

stevekuznetsov reviewed Jun 21, 2016
View reviewed changes

deads2k force-pushed the catch-mutators branch from 0cea212 to e825a54 Compare June 21, 2016 17:49

deads2k force-pushed the catch-mutators branch from e825a54 to 07e9700 Compare June 21, 2016 18:56

0xmichalis reviewed Jun 23, 2016
View reviewed changes

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 9, 2016

mikedanese assigned 0xmichalis and vishh and unassigned mikedanese Jul 14, 2016

vishh assigned lavalamp Jul 14, 2016

lavalamp reviewed Jul 14, 2016
View reviewed changes

vishh removed their assignment Jul 19, 2016

0xmichalis mentioned this pull request Jul 21, 2016

update node controller to use shared pod informer #24841

Merged

deads2k closed this Sep 6, 2016

deads2k deleted the catch-mutators branch September 6, 2016 17:23

deads2k restored the catch-mutators branch September 6, 2016 17:24

deads2k reopened this Sep 6, 2016

0xmichalis mentioned this pull request Sep 14, 2016

List pods for deployment just once #32487

Closed

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2016

jessfraz modified the milestone: v1.4 Oct 6, 2016

deads2k force-pushed the catch-mutators branch from dc51257 to dee1f2a Compare October 13, 2016 12:56

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 13, 2016

deads2k force-pushed the catch-mutators branch from dee1f2a to 5a70651 Compare October 13, 2016 13:42

deads2k added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Oct 13, 2016

0xmichalis reviewed Oct 13, 2016

View reviewed changes

add optional mutation checks for shared informer cache

aee54ae

deads2k force-pushed the catch-mutators branch from 5a70651 to aee54ae Compare October 18, 2016 13:19

deads2k added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 18, 2016

k8s-github-robot merged commit 4b7024e into kubernetes:master Oct 18, 2016

deads2k deleted the catch-mutators branch February 1, 2017 17:25

Conversation

deads2k commented Jun 21, 2016 • edited by k8s-oncall Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

derekwaynecarr commented Jun 21, 2016

Uh oh!

derekwaynecarr commented Jun 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fejta commented Jun 21, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xmichalis commented Jun 23, 2016

Uh oh!

pwittrock commented Jun 23, 2016

Uh oh!

deads2k commented Jun 24, 2016

Uh oh!

pwittrock commented Jun 24, 2016

Uh oh!

vishh commented Jun 24, 2016

Uh oh!

vishh commented Jun 24, 2016

Uh oh!

smarterclayton commented Jun 24, 2016

Uh oh!

vishh commented Jun 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lavalamp commented Jul 14, 2016

Uh oh!

deads2k commented Jul 15, 2016

Uh oh!

deads2k commented Oct 13, 2016

Uh oh!

k8s-ci-robot commented Oct 13, 2016

Uh oh!

deads2k commented Oct 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

0xmichalis commented Oct 14, 2016

Uh oh!

deads2k commented Oct 18, 2016

Uh oh!

k8s-ci-robot commented Oct 18, 2016

Uh oh!

k8s-github-robot commented Oct 18, 2016

Uh oh!

deads2k commented Jun 21, 2016 •

edited by k8s-oncall

Loading