Skip to content

add optional mutation checks for shared informer cache#27784

Merged
k8s-github-robot merged 1 commit intokubernetes:masterfrom
deads2k:catch-mutators
Oct 18, 2016
Merged

add optional mutation checks for shared informer cache#27784
k8s-github-robot merged 1 commit intokubernetes:masterfrom
deads2k:catch-mutators

Conversation

@deads2k
Copy link
Copy Markdown
Contributor

@deads2k deads2k commented Jun 21, 2016

We need to make sure that no one is mutating caches if they're using a shared informer. It is important that whatever is tracking those changes gets the object before anyone else possibly could.

This adds the ability to track the original objects in the cache and their current values. Go doesn't have an exit hook or a way to say "wait for non-daemon go-funcs to complete before exit", so this runs a gofunc on a loop that can panic the entire process. It's gated behind an env var.

@derekwaynecarr did I get the right spots to make sure that e2e runs with this flag?
@smarterclayton @kubernetes/rh-cluster-infra


This change is Reviewable

@derekwaynecarr
Copy link
Copy Markdown
Member

I think you need to also edit hack/jenkins/e2e-runner.sh / cc @pwittrock

@derekwaynecarr
Copy link
Copy Markdown
Member

actually, i am not sure which jenkins script is used, containerized or not. @fejta or @spxtr?

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jun 21, 2016
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many of these threads are running? How will the panic interact with GC? Is there a chance of running into an issue like #23210 ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many of these threads are running? How will the panic interact with GC? Is there a chance of running into an issue like #23210 ?

I don't think that is the same as a timer.

I'm not sure what the panic does to GC, but you'll notice that we aren't attempting to recover from it. We want to die. We'll set this env var during testing to try to flush out naughty controllers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this code catch the bug in deployment we recently had?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this code catch the bug in deployment we recently had?

There were two, both related to annotation map mutation on shallow copies. One was a consistent failure (and this code was actually how we found it). The other would have been a flake, but the window for catching it would be small since it change and changed back very quickly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hard time convincing myself this is worth it when transient
changes will probably slip by and even constant changes that happen rarely
will merely result in flaky behavior.

OTOH I don't have better ideas.

On Thu, Aug 18, 2016 at 5:42 AM, David Eads notifications@github.com
wrote:

In pkg/controller/framework/mutation_detector.go
#27784 (comment)
:

+// cacheObj holds the actual object and a copy
+type cacheObj struct {

  • cached interface{}
  • copied interface{}
    +}

+func (d *defaultCacheMutationDetector) Run(stopCh <-chan struct{}) {

  • // we DON'T want protection from panics. If we're running this code, we want to die
  • go func() {
  •   for {
    
  •       d.CompareObjects()
    
  •       select {
    
  •       case <-stopCh:
    
  •           return
    
  •       case <-time.After(d.period):
    

Would this code catch the bug in deployment we recently had?

There were two, both related to annotation map mutation on shallow copies.
One was a consistent failure (and this code was actually how we found it).
The other would have been a flake, but the window for catching it would be
small since it change and changed back very quickly.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/pull/27784/files/0cea212f3d60e5ceb403607af009ef1633d7bbfa#r75300013,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAnglrlqXS87dNKbNNRrZHDefT7B1W2lks5qhFMdgaJpZM4I69qi
.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a hard time convincing myself this is worth it when transient
changes will probably slip by and even constant changes that happen rarely
will merely result in flaky behavior.

OTOH I don't have better ideas.

This has caught problems, both in openshift and kube. The problems aren't imaginary and they are difficult for reviewers to catch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will stop objecting.

On Thu, Aug 18, 2016 at 1:08 PM, David Eads notifications@github.com
wrote:

In pkg/controller/framework/mutation_detector.go
#27784 (comment)
:

+// cacheObj holds the actual object and a copy
+type cacheObj struct {

  • cached interface{}
  • copied interface{}
    +}

+func (d *defaultCacheMutationDetector) Run(stopCh <-chan struct{}) {

  • // we DON'T want protection from panics. If we're running this code, we want to die
  • go func() {
  •   for {
    
  •       d.CompareObjects()
    
  •       select {
    
  •       case <-stopCh:
    
  •           return
    
  •       case <-time.After(d.period):
    

I have a hard time convincing myself this is worth it when transient
changes will probably slip by and even constant changes that happen rarely
will merely result in flaky behavior.

OTOH I don't have better ideas.

This has caught problems, both in openshift and kube. The problems aren't
imaginary and they are difficult for reviewers to catch.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/pull/27784/files/0cea212f3d60e5ceb403607af009ef1633d7bbfa#r75379572,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAngloIYVTdJqYfHbnO9ywwCNWu8BAE5ks5qhLvPgaJpZM4I69qi
.

@fejta
Copy link
Copy Markdown
Contributor

fejta commented Jun 21, 2016

For the time being jenkins calls dockerized-e2e-runner.sh which in turn calls e2e-runner.sh

func (dummyMutationDetector) CompareObjects() {
}

// cacheMutationDetector gives a way to detect if a cached object has been mutated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/cacheMutationDetector/defaultCacheMutationDetector/

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: . missing at EOL

@0xmichalis
Copy link
Copy Markdown
Contributor

Good idea. Two nits aside, lgtm

@pwittrock
Copy link
Copy Markdown
Member

@deads2k where was this added? Would it be part of the node e2e ginkgo tests? Does it need to be added to them?

@deads2k
Copy link
Copy Markdown
Contributor Author

deads2k commented Jun 24, 2016

@deads2k where was this added? Would it be part of the node e2e ginkgo tests? Does it need to be added to them?

I've tried to add them it to all the e2e runs, but I'm not certain which of the numerous scripts needs to be modified.

@pwittrock
Copy link
Copy Markdown
Member

@vishh could you take a look to make sure this does what we need for node e2e?

@vishh
Copy link
Copy Markdown
Contributor

vishh commented Jun 24, 2016

@deads2k You need to edit test/e2e_node/e2e_service.go as well.

@vishh
Copy link
Copy Markdown
Contributor

vishh commented Jun 24, 2016

BTW is there an issue that warrants this PR?

@smarterclayton
Copy link
Copy Markdown
Contributor

It's primarily defensive, but the implications of a cache mutation are very
serious and would be almost impossible to diagnose in production. So we
will soak test with this. I believe we've already found controllers that
mutated their caches.

On Fri, Jun 24, 2016 at 5:15 PM, Vish Kannan notifications@github.com
wrote:

BTW is there an issue that warrants this PR?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#27784 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/ABG_p1v-1AC_UTuxpg3zSY5aEwXYmkbxks5qPEkEgaJpZM4I69qi
.

@vishh
Copy link
Copy Markdown
Contributor

vishh commented Jun 25, 2016

Ack. This feature has to be made discover-able somehow.

On Fri, Jun 24, 2016 at 4:40 PM, Clayton Coleman notifications@github.com
wrote:

It's primarily defensive, but the implications of a cache mutation are very
serious and would be almost impossible to diagnose in production. So we
will soak test with this. I believe we've already found controllers that
mutated their caches.

On Fri, Jun 24, 2016 at 5:15 PM, Vish Kannan notifications@github.com
wrote:

BTW is there an issue that warrants this PR?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
#27784 (comment)
,
or mute the thread
<
https://github.com/notifications/unsubscribe/ABG_p1v-1AC_UTuxpg3zSY5aEwXYmkbxks5qPEkEgaJpZM4I69qi

.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#27784 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGvIKKN2-hhHu5kgKgWQjYgf9vszaHLCks5qPGsHgaJpZM4I69qi
.

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 9, 2016
@mikedanese mikedanese assigned 0xmichalis and vishh and unassigned mikedanese Jul 14, 2016
listerWatcher: lw,
objectType: objType,
fullResyncPeriod: resyncPeriod,
cacheMutationDetector: NewCacheMutationDetector(fmt.Sprintf("%v", reflect.TypeOf(objType))),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use %t and don't import reflect

@lavalamp
Copy link
Copy Markdown
Contributor

This is a good goal but can we brainstorm about other designs?

@rsc If go had a "const" keyword we wouldn't need to do things like this...

@deads2k
Copy link
Copy Markdown
Contributor Author

deads2k commented Jul 15, 2016

This is a good goal but can we brainstorm about other designs?

I'm open to other suggestions, but this is all I came up with. Since everything gets direct access to the struct, there's no spot to intercept a mutation. We caught one from the deployment controller with it. This still doesn't catch deletions, but those are a little harder to do accidentally.

@deads2k deads2k closed this Sep 6, 2016
@deads2k deads2k deleted the catch-mutators branch September 6, 2016 17:23
@deads2k deads2k restored the catch-mutators branch September 6, 2016 17:24
@deads2k deads2k reopened this Sep 6, 2016
@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2016
@jessfraz jessfraz modified the milestone: v1.4 Oct 6, 2016
@deads2k
Copy link
Copy Markdown
Contributor Author

deads2k commented Oct 13, 2016

@jayunit100 Is there a way for me to make the CI e2e tests use this flag on their processes? I've added it to local-up-cluster.sh and test.sh which I think covers people's "normal" local machine startup and test-go and test-integration.

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 13, 2016
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Jenkins verification failed for commit dee1f2a761b906acc7d1da86056bd8aa0630cebc. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@deads2k deads2k added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Oct 13, 2016
@deads2k
Copy link
Copy Markdown
Contributor Author

deads2k commented Oct 13, 2016

@Kargakis ping


lock sync.Mutex
cachedObjs []cacheObj
failureFunc func(message string)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add godoc that if this is not specified, we panic

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add godoc that if this is not specified, we panic

Sure though that's desired behavior. This func only exists for unit testing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then call that out


d.lock.Lock()
defer d.lock.Unlock()
d.cachedObjs = append(d.cachedObjs, cacheObj{cached: obj, copied: copiedObj})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to dedup?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to dedup?

I think we want a reference to every reference that has existed since we don't know which reference any particular observer (and naughty mutator) may have gotten.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

for _, d := range obj.(Deltas) {
switch d.Type {
case Sync, Added, Updated:
s.cacheMutationDetector.AddObject(d.Object)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also want to remove on Deleted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we also want to remove on Deleted?

Deletion doesn't guarantee references are gone. If we end up with memory pressure problems during runs, we could consider that.

@0xmichalis
Copy link
Copy Markdown
Contributor

Aside the comment for failureFunc, this LGTM

@deads2k deads2k added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 18, 2016
@deads2k
Copy link
Copy Markdown
Contributor Author

deads2k commented Oct 18, 2016

Aside the comment for failureFunc, this LGTM

comment added.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Jenkins GKE smoke e2e failed for commit aee54ae. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-github-robot
Copy link
Copy Markdown

@k8s-bot test this [submit-queue is verifying that this PR is safe to merge]

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Jenkins GCI GKE smoke e2e failed for commit aee54ae. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-github-robot
Copy link
Copy Markdown

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 4b7024e into kubernetes:master Oct 18, 2016
@deads2k deads2k deleted the catch-mutators branch February 1, 2017 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.